VALSE 论文速览第113期：Understanding the Failure of BN in Transformer

2023-4-26 16:53| 发布者: 程一-计算所| 查看: 899| 评论: 0

摘要: 为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展，VALSE最新推出了《论文速览》栏目，将在每周发布一至两篇顶会顶刊论文的录制视频，对单个前沿工作进行细致讲解。本期VALSE论文速 ...

为了使得视觉与学习领域相关从业者快速及时地了解领域的最新发展动态和前沿技术进展，VALSE最新推出了《论文速览》栏目，将在每周发布一至两篇顶会顶刊论文的录制视频，对单个前沿工作进行细致讲解。本期VALSE论文速览选取了来自清华大学的理解批归一化层(Understanding Batch Normalization)方面的工作。该工作由黄雷副教授和吴及教授指导，论文一作王嘉曦同学录制。

论文题目：Understanding the Failure of Batch Normalization for Transformers in NLP

作者列表：王嘉曦 (清华大学)、吴及 (清华大学)、黄雷 (北京航空航天大学)

B站观看网址：

https://www.bilibili.com/video/BV1KT411H73H/

论文摘要：

Batch Normalization (BN)is a core and prevalent technique in accelerating the training of deep neural networks and improving the generalization on Computer Vision (CV)tasks. However, it fails to defend its position in Natural Language Processing (NLP), which is dominated by Layer Normalization (LN). In this paper, we are trying to answer why BN usually performs worse than LN in NLP tasks with Transformer models. We find that the inconsistency between training and inference of BN is the leading cause that results in the failure of BN in NLP. We define Training Inference Discrepancy (TID)to quantitatively measure this inconsistency and reveal that TID can indicate BN's performance, supported by extensive experiments, including image classification, neural machine translation, language modeling, sequence labeling, and text classification tasks. We find that BN can obtain much better test performance than LN when TID keeps small through training. To suppress the explosion of TID, we propose Regularized BN (RBN)that adds a simple regularization term to narrow the gap between batch statistics and population statistics of BN. RBN improves the performance of BN consistently and outperforms or is on par with LN on 17 out of 20 settings, involving ten datasets and two common variants of Transformer.

论文信息：

[1] Wang, Jiaxi and Wu, Ji and Huang, Lei,“Understanding the Failure of Batch Normalization for Transformers in NLP,”In Advances in Neural Information Processing Systems, 2022.

论文链接：

[https://arxiv.org/abs/2210.05153]

代码链接：

[https://github.com/wjxts/RegularizedBN]

视频讲者简介：

王嘉曦，清华大学博士生。主要研究方向为理解深度学习中的归一化层, 机器学习用于药物发现。目前在ccf A/B类会议上发表两篇一作论文。

特别鸣谢本次论文速览主要组织者：

月度轮值AC：林迪 (天津大学)、彭春蕾 (西安电子科技大学)

活动参与方式

1、VALSE每周举行的Webinar活动依托B站直播平台进行，欢迎在B站搜索VALSE_Webinar关注我们！

直播地址：

https://live.bilibili.com/22300737；

历史视频观看地址：

https://space.bilibili.com/562085182/

2、VALSE Webinar活动通常每周三晚上20:00进行，但偶尔会因为讲者时区问题略有调整，为方便您参加活动，请关注VALSE微信公众号：valse_wechat 或加入VALSE QQ S群，群号：317920537）；

*注：申请加入VALSE QQ群时需验证姓名、单位和身份，缺一不可。入群后，请实名，姓名身份单位。身份：学校及科研单位人员T；企业研发I；博士D；硕士M。

3、VALSE微信公众号一般会在每周四发布下一周Webinar报告的通知。

4、您也可以通过访问VALSE主页：http://valser.org/ 直接查看Webinar活动信息。Webinar报告的PPT（经讲者允许后），会在VALSE官网每期报告通知的最下方更新。

收藏邀请

上一篇：VALSE 论文速览第112期：Flow-Guided Transformer for Video Inpainting下一篇：VALSE Webinar 20230524-10期总第310期大模型背景下的多媒体智能检索 ...

下级分类

小黑屋|手机版|Archiver|Vision And Learning SEminar

GMT+8, 2025-8-20 12:40 , Processed in 0.012673 second(s), 14 queries .

返回顶部

VALSE 论文速览 第113期：Understanding the Failure of BN in Transformer

相关分类

下级分类

VALSE 论文速览第113期：Understanding the Failure of BN in Transformer