NeurIPS 2020

ConvBERT: Improving BERT with Span-based Dynamic Convolution


Meta Review

The paper proposes to replace some of the self-attention heads in Transformer/BERT models with span-based dynamic convolution. This is a good idea since many of the dependencies modeled by self-attention are local. The proposed heads are mixed with the original self-attention heads to reduce the FLOPs of Transformer. Good results are achieved on the GLUE benchmark. The authors addressed questions by the reviewers in their response and everyone agrees that this paper will be great to have at the conference.