NeurIPS 2020

SMYRF - Efficient Attention using Asymmetric Clustering


Meta Review

This paper proposes a method for reducing the quadratic bottleneck of transformer architectures to O(N log N), using an asymmetric LHS clustering strategy. The paper also shows that finding an optimal assignment is NP-hard and thus, heuristic approaches must be pursued. They propose a novel type of balanced clustering algorithm to approximate attention. The method can be directly used for pre-trained models and achieves competitive/better performance with BigGAN/BERT/RoBERTa by shrinking 50% memory. There was some disagreement among reviewers about this paper, with R1 and R3 recommending solid acceptance, and R2 and R4 recommending weak reject. The reviewers mentioned as strengths that the proposed model can save memory for both training and inference; the experiments are significant compared to the memory usage of BigGAN and BERT; the proposed model can be widely applied for both CV and NLP tasks; the evaluation has shown good results on image&text data in multiple regimes: with/without fine-tuning, training from scratch, several compression levels; the analysis of the number of queries per cluster in the proposed model is quite interesting; and that the drop-in nature makes proposed method much more useful than alternatives. The main weakness is that the experiments do not compare against related efficient attention models (reformer, routing transformer, longformer) even in settings where the comparison would be possible. The comparison with reformer, in particular, seems important, given that it is also a LSH-based approach. Overall, I think this is a good paper that provides a new solution for an important problem (efficient attention is highly relevant and sought after at the moment) and validates empirically the proposed approach in different regimes. The author response addressed the lack of comparison against other methods that was pointed out as the main weakness by the reviewers, and this has been acknowledged in the discussion phase. Therefore I recommend acceptance. I urge the authors to add these new results to the final version of their paper. I also recommend them to add the citations recommended by R3. Finally, a point was made by R2 about comparison with distillation. In the discussion phase, R2 clarified that “as shown in Table 4 of TinyBERT (https://arxiv.org/pdf/1909.10351.pdf), a distilled 4-layer Transformer is comparable to the proposed method, and the distilled 6-layer model can be significantly better. But the author response makes sense because the proposed modification is not mutually exclusive with knowledge distillation.”