Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
The paper presents a way to incorporate sparse routing networks into the transformer architecture to reduce the computation cost of attention for long sequences. The reviewers acknowledge that the idea is novel and the experiments suggest that the proposed architecture is potentially useful. However, the experiments do not demonstrate improved efficiency or accuracy on real world tasks with long sequences. Comparison with Transformer architectures that make use of sparse attention is lacking. Hence, I recommend acceptance as a poster.