NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
UPDATE: Authors answered my questions, I would like to keep my score unchanged and suggest to focus on clarity of the final version. Perhaps, this is the case when I would really be interested in looking at the source code. Originality: the paper borrows the general idea of product keys from the database community, however the application to fast retrieval in neural memory systems seems quite novel to me. Quality: The core ideas of the paper are sound, however more I would appreciate more rigor in both conceptual and experimental comparison with other approaches incorporating memory to Transformer (see e.g. [1]). Another suggestion would be to discuss more the issue of potential non-uniformity of the query distribution, which indeed seems to be quite relevant. There is a recent work [2] that also uses distributed sparse memory and uses a seemingly more advanced way of improving on this than a simple batch norm. Clarity: On this front the paper can be improved, and it is my main critique of the paper. Authors do not explain clearly how are the keys and values computed. Presumably those are the same as in the original Transformer network, produced on the full sequence processed to the moment, but I could not find any confirmation of this. Expressions such as "x = x + FFN(x)" (line 160) look a bit odd. I find the modification of the multi-head attention (section 3.3) unusual and requiring some explanation. If output of the heads is summed and not concatenated, how does this modify the original Transformer? Significance: I think it is a quite neat technique that is easy to implement and it will find applications in the community. [1] Chen, Mia Xu, et al. "The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation." Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018. [2] Rae, Jack, Sergey Bartunov, and Timothy Lillicrap. "Meta-Learning Neural Bloom Filters." International Conference on Machine Learning. 2019.
Reviewer 2
**SUMMARY * Goal: Increase the capacity of memory in memory models while keeping attentional-access cost feasible. * Model innovation: Product-Key Memory (PKM). Use as a key for a stored memory a pair of half-sized sub-keys. Treat a query as a pair of half-sized sub-queries. There are two codebooks of N sub-keys each, and the Cartesian product of these codebooks defines the set of K = N^2 full keys for the memory. For retrieval (attention): Given a query, limit retrieval to a linear combination of the values associated with the k keys that best match the query. For each sub-query, find the closest k sub-keys (maximal dot-product). The Cartesian product of these two sets of k sub-keys must contain the k full keys with largest dot-product with the full query. * Complexity: inference is reduced from O(K*d_q) to O((sqrt(K)+k^2)*d_q). * Tested models: The contents of memory are fixed at test time. (It is like Long Term Memory, not Working Memory.) In Transformer models, the FFN is replaced with a memory-retrieval network in typically 1 layer [158]. Memory size is K = 512^2 = 262k, k = 32 selected keys [226] * Task: Language Modeling on 28B-word news articles from Common Crawl corpora. * Results: E.g., Inserting one PKM layer in a dim-1600, 16 layer Transformer drops perplexity from 14.4 to 13.2 [Table 1] "Models with 12 layers and a PKM outperform 24-layer models of the same dimension, while being almost twice faster at inference." [Fig 4] Compared to comparable models with standard flat keys, "models with product keys are not only faster but also they have a much better memory usage, and consequently obtain a better perplexity" [266]. **REVIEW *The proposal is a low-level implementational modification that, for language modeling with huge training data, does seem to enable performance improvement while lowering inference cost. The memory layer proposed is in principle insertable into any model. * It is unclear how general the improvement would be across models and tasks. * The thorough set of ablation studies provided lends support to the numerous design choices. * The results are not presented with error bars, significance levels, or other indications of variability across runs.
Reviewer 3
The paper provides a "memory layer" with a large number of parameters. The parameters are accessed by a content-based attention. The attention takes a query and finds the top-K parameter vectors with the nearest keys. The innovation is in the efficient way to find the top-K keys. The efficient search splits the key to 2 parts and searches over 2 * sqrt(N_{items}) sub-keys, instead of searching over N_{items}. This still provides exact results, because the memory holds an item for each (subKey1, subKey2) combination. The experiments are done on a dataset with 28 billions words. The extra parameters are helpful there. Comments: The paper is good in all aspects. The writing is clear and pleasant to read. The experiments convincingly show the benefits of the model at the very large dataset. The memory layer may become a standard building block for large networks. I like that conditional computation models are mentioned in the related work section. The memory layer provides a nice alternative way to increase the network capacity. Minor details: - Line 122: The usage of top-K attention weights is not fully differentiable, because the k-th attention weight changes with a discontinuity when switching from rank k+1 to k. If needed, it would be possible to design a fully differentiable output, by ensuring that the k-th attention weight is zero at the tie-break. - Under line 140: The "in practice, the cost of ..." sentence should probably be: "in theory, the cost is O(k \log k x d_q) with a priority list ...". Update: Thank you for the polite feedback. The answers were informative. I hope the paper will be accepted.