{"title": "Mixtape: Breaking the Softmax Bottleneck Efficiently", "book": "Advances in Neural Information Processing Systems", "page_first": 5775, "page_last": 5783, "abstract": "The softmax bottleneck has been shown to limit the expressiveness of neural lan-\r\nguage models. Mixture of Softmaxes (MoS) is an effective approach to address such a theoretical limitation, but are expensive compared to softmax in terms of both memory and time. We propose Mixtape, an output layer that breaks the softmax bottleneck more efficiently with three novel techniques\u2014logit space vector gating, sigmoid tree decomposition, and gate sharing. On four benchmarks including language modeling and machine translation, the Mixtape layer substantially improves the efficiency over the MoS layer by 3.5x to 10.5x while obtaining similar performance. A network equipped with Mixtape is only 20% to 34% slower than a\r\nsoftmax-based network with 10-30K vocabulary sizes, and outperforms softmax in perplexity and translation quality.", "full_text": "Mixtape: Breaking the Softmax Bottleneck Ef\ufb01ciently\n\nZhilin Yang1, Thang Luong2, Ruslan Salakhutdinov1, Quoc Le2\n\n1Carnegie Mellon University, 2Google Brain\n\n{zhiliny,rsalakhu}@cs.cmu.edu, {thangluong,qvl}@google.com\n\nAbstract\n\nThe softmax bottleneck has been shown to limit the expressiveness of neural lan-\nguage models. Mixture of Softmaxes (MoS) is an effective approach to address\nsuch a theoretical limitation, but are expensive compared to softmax in terms of\nboth memory and time. We propose Mixtape, an output layer that breaks the soft-\nmax bottleneck more ef\ufb01ciently with three novel techniques\u2014logit space vector\ngating, sigmoid tree decomposition, and gate sharing. On four benchmarks includ-\ning language modeling and machine translation, the Mixtape layer substantially\nimproves the ef\ufb01ciency over the MoS layer by 3.5x to 10.5x while obtaining similar\nperformance. A network equipped with Mixtape is only 20% to 34% slower than a\nsoftmax-based network with 10-30K vocabulary sizes, and outperforms softmax in\nperplexity and translation quality.\n\n1\n\nIntroduction\n\nSoftmax has been a standard output layer for a wide variety of neural networks, including the\nmajority of neural language models [5, 2, 3, 8, 11]. However, as pointed out by [19], softmax is a\nfundamental limitation of the expressiveness of neural language models, because it constrains the\noutput representations to be low-rank, which might not be suf\ufb01cient for modeling the complexity\nof natural language. Such a limitation is called the softmax bottleneck. To break the softmax\nbottleneck, [19] proposed Mixture of Softmaxes (MoS) that introduces discrete latent variables into\nthe output layer so that the log probability matrix is high-rank because of the log-sum-exp nonlinear\ntransformation. However, MoS is expensive compared to softmax in terms of both memory and time,\nwhich makes it less practically useful when computational budgets are limited.\nTo reduce the computational cost of MoS, we propose a novel output layer Mixtape to break the\nsoftmax bottleneck ef\ufb01ciently. Mixtape can be plugged into any existing networks as an additional\nlayer before the cross entropy loss. Instead of employing a scalar mixture in the probability space as in\nMoS, Mixtape applies a vector gating mechanism in the logit space to avoid using multiple expensive\nsoftmaxes. In addition, Mixtape uses two more novel techniques to further reduce the computational\ncost. First, the vector gating mechanism is expensive because we need to compute a softmax gate for\neach word in the vocabulary. We propose sigmoid tree decomposition that decomposes a softmax\nprobability gating distribution into a depth-2 binary tree structure, where each branch carries a portion\nof the probability mass determined by a sigmoid function. Sigmoid tree decomposition is much more\nef\ufb01cient because it avoids the reduction and division operations in softmax. The other technique\ngate sharing is to share the gate values for all infrequent words, resulting in partially high-rank\nrepresentations. This technique saves a considerable amount of memory and computation without\naffecting the performance because the gate values of infrequent words are usually hard to accurately\nestimate even without sharing the gates.\nWith all the above techniques combined, Mixtape substantially improves the ef\ufb01ciency of MoS while\nobtaining comparable or even better performances on four benchmarks, including language modeling\nand machine translation. With normal vocabulary sizes (e.g., 10K-30K), the Mixtape layer is 1.6x\nto 11.5x fater than the MoS layer given the same batch size, and is 3.5x to 10.5x faster given the\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsame memory budget. With normal vocabulary sizes, a Mixtape-based network is only 5% to 18%\nslower than a softmax-based network given the same batch size, and is only 20% to 34% slower\ngiven the same memory budget. With a large vocabulary of 100K tokens, a Mixtape-based network is\nstill only 60% slower than a softmax-based network. Both Mixtape and MoS outperform softmax\nin perplexity and translation quality. Interestingly, these benchmarks have varied vocabulary sizes\nranging from 10K to 100K and different input representations including words and BPE subwords,\nwhich demonstrates that Mixtape is effective and robust with a variety of inputs.\n\n2 Softmax Bottleneck\n\nIn the following, we will introduce the notations and review the softmax bottleneck problem pointed\nout by [19].\nConsider a general setting of language modeling and text generation, where given the context C\nwe want to estimate the conditional distribution of the next token P \u2217(X|C). Here we use P \u2217 to\ndenote the true data distribution. The context C denotes the tokens that have occurred so far. For\nexample, given a corpus (X1, X2,\u00b7\u00b7\u00b7 , XT ), for each time step t, we aim to estimate the probability\nP \u2217(Xt|C = X<t). For conditional generation, the probability is additionally conditioned on other\ninputs, which are omitted in our discussions without loss of generality.\nWe consider a natural language modeling task as the problem of modeling a \ufb01nite set of pairs of\na context and its conditional next-token distribution L = {(c1, P \u2217(X|c1)),\u00b7\u00b7\u00b7 , (cN , P \u2217(X|cN ))},\nwhere N is the number of possible contexts. The validity of the \ufb01niteness assumption has been\ndiscussed in [19] and does not affect our conclusion that follows.\nA commonly-used approach for language modeling is to use neural networks to encode the context\nand the next token into vector representations hc and wx respectively. The conditional distribution is\nthen modeled by a softmax function, P\u03b8(x|c) = exp h(cid:62)\nc wx\nc wx(cid:48) where \u03b8 denotes the model parameters.\nx(cid:48) exp h(cid:62)\nThe dot products between the two embeddings are called logits, and the corresponding feature space\nis termed a logit space.\nWe write down the context embeddings, token embeddings, and log probabilities in matrix forms as\nfollows,\n\n(cid:80)\n\n\uf8f9\uf8fa\uf8fa\uf8fb ; W\u03b8 =\n\n\uf8ee\uf8ef\uf8ef\uf8f0 h(cid:62)\n\nh(cid:62)\nc2\u00b7\u00b7\u00b7\nh(cid:62)\n\nc1\n\ncN\n\n\uf8ee\uf8ef\uf8ef\uf8f0 w(cid:62)\n\nw(cid:62)\n\u00b7\u00b7\u00b7\nw(cid:62)\n\nx1\n\nx2\n\nxM\n\n\uf8f9\uf8fa\uf8fa\uf8fb ; A =\n\n\uf8ee\uf8ef\uf8ef\uf8f0 log P \u2217(x1|c1)\n\nlog P \u2217(x1|c2)\n\nlog P \u2217(x1|cN )\n\n...\n\n\u00b7\u00b7\u00b7\n\u00b7\u00b7\u00b7\n...\n\u00b7\u00b7\u00b7\n\nH\u03b8 =\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\nlog P \u2217(xM|c1)\nlog P \u2217(xM|c2)\n\nlog P \u2217(xM|cN )\n\n...\n\nH\u03b8W(cid:62)\n\nwhere M is the number of possible next tokens.\nThe language modeling problem is now turned into a matrix factorization problem of \ufb01nding model\nparameters \u03b8 such that\n\n\u03b8 = A + row-wise shift\n\n(1)\nThe row-wise shift operation is de\ufb01ned as A + \u039bJN,M where \u039b is a diagonal matrix with size\nN \u00d7 N and JN,M is an all-ones matrix with size N \u00d7 M.\nGiven the matrix factorization formulation, it follows that the rank of LHS in Eq. (1) is upper bounded\nby the embedding size d. Based on this key observation, the softmax bottleneck problem is identi\ufb01ed\nas follows.\nCorollary 1 (Softmax bottleneck) [19] If d < rank(A) \u2212 1, for any function family U and any\nmodel parameter \u03b8, there exists a context c in L such that P\u03b8(X|c) (cid:54)= P \u2217(X|c).\nIn other words, given that most neural language models use distributed low-dimensional context\nand token embeddings, the softmax bottleneck indicates that these models do not have suf\ufb01cient\nexpressiveness to model complex, high-rank natural language.\n\n3 Breaking the Softmax Bottleneck Ef\ufb01ciently\n\nMixture of Softmaxes (MoS) [19] is an effective approach to break the softmax bottleneck.\nSpeci\ufb01cally, MoS uses the following formulation for the conditional distribution, P\u03b8(x|c) =\n\n2\n\n\f(a) The Mixtape Layer.\n\n(b) Sigmoid Tree Decomposi-\ntion with K = 4.\n\nFigure 1: Left: S is the number of frequent tokens that use their own gate priors, M is the vocabulary size, and\nthe blue boxes denote the gate priors shared by all infrequent tokens. In the diagram, the number of gates is set\nat K = 4, which is the value we use throughout the experiments. In our implementation, we do not explicitly\ncompute the gate logits for infrequent tokens. Instead, we perform a scalar mixture using the shared gate priors\nand the context embeddings in Eq. (5) before multiplication with token embeddings to save memory. Right:\nEach edge in the sigmoid tree \u03b3\u2217 is a probability computed using sigmoid functions. Each gate prior is the\nproduct of the probabilities along the path from the root to the leaf; e.g., \u03c01 = \u03b31\u03b32.\n\n(cid:80)\n\nk=1 \u03c0c,k\n\n(cid:80)K\nc,kwx(cid:48) ; s.t. (cid:80)K\nis modeled by \u02c6AMoS = log(cid:80)K\n\nexp h(cid:62)\nx(cid:48) exp h(cid:62)\n\nc,kwx\n\nk=1 \u03c0c,k = 1 where the priors \u03c0c,k are obtained by another\nsoftmax-based function of the last-layer hidden states, and K is the number of mixture compo-\nnents. This formulation is not limited by the softmax bottleneck because the log probability matrix A\n\u03b8 ) where the log-sum-exp nonlinearity produces\n\nk=1 \u03a0k exp(H\u03b8,kW(cid:62)\n\na high-rank matrix \u02c6AMoS.\nHowever, softmax involves applying nonlinear exp transformations for each token in the vocabulary,\nperforming reduction across the vocabulary, followed by division, which are all computationally\nintensive. Moreover, softmax is memory intensive because it has to store the pre-activations h(cid:62)\n\u2217 w\u2217,\nthe post-activations exp(\u00b7), and the output probabilities for each token in the vocabulary. Since a\nnormal vocabulary size is in the magnitude of 104, MoS dramatically increases the computational\ncost by using multiple softmaxes.\nAnother approach to break the softmax bottleneck was recently introduced [9]. However, this\napproach is less ef\ufb01cient than MoS because it computes a mixture of sigmoid functions in addition to\nsoftmaxes.\nTo alleviate the ef\ufb01ciency issue, we will introduce our novel method Mixtape that improves the\nef\ufb01ciency over MoS without sacri\ufb01cing the ability to learn high-rank representations.\n\n3.1 Logit Space Vector Gating\n\nexp((cid:80)K\n(cid:80)\nx(cid:48) exp((cid:80)K\n\nSince the most expensive part of MoS is to compute K softmaxes, signi\ufb01cant computational budget\ncan be saved if we manage to use only one softmax to compute the \ufb01nal probability distribution.\nIt is tempting to move the mixture from the probability space into the logit space; i.e., mixing the\nrepresentations before the softmax operation. This leads to the following conditional distribution,\nP\u03b8(x|c) =\n. However, as pointed out in [19], such a formulation will\nresult in a low-rank representation because the matrix factorization form in Eq. (1) still applies.\nNevertheless, we will now show that with a small modi\ufb01cation, applying mixture operations in the\nlogit space leads to high-rank representations. The key idea is to use a vector gating mechanism\ninstead of scalar mixtures. In other words, instead of using a shared set of mixture weights for\nevery token, we use a different set of weights for different tokens. Formally, with vector gating, the\n\nk=1 \u03c0c,khc,k)\n\nk=1 \u03c0c,khc,k)\n\nwx\n(cid:62)\n\nwx(cid:48)\n\n(cid:62)\n\n3\n\nLast layer hidden stateFeed forward layersSigmoid Tree DecompositionGate prior 4Gate prior 3Gate prior 2Gate prior 1Token 1Token 2\u2026Token SToken S+1\u2026Token M\u2026\u2026\u2026\u2026Gate logit 4Gate logit 3Gate logit 2Gate logit 1Token 1Token 2\u2026Token M\u2026\u2026\u2026\u2026Vector GatingToken 1Token 2Token M\u2026\u2026Softmax\fconditional distribution can be written as\n\nK(cid:88)\nk=1 \u03a0k (cid:12)(cid:0)H\u03b8,kW(cid:62)\nThe log probability matrix A is now modeled as \u02c6AMixtape =(cid:80)K\n\nk=1 \u03c0c,x,kh(cid:62)\nc,kwx\nk=1 \u03c0c,x(cid:48),kh(cid:62)\n\nexp(cid:80)K\n(cid:80)\nx(cid:48) exp(cid:80)K\n\nP\u03b8(x|c) =\n\n\u03c0c,x,k = 1\n\nc,kwx(cid:48)\n\n; s.t.\n\nelementwise multiplication introduced, the matrix factorization form in Eq. (1) does not apply, and\nthe log probability matrix is therefore high-rank. In addition, the vector gating mechanism removes\nthe necessity of computing K softmax probability distributions, which makes ef\ufb01ciency improvement\npossible.\nHowever, there is still a remaining obstacle before Mixtape is actually ef\ufb01cient enough. Notice that\nsince the priors \u03c0c,x,k need to sum to one1 for each context-token pair (c, x), a naive implementation\nrequires computing a softmax for the prior probabilities for each pair token x given the context c. Let\n(cid:80)K\nlc,x,k be the pre-activation priors, we have \u03c0c,x,k =\nk(cid:48)=1 exp lc,x,k(cid:48) Unfortunately, this will be even\nslower than MoS because the number of tokens in the vocabulary is usually large. In the following,\nwe will introduce a novel technique that avoids such an ef\ufb01ciency trap.\n\nexp lc,x,k\n\nk=1\n\n(2)\n\n(cid:1) Due to the\n\n\u03b8\n\n3.2 Sigmoid Tree Decomposition\n\nNow, we introduce how to ef\ufb01ciently compute the priors \u03c0c,x,k. Instead of using a softmax, we\npropose to decompose a softmax distribution into a tree structure of sigmoid functions. Speci\ufb01cally,\nwe compute (K \u2212 1) sigmoid outputs and use them to de\ufb01ne the probabilities along the tree branches.\nFor example, with K = 4, the priors are de\ufb01ned as:\n\n\u03b3c,x,k = \u03c3(lc,x,k) for k = 1 . . . K \u2212 1\n\u03c0c,x,1 = \u03b3c,x,1\u03b3c,x,2\n\u03c0c,x,2 = \u03b3c,x,1(1 \u2212 \u03b3c,x,2)\n\u03c0c,x,3 = (1 \u2212 \u03b3c,x,1)\u03b3c,x,3\n\u03c0c,x,4 = (1 \u2212 \u03b3c,x,1)(1 \u2212 \u03b3c,x,3)\n\n(3)\n\nwhere \u03b3\u2217 denotes the sigmoid probabilities and \u03c3 is the sigmoid function. The above equations are\nillustrated in Figure ??.\nWe call this technique sigmoid tree decomposition. Such a decomposition is able to fully recover a\nK-way probability distribution with (K \u2212 1) sigmoid functions. Using sigmoid functions removes\nthe reduction and division operations in softmax and is more ef\ufb01cient.\nAlthough the sigmoid tree composition technique can be used with any K, in our experiments, we\nalways use K = 4 for two reasons. First, we \ufb01nd Mixtape is effective with K = 4 for all the tasks\nin our experiments. Second, speed is core to Mixtape and we \ufb01x K to be the minimal possible\nvalue. Compared to MoS, using a \ufb01xed number of components K means Mixtape requires less\nhyperparameter tuning efforts. Moreover, K = 4 is relatively small compared to the number of\ncomponents in MoS, which further reduces the computational cost.\nLet gc be a d1-dimensional last-layer hidden states given context c. The pre-activation priors l\u2217 are\ncomputed as\n\nlc,x,k = v(cid:62)\n\nx tanh(Ukgc) + u(cid:62)\n\n(4)\nwhere vx \u2208 Rd2, Uk \u2208 Rd2\u00d7d1, uk \u2208 Rd1, and bx,k \u2208 R are model parameters. Here d2 is a\nhyperparameter that denotes the gate embedding size and is usually chosen to be much smaller than\nthe normal word embedding size d. The context embeddings are obtained by\n\nk gc + bx,k\n\nhc,k = tanh(Hkgc)\n\n(5)\n\nwhere Hk \u2208 Rd\u00d7d1 is a model parameter.\n\n1We were not able to get good performance with unnormalized priors.\n\n4\n\n\f3.3 Gate Sharing\n\nSo far we have arrived at an ef\ufb01cient high-rank model, but there is still room for further improvement.\nOne observation is that we still have to compute a gate prior for each token in the vocabulary, which\nbecomes the bottleneck of ef\ufb01ciency. However, for infrequent tokens, it is hard to estimate the gate\npriors accurately due to lack of training samples, and thus learning gate priors for infrequent tokens\nmight simply be waste of computation. As a way to leverage this observation, the core idea of gate\nsharing is to share the same gate priors for all infrequent words. Speci\ufb01cally, for an infrequent token\nx, the pre-activation gate priors are de\ufb01ned as\n\nlc,x,k = u(cid:62)\n\nk gc\n\n(6)\n\nwhich remains constant given c and k for different infrequent tokens x.\nThe resulting representations are partially high-rank. Supposed the token indices are ranked by\nfrequency. The log probability matrix is now modeled by\n\u03b8,kW(1)(cid:62)\nH(1)\n\nk (cid:12)(cid:16)\n\n(cid:34) K(cid:88)\n\n\u03a0(1)\n\n; H(2)\n\n\u03b8 W(2)\n\n\u03b8\n\n\u02c6A =\n\n(cid:17)\n\n\u03b8\n\n(cid:35)\n\nk=1\n\nwhere the superscripts (1) and (2) denote the representations for frequent and infrequent tokens\nrespectively. We have high-rank and low-rank representations for frequent and infrequent tokens\nrespectively. For infrequent tokens, our formulation is equivalent to performing logit space scalar\nmixtures, also known as Mixture of Contexts in [19]. Similar ideas have been demonstrated in\nprevious work [8] where infrequent tokens use less-expressive representations (smaller embedding\nsizes) to save memory and computation without affecting performance.\nWith gate sharing, we use the shared gate prior to mix the context embedding hc,k before multipli-\ncation with the token embeddings wx, which saves memory because no gate logits are stored for\ninfrequent tokens. Gate sharing also speeds up the computation by computing only one set of gate\npriors for all infrequent tokens.\nLet S be the number of frequent tokens and let r = S/M with M being the vocabulary size. In our\nexperiments, we set r = 0.5 for machine translation and r = 0.1 for language modeling.\n\n3.4 Summary and Discussion\n\nThe Mixtape layer is summarized as follows:\n1. Given the last-layer hidden states gc, compute the context embeddings hc,k using Eq. (5).\n2. For each frequent token x, compute the pre-activation gate priors lc,x,k using Eq. (4).\n3. For all infrequent tokens, compute a shared pre-activation gate prior lc,x,k using Eq. (6).\n4. Use sigmoid tree decomposition to compute the gate priors \u03c0c,x,k as in Eq. (3).\n5. Use vector gating to obtain the next-token probabilities using Eq. (2).\nThe architecture of the Mixtape layer is illustrated in Figure 1.\nIn our implementation, we also add biases to the matrix multiplication operations in Eq. (2), (4)\nand (5), which were omitted in the above text for simplicity. It is also optional to employ weight\nnormalization [14] for the parameter Uk in Eq. (4). Different from [14], we use a constant scale\ninstead of a learnable one as it leads to more stable optimization. In our experiments, we use weight\nnormalization for language modeling but did not observe improvement on machine translation tasks.\nWe also apply dropout on tanh(Ukgc) and hc,k in Eq. (4) and (5). To further regularize the networks,\nwe also add a small amount of Gaussian noise on the pre-activation priors l\u2217 in the forward pass.\nIf we neglect cheap operations and only consider matrix multiplication and softmax, MoS has\n2(d1dK + dKM ) FLOPs for matrix multiplication and K M-way softmaxes. For comparison,\nMixtape has 2(d1dK + dKS) FLOPs for matrix multiplication and one M-way softmax. The\nspeedup of Mixtape comes from a smaller number of softmaxes, a smaller K, and a smaller S < M.\nSuppose an M-way softmax uses 8M bytes for storing intermediate and \ufb01nal results. If we again only\nconsider major operations of matrix multiplication and softmax, with FP32 tensors, MoS roughly uses\n(4dK + 12M K) bytes and Mixtape uses (4dK + 12SK + 8M ) bytes. Mixtape uses less memory\ndue to a smaller S and a smaller K.\n\n5\n\n\fMethod\n[18] \u2013 Transformer\n[6] \u2013 Universal Transformer\n[1] \u2013 Weighted Transformer\n[16] \u2013 Transformer + Relative encodings\n[10] \u2013 Transformer + MoS\n[13] \u2013 Large-batch training\n[17] \u2013 Mesh Tensor\ufb02ow (2.9B params)\n[17] \u2013 Mesh Tensor\ufb02ow (0.8B params)\nOurs \u2013 Transformer + Mixtape (0.2B/0.8B params)\n\nEn-De BLEU En-Fr BLEU\n\n28.4\n28.9\n28.9\n29.2\n29.5\n29.3\n26.7\n27.5\n29.3\n\n41.0\n\n-\n\n41.4\n41.5\n42.1\n43.2\n43.9\n43.5\n43.9\n\nTable 1: Comparison with state-of-the-art systems on WMT En-De and En-Fr. Mixtape uses 0.2 and 0.8 billion\nparameters for En-De and En-Fr tasks respectively.\n\nSize\n\nSize\n\nUnit Vocab Size\n\nDataset\nPTB\n1B\nTable 2: Dataset statistics. \u201cPTB\u201d and \u201c1B\u201d denote Penn Treebank and One Billion Word respectively.\n\nUnit Vocab Size Dataset\nEn-De\nEn-Fr\n\n1M tokens Word\n1B tokens Word\n\n4.5M pairs BPE\n36M pairs\nBPE\n\n10K\n100K\n\n32K\n32K\n\n4 Experiments\n\nOur experiments consist of three parts. First, we demonstrate that the proposed Mixtape layer is able\nto improve state-of-the-art machine translation systems by breaking the softmax bottleneck. Second,\nwe compare the perplexity, translation quality, speed, and memory constraints of Mixtape, MoS, and\nsoftmax, to demonstrate that Mixtape is able to achieve a good balance between effectiveness and\nef\ufb01ciency. Third, through ablation studies, we show the bene\ufb01ts of gate sharing.\n\n4.1 Datasets\n\nWe test Mixtape on two tasks, language modeling and machine translation. For language modeling,\nwe exactly follow the settings in [19] on Penn Treebank [12] and One Billion Word [4] for fair\ncomparison. We implement the same recurrent network architectures and follow the regularization\nand optimization techniques used in [19]. We tune the model size of Mixtape such that Mixtape has\nthe same number of parameters as MoS in the corresponding settings. On One Billion Word, we also\nreplicate the data preprocessing pipeline that lower-cases the text and chooses the top 100K tokens\nas the vocabulary. This results in a non-standard setting, but it enables fair comparison with MoS\nas well as excluding the orthogonal effects of techniques for a larger vocabulary such as adaptive\nsoftmax [8].\nFor machine translation, our experiments are based on two widely-used WMT\u201914 benchmarks,\nEnglish to German (En-De) and English to French (En-Fr), following the setups in [13, 18]. For\nEn-De, we train on the WMT\u201916 training data and test on newstest14. For En-Fr, we train on the\nWMT\u201914 training data and test on newstest14. We use BPE encodings [15] with a vocabulary size\nof 32K. Following [17], we use sacrebleu for evaluation.\nThe statistics of different datasets and settings are shown in Table 2. The selected datasets present a\ndegree of diversity in sizes, input units, and vocabulary sizes, which enables evaluating the robustness\nof Mixtape.\n\n4.2 WMT\u201914 Results\n\nWe apply Mixtape on top of Transformers [18] to have a comparison with state-of-the-art systems\non WMT\u201914 benchmarks. We also incorporate relative positional encodings [16] in our architecture.\nOn En-De, we employ a 6-layer Transformer with embedding size 1024, inner layer size 4096, and\n16 attention heads. We train for 300K steps with a learning rate of 2.5, a batch size of 4096, and\n16K warmup steps. We apply a dropout of 0.3 on the layer outputs, a dropout of 0.15 on attention\nprobabilities, a dropout of 0.2 on tanh(Ukgc) in Eq. (4), and a Gaussian noise with 0.1 stdev on\npre-activation gate priors. On En-Fr, we employ a 6-layer Transformer with embedding size 2048,\ninner layer size 8192, and 16 attention heads. We train for 1.2M steps with a learning rate of 2.0, a\nbatch size of 4096, and 16K warmup steps. We apply a dropout 0.25 on the layer outputs, dropouts of\n\n6\n\n\fMethod\n\nSoftmax\nMoS-3\nMoS-5\nMoS-10\nMoS-15\nMixtape\n- no sharing\n\nPerplexity\nPrior work\n\n58.8\n58.62\n57.36\n56.33\n55.97\n\n-\n-\n\nPerplexity Training time (same bsz) Training time (same mem)\nOur impl.\n\nAll layers\n\nOut layer\n\nOut layer\n\nAll layers\n\n59.19\n57.62\n57.24\n56.49\n56.14\n56.37\n56.33\n\n14\n147\n173\n242\n310\n27\n95\n\n328\n439\n488\n609\n731\n345\n487\n\n0.40\n3.37\n4.40\n6.96\n9.62\n0.90\n3.88\n\n3.06\n6.25\n8.38\n12.99\n15.73\n3.66\n10.44\n\nTable 3: Perplexity and training time comparison on Penn Treebank. \u201cMoS-K\u201d means MoS with K mixture\ncomponents, \u201cno sharing\u201d means using Mixtape without the gate sharing technique, \u201cour impl.\u201d means results\nfrom our own implementation, \u201cout layer\u201d means training time for the output layer only, \u201call layers\u201d means\ntraining time for the entire network, \u201csame bsz\u201d means using the same batch size of 48, \u201csame mem\u201d means\nusing the same GPU memory budget of 12GB with a maximum possible batch size. Results from prior work are\ntaken from [19]. In the setting with a \ufb01xed batch size (\u201csame bsz\u201d), training time per batch in seconds is reported.\nIn the setting with a \ufb01xed memory budget (\u201csame mem\u201d), training time per instance in seconds is reported.\n\nMethod\n\nSoftmax\nMoS-7\nMixtape\n- no sharing\n\nPerplexity Training time (same bsz) Training time (same mem)\n\nOut layer\n\nAll layers\n\nOut layer\n\nAll layers\n\n42.77\n37.10\n36.52\n36.77\n\n53\n794\n114\n364\n\n119\n856\n170\n414\n\n3.2\n52.5\n8.6\n34.3\n\n7.0\n59.8\n11.3\n48.6\n\nTable 4: Perplexity and training time comparison on One Billion Word. Text abbreviations are the same as\nTable 3. In the setting with a \ufb01xed batch size (\u201csame bsz\u201d), we use a batch size of 20. Results of Softmax and\nMoS-7 are taken from [19].\n0.15 on attention probabilities and tanh(Ukgc) in Eq. (4), and a Gaussian noise with 0.1 stdev on\npre-activation gate priors.\nThe results of our method are shown in Table 1. Mixtape with Transformers achieves state-of-the-\nart results on both En-De and En-Fr. Interestingly, Mixtape outperforms baselines that use MoS\n[10]. This demonstrates that breaking the softmax bottleneck signi\ufb01cantly contributes to achieving\nstate-of-the-art performance for machine translation, and Mixtape is an effective approach to break\nsuch a bottleneck. On En-Fr, Mixtape obtains the same performance with Transformers trained\nwith Mesh Tensor\ufb02ow [17]. However, Mixtape is much more parameter-ef\ufb01cient, using only 0.8\nbillion parameters v.s. 2.9 billion parameters in Mesh Tensor\ufb02ow. Moreover, Mixtape outperforms\nMesh Tensor\ufb02ow by a large margin on En-De, demonstrating more robustness and generalization\ncapabilities on relatively small datasets. Note that [7] reports better performance with back translation,\nwhich is not comparable with our setting.\n\n4.3 Ablation Study and Comparison with Baselines\n\nWe now compare the performance of Mixtape with MoS and softmax, as well as studying the effects\nof gate sharing. We report the training time used for both the sole output layer and the entire network.\nTo take the memory usage of different methods into consideration, in addition to reporting training\ntime with the same batch size, we also consider the training time with the same memory budget. In\nother words, a model that uses more memory will have a smaller batch size, and thus training time\nper instance will increase.\nThe results of different methods on Penn Treebank, One Billion Word, WMT\u201914 En-De, and WMT\u201914\nEn-Fr are shown in Tables 3, 4, 5, and 6. We use baseline MoS results from [19, 10] whenever\npossible and avoid using our own implementation for fair comparison. There are three main messages\ndelivered in these results.\nFirst, compared to softmax, Mixtape is comparably ef\ufb01cient while being more accurate at language\nmodeling and translation. On tasks with normal vocabulary sizes including Penn Treebank, WMT\u201914\nEn-De, and WMT\u201914 En-Fr, a Mixtape-based network is only 5% to 18% slower than a softmax-based\nnetwork given the same batch size and only 20% to 34% slower given the same memory budget.\nEven on One Billion Word with a 100K vocabulary, a Mixtape-based network is only 60% slower\n\n7\n\n\fMethod\n\nSoftmax\nMoS-9\nMixtape\n\nBLEU Training time (same bsz) Training time (same mem)\n\nOut layer\n\nAll layers\n\nOut layer\n\nAll layers\n\n29.0\n29.5\u2217\n29.3\n\n2.16\n14.36\n5.83\n\n18.15\n30.08\n21.48\n\n5.4\n61.1\n17.6\n\n37.0\n97.9\n49.8\n\nTable 5: BLEU and training time comparison on WMT\u201914 En-De. Text abbreviations are the same as in Table\n3. \u2217 indicates results taken from [10]. In the setting with a \ufb01xed batch size (\u201csame bsz\u201d), we use a batch size\n256 and report the training time per 100 batches in seconds. In the setting with a \ufb01xed memory budget (\u201csame\nmem\u201d), training time per instance in milliseconds is reported.\n\nMethod\n\nSoftmax\nMoS-9\nMixtape\n\nBLEU Training time (same bsz) Training time (same mem)\n\nOut layer\n\nAll layers\n\nOut layer\n\nAll layers\n\n43.0\n42.1\u2217\n43.9\n\n2.43\n7.90\n4.88\n\n24.06\n29.05\n26.81\n\n15.0\n254.0\n40.5\n\n159.3\n936.0\n197.2\n\nTable 6: BLEU and training time comparison on WMT\u201914 En-Fr. Text abbreviations are the same as in Table 3.\nThe way we report the training time for different settings is the same as in Table 5.\nthan a softmax-based network. On the other hand, Mixtape improves the perplexity over MoS by 2.8\npoints and 6.25 points on Penn Treebank and One Billion Word respectively. On translation tasks,\nMixtape improves the BLUE scores from 29.0 to 29.3 on En-De and from 43.0 to 43.9 on En-Fr.\nSecond, compared to MoS, Mixtape achieves similar or better performance in perplexity and BLEU\nwhile being much more ef\ufb01cient. Mixtape is 1.6x to 11.5x faster than MoS given the same batch size\nand 3.5x to 10.5x faster given the same memory budget. The speedup is usually more signi\ufb01cant with\nthe memory budget constraints, demonstrating that the ability to save memory also contributes to\nthe ef\ufb01ciency of Mixtape. Mixtape has better performance than MoS on translation and comparable\nperformance on language modeling.\nThird, gate sharing substantially reduces the computational cost without sacri\ufb01cing accuracy. In\nTables 3 and 4, the perplexities of Mixtape with and without gate sharing only have an almost\nnegligible difference. Gating sharing improves the speed by 4.3x and 4.0x on Penn Treebank and\nOne Billion Word respectively given the same memory budget. The speedup is 3.5x and 3.2x given\nthe same batch size. This indicates that gate sharing reduces the memory cost as well as training time\nper forward-backward pass.\n\n5 Conclusions\n\nWe propose Mixtape to break the softmax bottleneck more ef\ufb01ciently. Compared to MoS, Mixtape\nis more computationally ef\ufb01cient. Compared to softmax, Mixtape has comparable ef\ufb01ciency and is\nsuperior in terms of accuracy. Based on the above results, it is possible that Mixtape can be used as\na plug-and-play layer to improve conditional and unconditional text generation in general. In the\nfuture, it will be intriguing to further investigate more applications of Mixtape.\n\nReferences\n[1] Karim Ahmed, Nitish Shirish Keskar, and Richard Socher. Weighted transformer network for\n\nmachine translation. arXiv preprint arXiv:1711.02132, 2017.\n\n[2] Rami Al-Rfou, Dokook Choe, Noah Constant, Mandy Guo, and Llion Jones. Character-level\n\nlanguage modeling with deeper self-attention. arXiv preprint arXiv:1808.04444, 2018.\n\n[3] Alexei Baevski and Michael Auli. Adaptive input representations for neural language modeling.\n\narXiv preprint arXiv:1809.10853, 2018.\n\n[4] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and\nTony Robinson. One billion word benchmark for measuring progress in statistical language\nmodeling. arXiv preprint arXiv:1312.3005, 2013.\n\n8\n\n\f[5] Zihang Dai, Zhilin Yang, Yiming Yang, William W Cohen, Jaime Carbonell, Quoc V Le,\nand Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a \ufb01xed-length\ncontext. arXiv preprint arXiv:1901.02860, 2019.\n\n[6] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and \u0141ukasz Kaiser. Uni-\n\nversal transformers. arXiv preprint arXiv:1807.03819, 2018.\n\n[7] Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. Understanding back-translation\n\nat scale. arXiv preprint arXiv:1808.09381, 2018.\n\n[8] Edouard Grave, Armand Joulin, Moustapha Ciss\u00e9, Herv\u00e9 J\u00e9gou, et al. Ef\ufb01cient softmax\napproximation for gpus. In Proceedings of the 34th International Conference on Machine\nLearning-Volume 70, pages 1302\u20131310. JMLR. org, 2017.\n\n[9] Sekitoshi Kanai, Yasuhiro Fujiwara, Yuki Yamanaka, and Shuichi Adachi. Sigsoftmax: Reanal-\nysis of the softmax bottleneck. In Advances in Neural Information Processing Systems, pages\n284\u2013294, 2018.\n\n[10] Xiang Kong, Qizhe Xie, Zihang Dai, and Eduard Hovy. Fast and simple mixture of softmaxes\nwith bpe and hybrid-lightrnn for language generation. arXiv preprint arXiv:1809.09296, 2018.\n\n[11] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing lstm\n\nlanguage models. arXiv preprint arXiv:1708.02182, 2017.\n\n[12] Tom\u00e1\u0161 Mikolov, Martin Kara\ufb01\u00e1t, Luk\u00e1\u0161 Burget, Jan \u02c7Cernock`y, and Sanjeev Khudanpur. Recur-\nrent neural network based language model. In Eleventh annual conference of the international\nspeech communication association, 2010.\n\n[13] Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. Scaling neural machine translation.\n\narXiv preprint arXiv:1806.00187, 2018.\n\n[14] Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to\naccelerate training of deep neural networks. In Advances in Neural Information Processing\nSystems, pages 901\u2013909, 2016.\n\n[15] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words\n\nwith subword units. arXiv preprint arXiv:1508.07909, 2015.\n\n[16] Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position repre-\n\nsentations. arXiv preprint arXiv:1803.02155, 2018.\n\n[17] Noam Shazeer, Youlong Cheng, Niki Parmar, Dustin Tran, Ashish Vaswani, Penporn Koanan-\ntakool, Peter Hawkins, HyoukJoong Lee, Mingsheng Hong, Cliff Young, et al. Mesh-tensor\ufb02ow:\nDeep learning for supercomputers. In Advances in Neural Information Processing Systems,\npages 10435\u201310444, 2018.\n\n[18] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[19] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W Cohen. Breaking the softmax\n\nbottleneck: A high-rank rnn language model. arXiv preprint arXiv:1711.03953, 2017.\n\n9\n\n\f", "award": [], "sourceid": 3094, "authors": [{"given_name": "Zhilin", "family_name": "Yang", "institution": "Recurrent AI"}, {"given_name": "Thang", "family_name": "Luong", "institution": "Google Brain"}, {"given_name": "Russ", "family_name": "Salakhutdinov", "institution": "Carnegie Mellon University"}, {"given_name": "Quoc", "family_name": "Le", "institution": "Google"}]}