{"title": "GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking", "book": "Advances in Neural Information Processing Systems", "page_first": 10988, "page_last": 10998, "abstract": "Model compression is essential for serving large deep neural nets on devices with limited resources or applications that require real-time responses. For advanced NLP problems, a neural language model usually consists of recurrent layers (e.g., using LSTM cells), an embedding matrix for representing input tokens, and a softmax layer for generating output tokens. For problems with a very large vocabulary size, the embedding and the softmax matrices can account for more than half of the model size. For instance, the bigLSTM model achieves state-of-the-art performance on the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its word embedding and softmax matrices use more than 6GBytes space, and are responsible for over 90\\% of the model parameters. In this paper, we propose GroupReduce, a novel compression method for neural language models, based on vocabulary-partition (block) based low-rank matrix approximation and the inherent frequency distribution of tokens (the power-law distribution of words). We start by grouping words into $c$ blocks based on their frequency, and then refine the clustering iteratively by constructing weighted low-rank approximation for each block, where the weights are based the frequencies of the words in the block. The experimental results show our method can significantly outperform traditional compression methods such as low-rank approximation and pruning. On the OBW dataset, our method achieved 6.6x compression rate for the embedding and softmax matrices, and when combined with quantization, our method can achieve 26x compression rate without losing prediction accuracy.", "full_text": "GroupReduce: Block-Wise Low-Rank Approximation\n\nfor Neural Language Model Shrinking\n\nPatrick H. Chen\u2217\n\nUCLA\n\nLos Angeles, CA\n\npatrickchen@g.ucla.edu\n\nSi Si\n\nGoogle Research\nMountain View, CA\n\nsisidaisy@google.com\n\nYang Li\n\nGoogle Research\nMountain View, CA\nliyang@google.com\n\nCiprian Chelba\nGoogle Research\nMountain View, CA\n\nciprianchelba@google.com\n\nCho-Jui Hsieh\n\nUCLA\n\nLos Angeles, CA\n\nchohsieh@cs.ucla.edu\n\nAbstract\n\nModel compression is essential for serving large deep neural nets on devices with\nlimited resources or applications that require real-time responses. As a case study, a\nneural language model usually consists of one or more recurrent layers sandwiched\nbetween an embedding layer used for representing input tokens and a softmax\nlayer for generating output tokens. For problems with a very large vocabulary\nsize, the embedding and the softmax matrices can account for more than half of\nthe model size. For instance, the bigLSTM model achieves great performance\non the One-Billion-Word (OBW) dataset with around 800k vocabulary, and its\nword embedding and softmax matrices use more than 6GBytes space, and are\nresponsible for over 90% of the model parameters. In this paper, we propose\nGroupReduce, a novel compression method for neural language models, based\non vocabulary-partition (block) based low-rank matrix approximation and the\ninherent frequency distribution of tokens (the power-law distribution of words).\nThe experimental results show our method can signi\ufb01cantly outperform traditional\ncompression methods such as low-rank approximation and pruning. On the OBW\ndataset, our method achieved 6.6 times compression rate for the embedding and\nsoftmax matrices, and when combined with quantization, our method can achieve\n26 times compression rate, which translates to a factor of 12.8 times compression\nfor the entire model with very little degradation in perplexity.\n\nIntroduction\n\n1\nDeep neural nets with a large number of parameters have a great capacity for modeling complex\nproblems. However, the large size of these models is a major obstacle for serving them on-device\nwhere computational resources are limited. As such, compressing deep neural nets has become a\ncrucial problem that draws an increasing amount of interest from the research community. Given\na large neural net, the goal of compression is to build a light-weight approximation of the original\nmodel, which can offer a much smaller model size while maintaining the same (or similar) prediction\naccuracy.\nIn this paper, we focus on compressing neural language models, which have been successfully\napplied in a range of important NLP tasks including language modeling (e.g., next word prediction)\n\n\u2217Work is done when interning at Google.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fand machine translation. A neural language model often consists of three major components: one\nor more recurrent layers (often using LSTM), an embedding layer for representing input tokens,\nand a softmax layer for generating output tokens. The dimension of recurrent layers (e.g., LSTM),\nwhich corresponds to the hidden state, is typically small and independent of the vocabulary size\nof input/output tokens. In contrast, the dimension of the embedding and the softmax layers grow\nwith the vocabulary size, which can easily be at the scale of hundreds of thousands. As a result, the\nparameter matrices of the embedding and softmax layers are often responsible for the major memory\nconsumption of a neural language model. For example, DE-EN Neural Machine Translation task has\nroughly a vocabulary size around 30k and around 80% of the memory is used to store embedding\nand softmax matrices. Furthermore, the One Billion Word language modeling task has a vocabulary\nsize around 800k, and more than 90% of the memory footprint is due to storing the embedding and\nsoftmax matrices. Therefore, to reduce the size of a neural language model, it is highly valuable to\ncompress these layers, which is the focus of our paper.\nThere have been extensive studies for compressing fully connected and convolutional networks [20,\n5, 7, 6, 25, 27, 9]. The mainstream algorithms from these work such as low-rank approximation,\nquantization, and pruning can also be directly applied to compress the embedding and softmax\nmatrices. However, it has been reported in previous papers that these algorithms, though ef\ufb01cient for\nCNN compression, are not able to achieve a good compression rate for word embedding matrices.\nFor instance, [9] proposed a very successful quantization method for CNNs, but for language models\nthe compression rate is less than 3 times.\nOne important aspect that has not been well explored in the literature is that the embedding matrix\nhas several speci\ufb01c properties that do not exist in a general weight matrix of CNNs. Each column of\nthe input embedding and softmax matrix represents a token, which implies that on a given training or\ntest set the parameters in that column are used with a frequency which obeys Zipf\u2019s law distribution.\nBy exploiting these structures, we propose GroupReduce, a novel method for compressing the\nembedding and softmax matrices using block-wise, weighted low-rank approximation. Our method\nstarts by grouping words into blocks based on their frequencies, and then re\ufb01nes the clustering\niteratively by constructing weighted low-rank approximation for each block. This allows word\nvectors to be projected into a better subspace during compression. Our experiments show that\nGroupReduce is more effective than standard low-rank approximation methods for compressing these\nlayers. It is easy-to-implement and can handle very large embedding and softmax matrices.\nOur method achieves good performance on compressing a range of benchmark models for language\nmodeling and neural machine translation tasks, and outperforms previous methods. For example,\non DE-EN NMT task, Our method achieves 10 times compression rate on the embedding and\nsoftmax matrices without much degradation of performance. Results can be further improved to 24\ntimes compression rate when combined with quantization scheme. On One Billion Word dataset,\nour method achieves 6.6 times compression rate on the embedding and softmax matrices that are\noriginally more than 6GB. When combined with quantization scheme, our method achieves more\nthan 26 times compression rate while maintaining similar perplexity.\n2 Related Work\n2.1 Model Compression for CNN\nLow-rank matrix/tensor factorization.\nTo compress a deep net, a natural direction is to approxi-\nmate each of its weight matrices, W , by a low-rank approximation of the matrix using SVD. Based\non this idea, [20] compressed the fully connected layers in neural nets. For convolution layers, the\nkernels can be viewed as 3D tensors. Thus, [10, 5] applied higher-order tensor decomposition to\ncompress CNN. In the same vein, [8] developed another structural approximation. [12] proposed\nan algorithm to select rank for each layer. More recently, [27] reconstructed the weight matrices by\nusing sparse plus low-rank approximation.\n\nPruning. Algorithms have been proposed to remove unimportant weights in deep neural nets. In\norder to do this, one needs to de\ufb01ne the importance of each weight. For example, [15] showed that\nthe importance can be estimated by using the Hessian of loss function. [7] considered adding (cid:96)1 or (cid:96)2\nregularization and applied iterative thresholding approaches to achieve very good compression rates.\nLater on, [6] demonstrated that CNNs can be compressed by combining pruning, weight sharing and\nquantization.\n\n2\n\n\fQuantization.\nStoring parameters using lower precision representations has been used for model\ncompression. Recently, [9] showed that a simple uniform quantization scheme can effectively reduce\nboth the model size and the prediction time of a deep neural net. [16] showed that non-uniform\nquantization can further improve the performance. Recently, several advanced quantization techniques\nhave been proposed for CNN compression [26, 4].\n\n2.2 Model Compression for RNN/LSTM\nAlthough model compression has been studied extensively for CNN models, less works have focused\non the compression for recurrent neural nets (RNNs), another widely-used category of deep models\nin NLP applications. Since RNN involves a collection of fully connected layers, many of the\naforementioned approaches can be naturally applied. For example, [9] applied their quantization and\nretraining procedure to compress a LSTM (a popular type of RNN) language model on Penn Tree\nBank (PTB) dataset. [24] applied a matrix/tensor factorization approach to compress the transition\nmatrix of LSTM and GRU, and tested their algorithm on image and music classi\ufb01cation problems\n(which does not need word embedding matrices). [19, 17] proposed pruning algorithms for LSTM\nmodels compression.\nAmong the previous work, we found only [9, 17] tried to compress the word embedding matrix in\nNLP applications. [9] showed that the quantization-plus-retraining approach can only achieve less\nthan 3 times compression rate on PTB data with no performance loss. [17] showed that for word-level\nLSTM models, the pruning approach can only achieve 87% sparsity with more than 5% performance\nloss. This means roughly 26% parameters over the original model since this approach also needs to\nstore the index for non-zero locations. Very recently, [14] compressed the word embeddings computed\nby the word2vec algorithm and applied to similarity/analogy task and Question Answering. [21]\napplied compositional coding to compress the input embedding matrix of LSTM, but it is challenging\nto compress the softmax (output) layer matrix using the same algorithm. As a result, the overall\ncompressed model from this approach is still large. One main issue of the approach is that multiple\nwords share the same coding, which makes these words indistinguishable in the output layer during\ninference.\nThese previous results indicate that compressing embedding matrices in natural language tasks is a\ndif\ufb01cult problem\u2014it is extremely challenging to achieve 4 times compression rate without sacri\ufb01cing\nperformance. In this paper, we will show that instead of only treating the embedding or the softmax\nparameters as a pure matrix, by exploiting the inherent structure of natural languages, GroupReduce\nalgorithm could achieve much better compression rates.\n\n3 Proposed Algorithms\n\nWe now introduce a novel algorithm for compressing both the embedding and the softmax layer, two\nmajor components in a neural language model as discussed earlier. Assume the word embedding\nmatrix has size N-by-D, where N is the vocabulary size and D is the embedding dimension. We\nwill use A \u2208 RN\u00d7D to denote the embedding matrix (either input or softmax layer), and each row of\nA corresponds to the embedding vector of a word, i.e., the vector representation of the word.\nOur goal is to compress the embedding matrix A so that it uses less memory while achieving similar\nprediction performance. For a typical language model, especially the one with a large vocabulary\nsize, the large memory size of the model is mostly due to the need to store the input and output\nword embedding matrices. In Table 1, we show an anatomy of memory consumption for several\nclassic models trained on the publicly available datasets. We can see that for three out of four\nsetups, embedding matrices contribute more than 75% of the overall memory usage. For example, in\nbigLSTM model that achieved start-of-the-art performance on OBW, more than 90% of memory is\nused to store two (input and output) word-embedding matrices. Thus, for deep neural net models\nalike, the main challenge to serve them on-device is to store tremendous memory usage of word\nembedding matrices. As such, it is highly valuable to compress these word embedding matrices.\nGiven a word embedding matrix A, a standard way to compress A while preserving the information\nis to perform low-rank approximation over A. A low-rank approximation can be acquired by using\nsingular value decomposition (SVD), which achieves the best rank-k approximation:\n\nA \u2248 U SV T ,\n\n3\n\n(1)\n\n\f(a) Frequency\n\n(b) eigenvalues\n\n(c) reconstruction error\n\nFigure 1: Illustration on Penn Treebank (PTB) dataset with the vocabulary size to be 10k and the\nmodel\u2019s embedding dimension to be 1500. (a): log of word frequency vs rank of the word. One word\u2019\nrank is de\ufb01ned as the log of number of words that occurs less than it. We can clearly observe the\npower law distribution of the word frequency; (b) x-axis shows the rank of approximatiion, and y-axis\nshows the eigenvalues. Here eigenvalues for two embedding matrices are from the input embedding\nlayer and softmax layer; we can see the eigenvalues are very large. (c) low-rank reconstruction error\nbased on singular value decomposition for the two embedding matrices. This in other way shows that\nthe vanilla SVD may not work well for the embedding matrix.\n\nTable 1: The size of each layer in the model. The number in parenthesis shows the ratio respective to\nthe entire model size.\n\nModels\n\nPTB-Small\nPTB-Large\n\nNMT: DE-EN\nOBW-BigLSTM\n\nvocabulary size\n10k\n10k\n30k\n793k\n\ndimension model size\n17.7MB\n251MB\n195 MB\n6.8GB\n\n200\n1500\n500\n1024\n\nembedding layer(s)\n7.6MB(42.9%)\n57MB(22.7%)\n115 MB (59.0%)\n3.1GB (45.6%)\n\nsoftmax layer\n7.6MB(42.9%)\n57MB(22.7%)\n47MB(24.1%)\n3.1GB(45.6%)\n\nLSTM cell\n2.5MB(14.2%)\n137MB(54.6%)\n33MB(16.9%)\n0.6GB(8.8%)\n\nwhere U \u2208 RN\u00d7k, V \u2208 RD\u00d7k where k < min(D, N ) is the target rank, and S is a diagonal matrix\nof singular values. After the rank-k low-rank approximation, the memory footprint for A reduces\nfrom O(N D) to O(N k + Dk).\nThere are two issues for using vanilla SVD to compress an embedding matrix. First, the rank of the\nSVD is not necessarily low for an embedding matrix. For example, Figure 1(b) shows that all the\neigenvalues of the PTB word embedding matrices are quite large, which leads to poor reconstruction\nerror of low-rank approximation in Figure 1(c). Second, the SVD approach considers A as a regular\nmatrix, but in fact each row of A corresponds to the embedding of a word, which implies additional\nstructure that we can further exploit under the language model case.\n\n3.1 The Word Frequency Matters\n\nOne important statistical property of natural languages is that the distribution of word frequencies\ncan be approximated by a power law. That means a small fraction of words occur many times, while\nmany words only appear few times. Figure 1(a) shows the power-law distribution of word frequency\nin the PTB datasets.\nIn the previous compression methods, none of them takes the word frequency into consideration\nwhen approximating the embedding matrix. Intuitively, to construct a good compressed model\nwith low-rank approximation under the limited memory budget, it is important to enforce more\nfrequent words to have better approximation. In this paper, we considered two strategies to exploit\nthe frequency information in low-rank approximation: weighted low-rank approximation and block\nlow-rank approximation.\n\n3.2\n\nImproved Low-rank Approximation by Exploiting Frequency\n\nWeighted low-rank approximation. Firstly, we introduce a weighted low-rank approximation to\ncompress the embedding matrix A. This will be used to replace original SVD and serves as the basic\nbuilding block of our proposed algorithm. The main idea is to assign a different weight for each\nword\u2019s approximation and penalize more for the higher frequency words when constructing low-rank\n\n4\n\n\fFigure 2: Illustration of our method. Given an embedding matrix A in (a), we \ufb01rst group the words by\ntheir frequency (step (b)), and then perform weighted-SVD inside each group as shown in Eq.2(step\n(c)). Finally we re\ufb01ne the clustering by considering the low-rank reconstruction error of words as in\nEq.5(step (d)).\n\napproximation. Mathematically, for the i-th word\u2019s frequency to be qi, we want to approximate the\nembedding A by minimizing\n\nmin\n\nU\u2208RN\u00d7k,V \u2208RD\u00d7k\n\nqi(Aij \u2212 UiV T\n\nj )2\n\n(2)\n\nN(cid:88)\n\nD(cid:88)\n\ni=1\n\nj=1\n\nwhere k is the reduced rank; Aij is i-th word\u2019s j-th feature; U \u2208 RN\u00d7k, V \u2208 RD\u00d7k; Ui and Vj are\ni-th and j-th row of U and V respectively. Note that here we do not require U, V to be orthonormal.\nAlthough it is known that weighted SVD with element-wise weights does not have a closed-form\n\u221a\nsolution [23], in our case elements in the same row of A are associated with the same weights, which\nleads to a simple solution. De\ufb01ne Q = diag(\nqN ), then the optimization problem of (2) is\nequivalent to\n\nq1, . . . ,\n\n\u221a\n\nmin\n\nU\u2208RN\u00d7k,V \u2208RD\u00d7k\n\n(cid:107)QA \u2212 QU V T(cid:107)2\nF .\n\n(3)\n\nTherefore, assume all the qi are nonzeros, we can solve (2) by conducting low-rank approximation\nof QA. Assume [ \u00afU , \u00afS, \u00afV ] = svd(QA), then (U\u2217, V \u2217) = (Q\u22121 \u00afU \u00afS, \u00afV ) will be a solution of (2).\nTherefore solving Eq.(2) is easy and the solution can be immediately computed from SVD of QA.\nBlock low-rank approximation. As can be seen from Figure 1(b), the embedding matrix is in\ngeneral not low-rank. Instead of constructing one low-rank approximation for the entire matrix, we\ncan consider block-wise low-rank approximation\u2013each block has its own approximation to achieve\nbetter compression. A similar strategy has been exploited in [22] for kernel approximation (symmetric\nPSD matrix). Mathematically, suppose we partition the words into c disjoint blocks V1,\u00b7\u00b7\u00b7 ,Vc, and\neach Vp contains a set of words. For each block Vp and its corresponding words\u2019 embedding AVp in\nA, we can generate a low-rank approximation with rank kp as AVp \u2248 U p(V p)T for AVp. Then block\nlow-rank approximation for A is represented as:\n\nA = [AV1, AV2,\u00b7\u00b7\u00b7 , AVc] \u2248 [U 1(V 1)T , U 2(V 2)T ,\u00b7\u00b7\u00b7 , U c(V c)T ].\n\n(4)\n\nThe challenges for Eq (4) is on how to construct the clustering structure. Intuitively, we want similar\nfrequency words to be grouped in the same block, so we can assign different ranks for different\nblocks based on their average frequency. For higher frequency words\u2019 clusters, we can provide more\nranks/budget for better approximation. Meanwhile, we want to make sure the approximation error\nto be small for words under the same memory budget. Therefore, in this paper we consider two\nfactors, word frequency and reconstruction performance, when constructing the partition. Next, we\nwill explain how to construct the partition.\nBlock weighted low-rank approximation. To take both matrix approximation as well as frequency\ninformation into account when forming the block structure in Eq (4), we propose to re\ufb01ne the blocks\nafter initializing the blocks from frequency grouping to achieve lower reconstruction error. In the\nre\ufb01nement stage, we move the words around by simultaneously learning a clustering structure as well\nas low-rank approximation inside each cluster for the word embedding matrix.\n\n5\n\n\fTable 2: PTB-small with 5 blocks and 5 times compression rate. We add the proposed strategies\none-by-one to see the effectiveness of each of them using the perplexity as the performance metric.\nNotice that in practice, when applying GroupReduce, we will keep certain percentage of most frequent\nwords uncompressed. But numbers in this table is obtained without preserving any frequent words.\n\nvanilla SVD weighted-SVD block SVD block weighted-SVD block weighted-SVD with dynamic rank\n129.63\n\n161.44\n\n155.10\n\n143.88\n\n135.19\n\nre\ufb01nement\n127.26\n\nMathematically, given an embedding matrix A, we \ufb01rst initialize the blocks by frequency grouping,\nand then jointly learn both the clustering V1,V2,\u00b7\u00b7\u00b7 ,Vc and low-rank embeddings for each block\nU p, V p simultaneously by minimizing the following clustering objective:\n\nc(cid:88)\n\n(cid:107)QVp AVp \u2212 QVp U p(V p)T(cid:107)2\nF ,\n\n(5)\n\nmin\np=1,{U p}c\n{Vp}c\n\u221a\n(\n\np=1,{V p}c\n\u221a\n\np=1\n\np=1\n\nq1, . . . ,\n\nwhere QVp = diagj\u2208Vp\nqj). Intuitively, the inner part aims to minimize the weighted\nlow-rank approximation error for one cluster, and outer sum is searching for the partitions so as to\nminimize the overall reconstruction error.\nOptimization: Eq.(5) is non-convex. In this paper, we use alternating minimization to minimize the\nabove objective. When \ufb01xing the clusters assignment, we use weighted SVD to solve for U p and\nV p for each AVp. To solve for U p and V p, as mentioned above in Eq(2), we can perform SVD over\nQVp AVp to obtain the approximation. The time complexity is the same with traditional SVD on AVp.\nTo \ufb01nd the clustering structure, we \ufb01rst initialize the clustering assignment by frequency, and then\nre\ufb01ne the block structure by moving words from one cluster to another cluster if the moves can\ndecrease the reconstruction error Eq (5). To compute the reconstruction error reduction, we will\nproject each Ai into each basis V p and see how much reconstruction error will improve. So if\n\n(cid:107)Ai \u2212 V p(V p)T Ai(cid:107) > (cid:107)Ai \u2212 V \u00afp(V \u00afp)T Ai(cid:107),\n\n(6)\nthen we will move i-th word Ai from the p-th cluster to the \u00afp-th cluster. By this strategy, we will\ndecrease the restructure error.\nThe overall algorithm, GroupReduce is in Figure (2) illustrates our overall algorithm. First, we group\nthe words into c blocks based on frequency. After that, we perform weighted lowrank approximation\nEq (2) for each block, and then solve Eq (5) to iteratively re\ufb01ne the clusters and obtain block-wise\napproximation based on reconstruction error.\nThere are some implementation details for Algorithm 1. After initial grouping, we assign different\nranks to different blocks based on the average frequency of words inside that cluster\u2014the rank kp\nfor block p is proportional to the average frequency of words inside that cluster. Suppose the block\nwith smallest frequency is assigned with rank r, then the rank of cluster p is fp\nr, where fc is the\nfc\naverage frequency for the block with least frequency words. r is related to the budget requirement.\nThis dynamic rank assignment can signi\ufb01cantly boost the performance, as it assigns more ranks to\nhigh-frequency words and approximates them better.\nIn Table 2, we compare the effectiveness of different strategies in our algorithm. We test on PTB-\nSmall setting with statistics shown in Table 1. Every method in the table has the same compression\nrate, and we report perplexity number. We compare using vanilla SVD, weighted SVD, weighted\nSVD for each block (10 blocks), assigning different ranks for different blocks, and re\ufb01ning the blocks.\nWe can see that all the operations involved can improve the \ufb01nal performance and are necessary for\nour algorithm. The overall memory usage to represent A after our algorithm is O(N k + ckD), where\nN is the vocabulary size; c is the number of clusters; k the average rank of each cluster.\n\n4 Experiments\n4.1 Datasets and Pretrained Models\nWe evaluate our method (GroupReduce) on two tasks: language modeling (LM) and neural machine\ntranslation (NMT). For LM, we evaluate GroupReduce on two datasets: Penn Treebank Bank (PTB)\nand One-billion-Word Benchmark (OBW). OBW is introduced by [2], and it contains a vocabulary\n\n6\n\n\fAlgorithm 1: GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model\nShrinking\nInput: Embedding matrix A; number of clusters c; the smallest rank r; the maximal number of\n\niterations tmax; minimal size of the candidate set mmin;\n\nOutput: Compact representation \u00afA\n1 Initialize clusters of words as V1,V2,\u00b7\u00b7\u00b7 ,Vc by clustering on the frequency of words;\n2 Compute the desired rank for each cluster based on the average frequency for that cluster and r;\n3 for p = 1,\u00b7\u00b7\u00b7 , c do\n4\n\nCompute the rank-kp weighted lowrank for each sub-matrix AVp as AVp \u2248 U p(V p)T ;\n\n5 for t = 1,\u00b7\u00b7\u00b7 , tmax do\nM = [];\nfor i = 1,\u00b7\u00b7\u00b7 , N do\n\n6\n7\n8\n9\n10\n11\n\nCompute the reconstruction error for i-th word Ai, ei = minp=1\u00b7\u00b7\u00b7c(cid:107)Ai \u2212 V p(V p)T Ai(cid:107)2\n2 ;\nFind the cluster with smallest reconstruction error gi : minp=1\u00b7\u00b7\u00b7cei\np;\nif gi (cid:54)= \u03c0i (\u03c0i is the original cluster index for i-th word) then\n\nput i into the candidate set M;\n\n12\n13\n14\n15\n\n16\n17\n18\n\nChoose the top m words in M that with least reconstruction error;\nmove m words (we choose 10% in the paper) into clusters with smallest reconstruction error;\nif m < mmin then\nStop and output;\nfor p = 1,\u00b7\u00b7\u00b7 , c do\nif Cluster Vp changes then\n\nCompute the rank-kp weighted lowrank from Eq (2) for each sub-matrix AVp as\nAVp \u2248 U p(V p)T ;\n\n19 Output: \u00afA = [U 1(V 1)T ,\u00b7\u00b7\u00b7 , U c(V c)T ]\n\nof 793,471 words with the sentences shuf\ufb02ed and the duplicates removed. For NMT, we evaluate\nour method on the IWSLT 2014 German-to-English translation task [1]. On these three benchmark\ndatasets, we compress four models with the models details shown in Table 1. All four models use a\n2-layer LSTM. Two of them (OBW and NMT) are based on exiting model checkpoints and the other\ntwo (based on PTB) are trained from scratch due to the lack of publicly released model checkpoint.\nWe train a 2-layer LSTM-based language model on PTB from scratch with two setups: PTB-Small\nand PTB-Large. The LSTM hidden state sizes are 200 for PTB-Small and 1500 for PTB-Large, so are\ntheir embedding sizes. For OBW, we use the \"2-LAYER LSTM-8192-1024\" model shown in Table 1\nof [11]. For NMT, we use the PyTorch checkpoint provided by OpenNMT [13] to perform German\nto English translation tasks. We veri\ufb01ed that all these four models achieved benchmark performance\non the corresponding datasets as reported in the literature. We then apply our method to compress\nthese benchmark models.\nFor experiments using BLEU scores as performance measure, we report results when the BLEU\nscores achieved after compression is within 3 percent difference from original score. For experiments\nusing perplexity (PPL) as measure such as PTB dataset, we target 3 percent drop of performance too.\nFor OBW dataset, since it has larger vocaburary size, we report results within 10 percent difference\nfrom original PPL. For each method in Table 3, 4 and 5, we tested various parameters and report the\nsmallest model size of the compression ful\ufb01lling above criteria. Certainly, the compression rate and\ncorresponding performance will be a spectrum. The more we compress, the larger the performance\ndrop. We plot this trade-off on PTB-Large in the supplementary. Number of clusters will impact\nthe compression rate. In the experiment, we set the number of clusters to be 5 for PTB and IWSLT\ndatasets, and 20 for the OBW dataset. We show the performance of GroupReduce with different\nnumbers of clusters under the PTB-Large setting in the supplementary.\nNote that the goal of this work is to compress an existing model to a signi\ufb01cantly-reduced size while\nmaintaining accuracy (e.g., perplexity or BLEU scores), rather than attempting to achieve higher\n\n7\n\n\fTable 3: Embedding compression results on three datasets comparing our method GroupReduce\nwith Low-rank and Pruning. Compression rate is compared to both input embedding and softmax\nlayer. For example, 10x means approximated embedding uses 10 times smaller memory compared to\noriginal input layer and softmax layer.\n\nModel\n\nPTB-Small\n\nPTB-Large\n\nEmbedding Memory\nPPL(before retrain)\nPPL(after retrain)\nEmbedding Memory\nPPL(before retrain)\nPPL(after retrain)\nOBW-bigLSTM Embedding Memory\nPPL(before retrain)\nPPL(after retrain)\nEmbedding Memory\nBLEU(before retrain)\nBLEU(after retrain)\n\nNMT: DE-EN\n\nMetric Original Low-rank\n2x\n117.11\n113.83\n5x\n84.63\n80.04\n2x\n39.41\n38.03\n3.3x\n29.65\n29.96\n\n1x\n112.28\n\u2013\n1x\n78.32\n\u2013\n1x\n31.04\n\u2013\n1x\n30.33\n\u2013\n\nPruning GroupReduce\n4x\n115.38\n113.81\n8x\n84.79\n79.83\n6.6x\n32.47\n32.50\n8x\n29.31\n29.96\n\n2x\n115.9\n113.78\n3.3x\n84.23\n78.38\n1.14x\n128.31\n84.11\n3.3x\n25.96\n29.34\n\naccuracy. It is possible that there are models that could achieve higher accuracy, in which case our\nmethod can be applied to compress these models as well.\n\n4.2 Comparison with Low-Rank and Pruning\nWe compare GroupReduce with two standard model compression strategies: low-rank approximation\nand pruning.These two techniques are widely used for language model compression, such as [17,\n19, 18] We compress both input embedding and softmax matrices. For the low-rank approximation\napproach, we perform standard SVD on the embedding and softmax matrices and obtain the low-rank\napproximation. For pruning, we set the entires whose magnitude is less than a certain threshold to\nzero. Note that storing the sparse matrix requires to use the Compressed Sparse Row or Compressed\nSparse Column format, the memory usage is thus 2 times the number of non-zeros in the matrix\nafter pruning. After approximation, we retrain the rest of parameters by SGD optimizer with initial\nlearning rate 0.1. Whenever, the validation perplexity does not drop down, we decrease the learning\nrate to an order smaller. As shown in Table 3, GroupReduce can compress both the input embedding\nand softmax layer 5-10 times without losing much accuracy. In particular, GroupReduce compress\n6.6 times on the language model trained on OBW benchmark, which saves more than 5 GB memory.\nNotice that GroupReduce achieves good results even before retraining. This is important as retraining\nmight be infeasible or take a long time to converge. We experimented with different learning rates\nand retrained for 100k steps (about 3 hours), but we observe that all the retraining scheme of OBW-\nbigLSTM model after approximation do not lead to signi\ufb01cant improvement on accuracy. One\nreason is that to retrain the model, we need to keep the approximated embedding matrices \ufb01xed\nand re-initialize other parameters, and train these parameters from scratch as done in [21]. On\nOBW-bigLSTM, it will take more than 3 weeks for the retraining process. It is not practical if the\ngoal is to compress model within a short period of time. Therefore, performance before retraining is\nimportant and GroupReduce in general obtains good results.\n\n4.3 Comparison with Quantization\n\nAs noted in the related work, quantization has been shown to be a competent method in model\ncompression [9]. We implement b-bit quantization by equally spacing the range of a matrix into 2b\nintervals and use one value to represent each interval. For example, 4-bit quantization will transform\noriginal matrix into matrix with 16 distinct values.\nWe need to point out that quantization is not orthogonal to other methods. In fact, GroupReduce\ncan be combined with quantization to achieve a better compression rate. We \ufb01rstly approximate\nthe embedding or the softmax matrices by GroupReduce to obtain low rank matrices of each block,\nand then apply 4 or 8 bits quantization on these low rank matrices. After retraining, quantized\nGroupReduce could achieve at least 26 times compression for both input embedding and softmax\nmatrix in OBW as shown in Table 4. In addition, comparisons to other coding schemes including\ndeep compositional coding [21] and dictionary coding [3] are shown in the supplementary.\n\n8\n\n\fTable 4: Embedding compression results on three datasets comparing our method Quantized\nGroupReduce with traditional Quantization. 10x means approximated embedding uses 10 times\nsmaller memory compared to original input embedding layer and softmax layer.\n\nModel\n\nPTB-Small\n\nPTB-Large\n\nEmbedding Memory\nPPL(before retrain)\nPPL(after retrain)\nEmbedding Memory\nPPL(before retrain)\nPPL(after retrain)\nOBW-bigLSTM Embedding Memory\nPPL(before retrain)\nPPL(after retrain)\nEmbedding Memory\nBLEU(before retrain)\nBLEU(after retrain)\n\nNMT: DE-EN\n\nMetric Original Quantization Quantized GroupReduce\n16x\n116.54\n114.39\n20x\n81.53\n78.61\n26x\n34.43\n33.60\n32x\n29.33\n29.65\n\n1x\n112.28\n\u2013\n1x\n78.32\n\u2013\n1x\n31.04\n\u2013\n1x\n30.33\n\u2013\n\n6.4x\n115.81\n114.14\n6.4x\n81.69\n79.22\n6.4x\n32.63\n33.86\n6.4x\n27.41\n30.19\n\nTable 5: Compression rate of overall model compression using Quantized GroupReduce. Compression\nrate shown in the column 4-6 is compared to the corresponding part of the model.\n\nModels\n\nOriginal PPL/BLEU PPL/BLEU after approximation\n\nNMT: DE-EN\nOBW-BigLSTM\n\n30.33(BLEU)\n31.04(PPL)\n\n29.68(BLEU)\n33.61(PPL)\n\ninput layer\n24x (45.9%)\n26x (45.6%)\n\nsoftmax layer LSTM cell Overall Compression\n24x(31.8%)\n26x(45.6%)\n\n4x(22.3%)\n2x(8.8%)\n\n11.3x\n12.8x\n\n4.4 Overall Compression\n\nResults above have shown GroupReduce is an effective compression method when the frequency\ninformation is given. We need to point out that part of the model (e.g., LSTM cells) cannot leverage\nthis information as the transition matrices in LSTM cell do not correspond to the representation of a\nword. We adopt simple quantized low-rank approximation of LSTM to compress this part. Speci\ufb01-\ncally, we \ufb01rst compute SVD of LSTM matrix to obtain 2 times compression, and quantize the entries\nof low-rank matrices by using only 16 bits. In total the model would be 4 times smaller. However,\nwe found out for OBW-bigLSTM model, LSTM matrix does not have a clear low-rank structure.\nEven slight compression of LSTM part will cause performance signi\ufb01cantly drop. Therefore, we only\napply 16-bit quantization on OBW-bigLSTM to have a 2 times compression on LSTM cells. Overall\ncompression rate is shown in Table 5. With the aid of GroupReduce, we can achieve over 10 times\ncompression on both language modeling and neural machine translation task.\n\n5 Conclusion\n\nIn this paper, we propose a novel compression method for neural language models. Our method\nleverages the statistical property of words in language to form block-wise low-rank matrix approxima-\ntions for embedding and softmax layers. Experimental results show that our method can signi\ufb01cantly\noutperform traditional compression methods such as low-rank approximation and pruning. In particu-\nlar, on the OBW dataset, our method combined with quantization achieves 26 times compression\nrate for both the embedding and softmax matrices, which saves more than 5GB memory usage. It\nprovides practical bene\ufb01ts when deploying neural language models on memory-constrained devices.\nFor the future work, we will investigate different retrain schemes such as training the block low-rank\nparameterization of the model end-to-end.\n\n6 Acknowledgement\n\nThis research is mainly done during Patrick Chen\u2019s internship at Google Research. We also acknowl-\nedge the support by NSF via IIS-1719097, Intel faculty award, Google Cloud and Nvidia.\n\n9\n\n\fReferences\n[1] Mauro Cettolo, Jan Niehues, Sebastian St\u00fcker, Luisa Bentivogli, and Marcello Federico. Report on the\n11th iwslt evaluation campaign, iwslt 2014. In Proceedings of the International Workshop on Spoken\nLanguage Translation, Hanoi, Vietnam, 2014.\n\n[2] Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, and Tony\nRobinson. One billion word benchmark for measuring progress in statistical language modeling. arXiv\npreprint arXiv:1312.3005, 2013.\n\n[3] Yunchuan Chen, Lili Mou, Yan Xu, Ge Li, and Zhi Jin. Compressing neural language models by sparse\n\nword representations. arXiv preprint arXiv:1610.03950, 2016.\n\n[4] Yoojin Choi, Mostafa El-Khamy, and Jungwon Lee. Universal deep neural network compression. CoRR,\n\nabs/1802.02271, 2018.\n\n[5] Emily L Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, and Rob Fergus. Exploiting linear structure\nwithin convolutional networks for ef\ufb01cient evaluation. In Advances in neural information processing\nsystems, pages 1269\u20131277, 2014.\n\n[6] Song Han, Huizi Mao, and William J. Dally. Deep compression: Compressing deep neural network with\n\npruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015.\n\n[7] Song Han, Jeff Pool, John Tran, and William J. Dally. Learning both weights and connections for ef\ufb01cient\n\nneural networks. CoRR, abs/1506.02626, 2015.\n\n[8] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco\nAndreetto, and Hartwig Adam. Mobilenets: Ef\ufb01cient convolutional neural networks for mobile vision\napplications. arXiv preprint arXiv:1704.04861, 2017.\n\n[9] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized\nneural networks: Training neural networks with low precision weights and activations. arXiv preprint\narXiv:1609.07061, 2016.\n\n[10] Max Jaderberg, Andrea Vedaldi, and Andrew Zisserman. Speeding up convolutional neural networks with\n\nlow rank expansions. arXiv preprint arXiv:1405.3866, 2014.\n\n[11] Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. Exploring the limits of\n\nlanguage modeling. arXiv preprint arXiv:1602.02410, 2016.\n\n[12] Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. Compression\n\nof deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.\n\n[13] Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M Rush. Opennmt: Open-source\n\ntoolkit for neural machine translation. arXiv preprint arXiv:1701.02810, 2017.\n\n[14] Maximilian Lam. Word2bits - quantized word vectors. arXiv preprint arXiv:1803.05651, 2018.\n\n[15] Yann LeCun, John S Denker, and Sara A Solla. Optimal brain damage. In Advances in neural information\n\nprocessing systems, pages 598\u2013605, 1990.\n\n[16] Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. Fixed point quantization of deep convolutional\n\nnetworks. In International Conference on Machine Learning, pages 2849\u20132858, 2016.\n\n[17] Ekaterina Lobacheva, Nadezhda Chirkova, and Dmitry Vetrov. Bayesian sparsi\ufb01cation of recurrent neural\n\nnetworks. arXiv preprint arXiv:1708.00077, 2017.\n\n[18] Zhiyun Lu, Vikas Sindhwani, and Tara N. Sainath. Learning compact recurrent neural networks. CoRR,\n\nabs/1604.02594, 2016.\n\n[19] Sharan Narang, Erich Elsen, Gregory Diamos, and Shubho Sengupta. Exploring sparsity in recurrent\n\nneural networks. In ICLR.\n\n[20] Tara N Sainath, Brian Kingsbury, Vikas Sindhwani, Ebru Arisoy, and Bhuvana Ramabhadran. Low-rank\nmatrix factorization for deep neural network training with high-dimensional output targets. In Acoustics,\nSpeech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 6655\u20136659. IEEE,\n2013.\n\n[21] Raphael Shu and Hideki Nakayama. Compressing word embeddings via deep compositional code learning.\n\nIn ICLR, 2018.\n\n10\n\n\f[22] Si Si, Cho-Jui Hsieh, and Inderjit S Dhillon. Memory ef\ufb01cient kernel approximation. J. Mach. Learn. Res,\n\n32:701\u20139.\n\n[23] Nathan Srebro and Tommi Jaakkola. Weighted low-rank approximations. In Proceedings of the 20th\n\nInternational Conference on Machine Learning (ICML-03), pages 720\u2013727, 2003.\n\n[24] Andros Tjandra, Sakriani Sakti, and Satoshi Nakamura. Compressing recurrent neural network with tensor\ntrain. In Neural Networks (IJCNN), 2017 International Joint Conference on, pages 4451\u20134458. IEEE,\n2017.\n\n[25] Jiaxiang Wu, Cong Leng, Yuhang Wang, Qinghao Hu, and Jian Cheng. Quantized convolutional neural\n\nnetworks for mobile devices. CoRR, abs/1512.06473, 2015.\n\n[26] Yuhui Xu, Yongzhuang Wang, Aojun Zhou, Weiyao Lin, and Hongkai Xiong. Deep neural network\n\ncompression with single and multiple level quantization. CoRR, abs/1803.03289, 2018.\n\n[27] Xiyu Yu, Tongliang Liu, Xinchao Wang, and Dacheng Tao. On compressing deep models by low rank\nand sparse decomposition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 67\u201376, 2017.\n\n11\n\n\f", "award": [], "sourceid": 8035, "authors": [{"given_name": "Patrick", "family_name": "Chen", "institution": "UCLA"}, {"given_name": "Si", "family_name": "Si", "institution": "Google Research"}, {"given_name": "Yang", "family_name": "Li", "institution": "Google"}, {"given_name": "Ciprian", "family_name": "Chelba", "institution": "Google"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UCLA, Google Research"}]}