{"title": "Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 4943, "page_last": 4953, "abstract": "In extreme classification settings, embedding-based neural network models are currently not competitive with sparse linear and tree-based methods in terms of accuracy. Most prior works attribute this poor performance to the low-dimensional bottleneck in embedding-based methods. In this paper, we demonstrate that theoretically there is no limitation to using low-dimensional embedding-based methods, and provide experimental evidence that overfitting is the root cause of the poor performance of embedding-based methods. These findings motivate us to investigate novel data augmentation and regularization techniques to mitigate overfitting. To this end, we propose GLaS, a new regularizer for embedding-based neural network approaches. It is a natural generalization from the graph Laplacian and spread-out regularizers, and empirically it addresses the drawback of each regularizer alone when applied to the extreme classification setup. With the proposed techniques, we attain or improve upon the state-of-the-art on most widely tested public extreme classification datasets with hundreds of thousands of labels.", "full_text": "Breaking the Glass Ceiling for Embedding-Based\n\nClassi\ufb01ers for Large Output Spaces\n\nChuan Guo\u21e4\u2020\n\nCornell University\n\ncg563@cornell.edu\n\nAli Mousavi\u21e4\nGoogle Research\n\nalimous@google.com\n\nXiang Wu\u2020\nByteDance\n\nxiang.wu@bytedance.com\n\nDaniel Holtmann-Rice\n\nGoogle Research\ndhr@google.com\n\nSatyen Kale\n\nGoogle Research\n\nsatyenkale@google.com\n\nSashank Reddi\nGoogle Research\n\nsashank@google.com\n\nSanjiv Kumar\nGoogle Research\n\nsanjivk@google.com\n\nAbstract\n\nIn extreme classi\ufb01cation settings, embedding-based neural network models are\ncurrently not competitive with sparse linear and tree-based methods in terms of\naccuracy. Most prior works attribute this poor performance to the low-dimensional\nbottleneck in embedding-based methods. In this paper, we demonstrate that theo-\nretically there is no limitation to using low-dimensional embedding-based methods,\nand provide experimental evidence that over\ufb01tting is the root cause of the poor per-\nformance of embedding-based methods. These \ufb01ndings motivate us to investigate\nnovel data augmentation and regularization techniques to mitigate over\ufb01tting. To\nthis end, we propose GLaS, a new regularizer for embedding-based neural network\napproaches. It is a natural generalization from the graph Laplacian and spread-out\nregularizers, and empirically it addresses the drawback of each regularizer alone\nwhen applied to the extreme classi\ufb01cation setup. With the proposed techniques, we\nattain or improve upon the state-of-the-art on most widely tested public extreme\nclassi\ufb01cation datasets with hundreds of thousands of labels.\n\n1\n\nIntroduction\n\nWe study the problem of multi-label classi\ufb01cation with large output space, which has garnered\nsigni\ufb01cant attention in recent years [36, 6, 14, 3, 33, 23]. This problem differs from the traditional\nclassi\ufb01cation setting insofar that the number of labels is potentially in the millions, presenting\nsigni\ufb01cant computational challenges. Many real world applications such as product recommendation\nand text retrieval can be formulated under this framework and thus, practical solutions to this problem\ncan have signi\ufb01cant and far-reaching impact.\nIn this unusual yet practical setting, both the number of input feature dimensions D and the number of\nlabels K could be upwards of hundreds of thousands or even millions. This renders most traditional\nmachine learning models, such as logistic regression and SVM, infeasible due to excessive number\nof model parameters \u2014 approximately O(DK). Most recent approaches resort to using sparse linear\n\n\u21e4Equal Contribution\n\u2020Work done at Google\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fmodels or tree-based methods in order to tackle this challenge [29, 23, 24, 34, 33]. An alternate\napproach to address this problem is through low-dimensional embeddings. Here, the model consists\nof an embedding function  : RD ! Rd, where d is the embedding dimension, and a classi\ufb01er\nf : Rd !{ 0, 1}K. Thus, for any input x 2 RD, f ((x)) is the indicator vector or label vector of\nthe predicted labels. To handle a large number of labels, the embedding dimension d is chosen to be\nsmall in comparison to D; thereby, signi\ufb01cantly reducing the number of model parameters.\nDespite their accomplishments in computer vision and natural language processing domains [17, 27],\nembedding-based deep neural networks (DNNs) have not achieved the same level of success in\nlearning with large output spaces. This point is often attributed to low-dimensional bottleneck layers\nin neural networks that cannot represent enough information for the downstream learning task when\nthe number of potential labels is substantially larger than the embedding dimensionality [24, 6, 32].\nAttempts to circumvent this limitation have been met with limited success [6, 31]. As a result, sparse\nlinear models and tree-based methods are favored in comparison to embedding-based methods for\nlarge-scale multi-label classi\ufb01cation problems.\nIn this paper, we investigate embedding-based methods for the problem of our interest. Our main\nobservation is that, contrary to the widespread belief of limited representation power, over\ufb01tting is\nthe cause for the inferior performance of embedding-based methods, which suggests that efforts to\neither augment the training set or regularize the model may dramatically boost test set performance.\nInspired by this, we show that a number of regularization techniques can shrink the generalization gap\nfor embedding-based methods and allow them to achieve, or improve upon, state-of-the-art accuracy\non a variety of widely tested public datasets. The most discernible improvement comes from a novel\nregularizer that promotes embeddings for frequently co-occurring labels to be close.\nContributions. In the light of this background, we state the following key contributions of this paper:\n1. We demonstrate experimentally that the main reason for the poor performance of neural network\nembedding-based models is over\ufb01tting. Our empirical observation is further supported by theoretical\nanalysis, where we prove that there exists a low-dimensional embedding-based linear classi\ufb01er with\nperfect accuracy in the limit of in\ufb01nite expressivity of the embedding map. This shows that, contrary\nto speculations in existing literature, low-dimensional embeddings are indeed suf\ufb01ciently expressive\nand cannot be a bottleneck.\n2. Based on this \ufb01nding, we propose a suite of principled data augmentation and regularization\ntechniques, including a novel regularizer called GLaS, to shrink the gap between training and test\nperformance.\n3. Finally, on several widely tested public datasets, with our proposed techniques, we achieve state-\nof-the-art results with very simple network architectures and little tuning. We achieve high precision\nand propensity scores, thus demonstrating the effectiveness of our method even on infrequent tail\nlabels. We also provide an ablation study to highlight the effectiveness of each individual factor. This\nprovides a strong baseline and several new venues for future research on applying embedding-based\nmethods to the large output space setting.\n\n1.1 Related Work\n\nThere is a vast amount of literature on text classi\ufb01cation; therefore, we only mention those that are\nmost relevant to the problem setting of our interest. Existing approaches to our problem setting can\nbe broadly classi\ufb01ed into three categories: (i) Embedding-based methods, (ii) Tree-based methods\nand (iii) Sparse and One-vs-all methods. We discuss these approaches brie\ufb02y here.\nEmbedding-based methods learn a model of the form f ((x)) where (x) 2 Rd and d is small.\nEmbedding methods mainly differ in their choice of the functional form and approaches to learn\nthe parameters of the function. A variety of approaches such as compressed sensing [12], bloom\n\ufb01lter [10], and SVD [36] are applied to train these models. While most of these approaches assume a\nlinear functional form [7, 9, 18, 28], non-linear forms have also been proposed [6]. One criticism of\nembedding-based approaches is that label embeddings are compressed to a very small dimensionality\nd, which is believed to cause degradation in performance greatly [24, 6] and are thus, less favored for\nlarge-scale settings.\nTree-based methods learn a hierarchical structure over the label space and predict the path from the\nroot to the target label [1, 15, 29, 24, 14, 22, 35]. While this greatly reduces inference time and the\n\n2\n\n\fnumber of parameters needed to be learnt, it typically comes at the cost of low prediction accuracy.\nAlthough traditionally done over the label set [29], more recent methods [24, 14] partition the feature\nspace instead, relying on the assumption that only a small set of features are relevant for any label.\nThese methods are heavily affected by so-called cascading effect, where the prediction error at the\ntop cannot be corrected at a lower level.\nSparse and One-vs-all methods restrict the model capacity and improve ef\ufb01ciency by applying\nsparse linear methods to learn only a small fraction of the non-zero parameters. This allows the\nsparse model to be kept in main memory while ensuring that matrix-vector products can be carried\nout ef\ufb01ciently. Methods such as DiSMEC [3], ProXML [4], PD-Sparse [34] and PPD-Sparse [33] are\nrepresentative of this strategy and have enjoyed great success recently. DiSMEC and PPD-Sparse\nare, in particular, highly parallelizable since they are based on the one-vs-all approach for training\nextreme multi-label classi\ufb01cation models. However, these models are typically simple linear models\nand hence, do not capture complex non-linear relationships.\n\n2 Discussion on Embedding-based Methods\n\nIn this section, we describe our problem setup more formally and investigate the validity of the\ncriticism on embedding-based methods. The general learning problem of multi-label classi\ufb01cation\ncan be de\ufb01ned as follows. Given an input x 2X\u21e2 RD, its label y 2Y\u21e2{\n0, 1}K is a K-\ndimensional vector with multiple non-zero entries, where y(k) = 1 if and only if label k is relevant\nfor input x. Let Ly denote the set of indices that are non-zero in y. The elements of the set Ly\nare, hereafter, referred to as relevant labels in y. The number of distinct labels K is assumed to be\nlarge (on the order of hundreds of thousands or even millions). The goal of all embedding-based\nmethods is to learn a model of the form f ((x)) : X!{ 0, 1}K where (x) 2 Rd and d \u2327 D, K\nand f : Rd !{ 0, 1}K is a classi\ufb01er on top of the embedding.\nThe most common form of f is a linear classi\ufb01er. A linear classi\ufb01er is parameterized by a label\nembedding matrix V 2 Rd\u21e5K which is used to predict scores for all labels by computing (x)>V.\nV is called a label embedding matrix since its columns can be interpreted as embeddings of the K\nlabels in the same embedding space, Rd. In the following, for a label y, we will use the notation\nvy to denote the embedding of y given by V, i.e the y-th column of V. Depending on the speci\ufb01c\nformulation, the set of labels predicted for the input x can then be obtained by thresholding the scores\nat some value \u2327, i.e., {y : (x)>vy  \u2327} or taking the top m largest scores, i.e., Top((x)>V, m).\nThe use of a linear classi\ufb01er on top of embeddings naturally leads to a low-rank structure for the\nscore vectors of the labels: the set {(x)>V : x 2X} has rank at most d. This restriction on the\nscore vectors has frequently been cited as a reason for the poor performance of embedding based\napproaches for extreme classi\ufb01cation problems. However, several studies [31, 6] show that the set of\nlabel vectors violates the low-rank structure on large-scale datasets. We should note that the label\nvectors are generated by either thresholding the scores or taking the top m highest scores, which is a\nhighly non-linear transformation. Thus, it is not immediately clear if the low-rank structure of the\nscore vectors directly translates to a low-rank structure on the label vectors.\nThere have been efforts to tackle this presumed issue of embedding-based methods, primarily by using\na more complex \ufb01nal classi\ufb01er f than simple linear ones. For instance, Xu et al. [31] decomposed\nthe label matrix into a low-rank and a sparse part, where the sparse part captures tail labels as outliers.\nBhatia et al. [6] developed an ensemble of local distance preserving embeddings to predict tail labels.\nIn particular, they cluster data points into sub-regions and use a k-nearest neighbor classi\ufb01er in the\nlocally learned embedding space. However, these modern embedding-based approaches have several\ndrawbacks [3] and cannot outperform other approaches on all large-scale datasets.\nWhile most sparse linear and tree-based methods outperform embedding-based approaches, there has\nnot been any de\ufb01nitive proof that the inherent problem with embedding-based methods is their use\nof low-dimensional representations for the score vectors. To the contrary, we provide experimental\nevidence that a low-dimensional embedding produced by training a simple neural network extractor\ncan attain near-perfect training accuracy but generalize poorly, suggesting that over\ufb01tting is the root\ncause of the poor performance of embedding-based methods that has been reported in the literature. In\nfact, we will show that theoretically there is no limitation to using low-dimensional embedding-based\nmethods, even with simple linear classi\ufb01ers.\n\n3\n\n\f2.1 Validity of Low-Dimensional Bottleneck Criticism\nWe \ufb01rst present a different perspective regarding embedding-based models, showing their inferior\nperformance in large output spaces is due to over\ufb01tting to training set rather than their inability to\nrepresent the input-label relationship with low-dimensional label embeddings.\nLet w be the embedding function parameterized by some vector w that takes as input x 2X and\noutputs a feature embedding w(x) 2 Rd. In practice, we may take w to be a linear function\nw(x) = w>x or a neural network with multiple linear layers and ReLU activation. We use a linear\nclassi\ufb01er on top of the embedding, parameterized by a matrix V 2 Rd\u21e5K, whose columns give the\nlabel embeddings vy for all labels y. De\ufb01ne the scoring function h : X! RK as h(x) = w(x)>V.\nAt training time, we sample an input-label pair (x, y) uniformly and compute the margin loss [20]:\n\n`(h(x), y) := Xy2Ly Xy0 /2Ly\n\n[h(x)y0  h(x)y + c]+\n\n(1)\n\nRecall that Ly denotes the set of indices that are non-\nzero in y. This loss encourages the scores for all relevant\nlabels to be higher than the scores for irrelevant labels\nby a margin of c > 0. However, since the set of labels\nis large, computing this sum over the entire set is pro-\nhibitively expensive during training. Instead, we use a\nstochastic estimate of the loss by sampling a small subset\nof labels from Ly and computing the sum over that subset\nonly. This loss function can be ef\ufb01ciently minimized using\nbatched stochastic gradient descent. An alternative option\nis to use the so-called stochastic negative mining loss [25].\nAlgorithm 1 summarizes the training procedure.\nWe now illustrate the over\ufb01tting issue on this embedding-\nbased model setup. Figure 1 shows the results of train-\ning our model on the AMAZONCAT-13K dataset. The\nstatistics of this dataset is summarized in Table 5 in the\nsupplementary material. The blue line shows that training accuracy continues to improve throughout\noptimization, culminating in near-perfect accuracy towards the end of training. We emphasize that\nthis disputes the argument made by previous works that embedding-based models are ill-suited\nfor this dataset due to the dimensionality constraint. However, we observe in Figure 1 is that our\nembedding-based model has severely over\ufb01tted to the training set. This observation highlights the\nneed for regularization techniques to improve the performance of embedding-based methods.\n\nFigure 1: Training (blue) and test (red)\naccuracy of Alg. 1 on the AMAZONCAT-\n13K dataset.\nThe non-regularized\nembedding-based method severely over-\n\ufb01ts to the training data.\n\nFeature embedding model w : X! Rd\nLabel embedding matrix V 2 Rd\u21e5K\nLoss function ` : RK \u21e5 [K] ! R\nLearning rates \u2318w,\u2318 V\n\nAlgorithm 1 Training the basic embedding model\n1: Input: Dataset {(x1, y1), . . . , (xn, yn)}\n2:\n3:\n4:\n5:\n6: Initialize w, V\n7: repeat\n8:\n9:\n10:\n11:\n12:\n13: until convergence\n\ndw , V V  \u2318V\n\nBPB\n\ndL\ndV\n\nSample a batch x1, . . . , xB\nSample indices k1, . . . , kB uniformly from non-zero indices of y1, . . . , yB\nCompute loss L 1\nCompute gradients dL\nUpdate w w  \u2318w\n\ndV via backpropagation\n\ni=1 `(w(xi)>V, ki)\n\ndw and dL\n\ndL\n\n2.2 Existence of Perfect Accuracy Low-Dimensional Embedding Classi\ufb01ers\nWe further support our argument theoretically and demonstrate the fact that embedding-based models\ncan attain near-perfect accuracy is not limited to any speci\ufb01c dataset, but is feasible in general. We\n\n4\n\n\fmake the following mild assumption on the data: for every x there exists a unique label vector\ny = y(x), and the number of non-zero entries in y(x) is bounded by s \u2327 K, i.e., the number of true\nlabels associated with any feature vector is at most some small constant s. Under this assumption, the\nfollowing result shows that low-dimensional embedding-based models do not suffer from inability to\nrepresent the input-label relationship. Proof can be found in the supplementary material.\nTheorem 2.1. Let S\u2713X be a sample set. Under the assumption on the data speci\ufb01ed above, there\nexists a function  : X! Rd, and a label embedding matrix V 2 Rd\u21e5K such that:\n\n1. d = O(min{s log(K|S|)), s2 log(K)})\n2. For every label y, we have kvyk2 = 1.\n3. For all x 2S and y 2 Ly(x), we have (x)>vy  2\n3.\n4. For all x 2S and y 62 Ly(x), we have (x)>vy \uf8ff 1\n3.\n5. For every pair of labels y, y0 with y 6= y0, we have v>y vy0 \uf8ffq 2 log(4K2)\n6. For any x 2S , we have k(x)k2 = O(s( log(K)\n\n) 1\n4 ).\n\nd\n\n.\n\nd\n\nThis theorem shows that in the limit of in\ufb01nite model capacity for constructing the embedding map,\nthere exists a low-dimensional embedding-based linear classi\ufb01er that thresholds at 1\n2 and has perfect\ntraining accuracy. Furthermore, the label embeddings vy are normalized to unit length. Since deep\nneural networks have been demonstrated to have excellent function approximation capabilities, this\nresult naturally motivates a model architecture which uses a deep neural network to mimic the optimal\nin\ufb01nitely expressive embedding map , followed by a linear classi\ufb01er. Another consequence of the\nbound on the dimension in terms of |S| is it shows how over\ufb01tting is possible with small training sets:\nthe dependence of the dimension d on s improves to linear from quadratic at the price of a (mild)\nlogarithmic factor in the size of the sample set. On the other hand, applying the theorem with S = X\nshows that d = O(s2 log(K)) suf\ufb01ces to obtain a classi\ufb01er with perfect test accuracy.\n\n3 Regularizing Embedding-Based Models\n\nMotivated by our \ufb01ndings, in this section we propose a novel regularization framework and discuss\nits effectiveness for the classi\ufb01cation problem with large output spaces.\n\n3.1 Embedding Normalization\n\nWe \ufb01rst apply weight normalization proposed in [26]. In each layer, weight vectors of all output\nneurons share a single trainable length and each weight vector maintains its own trainable direction.\nWeight normalization not only helps stabilize training and accelerate convergence, but also improves\ngeneralization. For the ease of exposition, we assume all label embeddings are `2-normalized to unit\nnorm, i.e., vi 2 Sd1, where Sd1 denotes the unit sphere in Rd. In a similar vein, we can assume all\ninput embeddings are normalized as well: w(x) 2 Sd1. Our regularizer can be easily generalized\nto cases where the label embeddings are not unit norm.\n\n3.2 GLaS Regularizer\n\nIn large-scale multi-label classi\ufb01cation, the output space is both large and sparse \u2014 most feature\nvectors are associated with only very few true labels. Thus it may be desirable for an embedding-based\nclassi\ufb01er to have near-orthogonal label embeddings, as suggested by Theorem 2.1. As a result, it is\nnatural to consider regularizers such as spread-out [37] that explicitly promote such structure.\n\nSpread-out Regularization. Zhang et al. [37] introduced the spread-out regularization technique,\nwhich encourages local feature descriptors of images to be uniformly dispersed over the sphere. We\nconsider a variant of spread-out regularization that brings the inner product of the embeddings of two\ndifferent labels close to zero, i.e., v>y vy0 \u21e1 0 if y 6= y0. More formally, the spread-out regularizer\ncorresponds to the following:\n\n`spreadout =\n\n1\nK2\n\nKXy=1\n\nKXy0=1\n\n(v>y vy0)2.\n\n5\n\n(2)\n\n\fNote that due to embedding normalization, diagonal entries v>y vy = 1 and hence these terms will not\nplay a role in the regularization loss function in (2). Zhang et al. [37] have shown the effectiveness of\nthis technique in learning good local feature descriptors for images. However, one major drawback of\nthis regularizer is that it over-penalizes the embeddings of two different labels that occur frequently\ntogether (e.g., apple and fruit tend to co-occur for many inputs). In other words, label embeddings of\nlabels that co-occur frequently are also encouraged to be far away, which is clearly undesirable.\n\nCorrecting Over-penalization: GLaS Regularization. The spread-out regularizer suffers from\nthe lack of modeling the co-occurrences of labels. Thus, to correct for this over-penalization, we need\nto estimate the degree of occurrence between labels from training data and explicitly model it with\nthe regularizer.\nLet Y 2{ 0, 1}n\u21e5K be the training set label matrix where each row corresponds to a single training\nexample. Let A = Y >Y so that Ay,y0 = number of times labels y and y0 co-occur, and let Z =\ndiag(A) 2 RK\u21e5K be the matrix containing only the diagonal component of A. Observe that AZ1\nrepresents the conditional frequency of observing one label given the other. Indeed,\n\n(AZ1)y,y0 =\n\nAy,y0\nAy0,y0\n\n=\n\nnumber of times y and y0 co-occur\n\nnumber of times y0 occurs\n\n=: F (y|y0).\n\nSimilarly, Z1A = (AZ1)> contains the conditional frequencies in reverse, i.e., (Z1A)y,y0 =\nF (y0|y). These conditional frequencies encode the degree of co-occurrence between labels y and y0,\nand we would like their embeddings vy and vy0 to re\ufb02ect this co-occurrence pattern:\n\n`GLaS =\n\n1\n\nK2V>V \n\n1\n2\n\n(AZ1 + Z1A)\n\n2\n\nF\n\n.\n\n(3)\n\nIn the case where all labels are uncorrelated, this loss recovers the spread-out regularizer. While we\nchoose to de\ufb01ne the degree of label correlation as the average of conditional frequencies between\nlabels, other measures of similarity such as pointwise mutual information (PMI) and Jaccard distance\ncan also be used. In Appendix B, we give a theoretical justi\ufb01cation for using the geometric mean of\nthe conditional frequencies (see Theorem B.1). In experiments, however, we found empirically that\nusing arithmetic mean of the conditional frequencies gives a slight but noticeable boost in accuracy\ncompared to other measures, motivating the de\ufb01nition (3) of the GLaS regularizer.\nOne issue that arises when using this regularizer is that calculating `GLaS requires O(K2) operations\nand becomes prohibitively expensive when K is large. Instead, we select a batch of rows from V and\ncompute a stochastic version of the loss on that batch only.\n\nRelationship to Graph Laplacian and Spread-out Regulariza-\ntion. While the de\ufb01nition for the GLaS regularizer is intuitive,\nit may seem arbitrary and one can arrive at other regularizers by\nfollowing a similar intuition. However, we show that the GLaS regu-\nlarizer can be recovered as a sum of the well-known graph Laplacian\nregularizer and the spread-out regularizer, thus giving our regularizer\nits name (Graph Laplacian and Spreadout).\nGraph Laplacian as a general technique has been successfully applied\nto representation learning problems such as metric learning [5] and\nhashing [21]. By adding a graph Laplacian based loss, we can\nimpose the right structure on the off-diagonal values in the Gram\nmatrix of label embeddings. More speci\ufb01cally, to assign similar\nembeddings to labels that co-occur frequently, we can explicitly penalize the `2 distance between two\nlabel embeddings with a weight proportional to their co-occurrence statistics. As a result, the graph\nLaplacian regularization makes the label embeddings consistent with the connectivity pattern of label\nnodes in the item-label graph (Figure 2). We can write the graph Laplacian regularizer as\n\nFigure 2: The item-label bi-\npartite graph. The edge be-\ntween a label node and an item\nnode represents an assignment\nof the label to the item. La-\nbels i and j have co-occurred\nin two items.\n\n`Laplacian =\n\n1\nK2\n\nKXy=1\n\nKXy0=1\n\nkvy  vy0k2\n\n2uyy0,\n\n(4)\n\nwhere uyy0 denotes the amount of \u201cadjacency\u201d between graph nodes of labels y and y0 and is only\ndependent on the graph structure. However, this loss formulation admits a trivial optimal solution\nthat assigns all labels the same embedding.\n\n6\n\n\f`Laplacian + `spreadout =\n\n(a)\n=\n\n1\nK2\n\n1\nK2\n\nKXy=1\nKXy=1\n\n2uyy0 + (v>y vy0)2\u21e4\n\nKXy0=1\u21e5kvy  vy0k2\nKXy0=1\u21e5(v>y vy0  uyy0)2  (u2\n\nyy0  2uyy0)\u21e4\n\nRecall that the spread-out regularizer suffers from a completely opposite weakness of encouraging all\nlabel embeddings to be orthogonal regardless of any correlation. Thus, combining the two regularizers\nhas the effect of compensating their respective weaknesses and promoting their strengths. Summing\nthe graph Laplacian regularizer (4) and the spread-out regularizer (2) we get\n\n2 = 1. One can see thatPy,y0(u2\n\nwhere (a) holds since kvyk2\nyy0  2uyy0) is a constant that only\ndepends on the graph structure. The non-constant part of the sum can be written as 1\nF ,\nK2kV>V Uk2\n2 (AZ1 + Z1A) being the measure\nwhich is exactly the form of GLaS given in (3) with U = 1\nof degree of adjacency in the label graph. Note that the graph Laplacian regularizer `Laplacian\nencourages frequently co-occurring labels to have similar label embeddings. However, labels that do\nnot co-occur frequently but have similar embeddings are not penalized by graph Laplacian regularizer.\nThis is achieved through the spread-out regularizer `spreadout. Thus, our regularizer GLaS captures\nthe essence of label relation.\n\nFeature embedding model w : X! Rd\nLabel embedding matrix V 2 Rd\u21e5K\nLoss function ` : RK \u21e5Y! R\nGLaS loss `GLaS : RB\u21e5B \u21e5 RB\u21e5B ! R\nRegularization weight \nDropout probability \u21e2 2 [0, 1]\nLearning rates \u2318w,\u2318 V\n\nAlgorithm 2 Training with regularization\n1: Input: Dataset {(x1, y1), . . . , (xn, yn)}\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9: Initialize w, V\n10: repeat\n11:\n12:\n13:\nBPB\n14:\n15:\n16:\n17: V [vy1|\u00b7\u00b7\u00b7|vyB ] 2 RB\u21e5B\n18:\ndw and dL\n19:\n20:\n21: until convergence\n3.3\nInput Dropout\nInput dropout [13] is a simple regularization and data augmentation technique for text classi\ufb01cation\nmodels with sparse features. For a selected keep probability \u21e2 2 [0, 1] and an input feature x, the\nmethod produces an augmented input x0 = x  Bernoulli(\u21e2, D), where  denotes element-wise\nmultiplication. Thus, non-zero feature coordinates are set to zero with probability 1  \u21e2. This can be\ninterpreted as data augmentation, where features in the input are uniformly removed with probability\n1  \u21e2. It discourages the model from \ufb01tting spurious patterns in input features when training data is\nscarce and it also promotes the model to be robust to corruption of the input features. The complete\nlearning algorithm that integrates all techniques described in this section is presented as Algorithm 2.\n\nSample a batch x1, . . . , xB\nSample labels y1, . . . , yB uniformly from non-zero indices of y1, . . . , yB\nApply input dropout xi xi  Bernoulli(\u21e2, D)\nCompute loss L 1\ni=1 `(w(xi)>V, yi)\nY [y1|\u00b7\u00b7\u00b7|yB]\nU B \u21e5 B submatrix of Equation (3) corresponding to indices y1, . . . , yB\nRegularize L L + `GLaS(V>V, U )\nCompute gradients dL\nUpdate w w  \u2318w\ndw , V V  \u2318V\n\ndV via backpropagation\n\ndL\ndV\n\ndL\n\n4 Experiments\n\nIn this section, we present experimental results of our method on several widely used extreme\nmulti-label classi\ufb01cation datasets: AMAZONCAT-13K, AMAZON-670K, WIKILSHTC-325K,\n\n7\n\n\fDELICIOUS-200K, EURLEX-4K, and WIKIPEDIA-500K. The statistics of these datasets is pre-\nsented in Table 5 in the supplementary material.\n\nBatch Size\n\nLinear\n91.77\n\nd = 1024\n\n94.21\n\n94.21\n\nGLaS\n94.21\n\nParameters\n\n93.75\n\n\u21e2 = 0.6\n\n\u21e2 = 1.0\n\n\u21e2 = 0.8\n\nSpread-out\n\n93.34\n\nd = 512\n\n93.82\n\nd = 256\n\n93.24\n\nInput Dropout\n\n93.39\n1024\n94.04\n\n94.21\n2048\n93.98\n\nEmbedding Size\n\n = 10\n94.21\n\n94.08\n4096\n94.21\n\nEmbedding Type\n\nRegularization Weight\n\nVariable\nRegularizer\n\nNone\n92.34\n = 1\n93.68\n\nGravity\n93.42\n = 100\n\nAblation Study. We begin by studying the\nperformance of Algorithm 2 under different set-\ntings of its hyperparameters. In particular, we\ninvestigate variations in the regularization type\nand weight, input dropout, batch size, and em-\nbedding type and size. Table 1 shows the effects\nof different parameters on the performance of\nour method on the AMAZONCAT-13K dataset.\nWe \ufb01rst list our base setting that we have de-\nrived through cross validation. In the base set-\nting, we use GLaS regularizer (discussed in Sec.\n3.2) with regularization weight  = 10, input\ndropout with \u21e2 = 0.8, batch size B = 4096,\nand a non-linear embedding map w with em-\nbedding dimension d = 1024. In each row of\nTable 1, we alter one parameter from the base\nsetting to study its impact. For the regulariza-\ntion method, we compare our method with the\nspread-out regularizer [37] and Gravity regularizer [16] and show that our method signi\ufb01cantly\noutperforms these two. We can observe that the regularization weight and input dropout rate should\nnot be either excessively small or large as these settings hurt the test accuracy.\nAs one can expect, embeddings of higher dimensionalities outperform those of lower dimensionalities.\nBatch sizes in the range of 1000s do not have a signi\ufb01cant impact on the performance; however, we\ndo note that the largest batch size 4096 gives us the highest test accuracy. Finally and as shown in\nTable 1, adding the ReLU nonlinearity boosts the performance of w in learning the embedding.\n\nTable 1: Sensitivity of Algorithm 2 to variations in\ndifferent parameters for AMAZONCAT-13K. Each\nrow shows the effect of a single parameter. Our\nGLaS regularizer outperforms spread-out and grav-\nity. A moderate regularization weight and input\ndropout, a large embedding size, and using non-\nlinearity lead to a better result.\n\nNonlinear (ReLU)\n\nDataset\n\nTest Acc.\n\nGen. Gap\n\nTrain Acc.\n\nAMAZON-670K\n\nRegularization\n\nAMAZONCAT-13K\n\n98.77\n99.23\n96.10\n98.21\n\nGLaS\nNone\nGLaS\nNone\n\nGeneralization Gap. As discussed previ-\nously, one of the main goals of this paper is\nto propose regularization techniques that miti-\ngate the over\ufb01tting (Figure 1) of neural network\nembedding-based methods for extreme multi-\nlabel classi\ufb01cation problems. Table 2 studies\nthe effect of our regularization technique on the\ngeneralization gap, i.e., the difference between\ntraining and test accuracies. In particular, we\nhave studied two datasets AMAZONCAT-13K\nand AMAZON-670K in two different settings: with and without the regularization technique we\ndiscussed in Section 3. The table shows that regularizing embedding based models with our method\nreduces the generalization gap over the unregularized setting while improving test accuracy. As an\nexample, GLaS regularizer reduces the generalization gap of Algorithm 1 by more than 30% when\napplied to the AMAZONCAT-13K dataset.\n\nTable 2: The comparison of generalization gap in\nAlgorithm 1 and Algorithm 2 when they are ap-\nplied to AMAZONCAT-13K and AMAZON-670K\ndatasets. The GLaS regularizer (Section 3.2) sig-\nni\ufb01cantly improves the generalization gap.\n\n4.56\n6.89\n49.78\n53.68\n\n94.21\n92.34\n46.32\n44.53\n\nComparison with Previous Work. We compare our method with several other recent works on\nthe extreme classi\ufb01cation problem denoted in Table 3. As shown in this Table, on all datasets except\nDelicious-200K and EURLex-4K our method matches or outperforms all previous work in terms of\nprecision@k3. Even on the Delicious-200K dataset, our method\u2019s performance is close to that of the\nstate-of-the-art, which belongs to another embedding-based method SLEEC [6]. One thing to note\nabout the Delicious-200k dataset is that its average number of labels per training point is signi\ufb01cantly\nlarger than that of other datasets. Due to this, we observed that it took a long time for training to show\nsteady progress with the \ufb01xed margin loss. Hence, we have used the softmax-cross-entropy loss for\nthe Delicious-200K dataset instead of the loss function in (1). Softmax-cross-entropy loss relaxes the\nmargin requirement and signi\ufb01cantly stabilizes training.\n\n3P@k = 1\n\nkPl2rankk(\u02c6y) yl where \u02c6y is the predicted score vector and y 2{ 0, 1}L is the ground truth labels.\n\n8\n\n\fEmbedding-Based\n\nLEML [36]\n\nRobustXML [31]\n\nXML-CNN [19]\n\nOther Methods\n\nParabel [23]\n\nDiSMEC [3]\n\nPD-Sparse [34]\n\nPPD-Sparse [33]\n\nDataset\n\nAMAZONCAT-13K\n\nWIKILSHTC-325K\n\nAMAZON-670K\n\nDELICIOUS-200K\n\nEURLEX-4K\n\nWIKIPEDIA-500K\n\nP@k\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\nP@1\nP@3\nP@5\n\nOurs\n94.21\n79.70\n64.84\n65.46\n45.44\n34.51\n46.38\n42.09\n38.56\n46.4\n40.49\n38.1\n77.5\n65.01\n54.37\n69.91\n49.08\n38.35\n\nSLEEC [6]\n\n90.53\n76.33\n61.52\n54.83\n33.42\n23.85\n35.05\n31.25\n28.56\n47.85\n42.21\n39.43\n79.26\n64.3\n52.33\n48.2\n29.4\n21.2\n\n-\n-\n-\n\n19.82\n11.43\n8.39\n8.13\n6.83\n6.03\n40.73\n37.71\n35.84\n63.4\n50.35\n41.28\n41.3\n30.1\n19.8\n\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n\n35.39\n31.93\n29.32\n\n76.38\n62.81\n51.41\n59.85\n39.28\n29.81\n\nPfastreXML [14]\n\n91.75\n77.97\n63.68\n56.05\n36.79\n27.09\n39.46\n35.81\n33.05\n41.72\n37.83\n35.58\n75.45\n62.7\n52.51\n59.52\n40.24\n30.72\n\nFastXML [24]\n\n93.11\n78.2\n63.41\n49.75\n33.10\n24.45\n36.99\n33.28\n30.53\n43.07\n38.66\n36.19\n71.36\n59.9\n50.39\n54.1\n35.5\n26.2\n\n93.03\n79.16\n64.52\n65.04\n43.23\n32.05\n44.89\n39.80\n36.00\n46.97\n40.08\n36.63\n81.73\n68.78\n57.44\n66.73\n47.48\n36.78\n\n93.40\n79.10\n64.10\n64.40\n42.50\n31.50\n44.70\n39.70\n36.10\n45.50\n38.70\n35.50\n82.4\n68.5\n57.7\n70.2\n50.6\n39.7\n\n90.60\n75.14\n60.69\n61.26\n39.48\n28.79\n\n34.37\n29.48\n27.04\n76.43\n60.37\n49.72\n\n-\n-\n-\n\n-\n-\n-\n\nTable 3: Performance comparison (based on precision@k) with several other methods on large-scale\ndatasets. Our method attains or improves upon the state-of-the-art results. Results of other methods\nare derived from the extreme classi\ufb01cation repository. Italic underlined numbers are the best of the\nentire row and bold numbers are the best among embedding-based methods.\n\n-\n-\n-\n\n-\n-\n-\n\n64.08\n41.26\n30.12\n45.32\n40.37\n36.92\n\n83.83\n70.72\n59.21\n70.16\n50.57\n39.66\n\n27.47\n33.00\n36.29\n26.64\n30.65\n34.65\n\n-\n-\n-\n\n-\n-\n-\n-\n-\n-\n\n88.4\n74.6\n60.6\n53.5\n31.8\n29.9\n31.0\n28.0\n24.0\n45.0\n40.0\n38.0\n\n-\n-\n-\n-\n-\n-\n\n-\n-\n-\n\n3.48\n3.79\n4.27\n2.07\n2.26\n2.47\n6.06\n7.24\n8.10\n24.10\n26.37\n27.62\n\nEmbedding-Based\nSLEEC [6]\n\nLEML [36]\n\nOther Methods\n\nParabel [23]\n\nDiSMEC [3]\n\nPD-Sparse [34]\n\nPPD-Sparse [33]\n\nDataset\n\nAMAZONCAT-13K\n\nWIKILSHTC-325K\n\nAMAZON-670K\n\nDELICIOUS-200K\n\nEURLEX-4K\n\nPSP@k\nPSP@1\nPSP@3\nPSP@5\nPSP@1\nPSP@3\nPSP@5\nPSP@1\nPSP@3\nPSP@5\nPSP@1\nPSP@3\nPSP@5\nPSP@1\nPSP@3\nPSP@5\n\nOurs\n47.53\n62.74\n71.66\n46.22\n46.15\n47.28\n38.94\n39.72\n41.24\n28.68\n24.93\n23.87\n49.77\n51.05\n53.82\n\n46.75\n58.46\n65.96\n20.27\n23.18\n25.08\n20.62\n23.32\n25.98\n7.17\n8.16\n8.96\n34.25\n38.35\n40.30\n\nPfastreXML [14]\n\n69.52\n73.22\n75.48\n30.66\n31.55\n33.12\n29.30\n30.80\n32.43\n3.15\n3.87\n4.43\n43.86\n45.23\n46.03\n\nFastXML [24]\n\n48.31\n60.26\n69.30\n16.35\n20.99\n23.56\n19.37\n23.26\n26.85\n6.48\n7.52\n8.31\n26.62\n32.07\n35.23\n\n50.93\n64.00\n72.08\n26.76\n33.27\n37.36\n25.43\n29.43\n32.85\n7.25\n7.94\n8.52\n36.36\n41.95\n44.78\n\n59.10\n67.10\n71.20\n29.1\n35.6\n39.5\n27.8\n30.6\n34.2\n6.5\n7.6\n8.4\n41.20\n44.30\n46.90\n\n49.58\n61.63\n68.23\n28.34\n33.50\n36.62\n\n-\n-\n-\n\n5.29\n5.80\n6.24\n36.28\n40.96\n42.84\n\nTable 4: Performance comparison (based on propensity scored precision@k, PSP@k) with several\nother methods on large-scale datasets. Propensity weights are higher for rarer labels, hence this metric\nbetter re\ufb02ects the model\u2019s ability to generalize to tail labels than precision. Italic underlined numbers\nare the best of the entire row and bold numbers are the best among embedding-based methods.\n\nOne of the biggest challenges for learning in large output spaces comes from tail labels that are only\nassigned to a few inputs, but make up the majority of the whole label set. The propensity scored\nprecision@K (PSP@K4) metric corrects for this bias by up-weighting rare labels. To demonstrate the\neffectiveness of our method at predicting tail labels, we report results using this evaluation metric\nin Table 4. While many previous methods that we compare against have to explicitly change their\ntraining objective or algorithm accordingly to account for the re-weighting, in contrast, our simple\nembedding based models learn to predict these tail labels remarkably well without any adjustment of\nour training loss or procedure. On the dataset with the largest number of labels Amazon-670K, our\nmethod improves the PSP@1 metric by an absolute margin of 9.6%.\n\nhours, the time complexity is O(dPx2S\n\nTraining and Inference Speed. We train all models up to 10 epochs and apply early stopping\nwhen evaluation accuracy ceases to improve. Though the overall training process takes minutes to\nnnz(x)), where d is the embedding dimensionality, S is the\nset of training samples, and nnz(x) is the number of non-zero features of the sparse input x.\nAt inference time, we apply ef\ufb01cient Maximum Inner Product Search techniques such as [11, 30].\nThe non-exhaustive search achieves low latency due to highly effective clustering based tree indices\n[2] and hardware based acceleration [11, 8]. For all datasets up to a few million labels, the inference\nlatency is below 10ms and below 1ms for under 100k labels.\n\n5 Conclusions\n\nIn this paper, we showed that from both theoretical and empirical perspectives, neural network\nmodels suffer from over\ufb01tting instead of low-dimensional embedding bottleneck when applied to\nextreme multi-label classi\ufb01cation problems. To this end, we introduced the GLaS regularization\nframework and demonstrated its effectiveness with new state-of-the-art results on several widely\ntested large-scale datasets. We hope future work can build on our theoretical and empirical \ufb01ndings\nand more competitive embedding-based methods can be developed along this direction.\n\n4Similar to P@k, PSP@k = 1\n\nkPl2rankk(\u02c6y)\n\nyl\npl\n\nwhere pl denotes the propensity weights.\n\n9\n\n\fReferences\n[1] R. Agrawal, A. Gupta, Y. Prabhu, and M. Varma. Multi-label learning with millions of labels:\nRecommending advertiser bid phrases for web pages. In Proceedings of the 22nd international\nconference on World Wide Web, pages 13\u201324. ACM, 2013.\n\n[2] A. Auvolat and P. Vincent. Clustering is ef\ufb01cient for approximate maximum inner product\n\nsearch. CoRR, abs/1507.05910, 2015.\n\n[3] R. Babbar and B. Sch\u00a8olkopf. Dismec: Distributed sparse machines for extreme multi-label\nclassi\ufb01cation. In Proceedings of the Tenth ACM International Conference on Web Search and\nData Mining, WSDM 2017, Cambridge, United Kingdom, February 6-10, 2017, pages 721\u2013729,\n2017.\n\n[4] R. Babbar and B. Sch\u00a8olkopf. Data scarcity, robustness and extreme multi-label classi\ufb01cation.\n\nMachine Learning, pages 1\u201323, 2019.\n\n[5] A. Bellet, A. Habrard, and M. Sebban. A survey on metric learning for feature vectors and\n\nstructured data. CoRR, abs/1306.6709, 2013.\n\n[6] K. Bhatia, H. Jain, P. Kar, M. Varma, and P. Jain. Sparse local embeddings for extreme\nmulti-label classi\ufb01cation. In Advances in Neural Information Processing Systems 28: Annual\nConference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal,\nQuebec, Canada, pages 730\u2013738, 2015.\n\n[7] W. Bi and J. Kwok. Ef\ufb01cient multi-label classi\ufb01cation with many labels. In International\n\nConference on Machine Learning, pages 405\u2013413, 2013.\n\n[8] D. W. Blalock and J. V. Guttag. Bolt: Accelerated data mining with fast vector compression. In\nProceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, pages 727\u2013735, 2017.\n\n[9] Y.-N. Chen and H.-T. Lin. Feature-aware label space dimension reduction for multi-label\nclassi\ufb01cation. In Advances in Neural Information Processing Systems, pages 1529\u20131537, 2012.\n[10] M. Ciss\u00b4e, N. Usunier, T. Arti`eres, and P. Gallinari. Robust bloom \ufb01lters for large multilabel\nclassi\ufb01cation tasks. In Advances in Neural Information Processing Systems 26: 27th Annual\nConference on Neural Information Processing Systems 2013. Proceedings of a meeting held\nDecember 5-8, 2013, Lake Tahoe, Nevada, United States., pages 1851\u20131859, 2013.\n\n[11] R. Guo, S. Kumar, K. Choromanski, and D. Simcha. Quantization based fast inner product\nsearch. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and\nStatistics, AISTATS 2016, Cadiz, Spain, May 9-11, 2016, pages 482\u2013490, 2016.\n\n[12] D. J. Hsu, S. Kakade, J. Langford, and T. Zhang. Multi-label prediction via compressed sensing.\nIn Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural\nInformation Processing Systems 2009. Proceedings of a meeting held 7-10 December 2009,\nVancouver, British Columbia, Canada., pages 772\u2013780, 2009.\n\n[13] M. Iyyer, V. Manjunatha, J. L. Boyd-Graber, and H. D. III. Deep unordered composition\nrivals syntactic methods for text classi\ufb01cation. In Proceedings of the 53rd Annual Meeting of\nthe Association for Computational Linguistics and the 7th International Joint Conference on\nNatural Language Processing of the Asian Federation of Natural Language Processing, ACL\n2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1681\u20131691, 2015.\n\n[14] H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation,\ntagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA,\nAugust 13-17, 2016, pages 935\u2013944, 2016.\n\n[15] H. Jain, Y. Prabhu, and M. Varma. Extreme multi-label loss functions for recommendation,\ntagging, ranking & other missing label applications. In Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 935\u2013944, 2016.\n\n[16] W. Krichene, N. Mayoraz, S. Rendle, X. Lin, X. Yi, L. Hong, E. H. hsin Chi, and J. R. Anderson.\nEf\ufb01cient training on very large corpora via gramian estimation. CoRR, abs/1807.07187, 2018.\n[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems 25, pages 1097\u20131105.\n2012.\n\n[18] Z. Lin, G. Ding, M. Hu, and J. Wang. Multi-label classi\ufb01cation via feature-aware implicit label\n\nspace encoding. In International conference on machine learning, pages 325\u2013333, 2014.\n\n[19] J. Liu, W.-C. Chang, Y. Wu, and Y. Yang. Deep learning for extreme multi-label text classi-\n\ufb01cation. In Proceedings of the 40th International ACM SIGIR Conference on Research and\nDevelopment in Information Retrieval, pages 115\u2013124. ACM, 2017.\n[20] T.-Y. Liu et al. Learning to rank for information retrieval. Foundations and Trends R in\n\nInformation Retrieval, 3(3):225\u2013331, 2009.\n\n10\n\n\f[21] W. Liu, J. Wang, S. Kumar, and S.-F. Chang. Hashing with graphs. In Proceedings of the 28th\n\nInternational Conference on International Conference on Machine Learning, 2011.\n\n[22] Y. Prabhu, A. Kag, S. Gopinath, K. Dahiya, S. Harsola, R. Agrawal, and M. Varma. Extreme\nmulti-label learning with label features for warm-start tagging, ranking & recommendation. In\nProceedings of the Eleventh ACM International Conference on Web Search and Data Mining,\npages 441\u2013449. ACM, 2018.\n\n[23] Y. Prabhu, A. Kag, S. Harsola, R. Agrawal, and M. Varma. Parabel: Partitioned label trees for\nextreme classi\ufb01cation with application to dynamic search advertising. In Proceedings of the\n2018 World Wide Web Conference on World Wide Web, WWW 2018, Lyon, France, April 23-27,\n2018, pages 993\u20131002, 2018.\n\n[24] Y. Prabhu and M. Varma. Fastxml: a fast, accurate and stable tree-classi\ufb01er for extreme multi-\nlabel learning. In The 20th ACM SIGKDD International Conference on Knowledge Discovery\nand Data Mining, KDD \u201914, New York, NY, USA - August 24 - 27, 2014, pages 263\u2013272, 2014.\n[25] S. J. Reddi, S. Kale, F. Yu, D. Holtmann-Rice, J. Chen, and S. Kumar. Stochastic negative\n\nmining for learning with large output spaces. In AISTATS, 2019.\n\n[26] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems 29,\npages 901\u2013909. 2016.\n\n[27] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nAdvances in Neural Information Processing Systems 27, pages 3104\u20133112. 2014.\n\n[28] F. Tai and H.-T. Lin. Multilabel classi\ufb01cation with principal label space transformation. Neural\n\nComputation, 24(9):2508\u20132542, 2012.\n\n[29] J. Weston, A. Makadia, and H. Yee. Label partitioning for sublinear ranking. In Proceedings of\nthe 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21\nJune 2013, pages 181\u2013189, 2013.\n\n[30] X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, and F. Yu. Multiscale\nquantization for fast similarity search. In Advances in Neural Information Processing Systems\n30, pages 5745\u20135755. 2017.\n\n[31] C. Xu, D. Tao, and C. Xu. Robust extreme multi-label learning. In Proceedings of the 22nd ACM\nSIGKDD international conference on knowledge discovery and data mining, pages 1275\u20131284.\nACM, 2016.\n\n[32] Z. Yang, Z. Dai, R. Salakhutdinov, and W. W. Cohen. Breaking the softmax bottleneck: A\nhigh-rank RNN language model. In International Conference on Learning Representations,\n2018.\n\n[33] I. E. Yen, X. Huang, W. Dai, P. Ravikumar, I. S. Dhillon, and E. P. Xing. Ppdsparse: A parallel\nprimal-dual sparse method for extreme classi\ufb01cation. In Proceedings of the 23rd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada,\nAugust 13 - 17, 2017, pages 545\u2013553, 2017.\n\n[34] I. E. Yen, X. Huang, P. Ravikumar, K. Zhong, and I. S. Dhillon. Pd-sparse : A primal and dual\nsparse approach to extreme multiclass and multilabel classi\ufb01cation. In Proceedings of the 33nd\nInternational Conference on Machine Learning, ICML 2016, New York City, NY, USA, June\n19-24, 2016, pages 3069\u20133077, 2016.\n\n[35] R. You, S. Dai, Z. Zhang, H. Mamitsuka, and S. Zhu. Attentionxml: Extreme multi-label\ntext classi\ufb01cation with multi-label attention based recurrent neural networks. arXiv preprint\narXiv:1811.01727, 2018.\n\n[36] H. Yu, P. Jain, P. Kar, and I. S. Dhillon. Large-scale multi-label learning with missing labels. In\nProceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing,\nChina, 21-26 June 2014, pages 593\u2013601, 2014.\n\n[37] X. Zhang, F. X. Yu, S. Kumar, and S. Chang. Learning spread-out local feature descriptors. In\nIEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29,\n2017, pages 4605\u20134613, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2742, "authors": [{"given_name": "Chuan", "family_name": "Guo", "institution": "Cornell University"}, {"given_name": "Ali", "family_name": "Mousavi", "institution": "Google Brain"}, {"given_name": "Xiang", "family_name": "Wu", "institution": "ByteDance"}, {"given_name": "Daniel", "family_name": "Holtmann-Rice", "institution": "Google Inc"}, {"given_name": "Satyen", "family_name": "Kale", "institution": "Google"}, {"given_name": "Sashank", "family_name": "Reddi", "institution": "Google"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google Research"}]}