{"title": "One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities", "book": "Advances in Neural Information Processing Systems", "page_first": 4161, "page_last": 4169, "abstract": "The softmax representation of probabilities for categorical variables plays a prominent role in modern machine learning with numerous applications in areas such as large scale classification, neural language modeling and recommendation systems. However, softmax estimation is very expensive for large scale inference because of the high cost associated with computing the normalizing constant. Here, we introduce an efficient approximation to softmax probabilities which takes the form of a rigorous lower bound on the exact probability. This bound is expressed as a product over pairwise probabilities and it leads to scalable estimation based on stochastic optimization. It allows us to perform doubly stochastic estimation by subsampling both training instances and class labels. We show that the new bound has interesting theoretical properties and we demonstrate its use in classification problems.", "full_text": "One-vs-Each Approximation to Softmax for Scalable\n\nEstimation of Probabilities\n\nMichalis K. Titsias\n\nDepartment of Informatics\n\nAthens University of Economics and Business\n\nmtitsias@aueb.gr\n\nAbstract\n\nThe softmax representation of probabilities for categorical variables plays a promi-\nnent role in modern machine learning with numerous applications in areas such as\nlarge scale classi\ufb01cation, neural language modeling and recommendation systems.\nHowever, softmax estimation is very expensive for large scale inference because\nof the high cost associated with computing the normalizing constant. Here, we\nintroduce an ef\ufb01cient approximation to softmax probabilities which takes the form\nof a rigorous lower bound on the exact probability. This bound is expressed as a\nproduct over pairwise probabilities and it leads to scalable estimation based on\nstochastic optimization. It allows us to perform doubly stochastic estimation by\nsubsampling both training instances and class labels. We show that the new bound\nhas interesting theoretical properties and we demonstrate its use in classi\ufb01cation\nproblems.\n\nIntroduction\n\n1\nBased on the softmax representation, the probability of a variable y to take the value k \u2208 {1, . . . , K},\nwhere K is the number of categorical symbols or classes, is modeled by\n\n,\n\n(1)\n\np(y = k|x) =\n\n(cid:80)K\n\nefk(x;w)\nm=1 efm(x;w)\n\nwhere each fk(x; w) is often referred to as the score function and it is a real-valued function indexed\nby an input vector x and parameterized by w. The score function measures the compatibility of input\nx with symbol y = k so that the higher the score is the more compatible x becomes with y = k. The\nmost common application of softmax is multiclass classi\ufb01cation where x is an observed input vector\nand fk(x; w) is often chosen to be a linear function or more generally a non-linear function such as a\nneural network [3, 8]. Several other applications of softmax arise, for instance, in neural language\nmodeling for learning word vector embeddings [15, 14, 18] and also in collaborating \ufb01ltering for\nrepresenting probabilities of (user, item) pairs [17]. In such applications the number of symbols\nK could often be very large, e.g. of the order of tens of thousands or millions, which makes the\ncomputation of softmax probabilities very expensive due to the large sum in the normalizing constant\nof Eq. (1). Thus, exact training procedures based on maximum likelihood or Bayesian approaches\nare computationally prohibitive and approximations are needed. While some rigorous bound-based\napproximations to the softmax exists [5], they are not so accurate or scalable and therefore it would\nbe highly desirable to develop accurate and computationally ef\ufb01cient approximations.\nIn this paper we introduce a new ef\ufb01cient approximation to softmax probabilities which takes the\nform of a lower bound on the probability of Eq. (1). This bound draws an interesting connection\nbetween the exact softmax probability and all its one-vs-each pairwise probabilities, and it has several\ndesirable properties. Firstly, for the non-parametric estimation case it leads to an approximation of the\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\flikelihood that shares the same global optimum with exact maximum likelihood, and thus estimation\nbased on the approximation is a perfect surrogate for the initial estimation problem. Secondly, the\nbound allows for scalable learning through stochastic optimization where data subsampling can be\ncombined with subsampling categorical symbols. Thirdly, whenever the initial exact softmax cost\nfunction is convex the bound remains also convex.\nRegarding related work, there exist several other methods that try to deal with the high cost of softmax\nsuch as methods that attempt to perform the exact computations [9, 19], methods that change the\nmodel based on hierarchical or stick-breaking constructions [16, 13] and sampling-based methods\n[1, 14, 7, 11]. Our method is a lower bound based approach that follows the variational inference\nframework. Other rigorous variational lower bounds on the softmax have been used before [4, 5],\nhowever they are not easily scalable since they require optimizing data-speci\ufb01c variational parameters.\nIn contrast, the bound we introduce in this paper does not contain any variational parameter, which\ngreatly facilitates stochastic minibatch training. At the same time it can be much tighter than previous\nbounds [5] as we will demonstrate empirically in several classi\ufb01cation datasets.\n\n2 One-vs-each lower bound on the softmax\n\nHere, we derive the new bound on the softmax (Section 2.1) and we prove its optimality property when\nperforming approximate maximum likelihood estimation (Section 2.2). Such a property holds for the\nnon-parametric case, where we estimate probabilities of the form p(y = k), without conditioning\non some x, so that the score functions fk(x; w) reduce to unrestricted parameters fk; see Eq. (2)\nbelow. Finally, we also analyze the related bound derived by Bouchard [5] and we compare it with\nour approach (Section 2.3).\n\n2.1 Derivation of the bound\nConsider a discrete random variable y \u2208 {1, . . . , K} that takes the value k with probability,\n\nefk(cid:80)K\n\nm=1 efm\n\n(cid:89)\n\nm(cid:54)=k\n\np(y = k) = Softmaxk(f1, . . . , fK) =\n\n,\n\n(2)\n\nwhere each fk is a free real-valued scalar parameter. We wish to express a lower bound on p(y = k)\nand the key step of our derivation is to re-write p(y = k) as\n\np(y = k) =\n\n1 + \u03b11 + \u03b12 + \u03b11\u03b12 = (1 + \u03b11)(1 + \u03b12), and more generally it holds (1 +(cid:80)\n\n(3)\nThen, by exploiting the fact that for any non-negative numbers \u03b11 and \u03b12 it holds 1 + \u03b11 + \u03b12 \u2264\ni(1 + \u03b1i)\nwhere each \u03b1i \u2265 0, we obtain the following lower bound on the above probability,\n\ni \u03b1i) \u2264(cid:81)\n\nm(cid:54)=k e\u2212(fk\u2212fm)\n\n.\n\n1 +(cid:80)\n\n1\n\n(cid:89)\n\np(y = k) \u2265 (cid:89)\n\nm(cid:54)=k\n\n1\n\n1 + e\u2212(fk\u2212fm)\n\n=\n\nefk + efm\n\nm(cid:54)=k\n\nefk\n\n=\n\n\u03c3(fk \u2212 fm).\n\n(4)\n\nwhere \u03c3(\u00b7) denotes the sigmoid function. Clearly, the terms in the product are pairwise probabilities\neach corresponding to the event y = k conditional on the union of pairs of events, i.e. y \u2208 {k, m}\nwhere m is one of the remaining values. We will refer to this bound as one-vs-each bound on the\nsoftmax probability, since it involves K \u2212 1 comparisons of a speci\ufb01c event y = k versus each of the\nK \u2212 1 remaining events. Furthermore, the above result can be stated more generally to de\ufb01ne bounds\non arbitrary probabilities as the following statement shows.\nProposition 1. Assume a probability model with state space \u2126 and probability measure P (\u00b7). For\nany event A \u2282 \u2126 and an associated countable set of disjoint events {Bi} such that \u222aiBi = \u2126 \\ A, it\nholds\n\nP (A|A \u222a Bi).\n\n(5)\n\n(1 +(cid:80)\n\ni \u03b1i) \u2264(cid:81)\n\nProof. Given that P (A) = P (A)\n\nP (\u2126) =\n\ni(1 + \u03b1i) exactly as done above for the softmax parameterization.\n\ni P (Bi), the result follows by applying the inequality\n\nP (A) \u2265(cid:89)\nP (A)+(cid:80)\n\nP (A)\n\ni\n\n2\n\n\fRemark. If the set {Bi} consists of a single event B then by de\ufb01nition B = \u2126 \\ A and the bound is\nexact since in such case P (A|A \u222a B) = P (A).\nFurthermore, based on the above construction we can express a full class of hierarchically ordered\nbounds. For instance, if we merge two events Bi and Bj into a single one, then the term P (A|A \u222a\nBi)P (A|A \u222a Bj) in the initial bound is replaced with P (A|A \u222a Bi \u222a Bj) and the associated new\nbound, obtained after this merge, can only become tighter. To see a more speci\ufb01c example in the\nsoftmax probabilistic model, assume a small subset of categorical symbols Ck, that does not include\nk, and denote the remaining symbols excluding k as \u00afCk so that k \u222a Ck \u222a \u00afCk = {1, . . . , K}. Then, a\ntighter bound, that exists higher in the hierarchy, than the one-vs-each bound (see Eq. 4) takes the\nform,\n\u03c3(fk\u2212 fm), (6)\n\np(y = k) \u2265 Softmaxk(fk, fCk )\u00d7Softmaxk(fk, f \u00afCk ) \u2265 Softmaxk(fk, fCk )\u00d7 (cid:89)\nefk +(cid:80)\n\nefk +(cid:80)\n\nefm and Softmaxk(fk, f \u00afCk ) =\n\nwhere Softmaxk(fk, fCk ) =\nefm . For sim-\nplicity of our presentation in the remaining of the paper we do not discuss further these more general\nbounds and we focus only on the one-vs-each bound.\nThe computationally useful aspect of the bound in Eq. (4) is that it factorizes into a product, where\neach factor depends only on a pair of parameters (fk, fm). Crucially, this avoids the evaluation of the\nnormalizing constant associated with the global probability in Eq. (2) and, as discussed in Section 3, it\nleads to scalable training using stochastic optimization that can deal with very large K. Furthermore,\napproximate maximum likelihood estimation based on the bound can be very accurate and, as shown\nin the next section, it is exact for the non-parametric estimation case.\nThe fact that the one-vs-each bound in (4) is a product of pairwise probabilities suggests that there\nis a connection with Bradley-Terry (BT) models [6, 10] for learning individual skills from paired\ncomparisons and the associated multiclass classi\ufb01cation systems obtained by combining binary\nclassi\ufb01ers, such as one-vs-rest and one-vs-one approaches [10]. Our method differs from BT models,\nsince we do not combine binary probabilistic models to a posteriori form a multiclass model. Instead,\nwe wish to develop scalable approximate algorithms that can surrogate the training of multiclass\nsoftmax-based models by maximizing lower bounds on the exact likelihoods of these models.\n\nm\u2208 \u00afCk\nefk\nm\u2208 \u00afCk\n\nefk\nm\u2208Ck\n\n2.2 Optimality of the bound for maximum likelihood estimation\nAssume a set of observation (y1, . . . , yN ) where each yi \u2208 {1, . . . , K}. The log likelihood of the\ndata takes the form,\n\nL(f ) = log\n\np(yi) = log\n\np(y = k)Nk ,\n\n(7)\n\nN(cid:89)\n\ni=1\n\nK(cid:89)\n\nk=1\n\nefk(cid:80)K\n\nNk\nN\n\nwhere f = (f1, . . . , fK) and Nk denotes the number of data points with value k. By substituting\np(y = k) from Eq. (2) and then taking derivatives with respect to f we arrive at the standard stationary\nconditions of the maximum likelihood solution,\n\n=\n\nm=1 efm\n\n, k = 1, . . . , K.\n\n(8)\nThese stationary conditions are satis\ufb01ed for fk = log Nk + c where c \u2208 R is an arbitrary constant.\nWhat is rather surprising is that the same solutions fk = log Nk + c satisfy also the stationary\nconditions when maximizing a lower bound on the exact log likelihood obtained from the product of\none-vs-each probabilities.\nMore precisely, by replacing p(y = k) with the bound from Eq. (4) we obtain a lower bound on the\nexact log likelihood,\n\nefk\n\nefk + efm\n\n=\n\nlog P (fk, fm),\n\n(cid:105)Nm is a likelihood involving only the data of the pair\n\nk>m\n\n(9)\n\nwhere P (fk, fm) =\nof states (k, m), while there exist K(K \u2212 1)/2 possible such pairs. If instead of maximizing the exact\nlog likelihood from Eq. (7) we maximize the lower bound we obtain the same parameter estimates.\n\nefk +efm\n\nefk +efm\n\nefk\n\n\uf8ee\uf8f0(cid:89)\nK(cid:89)\n(cid:105)Nk(cid:104)\n\nk=1\n\nm(cid:54)=k\n\nefm\n\nF(f ) = log\n\n(cid:104)\n\n\uf8f9\uf8fbNk\n\n(cid:88)\n\n3\n\n\fK(cid:88)\n\nm=1\n\nlog\n\nefm \u2264 \u03b1 +\n\nlog(cid:0)1 + efm\u2212\u03b1(cid:1) ,\n\nK(cid:88)\n\nm=1\n\n(11)\n\nProposition 2. The maximum likelihood parameter estimates fk = log Nk + c, k = 1, . . . , K for\nthe exact log likelihood from Eq. (7) globally also maximize the lower bound from Eq. (9).\nProof. By computing the derivatives of F(f ) we obtain the following stationary conditions\n\nK \u2212 1 =\n\nNk + Nm\n\nefk\n\nNk\n\nefk + efm\n\n, k = 1, . . . , K,\n\n(10)\n\n(cid:88)\n\nm(cid:54)=k\n\nwhich form a system of K non-linear equations over the unknowns (f1, . . . , fK). By substituting\nthe values fk = log Nk + c we can observe that all K equations are simultaneously satis\ufb01ed which\nmeans that these values are solutions. Furthermore, since F(f ) is a concave function of f we can\nconclude that the solutions fk = log Nk + c globally maximize F(f ).\nRemark. Not only is F(f ) globally maximized by setting fk = log Nk + c, but also each pairwise\nlikelihood P (fk, fm) in Eq. (9) is separately maximized by the same setting of parameters.\n\n2.3 Comparison with Bouchard\u2019s bound\n\nBouchard [5] proposed a related bound that next we analyze in terms of its ability to approximate the\nexact maximum likelihood training in the non-parametric case, and then we compare it against our\nmethod. Bouchard [5] was motivated by the problem of applying variational Bayesian inference to\nmulticlass classi\ufb01cation and he derived the following upper bound on the log-sum-exp function,\n\nwhere \u03b1 \u2208 R is a variational parameter that needs to be optimized in order for the bound to become\nas tight as possible. The above induces a lower bound on the softmax probability p(y = k) from Eq.\n(2) that takes the form\n\np(y = k) \u2265\n\n(cid:81)K\n\nefk\u2212\u03b1\n\nm=1 (1 + efm\u2212\u03b1)\n\n.\n\n(12)\n\nThis is not the same as Eq. (4), since there is not a value for \u03b1 for which the above bound will reduce\nto our proposed one. For instance, if we set \u03b1 = fk, then Bouchard\u2019s bound becomes half the one\nin Eq. (4) due to the extra term 1 + efk\u2212fk = 2 in the product in the denominator.1 Furthermore,\nsuch a value for \u03b1 may not be the optimal one and in practice \u03b1 must be chosen by minimizing\nthe upper bound in Eq. (11). While such an optimization is a convex problem, it requires iterative\noptimization since there is not in general an analytical solution for \u03b1. However, for the simple case\nwhere K = 2 we can analytically \ufb01nd the optimal \u03b1 and the optimal f parameters. The following\nproposition carries out this analysis and provides a clear understanding of how Bouchard\u2019s bound\nbehaves when applied for approximate maximum likelihood estimation.\nProposition 3. Assume that K = 2 and we approximate the probabilities p(y = 1) and\np(y = 2) from (2) with the corresponding Bouchard\u2019s bounds given by\n(1+ef1\u2212\u03b1)(1+ef2\u2212\u03b1) and\n(1+ef1\u2212\u03b1)(1+ef2\u2212\u03b1) . These bounds are used to approximate the maximum likelihood solution by\nmaximizing a bound F(f1, f2, \u03b1) which is globally maximized for\n\nef2\u2212\u03b1\n\nef1\u2212\u03b1\n\n\u03b1 =\n\nf1 + f2\n\n2\n\n, fk = 2 log Nk + c, k = 1, 2.\n\n(13)\n\nThe proof of the above is given in the Supplementary material. Notice that the above estimates are\nbiased so that the probability of the most populated class (say the y = 1 for which N1 > N2) is\noverestimated while the other probability is underestimated. This is due to the factor 2 that multiplies\nlog N1 and log N2 in (13).\nAlso notice that the solution \u03b1 = f1+f2\nis not a general trend, i.e. for K > 2 the optimal \u03b1 is not the\nmean of fks. In such cases approximate maximum likelihood estimation based on Bouchard\u2019s bound\nrequires iterative optimization. Figure 1a shows some estimated softmax probabilities, using a dataset\n\n2\n\n1Notice that the product in Eq. (4) excludes the value k, while Bouchard\u2019s bound includes it.\n\n4\n\n\fof 200 points each taking one out of ten values, where f is found by exact maximum likelihood, the\nproposed one-vs-each bound and Bouchard\u2019s method. As expected estimation based on the bound in\nEq. (4) gives the exact probabilities, while Bouchard\u2019s bound tends to overestimate large probabilities\nand underestimate small ones.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) shows the probabilities estimated by exact softmax (blue bar), one-vs-each approximation (red bar)\nand Bouchard\u2019s method (green bar). (b) shows the 5-class arti\ufb01cial data together with the decision boundaries\nfound by exact softmax (blue line), one-vs-each (red line) and Bouchard\u2019s bound (green line). (c) shows the\nmaximized (approximate) log likelihoods for the different approaches when applied to the data of panel (b)\n(see Section 3). Notice that the blue line in (c) is the exact maximized log likelihood while the remaining lines\ncorrespond to lower bounds.\n\n3 Stochastic optimization for extreme classi\ufb01cation\nHere, we return to the general form of the softmax probabilities as de\ufb01ned by Eq. (1) where the\nscore functions are indexed by input x and parameterized by w. We consider a classi\ufb01cation task\nwhere given a training set {xn, yn}N\nn=1, where yn \u2208 {1, . . . , K}, we wish to \ufb01t the parameters w by\nmaximizing the log likelihood,\n\nN(cid:89)\n\nn=1\n\n(cid:80)K\n\nefyn (xn;w)\nm=1 efm(xn;w)\n\nL = log\n\n.\n\n(14)\n\n(cid:89)\n\nm(cid:54)=yn\n\nF = log\n\nN(cid:89)\nlabels,(cid:80)\n\nn=1\n\nWhen the number of training instances is very large, the above maximization can be carried out by\napplying stochastic gradient descent (by minimizing \u2212L) where we cycle over minibatches. However,\nthis stochastic optimization procedure cannot deal with large values of K because the normalizing\nconstant in the softmax couples all scores functions so that the log likelihood cannot be expressed as\na sum across class labels. To overcome this, we can use the one-vs-each lower bound on the softmax\nprobability from Eq. (4) and obtain the following lower bound on the previous log likelihood,\n\n1\n\n1 + e\u2212[fyn (xn;w)\u2212fm(xn;w)]\n\n= \u2212 N(cid:88)\n\n(cid:88)\n\nn=1\n\nm(cid:54)=yn\n\n(cid:16)\n\n1 + e\u2212[fyn (xn;w)\u2212fm(xn;w)](cid:17)\n\nlog\n\nm(cid:54)=yn\n\n(15)\nwhich now consists of a sum over both data points and labels. Interestingly, the sum over the\n, runs over all remaining classes that are different from the label yn assigned to xn.\nEach term in the sum is a logistic regression cost, that depends on the pairwise score difference\nfyn (xn; w)\u2212fm(xn; w), and encourages the n-th data point to get separated from the m-th remaining\nclass. The above lower bound can be optimized by stochastic gradient descent by subsampling terms\nin the double sum in Eq. (15), thus resulting in a doubly stochastic approximation scheme. Next we\nfurther discuss the stochasticity associated with subsampling remaining classes.\nThe gradient for the cost associated with a single training instance (xn, yn) is\n\n\u2207Fn =\n\n\u03c3 (fm(xn; w) \u2212 fyn (xn; w)) [\u2207wfyn(xn; w) \u2212 \u2207wfm(xn; w)] .\n\n(16)\n\n(cid:88)\n\nm(cid:54)=yn\n\nThis gradient consists of a weighted sum where the sigmoidal weights \u03c3 (fm(xn; w) \u2212 fyn(xn; w))\nquantify the contribution of the remaining classes to the whole gradient; the more a remaining\nclass overlaps with yn (given xn) the higher its contribution is. A simple way to get an unbiased\nstochastic estimate of (16) is to randomly subsample a small subset of remaining classes from the set\n{m|m (cid:54)= yn}. More advanced schemes could be based on importance sampling where we introduce\n\n5\n\n1234567891000.050.10.150.20.25ValuesEstimated Probability\u22123\u22122\u22121012\u22122\u221210120200040006000800010000\u2212300\u2212250\u2212200\u2212150\u2212100\u221250IterationsLower bound\fa proposal distribution pn(m) de\ufb01ned on the set {m|m (cid:54)= yn} that could favor selecting classes with\nlarge sigmoidal weights. While such more advanced schemes could reduce variance, they require\nprior knowledge (or on-the-\ufb02y learning) about how classes overlap with one another. Thus, in Section\n4 we shall experiment only with the simple random subsampling approach and leave the above\nadvanced schemes for future work.\nTo illustrate the above stochastic gradient descent algorithm we simulated a two-dimensional data set\nof 200 instances, shown in Figure 1b, that belong to \ufb01ve classes. We consider a linear classi\ufb01cation\nmodel where the score functions take the form fk(xn, w) = wT\nk xn and where the full set of\n\nparameters is w = (w1, . . . , wK). We consider minibatches of size ten to approximate the sum(cid:80)\nand subsets of remaining classes of size one to approximate(cid:80)\n\nm(cid:54)=yn\n\nn\n. Figure 1c shows the stochastic\nevolution of the approximate log likelihood (dashed red line), i.e. the unbiased subsampling based\napproximation of (15), together with the maximized exact softmax log likelihood (blue line), the\nnon-stochastically maximized approximate lower bound from (15) (red solid line) and Bouchard\u2019s\nmethod (green line). To apply Bouchard\u2019s method we construct a lower bound on the log likelihood\nby replacing each softmax probability with the bound from (12) where we also need to optimize a\nseparate variational parameter \u03b1n for each data point. As shown in Figure 1c our method provides a\ntighter lower bound than Bouchard\u2019s method despite the fact that it does not contain any variational\nparameters. Also, Bouchard\u2019s method can become very slow when combined with stochastic gradient\ndescent since it requires tuning a separate variational parameter \u03b1n for each training instance. Figure\n1b also shows the decision boundaries discovered by the exact softmax, one-vs-each bound and\nBouchard\u2019s bound. Finally, the actual parameters values found by maximizing the one-vs-each bound\nwere remarkably close (although not identical) to the parameters found by the exact softmax.\n\n4 Experiments\n4.1 Toy example in large scale non-parametric estimation\nHere, we illustrate the ability to stochastically maximize the bound in Eq. (9) for the simple non-\nparametric estimation case. In such case, we can also maximize the bound based on the analytic\nformulas and therefore we will be able to test how well the stochastic algorithm can approximate\nthe optimal/known solution. We consider a data set of N = 106 instances each taking one out of\nK = 104 possible categorical values. The data were generated from a distribution p(k) \u221d u2\nk, where\neach uk was randomly chosen in [0, 1]. The probabilities estimated based on the analytic formulas\nare shown in Figure 2a. To stochastically estimate these probabilities we follow the doubly stochastic\nframework of Section 3 so that we subsample data instances of minibatch size b = 100 and for each\ninstance we subsample 10 remaining categorical values. We use a learning rate initialized to 0.5/b\n(and then decrease it by a factor of 0.9 after each epoch) and performed 2 \u00d7 105 iterations. Figure\n2b shows the \ufb01nal values for the estimated probabilities, while Figure 2c shows the evolution of the\nestimation error during the optimization iterations. We can observe that the algorithm performs well\nand exhibits a typical stochastic approximation convergence.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) shows the optimally estimated probabilities which have been sorted for visualizations purposes. (b)\nshows the corresponding probabilities estimated by stochastic optimization. (c) shows the absolute norm for the\nvector of differences between exact estimates and stochastic estimates.\n\n4.2 Classi\ufb01cation\nSmall scale classi\ufb01cation comparisons. Here, we wish to investigate whether the proposed lower\nbound on the softmax is a good surrogate for exact softmax training in classi\ufb01cation. More precisely,\nwe wish to compare the parameter estimates obtained by the one-vs-each bound with the estimates\n\n6\n\n020004000600080001000000.511.522.533.5x 10\u22124ValuesEstimated Probability020004000600080001000000.511.522.533.5x 10\u22124ValuesEstimated Probability00.511.52x 10500.10.20.30.40.50.60.7IterationsError\fNtest(cid:88)\n\nNtest(cid:88)\n\nobtained by exact softmax training. To quantify closeness we use the normalized absolute norm\n\nnorm =\n\n|wsoftmax \u2212 w\u2217|\n\n|wsoftmax|\n\n,\n\n(17)\n\nwhere wsoftmax denotes the parameters obtained by exact softmax training and w\u2217 denotes estimates\nobtained by approximate training. Further, we will also report predictive performance measured by\nclassi\ufb01cation error and negative log predictive density (nlpd) averaged across test data,\n\nerror = (1/Ntest)\n\nI(yi (cid:54)= ti),\n\nnlpd = (1/Ntest)\n\n\u2212 log p(ti|xi),\n\n(18)\n\ni=1\n\ni=1\n\nwhere ti denotes the true label of a test point and yi the predicted one. We trained the linear multiclass\nmodel of Section 3 with the following alternative methods: exact softmax training (SOFT), the one-\nvs-each bound (OVE), the stochastically optimized one-vs-each bound (OVE-SGD) and Bouchard\u2019s\nbound (BOUCHARD). For all approaches, the associated cost function was maximized together with\nan added regularization penalty term, \u2212 1\n2 \u03bb||w||2, which ensures that the global maximum of the cost\nfunction is achieved for \ufb01nite w. Since we want to investigate how well we surrogate exact softmax\ntraining, we used the same \ufb01xed value \u03bb = 1 in all experiments.\nWe considered three small scale multiclass classi\ufb01cation datasets: MNIST2, 20NEWS3 and BIBTEX\n[12]; see Table 1 for details. Notice that BIBTEX is originally a multi-label classi\ufb01cation dataset [2].\nwhere each example may have more than one labels. Here, we maintained only a single label for each\ndata point in order to apply standard multiclass classi\ufb01cation. The maintained label was the \ufb01rst label\nappearing in each data entry in the repository \ufb01les4 from which we obtained the data.\nFigure 3 displays convergence of the lower bounds (and for the exact softmax cost) for all methods.\nRecall, that the methods SOFT, OVE and BOUCHARD are non-stochastic and therefore their optimiza-\ntion can be carried out by standard gradient descent. Notice that in all three datasets the one-vs-each\nbound gets much closer to the exact softmax cost compared to Bouchard\u2019s bound. Thus, OVE tends to\ngive a tighter bound despite that it does not contain any variational parameters, while BOUCHARD has\nN extra variational parameters, i.e. as many as the training instances. The application of OVE-SGD\nmethod (the stochastic version of OVE) is based on a doubly stochastic scheme where we subsample\nminibatches of size 200 and subsample remaining classes of size one. We can observe that OVE-SGD\nis able to stochastically approach its maximum value which corresponds to OVE.\nTable 2 shows the parameter closeness score from Eq. (17) as well as the classi\ufb01cation predictive\nscores. We can observe that OVE and OVE-SGD provide parameters closer to those of SOFT than the\nparameters provided by BOUCHARD. Also, the predictive scores for OVE and OVE-SGD are similar to\nSOFT, although they tend to be slightly worse. Interestingly, BOUCHARD gives the best classi\ufb01cation\nerror, even better than the exact softmax training, but at the same time it always gives the worst nlpd\nwhich suggests sensitivity to over\ufb01tting. However, recall that the regularization parameter \u03bb was\n\ufb01xed to the value one and it was not optimized separately for each method using cross validation.\nAlso notice that BOUCHARD cannot be easily scaled up (with stochastic optimization) to massive\ndatasets since it introduces an extra variational parameter for each training instance.\nLarge scale classi\ufb01cation. Here, we consider AMAZONCAT-13K (see footnote 4) which is a large\nscale classi\ufb01cation dataset. This dataset is originally multi-labelled [2] and here we maintained only\na single label, as done for the BIBTEX dataset, in order to apply standard multiclass classi\ufb01cation.\nThis dataset is also highly imbalanced since there are about 15 classes having the half of the training\ninstances while they are many classes having very few (or just a single) training instances.\nFurther, notice that in this large dataset the number of parameters we need to estimate for the linear\nclassi\ufb01cation model is very large: K \u00d7 (D + 1) = 2919 \u00d7 203883 parameters where the plus one\naccounts for the biases. All methods apart from OVE-SGD are practically very slow in this massive\ndataset, and therefore we consider OVE-SGD which is scalable.\nWe applied OVE-SGD where at each stochastic gradient update we consider a single training instance\n(i.e. the minibatch size was one) and for that instance we randomly select \ufb01ve remaining classes. This\n\n2http://yann.lecun.com/exdb/mnist\n3http://qwone.com/~jason/20Newsgroups/\n4http://research.microsoft.com/en-us/um/people/manik/downloads/XC/XMLRepository.\n\nhtml\n\n7\n\n\fName\nMNIST\n20NEWS\nBIBTEX\nAMAZONCAT-13K\n\nTable 1: Summaries of the classi\ufb01cation datasets.\nDimensionality Classes Training examples Test examples\n784\n61188\n1836\n203882\n\n60000\n11269\n4880\n1186239\n\n10\n20\n148\n2919\n\n10000\n7505\n2515\n306759\n\nTable 2: Score measures for the small scale classi\ufb01cation datasets.\n\nSOFT\n(error, nlpd)\n(0.074, 0.271)\n(0.272, 1.263)\n(0.622, 2.793)\n\nBOUCHARD\n(norm, error, nlpd)\n(0.64, 0.073, 0.333)\n(0.65, 0.249, 1.337)\n(0.25, 0.621, 2.955)\n\nOVE\n(norm, error, nlpd)\n(0.50, 0.082, 0.287)\n(0.05, 0.276, 1.297)\n(0.09, 0.636, 2.888)\n\nOVE-SGD\n(norm, error, nlpd)\n(0.53, 0.080, 0.278)\n(0.14, 0.276, 1.312)\n(0.10, 0.633, 2.875)\n\nMNIST\n20NEWS\nBIBTEX\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 3: (a) shows the evolution of the lower bound values for MNIST, (b) for 20NEWS and (c) for BIBTEX. For\nmore clear visualization the bounds of the stochastic OVE-SGD have been smoothed using a rolling window of\n400 previous values. (d) shows the evolution of the OVE-SGD lower bound (scaled to correspond to a single data\npoint) in the large scale AMAZONCAT-13K dataset. Here, the plotted values have been also smoothed using a\nrolling window of size 4000 and then thinned by a factor of 5.\n\nleads to sparse parameter updates, where the score function parameters of only six classes (the class\nof the current training instance plus the remaining \ufb01ve ones) are updated at each iteration. We used a\nvery small learning rate having value 10\u22128 and we performed \ufb01ve epochs across the full dataset, that\nis we performed in total 5 \u00d7 1186239 stochastic gradient updates. After each epoch we halve the\nvalue of the learning rate before next epoch starts. By taking into account also the sparsity of the input\nvectors each iteration is very fast and full training is completed in just 26 minutes in a stand-alone\nPC. The evolution of the variational lower bound that indicates convergence is shown in Figure 3d.\nFinally, the classi\ufb01cation error in test data was 53.11% which is signi\ufb01cantly better than random\nguessing or by a method that decides always the most populated class (where in AMAZONCAT-13K\nthe most populated class occupies the 19% of the data so the error of that method is around 79%).\n\n5 Discussion\nWe have presented the one-vs-each lower bound on softmax probabilities and we have analyzed\nits theoretical properties. This bound is just the most extreme case of a full family of hierarchi-\ncally ordered bounds. We have explored the ability of the bound to perform parameter estimation\nthrough stochastic optimization in models having large number of categorical symbols, and we have\ndemonstrated this ability to classi\ufb01cation problems.\nThere are several directions for future research. Firstly, it is worth investigating the usefulness of the\nbound in different applications from classi\ufb01cation, such as for learning word embeddings in natural\nlanguage processing and for training recommendation systems. Another interesting direction is to\nconsider the bound not for point estimation, as done in this paper, but for Bayesian estimation using\nvariational inference.\n\nAcknowledgments\nWe thank the reviewers for insightful comments. We would like also to thank Francisco J. R. Ruiz for\nuseful discussions and David Blei for suggesting the name one-vs-each for the proposed method.\n\n8\n\n00.511.52x 105\u22127\u22126\u22125\u22124\u22123\u22122x 104IterationsLower bound SOFTOVEOVE\u2212SGDBOUCHARD0510x 105\u22124000\u22123500\u22123000\u22122500\u22122000\u22121500IterationsLower bound0510x 105\u22126000\u22125000\u22124000\u22123000IterationsLower bound0510x 105\u22121000\u2212800\u2212600\u2212400\u22122000IterationsLower bound\fReferences\n[1] Yoshua Bengio and Jean-S\u00e9bastien S\u00e9n\u00e9cal. Quick training of probabilistic neural nets by importance\n\nsampling. In Proceedings of the conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2003.\n\n[2] Kush Bhatia, Himanshu Jain, Purushottam Kar, Manik Varma, and Prateek Jain. Sparse local embeddings\nfor extreme multi-label classi\ufb01cation.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 730\u2013738. Curran\nAssociates, Inc., 2015.\n\n[3] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics).\n\nSpringer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.\n\n[4] D. Bohning. Multinomial logistic regression algorithm. Annals of the Inst. of Statistical Math, 44:197\u2013200,\n\n1992.\n\n[5] Guillaume Bouchard. Ef\ufb01cient bounds for the softmax function and applications to approximate inference\n\nin hybrid models. Technical report, 2007.\n\n[6] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired\n\ncomparisons. Biometrika, 39(3/4):324\u2013345, 1952.\n\n[7] Jacob Devlin, Rabih Zbib, Zhongqiang Huang, Thomas Lamar, Richard Schwartz, and John Makhoul.\nFast and robust neural network joint models for statistical machine translation. In Proceedings of the\n52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages\n1370\u20131380, Baltimore, Maryland, June 2014. Association for Computational Linguistics.\n\n[8] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Book in preparation for MIT Press,\n\n2016.\n\n[9] Siddharth Gopal and Yiming Yang. Distributed training of large-scale logistic models. In Sanjoy Dasgupta\nand David Mcallester, editors, Proceedings of the 30th International Conference on Machine Learning\n(ICML-13), pages 289\u2013297. JMLR Workshop and Conference Proceedings, 2013.\n\n[10] Tzu-Kuo Huang, Ruby C. Weng, and Chih-Jen Lin. Generalized Bradley-Terry models and multi-class\n\nprobability estimates. J. Mach. Learn. Res., 7:85\u2013115, December 2006.\n\n[11] Shihao Ji, S. V. N. Vishwanathan, Nadathur Satish, Michael J. Anderson, and Pradeep Dubey. Blackout:\n\nSpeeding up recurrent neural network language models with very large vocabularies. 2015.\n\n[12] Ioannis Katakis, Grigorios Tsoumakas, and Ioannis Vlahavas. Multilabel text classi\ufb01cation for automated\n\ntag suggestion. In In: Proceedings of the ECML/PKDD-08 Workshop on Discovery Challenge, 2008.\n\n[13] Mohammad Emtiyaz Khan, Shakir Mohamed, Benjamin M. Marlin, and Kevin P. Murphy. A stick-\nbreaking likelihood for categorical data analysis with latent Gaussian models. In Proceedings of the\nFifteenth International Conference on Arti\ufb01cial Intelligence and Statistics, AISTATS 2012, La Palma,\nCanary Islands, April 21-23, 2012, pages 610\u2013618, 2012.\n\n[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\nwords and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani,\nand K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111\u20133119.\nCurran Associates, Inc., 2013.\n\n[15] Andriy Mnih and Yee Whye Teh. A fast and simple algorithm for training neural probabilistic language\nmodels. In Proceedings of the 29th International Conference on Machine Learning, pages 1751\u20131758,\n2012.\n\n[16] F. Morin and Y. Bengio. Hierarchical probabilistic neural network language model. In Proceedings of the\n\nTenth International Workshop on Arti\ufb01cial Intelligence and Statistics, pages 246\u2013252. Citeseer, 2005.\n\n[17] Ulrich Paquet, Noam Koenigstein, and Ole Winther. Scalable Bayesian modelling of paired symbols.\n\nCoRR, abs/1409.2824, 2012.\n\n[18] Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global Vectors for Word Represen-\ntation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing\n(EMNLP), pages 1532\u20131543, Doha, Qatar, October 2014. Association for Computational Linguistics.\n\n[19] Sudheendra Vijayanarasimhan, Jonathon Shlens, Rajat Monga, and Jay Yagnik. Deep networks with large\n\noutput spaces. CoRR, abs/1412.7479, 2014.\n\n9\n\n\f", "award": [], "sourceid": 2068, "authors": [{"given_name": "Michalis", "family_name": "Titsias RC AUEB", "institution": "Athens University of Economics and Business"}]}