NIPS 2016
Mon Dec 5th through Sun the 11th, 2016 at Centre Convencions Internacional Barcelona
Paper ID: 2068 One-vs-Each Approximation to Softmax for Scalable Estimation of Probabilities

### Reviewer 1

#### Summary

The authors introduce a lower bound on the softmax probability frequently used in multi-class learning, based on combining one-vs-one probabilities. They demonstrate that this bound can be used to train large-scale multi-class models.

#### Qualitative Assessment

In my view, the main reason the proposed lower bound is interesting is that it offers a potential way to speed up training for multi-class models with a very large number of classes. While it is useful to understand other properties of the lower bound, the paper could be improved by emphasizing this primary use case in machine learning. Figure 1c and Figure 3 need a more clear explanation of what is being displayed, and why it is important. In particular, what value is being plotted on the y-axis, and at what setting of the parameters w. Here is how I understand it, for Figure 1c say: Blue Line - value of Eq. (13) at the setting of parameters w that maximize 13 Red Line - value of Eq. (13) at the setting of parameters w that maximize 14 Green Line - value of Eq. (13)? at the setting of parameters w that maximize the Bouchard lower bound (?) Red dashed line - value of Eq. (13)? at parameters w based on the given iterations of training? But if it is always the actual log-likelihood on the y axis (Eq 13), then I'm not sure why the y-axis is labelled "lower bound" and you use the phrase "maximized (approximate) log likelihoods". The same comments apply to Figure 3. In order to effectively demonstrate that the proposed bound can significantly speed up training, additional experiments and comparisons are needed. Most importantly, another natural proxy loss that can be used in training is to only compute the softmax over a random sample of the full output space (vocabulary.) I believe this is a fairly standard trick (e.g., implemented in tensorflow as tf.nn.sampled_softmax_loss, https://www.tensorflow.org/versions/r0.9/api_docs/python/nn.html#sampled_softmax_loss). See also Section 3 of http://arxiv.org/pdf/1412.2007v2.pdf and references therein. A critical parameter in both approaches is the number of classes used in the approximation of the softmax. I'd like to see that parameter explored, with some guidance on how to choose it in practice. For example, in Figure 2(c) I'd like to see lines for the number of classes in the stochastic approximation ranging from 1 to 10^4, as well as a line for SGD with computation of the full softmax. But ideally this parameter would be explored on the real datasets as well, with the key metric being wall-clock-time to reach a given level of accuracy. For the small-scale experiments, in Table 2, the Softmax column has its fields in a different order, and is (critically) missing the error column. It's not clear from the text if the metrics are on the training set or the test set --- I assume the test set. But in that case it's hard to dismiss the Bouchard method as overfitting if it is actually getting the lowest test-set error. This method is more computationally expensive, but if that is the only downside, you should include experiments that show this (in particular to justify not including it in the comparison on the large-scale problem). For the amazon-13k results, no comparisons are given, so it is hard to draw conclusions. To be meaningful, there should be a comparison to the sampled softmax mentioned above. You could also compare to full softmax but with a random projection or hashing trick on the input features to a lower-dim space (few thousands) where this would be more practical --- this would be a natural baseline. Again, since computational speedups are the main potential advantage here, those should be clearly illustrated. Minor comments: Figure 1(a): I assume these are probabilities estimated by optimizing using the given loss, but then with probabilities all computed using Eq (2) (rather than say computing estimated probabilities for the Bouchard method using Eq. (11)). Please clarify this. Several typos and missing words, e.g. line 10 (allows *us*), 37 (*wheneven*), 114 (Not only *is F(f)* globally …), 122 (in order *for* the bound). Line 244 should refer to Table 2 not Table 1. Lines 235-237: These methods are all naturally singly stochastic, and can (and probably should) be solved with SGD (sampling examples) not GD. It should be clarified that while the global optimum is the same for Eqs. (6) and (8), it is not the case that (13) and (14) share the same minimum.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 2

#### Summary

To reduce the expensive computation of softmax over multiple classes, this paper presents an one-vs-one approximation for the estimation of probabilities.

#### Qualitative Assessment

Huang et al. presented "Generalized Bradley-Terry Models and Multi-class Probability Estimates", JMLR 7 (2006) 85–115, which also estimates the multi-class probability based on pairwise models. Even BT model has been proposed for very long time. I am surprised that the authors did not aware this paper, and missed the comparison. Huang's model is also very simple to compute. Therefore, I am not clear what is the significance of this paper. The authors should do proper analysis and comparison with this classic work before submission.

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 3

#### Summary

Proposes a better lower-bound for popular softmax multi-class probability models that decomposes into a product over pairs-of-classes, which the authors then suggest stochastically approximating. Good set of experimental results suggests this lower bound approximation works better than prior art and works well in general.

#### Qualitative Assessment

An elegantly simple bound for a practical problem that leads to a practical strategy to stochastically approximate the bound by sampling pairs of classes, producing an unbiased estimate. Nice side-comments on reducing variance. The resulting strategy of stochastically sampling classes has been found in practice to work well, see e.g. Weston et al. JMLR 2015 "Training Highly Multiclass Classifiers" for survey of such WARP/WSABIE methods, and this work seems to add to the theoretical motivation for such strategies. Some of that work should probably be mentioned here. My biggest concern is this seems a lot like a 2005 NIPS paper by Huang and has a better modeling story (Bradley-Terry): http://papers.nips.cc/paper/2705-a-generalized-bradley-terry-model-from-group-competition-to-individual-skill.pdf hope the authors can help differentiate any novelty here? Curious how the stochastic approximation of the lower bound (4) compares to a stochastic approximation of the original (2). Consider adding a couple comments on what types of distributions (e.g. uniform vs K-sparse for small K) this approximation performs best at. Well-written and well-argued. Typos: "in order the bound to become" "bayesian" in refs -> "Bayesian"

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 4

#### Summary

The paper takes a softmax to model a categorical random variable. Estimating its parameters is hard: for K classes, on observing class y=k, the softmax's normalizing constant incorporates all other K-1 classes too. The problem is everywhere. In a neural language model, for instance, on seeing word y="cat" given a context, the parameters of all other K-1 words would also have to be adjusted to tell us that they don't describe the context. Traditionally, there's been a number of ways around it. (1) Ignore the softmax's complexity, and evaluate its gradient's fully. (2) Hack it by "negative sampling", like word2vec. (3) Lower-bound it, and optimize the bound -- but the bound could be too loose to be practical. (4) Use some contrastive (background noise) distribution, like "noise contrastive estimation", and turn the problem into classification problems that approach the real gradient as noise samples increase to infinity... This paper introduces a neat new bound. It's a lower bound, but with the property that at the bound's maximum, the parameters would also maximize the softmax. Even though the bound is not tight at its maximum (as is evident in Fig 3). We like this property, as Proposition 2 in the paper tells us that we can plug in the bound when we learn parameters, and find the same maximum. The main point that the paper makes (and could do more to reinforce through experimental illustrations) is that even though the bound *still* makes use of around K terms, it decomposes into a sum, which works well with SGD.

#### Qualitative Assessment

I was tempted to put "oral level" for potential impact, as this could have substantial practical implications. The reason for "poster level" and not "oral" is that some more convincing arguments are needed for it's widescale use. In Ilya Sutskever's NIPS talk / paper on Sequence to Sequence learning, I found his remark that they used the full softmax gradients all the way quite strange, as there are many techniques to speed it up: why not try Gutmann & Hyvarinen's "Noise Contrastive Estimation"? As the softmax sits at the heart of so many state of the art (deep) Neural Network models, why not show that you can get a 100-fold speed-up in training time, on some machine translation task, just by plugging in the paper's bound? The comparison stated above could also tease out some other effects: if the softmax is used at the end or output of a highly non-linear model, does plugging in the bound give you a different solution? It changes the objective function, despite the equivalent maximums (given in Prop 2). We know that the 1:1 bound is a bound, so in Fig 3 we'd expect OVO to be below SOFT. The paper could be improved by plugging OVO's parameters into SOFT, and plotting *that* curve. It should be the same as SOFT, and if it is, it brings home the point of the paper a lot better. One can differentiate between using the bound for training, and for inference. As maximizing is doubly-stochastic, a nice addition could be to show how much more variance is introduced into the gradients, and how to mitigate that. In summary, the bound is practically useful due to its decomposition as a product (or sum) and the way it fits SGD, and it having the same maximizing parameter settings as a softmax. If the doubly-stochastic SGD's variance is accounted for, it might even become a fairly commonly used technique. ---------------- Authors, after your rebuttal response: I realised I missed some useful pointers. To Böhning's bound, and Khan et al.'s approach from "Variational bounds for mixed data factor analysis" (NIPS 2010).

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)

### Reviewer 5

#### Summary

This paper proposes a lower bound on softmax probabilities based on estimating all one-vs-one class probabilities allowing efficient estimation of the class probabilities by subsampling both data samples and classes. The proposed lower bound is compared to exact softmax probabilities in both a stochastic and non-stochastic version as well as to previously published variational lower bound using both artificial and real data sets.

#### Qualitative Assessment

Overall the paper is well written and easy to follow. The theory section gives a concise presentation of the proposed method and the proofs seems sound. The proposed method is of most interest when used as a stochastic approximation to the exact softmax probabilities for the case of large N and K. In Figure 2 the authors compare the stochastic and exactly estimated lower bound on the class probabilities showing good agreement. However how close are these estimates to the exact softmax probabilities? In table 2 a number of results are presented for different datasets (I assume that the labels of the softmax results should have been [error, nlpd], also please use SOFT, BOCHARD, OVO, OVO-SGD as column names as in the text). The nlpd results are close to the softmax values for the ovo methods. However the classification performances are slightly worse and the BOUCHARD method actually gives the best performance. I don’t understand the explanation of this, L249: “BOUCHARD gives the best classification error, even better than the exact softmax training, but at the same time it always gives the worst nlpd which suggests sensitivity to overfitting”. ? I assume the reported values are test-set performances obtained by optimizing performance on a validation set? Since BOUCHARD is a less tight lower bound it makes sense that nlpd is worse, however it could be interesting with a better explanation of why this method gives better classification performance compared to the authors method which is usually what people care about in the end? Overall I find this paper interesting. What I currently miss in the presented work are 1) The proposed method is presented as an efficient approximation to the softmax distribution. I would like some quantitative results on this, i.e. runtime for the different approaches. Secondly I would like some more discussion on the trade-off between number of stochastic samples (classes and data-points) versus the obtained accuracy of the estimated lower bound? 2) A larger comparison on a real dataset using a realistic sized model such as substituting the OVO for softmax in i.e. an ImageNET model or in a NLP-task such as translation? 3) Related to the above question. How is the convergence rate of OVO compared to Softmax using stochastic optimization, i.e. is more iteration/runtime needed to reach similar solutions?

#### Confidence in this Review

2-Confident (read it all; understood it all reasonably well)

### Reviewer 6

#### Summary

This paper presents a new lower bound on softmax probability that takes the form of a product over 1-vs-1 pairwise probability, and this enables a scalable estimation based on stochastic optimization. It shows that this bound is tighter and more computation efficient than Bouchard’s bound, and the experiments demonstrate the superiority of the new approximation over the Bouchard’s bound, and effectiveness of the scalable estimation on small and large datasets.

#### Qualitative Assessment

The derivation and analysis of the new bound is solid. My concern is mainly around the experimental methods. I think the comparison to the Bouchard’s bound isn’t enough, which isn’t used anywhere in this context before and clearly not the state-of-the-art. So, additional comparisons with many other state-of-the-art alternatives should be made. These alternative methods include important sampling-based [Bengio et al, 2003], NCE, and their extension blackout “Ji et al, 2015, blackout: speeding up RNNLM with very large vocabularies”. The authors cited the NCE paper without any comparisons, but missed IS-based [Bengio et al, 2003] and the blackout paper, which is a unification and extension of IS and NCE. All of them are shown to be a good approximation of softmax gradient, which is what really matters; see “Mnih, The 2012, A fast and simple algorithm for training NLM”. Compared to the gradient approximation in Eq. 15, it’s unclear if the proposed one will be better. Related work: in addition to the IS-based and the blackout paper mentioned above, there is at least one more related work is missing “Devlin et al, 2014, Fast and robust NN joint models for SMT”, which considers optimization approaches to make the partition function to be close to 1 (self normalization). Experiments: Additional comparisons to IS, NCE or BlackOut are needed for a fair comparison since they are state-of-the-arts; In table 2, the “errors” of “Softmax” are missing. This data is important to see the overfitting argument below. Only one result (ovo-sgd) on large scale data set (Amazon) is provided. Given it only takes ovo-sgd 26 mins to produce result on a stand-alone PC, it should be feasible to do all the experiments on this dataset as in table 2. This will make the claims stronger. Why does Bouchard give the best classification error while has the worst NLPD? The authors only say a few words and suspect the overfitting. Since all the methods optimize NLPD, does this mean all the other methods except Bouchard have overfitting issues? Typos and inconsistencies: Line 37, wheneven -> whenever Line 125, Eq. (3) -> Eq. (4) Line 223: lineal -> linear Line 244: Table 1 -> Table 2 Table 2: Product -> ovo

#### Confidence in this Review

3-Expert (read the paper in detail, know the area, quite certain of my opinion)