Paper ID: | 1112 |
---|---|
Title: | Stochastic Multiple Choice Learning for Training Diverse Deep Ensembles |
This paper proposes a new stochastic gradient descent algorithm to train ensemble models. The writers build upon the work of Guzman-Rivera et al [8] where the loss of the ensemble model is the loss of the best output of a single classifier in the ensemble, this yields a diverse set of classifiers. Their contribution is to suggest the use of a stochastic gradient descent algorithm for the training which enables the use of this algorithm for deep neural networks. The paper is interesting and clearly written.
My biggest concern is whether the algorithm novelty of the paper is not only incremental to this of [8]. Since the authors convinced me that the small change they add makes this algorithm much more useful I am in favor of acceptance. I believe that an interesting comparison would be to check the MCL algorithm when batches are used (rather than the entire dataset). I would like to see more analysis regarding the "diversity" effect.
2-Confident (read it all; understood it all reasonably well)
The paper proposes a new method for training ensemble of Deep Neural Networks (DNN) (or any other gradient-based approach), by using an oracle (that knows with member of the ensemble is optimal for the current instance) over each data instances used for training, applying backprop only on the network with the output selected. The approach has the advantage of letting the networks to specialized themeselves, giving ensemble that are covering well all (or most) possible output of interest. Results are reported on a variety of domains, with overall great performances.
I think that's a pretty good paper. It presents well an interesting idea of training an ensemble of DNN with an oracle, which oracle is guiding backprop update to the net with the best output (according to the true label). It appears to enforce specialization quite well in a classification context. The overall presentation is really good and the experiments are done on several domains. Globally, a strong work that deserves to be published. I would have considered this paper as outstanding if the issue exploitation of ensemble of DNNs trained with the proposed approach would also been addressed. Currently, training is done with the oracle, but the use of the resulting ensemble is not developped further, outside saying it can provide a broad pool of possible results and that the approach is coherent with the top-k metric proposed for some problems (i.e., ImageNet). This is an important missing part for making the proposed approach fully usable in practice. Nevertheless, the contribution is there in the current stage, the missing part being relatively problem-specific. Fig. 2 presents pseudo-code that is in fact raster images embedded in the paper. Why not making the pseudo-code directly within the paper, or at least embed the pseudo-code figure as a vectorial image?
2-Confident (read it all; understood it all reasonably well)
This paper proposes a simple solution to the multiple choice learning problem where the learners are deep networks. Current solutions to MCL do not directly extend to deep networks. This paper shows that a simple layer of winner-takes-the-gradient in back propagation allow them to combine the training of the learners with the assignment problem in an ensemble setting. Experimental results show that this method performs well across wide range of problems: Image classification, image segmentation, and captioning.
Strengths: - In general, the problem of MCL for deep learning is of enough importance to the NIPS community. - I like the simplicity of the proposed method. - Experiments show results on 3 different tasks. Weaknesses: - My biggest concern with this paper is the fact that it motivates “diversity” extensively (even the word diversity is in the title) but the model does not enforce diversity explicitly. I was all excited to see how the authors managed to get the diversity term into their model and got disappointed when I learned that there is no diversity. - The proposed solution is an incremental step considering the relaxation proposed by Guzman. et. al. Minor suggestions: - The first sentence of the abstract needs to be re-written. - Diversity should be toned down. - line 108, the first “f” should be “g” in “we fixed the form of ..” - extra “.” in the middle of a sentence in line 115. One Question: For the baseline MCL with deep learning, how did the author ensure that each of the networks have converged to a reasonable results. Cutting the learners early on might significantly affect the ensemble performance.
3-Expert (read the paper in detail, know the area, quite certain of my opinion)
The author formulate a stochastic gradient descent method with Multiple choice learning. The author claim the advantages of proposed framework in multiple tasks such as classification, segmentation and captioning.
1. This paper show the sMCL method is kind of incremental in novelty. The motivation behind MCL is clear, but can not be considered as the contribution of this work. This work is an NN version of [8] with some adjustments. 2. The paper is not well written. The technical part of this work (sec 3) task only 1 page. What is the difference of proposed framework comparing with alternative training, which has been widely adopted when multiple tasks are trained together. 3. The author conduct experiments on many computer vision tasks, such as classification, image segmentation and captioning. However, the experiments lack strong baselines and the improvement in performance is small. To prove the efficiency of your frame work, the author should perform experiments on PASCAL VOC 2012 to compare with more strong baselines.
1-Less confident (might not have understood significant parts)
The authors propose a method to generate multiple outputs by jointly learning an ensemble of deep networks that minimize the oracle loss, which is useful when interacting with users. The oracle experiments on three tasks demonstrate the effectiveness of the proposed method.
This paper integrates deep network to the MCL paradigm and proposes a stochastic block gradient descent optimization method to minimize the oracle loss of MCL. The proposed optimization method can be easily implemented within the back-propagation learning framework and the whole model can be trained in an end-to-end fashion. * My concern is how important the initialization is? What if models are initialized with random values instead of a pre-trained network? * Authors do not explain much about the setting for comparison. Do authors use the same pre-trained network to baseline methods as sMCL for initialization? * Authors use the oracle accuracy as metric for evaluation. What about the diversity of ensemble members? I expect to see more analysis on this. * Besides, there is a work with very similar idea that authors are suggested to compare with. Stefan Lee, Senthil Purushwalkam, Michael Cogswell, David J. Crandall, Dhruv Batra, Why M Heads are Better than One: Training a Diverse Ensemble of Deep Networks. CoRR abs/1511.06314 (2015)
2-Confident (read it all; understood it all reasonably well)
The paper provides an ensemble training algorithm for deep learners. It is unique in that only one learner of the ensemble is active for each training example unlike previous deep and classical ensembles. During training, the member with lowest loss wrt the true label is identified and updated with the gradient. Overall it is a simple idea that leads to surprisingly good experimental results and fixes some deficiencies of existing methods, thus I recommend for weak acceptance.
(Contd. from summary) 1. The testing phase is not clearly explained, I suspect they use the ground truth and find the best member explained in line 159 in page 5. 2. In the figures and text they refer to 'independent ensembles' and 'regular ensembles'. (Denoted as 'Indp.' in figures and line 181 on page 5). I could not find explanation for this baseline. 3. The authors repeatedly point to `specialization` of ensemble members towards different output classes. However, the reassignment in training is on a per example basis. I feel this argument is not justified sufficiently. Can you incorporate specialization in the loss function? 4. Related to above, the authors did not discuss the very related work of diversity regularization in neural networks (e.g. Chen et.al). This technique is more rigorous but yet to be studied in the deep setting.
2-Confident (read it all; understood it all reasonably well)