Paper ID: 711
Title: Reservoir Boosting : Between Online and Offline Ensemble Learning
Reviews

Submitted by Assigned_Reviewer_2

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper propose a boosting algorithm GEEM keeping a limited number of samples in a reservoir while incremental learning. The key of GEEM is the gaussian apprixmation of the distribution of match of hypothesis for each sample and this paper shows the effectiveness of this simple approximation. GEEM keep updating the approximation and choosing the optimal hypothesis based on the approximation and selecting the samples to reserve with greedy subset selection. In the experiments againsts 3 reservor strategies, 3 online boosts and 3 sub-sampling methods for typical datasets, it is shown that GEEM outperforms other methods for most of datasets. I think quality, clarity, originality and significance are good.
Q2: Please summarize your review in 1-2 sentences
This paper proposes a pool-based boosting algorithm GEEM is based on the gaussian approximation of the match of hypothesis of each sample and shows the effectiveness of this approximation in the experimets against typical online and subsampling methods. I think quality, clarity, originality and significance are good.

Submitted by Assigned_Reviewer_5

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
Summary of the paper:

The authors describe a boosting technique in which each weak learner is trained
on a small subset of the training sample as in FilterBoost and related
techniques. In each iteration, the subset is doubled and half of the total set
is then discarded. The deletion is based on maximizing the expected edge of the
classifier on the full reservoir set, given a Gaussian model of the pointwise
classifications.

Detailed remarks (in order of appearance):

26: computer

36: learns

86: 23?

96: What is A here?

214: It's strange to put an index in the exponent, especially d, seems like an
exponentiation.

232: It's strange to use this pseudo-code-like notation instead of "static"
math. Why not just simply define Sigma_AA as a sum over d of the right hand
side?

295: enter

330: samples don't have an edge.

331: The instability using the exponential loss is a major concern. Where does
it come from? Why is it stable for the logistic loss? Can you show any results
on whether the algorithm converges using any loss?

The experiments are meaningful in the sense that they show that the proposed
sampling strategy is better then the other strategies, given the same number of
iterations and pool size. Still, what is the practical goal? We do subsampling
for saving on computational time or memory, so I would like to see the errors
not after a given number of iterations, but spending the same computational
time. The overhead is asymptotically not that big, but still, the small
difference between WSam and GEEM might be compensated by the fact that WSam can
use more samples or can iterate a bit more while GEEM using its overhead for
smarter sampling.

Q2: Please summarize your review in 1-2 sentences
The idea is interesting, the paper is well-written, the computational
complexity, which is a crucial aspect of the algorithm, is well analyzed. The
experiments show a slight but significant improvement over the simpler WSam
strategy, although I would like to see the results not at the same number of
iterations, but at the same computational (training) budget. My main theoretical
concern is that no boosting-like convergence theorems are shown.

Submitted by Assigned_Reviewer_6

Q1: Comments to author(s). First provide a summary of the paper, and then address the following criteria: Quality, clarity, originality and significance. (For detailed reviewing guidelines, see http://nips.cc/PaperInformation/ReviewerInstructions)
This paper proposes a new boosting method that represents a tradeoff between online and offline learning. The main idea of the method is to maintain a reservoir of training examples (of fixed size) from which to train the weak learners. At each boosting iteration, new examples are added to the reservoir and then a selection strategy is used to reduce the reservoir to its original fixed size before the weak learner is trained. Several naive selection strategies are proposed but the main contribution of the paper is a more sophisticated selection strategy whose goal is to remove examples from the reservoir so that a weak learner trained on the reduced set will minimize the error computed on the whole set before reduction. The resulting algorithm is applied on four computer vision datasets, where it is shown to outperform several other online boosting methods.

The idea of using a reservoir is original and very interesting. The derivation of the GEEM algorithm from this initial idea is technically sound and the authors have been very careful in designing an efficient solution. Experiments are done on only four datasets but these datasets represent real computer vision problems of practical interest. GEEM is compared with many different methods, including some well-chosen baselines for the reservoir framework. The authors seem to have taken great care to put other methods in the best conditions for the comparison and despite this, the performance of GEEM is clearly superior to all other methods.

While I think that the reservoir idea is very interesting per se, it's not clear which particular learning setting or constraints the authors want to address with this idea in the paper. Do they want to reduce computing times, to reduce memory usage, to obtain the best possible predictive accuracy, or a combination of those? This lack of a clear definition of the problem makes the assessment of the significance of the method difficult. It clearly works better than other online methods in terms of accuracy, but computing times and memory usage seem worse. In terms of predictive performance, GEEM is also probably worse than offline boosting (but this baseline is actually not provided). I think that the paper should more clearly define the specific constraints the method is targeting and then make the experiments better reflect such constraints. At the moment, only test accuracies are compared in the experiments and the authors have only put a constraint on the number T of weak classifiers. In such conditions, offline boosting is probably the best method.

Some design choices also are not totally convincing. In Section 3.1, the idea of selecting the subset B so that it is as representative as possible of the entire set A does not seem so natural to me. Ideally, the subset B should be selected so that the weak learner that will be obtained from it will improve as much as possible the current model. Making it as close as possible to the weak learner trained from A does not make sense when A is not a good subset to train a weak learner. So, a more natural selection criterion from B should not necessarily depend on A. The choice of a perfect model h* in 3.3 is justified but why not instead simply look for the optimal weak learner from A and then project it on all Bs when doing the greedy search. The computational cost of this operation would probably be negligible with respect to the cost of the computation of the covariance matrix.

The paper is well written and mostly clear but it would be very difficult however to re-implement the algorithm given its complexity and also to reproduce the experiments as the settings of all tested algorithms are mostly unknown (see some questions below). The authors will however share their implementation upon publication.

Other comments
- I find it difficult to see how logitboost could be combined with the algorithm in Table 2, since logitboost as defined in (Friedman, Hastie, and Tibshirani, 1998) uses weak learners in the form of regressors and updates both the weights and the responses of these examples. Giving a description of the whole algorithm could help.
- It's not clear how comparable are the different algorithms in the experiments. In particular, are all algorithms actually seeing the exact same number of examples, ie. is the number of calls to the filter function the same in each case?
- In the algorithm of Table 2, why is h^t built from R_t and not from R_{t-1} U Q_{t-1}? Since R_t is selected so that the weak classifier built from it will look as much as possible like the weak classifier built from R_{t-1} U Q_{t-1} and since the computational cost for deriving R_t is higher than the computational cost for selecting h^t from R_{t-1} U Q_{t-1}, I don't see any reason not to train h^t from R_{t-1} U Q_{t-1}.
- The authors show that the FIX method needs 10 times more examples at each iteration than GEEM to reach similar accuracy. However, FIX is forced to always use the same examples at each iteration while GEEM samples 100 or 250 new examples at each iteration, meaning that at the end, GEEM has seen 25,000 or 62,500 fresh examples in total. So, I'm not sure the comparison is really fair. I would be interested to see the results of the application of offline boosting on the same number of examples as seen by GEEM, to see what we loose by working with a fixed reservoir.
- Also, why is FIX using adaboost while GEEM uses logitboost?
- The size r seems to be a sensible parameter that should not be chosen too small (as the weak learner is trained on r examples). The algorithm can thus not be turned into a real online method. I would like to see a study of the impact of the value of this parameter, in particular to see how large r has to be to reach a reasonable accuracy.
Q2: Please summarize your review in 1-2 sentences
A technically sound and original boosting algorithm representing an interesting tradeoff between offline and online learning, with good performance on four computer vision datasets. The problem setting or constraints targeted by this algorithm are however not clear.
Author Feedback

Q1:Author rebuttal: Please respond to any concerns raised in the reviews. There are no constraints on how you want to argue your case, except for the fact that your text should be limited to a maximum of 6000 characters. Note however that reviewers and area chairs are very busy and may not read long vague rebuttals. It is in your own interest to be concise and to the point.
We would like to thank the reviewers for their helpful comments and take the opportunity to address some of their concerns.

Assigned Reviewer 6 asks for some clarification concerning the motivation behind the work. Our goal is to investigate learning and specifically boosting in a setting of limited memory, i.e. in a setting where only a limited amount of data can be loaded in memory and used for training at each iteration. Thus we would like to investigate the area between online (no memory) and offline (no memory constraints) learning, showing that GEEM outperforms online methods (i.e. the use of a reservoir helps) as well as sampling techniques such as MadaBoost and WSS which can be used when memory is limited (though these methods typically have access to the entire dataset, something not afforded to the reservoir strategies).

To our knowledge there is no prior work in this field as previous work in budgeted learning seems to budget the classifier representation (kernel expansion) rather than the amount of working data allowed to be kept in memory.

Concerning the Fix baseline, we meant it to serve as an indication of the performance of offline boosting. We show that offline boosting needs 10 times more memory to obtain the same results as GEEM. If Fix were to see as many samples as GEEM (i.e. the entire dataset) it would need, for CIFAR, 200 times more memory than GEEM. We could add results for such a setting, as the reviewer suggested, to show the effect of memory constraints. Of the presented baselines, only Fix is restricted in the samples it views, the remaining baselines view the same number of samples. Finally we note that Fix uses Adaboost as in this case there is no instability and we obtained slightly better results than with the logistic loss.

Assigned Reviewer 5 raises concerns with regards to the instability of the exponential loss. With such an aggressive loss, the weight of misclassified
samples may be orders of magnitude higher than the weights of all properly classified ones. Hence, particularly in the case of noisy
labels, the choice of weak-learners oscillate, taking care of one population of samples, then another, etc. The obvious way of fixing
such behavior would be through the use of a learning rate, close in spirit to stochastic gradient descent. However, to keep the article's
message clear, we preferred to use at that point a less aggressive loss as a simpler way of regularizing the learning.

Assigned Reviewer 5 also mentions that he would like to see results with equalized budgets. We would like to point to the paragraph at lines 410-415 where we point out that even when equalizing computation time GEEM outperforms the other reservoir strategies. These results should probably be highlighted and expanded.

Assigned Reviewer 6 argues that perhaps h_t should be trained on R_{t-1} \cup Q_{t-1}. We could train h_t thus and then use GEEM to select R_t, we have run experiments in such a framework, the results showed that GEEM still outperforms the other reservoir strategies (and other baselines). We would like to note however that the cost of training on set A is not necessarily negligible compared to computing the covariance matrix \Sigma, as noted in section 3.7 the cost of computing \Sigma is O(|A|^3), depending on the dimensionality of the data D, this could be similar to O(D|A|log|A|), the cost of training on |A|.

Another concern raised by Assigned Reviewer 6 is that of the strategy for choosing B and h*. For the former, we aim to construct a set B such that training on B will yield a weak learner that performs well on the entire set A, given that at iteration t the data at our disposable is the set A, this seems to us a natural strategy, pick a training set B that is representative of the entire set A. For the latter we refer to our argument in the previous paragraph as well as to the fact that the current choice of h* has a cost of O(1).