Sun Dec 2nd through Sat the 8th, 2018 at Palais des Congrès de Montréal
The paper presents an ensemble of deep neural networks (DNN) where the members of the ensemble are obtained by moving away (in weight space) from a pretrained DNN. The process works in a exploration/exploitation fashion by increasing/decreasing the learning rate in cycles. The DNN is trained in minibatches for a few epochs during these cycles and a "snapshot" of the weights is kept when the learning rate is set to its lowest value. Each of the snapshot constitute one member of the ensemble. In addition the paper shows that the minimum loss found by two DNN trained using different random initializations of the weights can be connected by paths of low loss. These paths are not straight in the weight space but the paper shows that they can follow simple routes with a single bend. I found the paper very clear, easy to follow and with compelling ideas, specially the ones related to the paths between modes. However, I still have some doubts/concerns about it: * The first part of the article is very interesting but raises many doubts about its generality. In this sense, Can the paths between modes always be found? Are the paths always simple and single-bended? How far where the modes of the two networks found? In the case of having many modes, can a path be found joining any of them? I find that there is missing evidence/theory to answer these questions in the article. * Related to the second part, I found that the proposed search, which is marginally different from , is not really related to "sec 3 finding path between modes". To see this relation, it would be interesting to plot the path in a similar way to fig2 or to show that a straight path would work differently. In addition, as Fig 3(left, middle line) suggests the followed path is not really through a low cost region as in sec3. * Finally, the following reference, which is very related, is missing: Lee, Stefan et al. Stochastic multiple choice learning for training diverse deep ensembles. NIPS-2016.
This paper proposes a method by which to identify paths between local optima in the error surface of deep neural networks as a function of the weights such that the error along these paths does not increase significantly. In addition to providing useful insights on the error surface, the authors develop an ensemble learning algorithm that prepares an ensemble of multiple deep neural nets found through exploring the weight space along the paths that they find. The paper produces novel and useful insights on deep neural networks with implications on the training and development of an ensemble learning method that uses the same total training budget as single network training by exploiting the characteristics of their path finding method. The experimental results need improvement though. Standard errors and statistical significances of the results are needed, especially given the random nature of the algorithm. It would also be beneficial to give a comparison of the errors of the individual deep networks and the diversities of the networks between the proposed algorithm and the independently trained network ensemble algorithm. I suspect that the base model errors are better for the proposed algorithm but the diversity is inferior. This should be confirmed. Figures 2 and 3 are especially insightful---please keep these no matter what. Having read the author's feedback, my opinion of the paper has improved. This paper represents an excellent submission, and one that I hope the authors and others will develop further in the future.
Update after author response: Thank you for the response. Additional details that the curves between the local optima are not unique would be also interesting to see. Nice work! Summary: This paper first shows a very interesting finding on the loss surfaces of deep neural nets, and then presents a new ensembling method called Fast Geometric Ensembling (FGE). Given two already well trained deep neural nets (with no limitations on their architectures, apparently), we have two sets of weight vectors w1 and w2 (in a very high-dimensional space). This paper states a (surprising) fact that for given two weights w1 and w2, we can (always?) find a connecting path between w1 and w2 where the training and test accuracy is nearly constant. Figure 1 demonstrates this, and Left is the training accuracy plot on the 2D subspace passing independent weights w1, w2, w3 of ResNet-164 (from different random starts); whereas Middle and Right are the 2D subspace passing independent weights w1, w2 and one bend point w3 on the curve (Middle: Bezier, Right: Polygonal chain). Learning the curves with one bend w3 for given two weights w1, w2 is based on minimizing the expected loss over a uniform distribution on the curve with respect to w3. Investigating the test accuracy of ensembles of w1 and w(t) on the curve by changing t suggests that even the ensemble of w1 and w(t) close to w1 have better accuracy than each of them. So by perturbing w1 around w1 by gradient descent with cyclical learning rate scheduling, we can collect weights at lower learning rate points, and the ensemble of these weights (FGE) have much better accuracy. The experiments also provide enough evidences that this idea work using a set of the-state-of-the-art architectures such as ResNet-110 and WRN-28-10 as well as more standard VGG-16. Strength: - The finding that the loss surfaces have the connecting curves between any converged weights w1 and w2 where the training and test accuracy is nearly constant is very interesting. This phenomena is presented clearly with some illustrative examples for the modern CNN architectures such as ResNet and WRN. - It is also interesting to confirm how far we need to move along a connecting curve to find a point nice to be mixed into the ensemble. This well motivates a proposed ensembling method FGE. - Compared to the other ensembling such as Snapshot Ensembles , the proposed FGE would be nicely presented, and shows good empirical performance. Weakness: - The uniqueness of connecting curves between two weights would be unclear, and there might be a gap between the curve and FGE. A natural question would be, for example, if we run the curve findings several times, we will see many different curves? Or, those curves would be nearly unique? - The evidences are basically empirical, and it would be nice if we have some supportive explanations on why this curve happens (and whether it always happens). - The connections of the curve finding (the first part) and FGE (the second part) would be rather weak. When I read the first part and the title, I imagined that take random weights, learn curves between weights, and find nice wights to be mixed into the final ensemble, but it was not like that. (this can work, but also computationally demanding) Comment: - Overall I liked the paper even though the evidences are empirical. It was fun to read. The reported phenomena are quite mysterious, and interesting enough to inspire some subsequent research. - To be honest, I'm not sure the first curve-finding part explains well why the FGE work. The cyclical learning rate scheduling would perturb the weight around the initial converged weight, but it cannot guarantee that weight is changing along the curve described in the first part.