NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Reviewer 1
Pro: +It provides a better theoretical guidance for learning the hyper parameters for model selection using the PEA theory, which accelerates DARTS . +The results are impressive, though conjunction with a bunch of data augmentation methods. + Theoretical proofs are valid. Cons: - The paper is somehow difficult to read, check the comments below. It needs clearer description. One thing for CIFAR10 is the author adopted many tricks (275 page 7) for learning. However, some of the works such as DARTS do not use part of it such as auto augmentation etc. Baseline algorithms should add same tricks for fair comparison. There are two proposals that the author proposed, 1) regret for evaluation 2) wipeout. I think it also worths to see how different components affects the results in ablation rather than simply showing the final results. Also ablation about the model FLOPs or inference time are valuable Comments of the paper. The author us (#) as citation, which save space, but raise confusion between equation and citation (especially in checking the proof at supplementary). Change () to [] for better distinguish the two. In PEA setting, there are p_t for prediction from the network. p^(j, k) as prediction from edge. I found this is confusing, better clarify the two p are predicting different things. Bigger problem is at Fig.1, the visualization is not clear, it should have better notation with correspondence to PEA setting and the fonts are too small for a clear understanding of the figure. Make it better and clearer is a must. Please relocate Fig.1 and All.1 to Page 3 for better readability. ================================================== The author's feedback solved my concern, I keep the score the same. However, it also needs better clarification about the holding of assumptions.
Reviewer 2
The paper proposes a new optimization algorithm for differentiable architecture search. The main difference compared to other differentiable methods is that it adds a wipe-out step, which eliminates the "experts" that have very small scores. The paper is not clearly written. It seems to be that the formulation of the paper is very similar to DARTS. It changes the name of "component" in DARTS to experts, but it does not make a fundamental difference. In terms of optimization algorithm, the main difference compared to DARTS method is that it has a wipe-out method. I think the main benefits of the wipe out is the search speed, but it has the risk of eliminating useful "expert" at the beginning. Not all the notations in Algorithm 1 are well defined. This makes understanding of the algorithm difficult. In general, I do not see enough novelty in the proposed method, and the writing is not clear enough.
Reviewer 3
[Based on the author feedback, I have upgraded my score. I think the empirical evidence for exponentiated gradient descent (+wipeout) in NAS is valuable for the community. I would suggest that the authors clearly state the limitations associated with the analysis.] Summary: The proposed approach for NAS treats architecture choices, i.e., intra-cell node connections as in DARTS (Liu et al., 2018), as selection among "experts". The expert weights are found via EG descent (Kivinen & Warmuth, 1997) that somehow utilizes the standard "back-propagated loss gradient" (line 113) to perform the multiplicative weight updates. It's at this point that I had trouble following the theoretical analysis given that the EG algorithm requires a loss function on the expert prediction to be specified for the development of a regret bound. The paper does not define this loss function (\ell_s() in (2)), and it isn't immediately clear to me what it could be in this context: what is the loss at the output of an intermediate node of a cell in the middle of a parent network during search? Further, the requirement in the main result that this loss be convex in its prediction makes me wonder whether the regret analysis actually holds for the setting of the paper. The experimental results, however, appear to be good enough to at least warrant further investigation into these kinds of multiplicative weight updates. Specifically, the speed of search is fast owing to either the EG descent algorithm, the proposed wipeout operation, or some combination. Originality: I can't speak to the application of EG to training neural networks generally, but I thought it was a clever idea in the context of NAS where hard selections among finite sets must be made. Quality: For the main result, a convex loss in the forecaster prediction is required. However, the paper also states that explicit loss is not assigned (line 113). I may be missing something simple, but I don't see how the regret bound holds. Can the authors provide the form of this loss function? The experimental setup leading to the performance comparisons provided in the tables of Section 5 appears to be fair, but this reviewer is not 100% certain. How much of the speed-up is due to the use of EG vs. EG+wipeout? (It would have been nice to see what the experimental performance would have been without the wipeout step.) Clarity: Both the regret bound itself, and the determination of an optimal learning rate require an absolute bound on the loss. What was this set to for the experiments? Also, I don't understand the relevance of the note in line 228 regarding the gradient clipping value. How is this important?