Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
This work includes original ideas and empirical findings. The formulation of weight sharing and using Lasso to find sparse solutions is quite neat and the implementation trick to include weak learners into the existing model is also clear and clean. The study of the effect on the initial model has been ignored in the NAS field and this work brings the importance of it back to the table which we all should pay attention to. The quality of the work is quite good. Related work are carefully reviewed and understood. Many choices in the proposed method are justified either by previous works or the ablation studies in the appendix. The experiments are extensive and informative with round results. The paper is written clearly and I believe this will be an significant work in the NAS field. One minor question for me is the stopping criterion for the experiments: are the experiments run up to 5 days and then report the best? Is it possible to develop some early stopping for NAS?
This paper proposes novel neural architecture search method dubbed Petridish which is based on gradient boosting of "weak learners" (i.e. small subnetworks attached to the main network) that are attached to the main network. Originality: The main contribution of the paper is applying basic ideas from gradient-boosting of weak learners to the task of neural architecture search. This is an original idea, which allows a more guided exploration of the space of neural architectures compared to the random steps done, e.g. in evolutionary algorithms. Most related work is adequately discussed. The connection/differences to NAS methods combining network morphisms with evolutionary algorithms should be discussed in more detail as these explore the search space based on similar steps (modifying a model by small incremental additions) but select steps randomly and not based on gradient boosting. Quality: The authors motivate and evaluate the main design decisions of the method carefully. A short summary of the main results from the supplementary material in the main document would be helpful. I am also proposing two control experiments in Point 5 which could further strengthen the paper. Only including models with fewer than 3.5M parameters (which rules out e.g. ProxylessNAS) in Table 1 is somewhat arbitrary. I propose to include at least the models corresponding to the ones in Table 2 (SNAS, ProxylessNAS) for completeness. Clarity: Generally, the paper is very well written and organized. One information missing for being able to reproduce the results (without looking into the code) would be a complete summary of the entire training pipeline of Petridish including data augmentation, regularization etc. Significance: The proposed work is competitive with other recent NAS methods but does not clearly advance the state-of-the-art in terms of search time, test error, number of parameters of the network, or other dimensions. The main significance of the method is in my opinion that it is not restricted to architectures that are subnetworks of a manually defined supergraph. Thus, it allows in principle a more open-ended architecture search without requiring excessive compute resources (since it still allows for weight sharing). This point, however, is only briefly mentioned in the introduction but not explored more thoroughly later on (e.g. in experiments). A more detailed discussion and some experimental evidence whether lifting the requirement of a predefined supergraph is helpful would greatly increase the significance of the paper. Overall, the work introduces an approach for NAS which is novel and presented clearly. The significance of this work is at the moment limited to "yet another NAS approach" (albeit with a nice connection to gradient boosting of weak learners). More clearly carving out the unique advantages of the approach would increase significance. Minor comment: * The bibliography entry for ProxylessNAS uses a wrong order for first and last name of authors.
The paper proposes to perform architecture search in the following way. A basic architecture is extended by adding a layer to the side. However during the forward pass that side layer is ignored. During the backward back propagation is used to update the parameters of this layer as if it were contributing during the actual computations. (But gradients are not propagated beyond the layer). Each of the components has an L1 regularised scalar alpha which is also trained that represents their "contribution". These alphas are used in the selection stage to pick the components to add. To me the similarity is that the weak learner in gradient boosting is selected based on the learning process (the gradient) but this computation did not take part in the forward pass. The paper is mostly well written and clear. I am mainly struggling with the iterative process. Are the cells extended once or is this done in an incremental growing manner? This is not described properly in the paper and I would like to see a clarification on this. Could you also clarify this for macro search vs cell search. Are the different layers updated all at once or one by one for macro search? How exactly is the coupling done in cell search? Do you share the same alpha parameters across cells? I think the idea is original, but the evaluation could be improved. Many of the choices were experimentally validated and these results are presented in the appendix. However key experiments are missing. A problem is that the methods it is compared against all use different search spaces. It is unclear whether the benefits come from growing the model/the search space/the actual implementation of the algorithm. For this reason I think the following experiments need to be included 1. (required) Compare the method to the baseline in which you would select a specific model and grow it by randomly selecting operations. This would show that the selection of operation by using the boosting trick is effective. 2. (required) Take the seed model and scale it up/down until it is equally expensive as the final model. This would show that the architectural changes are actually important and that the performance gains do not just come from the additional capacity. Additionally it would be interesting to see whether a more advanced model could be further improved. # Post rebuttal The authors added additional control experiments. Increased the score.