Reviews: DATA: Differentiable ArchiTecture Approximation

The paper takes the gumbel softmax trick in SNAS [1] further by ensembling the gumbel softmax estimator. As the result, it has a richer sample space while still being efficient. Rather than the credit assignment approach in SNAS, DATA makes use of the differentiability to update the probability vector. The paper is well written and clearly motivates the proposed approach. I am convinced that the proposed EGS estimator can bridge the gap of architectures between searching and validating, which is a well-known issue in DARTS [2]. The argument that the richer search space of EGS estimator is backed up by the experiments. It clearly outperforms both DARTS and SNAS on image classification and language modelling. However, from my point of view, there is still important analysis I expect to see in the paper. 1. Since the paper makes a claim that the proposed approach bridges the gap of architectures in searching and validating, then I expect to see a table of comparing validation accuracy between search network and child network ( such as Table 1 in [1] ). It will make the claim stronger. The search progress ( Figure. 3 in [1] ) should be attached at least in the supplementary material. 2. It is complained that NAS approach has high variance over different initializations. The results in Table. 1 and Table. 3 are obtained by the mean of 5 independent runs. I don't think it is good enough to just report the mean. A variance plot such as Figure. 3 is expected to see. 3. The interesting part of the proposed of approach is its richer search space. By just looking at the final result in the table, it seems DATA is benefited from the proposed EGS estimator. But it can be further justified by comparing the search result of M=4 and M=7. Unfortunately, I only found the search result of M=1 in the supplementary material. 4. The proposed method is still sampling based, so rather than saying it is differential end to end. I would say the gradient can be obtained by gradient estimator. And the gradient estimator should be written explicitly, at least in the supplementary material. 5. There is no resource constraint in this estimator. SNAS has a regularization on the forward pass time of the child networks [1]. Since the objective of architecture weights and probability vector are both training loss, then is overfitting an issue in this work? One toy example would be if I am trying to fit a polynomial to a regression problem. I want to use NAS method to search for the best polynomial degree. I would imagine searching based on the training loss would lead to largest possible degree which obviously overfits. DARTS in comparison has no such issue because it uses training loss as an objective for the shared weights and validation loss for the architecture vector. 6. It seems DATA should be slower in terms per search iteration than SNAS because it samples multiple times. But DATA is faster than SNAS in the table, can authors provide more intuition on this part? 7. One minor issue is the proposed method is incremental given the existing work [1, 2]. But the work does give the contribution to the NAS community by introducing an efficient estimator which leads to a richer search space. Overall, the paper is an interesting work but leaves some open questions. I don't think it needs further polish and is ready for acceptance yet. After reading the author's response, most of my concerns are addressed. I increase my score to 6. [1]: Xie, S., Zheng, H., Liu, C., & Lin, L. (2019). SNAS: Stochastic Neural Architecture Search. ICLR 2019. [2]: Liu, H., Simonyan, K., & Yang, Y. (2019). DARTS: Differentiable Architecture Search. ICLR 2019.

This paper introduces a new approach for Neural Architecture Search by utilizing Gumbel-Softmax operator as a trick to overcome the problem of selecting operators for edges which is discrete and a challenge for gradient descent. The authors justify their proposed approach by saying that the existing approaches do not use the exact architecture that is optimized in validation. Given that each edge might need to use more than a single operator, the authors introduce the idea of ensemble Gumbel-Softmax that allows their framework to be used when multiple operators are required for each edge. The paper is well written, easy to follow and the experimental study shows that their framework performs better than many other state of the arts baselines. One major problem that I have with this work and its motivation is that I am not sure why the operations used in each edge should be mutually exclusive. In general, I guess the gradient descent can determine what combination of operators should be utilized for the best performance. If there is an operator that is not good for an edge, its weight is going to decrease in the optimization. So, not sure what is the practical reason for using one-hot vector after optimization for methods that do so ([33] and [52]). So I believe methods in [33] and [52] can be used without using the one-hot vector step. As mentioned in these papers, they have also used the top-k strongest operations (not only single one). So the justification why the current method is better does is not strong enough even though the authors are showing that their introduced framework performs better and faster.

Paper ID:	477
Title:	DATA: Differentiable ArchiTecture Approximation

Reviewer 1

Reviewer 2

Reviewer 3