{"title": "Neural Architecture Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 7816, "page_last": 7827, "abstract": "Automatic neural architecture design has shown its potential in discovering powerful neural network architectures. Existing methods, no matter based on reinforcement learning or evolutionary algorithms (EA), conduct architecture search in a discrete space, which is highly inefficient. In this paper, we propose a simple and efficient method to automatic neural architecture design based on continuous optimization. We call this new approach neural architecture optimization (NAO). There are three key components in our proposed approach: (1) An encoder embeds/maps neural network architectures into a continuous space. (2) A predictor takes the continuous representation of a network as input and predicts its accuracy. (3) A decoder maps a continuous representation of a network back to its architecture. The performance predictor and the encoder enable us to perform gradient based optimization in the continuous space to find the embedding of a new architecture with potentially better accuracy. Such a better embedding is then decoded to a network by the decoder. Experiments show that the architecture discovered by our method is very competitive for image classification task on CIFAR-10 and language modeling task on PTB, outperforming or on par with the best results of previous architecture search methods with a significantly reduction of computational resources. Specifically we obtain $2.11\\%$ test set error rate for CIFAR-10 image classification task and $56.0$ test set perplexity of PTB language modeling task. The best discovered architectures on both tasks are successfully transferred to other tasks such as CIFAR-100 and WikiText-2. Furthermore, combined with the recent proposed weight sharing mechanism, we discover powerful architecture on CIFAR-10 (with error rate $3.53\\%$) and on PTB (with test set perplexity $56.6$), with very limited computational resources (less than $10$ GPU hours) for both tasks.", "full_text": "Neural Architecture Optimization\n\n1Renqian Luo\u2020\u2217, 2Fei Tian\u2020, 2Tao Qin, 1Enhong Chen, 2Tie-Yan Liu\n\n1University of Science and Technology of China, Hefei, China\n\n2Microsoft Research, Beijing, China\n\n1lrq@mail.ustc.edu.cn, cheneh@ustc.edu.cn\n2{fetia, taoqin, tie-yan.liu}@microsoft.com\n\nAbstract\n\nAutomatic neural architecture design has shown its potential in discovering power-\nful neural network architectures. Existing methods, no matter based on reinforce-\nment learning or evolutionary algorithms (EA), conduct architecture search in a\ndiscrete space, which is highly inef\ufb01cient. In this paper, we propose a simple and\nef\ufb01cient method to automatic neural architecture design based on continuous opti-\nmization. We call this new approach neural architecture optimization (NAO). There\nare three key components in our proposed approach: (1) An encoder embeds/maps\nneural network architectures into a continuous space. (2) A predictor takes the\ncontinuous representation of a network as input and predicts its accuracy. (3) A\ndecoder maps a continuous representation of a network back to its architecture.\nThe performance predictor and the encoder enable us to perform gradient based\noptimization in the continuous space to \ufb01nd the embedding of a new architecture\nwith potentially better accuracy. Such a better embedding is then decoded to a\nnetwork by the decoder. Experiments show that the architecture discovered by\nour method is very competitive for image classi\ufb01cation task on CIFAR-10 and\nlanguage modeling task on PTB, outperforming or on par with the best results of\nprevious architecture search methods with a signi\ufb01cantly reduction of computa-\ntional resources. Speci\ufb01cally we obtain 2.11% test set error rate for CIFAR-10\nimage classi\ufb01cation task and 56.0 test set perplexity of PTB language modeling\ntask. The best discovered architectures on both tasks are successfully transferred to\nother tasks such as CIFAR-100 and WikiText-2. Furthermore, combined with the\nrecent proposed weight sharing mechanism, we discover powerful architecture on\nCIFAR-10 (with error rate 3.53%) and on PTB (with test set perplexity 56.6), with\nvery limited computational resources (less than 10 GPU hours) for both tasks.\n\n1\n\nIntroduction\n\nAutomatic design of neural network architecture without human intervention has been the interests of\nthe community from decades ago [12, 22] to very recent [45, 46, 27, 36, 7]. The latest algorithms for\nautomatic architecture design usually fall into two categories: reinforcement learning (RL) [45, 46,\n34, 3] based methods and evolutionary algorithm (EA) based methods [40, 32, 36, 27, 35]. In RL\nbased methods, the choice of a component of the architecture is regarded as an action. A sequence\nof actions de\ufb01nes an architecture of a neural network, whose dev set accuracy is used as the reward.\nIn EA based method, search is performed through mutations and re-combinations of architectural\ncomponents, where those architectures with better performances will be picked to continue evolution.\nIt can be easily observed that both RL and EA based methods essentially perform search within\nthe discrete architecture space. This is natural since the choices of neural network architectures are\n\n\u2217The work was done when the \ufb01rst author was an intern at Microsoft Research Asia.\n\u2020The \ufb01rst two authors contribute equally to this work.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftypically discrete, such as the \ufb01lter size in CNN and connection topology in RNN cell. However,\ndirectly searching the best architecture within discrete space is inef\ufb01cient given the exponentially\ngrowing search space with the number of choices increasing. In this work, we instead propose to\noptimize network architecture by mapping architectures into a continuous vector space (i.e., network\nembeddings) and conducting optimization in this continuous space via gradient based method. On\none hand, similar to the distributed representation of natural language [33, 25], the continuous repre-\nsentation of an architecture is more compact and ef\ufb01cient in representing its topological information;\nOn the other hand, optimizing in a continuous space is much easier than directly searching within\ndiscrete space due to better smoothness.\nWe call this optimization based approach Neural Architecture Optimization (NAO), which is brie\ufb02y\nshown in Fig. 1. The core of NAO is an encoder model responsible to map a neural network\narchitecture into a continuous representation (the blue arrow in the left part of Fig. 1). On top of\nthe continuous representation we build a regression model to approximate the \ufb01nal performance\n(e.g., classi\ufb01cation accuracy on the dev set) of an architecture (the yellow surface in the middle part\nof Fig. 1). It is noteworthy here that the regression model is similar to the performance predictor\nin previous works [4, 26, 10]. What distinguishes our method is how to leverage the performance\npredictor: different with previous work [26] that uses the performance predictor as a heuristic to select\nthe already generated architectures to speed up searching process, we directly optimize the module to\nobtain the continuous representation of a better network (the green arrow in the middle and bottom\npart of Fig. 1) by gradient descent. The optimized representation is then leveraged to produce a new\nneural network architecture that is predicted to perform better. To achieve that, another key module\nfor NAO is designed to act as the decoder recovering the discrete architecture from the continuous\nrepresentation (the red arrow in the right part of Fig. 1). The decoder is an LSTM model equipped\nwith an attention mechanism that makes the exact recovery easy. The three components (i.e., encoder,\nperformance predictor and decoder) are jointly trained in a multi task setting which is bene\ufb01cial to\ncontinuous representation: the decoder objective of recovering the architecture further improves the\nquality of the architecture embedding, making it more effective in predicting the performance.\nWe conduct thorough experiments to verify the effectiveness of NAO, on both image classi\ufb01cation\nand language modeling tasks. Using the same architecture space commonly used in previous\nworks [45, 46, 34, 26], the architecture found via NAO achieves 2.11% test set error rate (with\ncutout [11]) on CIFAR-10. Furthermore, on PTB dataset we achieve 56.0 perplexity, also surpassing\nbest performance found via previous methods on neural architecture search. Furthermore, we show\nthat equipped with the recent proposed weight sharing mechanism in ENAS [34] to reduce the large\ncomplexity in the parameter space of child models, we can achieve improved ef\ufb01ciency in discovering\npowerful convolutional and recurrent architectures, e.g., both take less than 10 hours on 1 GPU.\nOur codes and model checkpoints are available at https://github.com/renqianluo/NAO.\n\n2 Related Work\n\nRecently the design of neural network architectures has largely shifted from leveraging human\nknowledge to automatic methods, sometimes referred to as Neural Architecture Search (NAS) [40,\n45, 46, 26, 34, 6, 36, 35, 27, 7, 6, 21]. As mentioned above, most of these methods are built upon\none of the two basic algorithms: reinforcement learning (RL) [45, 46, 7, 3, 34, 8] and evolutionary\nalgorithm (EA) [40, 36, 32, 35, 27]. For example, [45, 46, 34] use policy networks to guide the\nnext-step architecture component. The evolution processes in [36, 27] guide the mutation and\nrecombination process of candidate architectures. Some recent works [17, 18, 26] try to improve\nthe ef\ufb01ciency in architecture search by exploring the search space incrementally and sequentially,\ntypically from shallow to hard. Among them, [26] additionally utilizes a performance predictor\nto select the promising candidates. Similar performance predictor has been speci\ufb01cally studied in\nparallel works such as [10, 4]. Although different in terms of searching algorithms, all these works\ntarget at improving the quality of discrete decision in the process of searching architectures.\nThe most recent work parallel to ours is DARTS [28], which relaxes the discrete architecture space\nto continuous one by mixture model and utilizes gradient based optimization to derive the best\narchitecture. One one hand, both NAO and DARTS conducts continuous optimization via gradient\nbased method; on the other hand, the continuous space in the two works are different: in DARTS\nit is the mixture weights and in NAO it is the embedding of neural architectures. The difference in\n\n2\n\n\fFigure 1: The general framework of NAO. Better viewed in color mode. The original architecture x\nis mapped to continuous representation ex via encoder network. Then ex is optimized into ex(cid:48) via\nmaximizing the output of performance predictor f using gradient ascent (the green arrow). Afterwards\nex(cid:48) is transformed into a new architecture x(cid:48) using the decoder network.\n\noptimization space leads to the difference in how to derive the best architecture from continuous\nspace: DARTS simply assumes the best decision (among different choices of architectures) is the\nargmax of mixture weights while NAO uses a decoder to exactly recover the discrete architecture.\nAnother line of work with similar motivation to our research is using bayesian optimization (BO) to\nperform automatic architecture design [37, 21]. Using BO, an architecture\u2019s performance is typically\nmodeled as sample from a Gaussian process (GP). The induced posterior of GP, a.k.a. the acquisition\nfunction denoted as a : X \u2192 R+ where X represents the architecture space, is tractable to minimize.\nHowever, the effectiveness of GP heavily relies on the choice of covariance functions K(x, x(cid:48)) which\nessentially models the similarity between two architectures x and x(cid:48). One need to pay more efforts\nin setting good K(x, x(cid:48)) in the context of architecture design, bringing additional manual efforts\nwhereas the performance might still be unsatisfactory [21]. As a comparison, we do not build our\nmethod on the complicated GP setup and empirically \ufb01nd that our model which is simpler and more\nintuitive works much better in practice.\n\n3 Approach\n\nWe introduce the details of neural architecture optimization (NAO) in this section.\n\n3.1 Architecture Space\nFirstly we introduce the design space for neural network architectures, denoted as X . For fair\ncomparison with previous NAS algorithms, we adopt the same architecture space commonly used in\nprevious works [45, 46, 34, 26, 36, 35].\nFor searching CNN architecture, we assume that the CNN architecture is hierarchical in that a cell\nis stacked for a certain number of times (denoted as N) to form the \ufb01nal CNN architecture. The\ngoal is to design the topology of the cell. A cell is a convolutional neural network containing B\nnodes. Each of the nodes contains two branches, with each branch taking the output of one of the\nformer nodes as input and applying an operation to it. The operation set includes 11 operators listed\nin Appendix. The node adds the outputs of its two branches as its output. The inputs of the cell are\nthe outputs of two previous cells, respectively denoted as node \u22122 and node \u22121. Finally, the outputs\nof all the nodes that are not used by any other nodes are concatenated to form the \ufb01nal output of the\ncell. Therefore, for each of the B nodes we need to: 1) decide which two previous nodes are used as\nthe inputs to its two branches; 2) decide the operation to apply to its two branches. We set B = 5 in\nour experiments as in [46, 34, 26, 35].\nFor searching RNN architecture, we use the same architecture space as in [34]. The architecture\nspace is imposed on the topology of an RNN cell, which computes the hidden state ht using input it\nand previous hidden state ht\u22121. The cell contains B nodes and we have to make two decisions for\n\n3\n\nEncoderArchitecture \ud835\udc31DecoderOptimized Architecture \ud835\udc31\u2032output surface of performance prediction function \ud835\udc87continuous space of architectures \ud835\udcd4\ud835\udc86\ud835\udc99\ud835\udc86\ud835\udc99\u2032\feach node, similar to that in CNN cell: 1) a previous node as its input; 2) the activation function to\napply. For example, if we sample node index 2 and ReLU for node 3, the output of the node will be\no3 = ReLU (o2 \u00b7 W h\n3 ). An exception here is for the \ufb01rst node, where we only decide its activation\nfunction a1 and its output is o1 = a1(it \u00b7 W i + ht\u22121 \u00b7 W h\n1 ). Note that all W matrices are the weights\nrelated with each node. The available activation functions are: tanh, ReLU, identity and sigmoid.\nFinally, the output of the cell is the average of the output of all the nodes. In our experiments we set\nB = 12 as in [34].\nWe use a sequence consisting of discrete string tokens to describe a CNN or RNN architecture.\nTaking the description of CNN cell as an example, each branch of the node is represented via three\ntokens, including the node index it selected as input, the operation type and operation size. For\nexample, the sequence \u201cnode-2 conv 3x3 node1 max-pooling 3x3 \u201d means the two branches of\none node respectively takes the output of node \u22122 and node 1 as inputs, and respectively apply\n3 \u00d7 3 convolution and 3 \u00d7 3 max pooling. For the ease of introduction, we use the same notation\nx = {x1,\u00b7\u00b7\u00b7 , xT} to denote such string sequence of an architecture x, where xt is the token at t-th\nposition and all architectures x \u2208 X share the same sequence length denoted as T . T is determined\nvia the number of nodes B in each cell in our experiments.\n\n3.2 Components of Neural Architecture Optimization\n\nT\n\n(cid:80)T\n\nThe overall framework of NAO is shown in Fig. 1. To be concrete, there are three major parts that\nconstitute NAO: the encoder, the performance predictor and the decoder.\nEncoder. The encoder of NAO takes the string sequence describing an architecture as input, and maps\nit into a continuous space E. Speci\ufb01cally the encoder is denoted as E : X \u2192 E. For an architecture\nx, we have its continuous representation (a.k.a. embedding) ex = E(x). We use a single layer\nLSTM as the basic model of encoder and the hidden states of the LSTM are used as the continuous\nrepresentation of x. Therefore we have ex = {h1, h2,\u00b7\u00b7\u00b7 , hT} \u2208 RT\u00d7d where ht \u2208 Rd is LSTM\nhidden state at t-th timestep with dimension d3.\nPerformance predictor. The performance predictor f : E \u2192 R+ is another important module\naccompanied with the encoder. It maps the continuous representation ex of an architecture x into\nits performance sx measured by dev set accuracy. Speci\ufb01cally, f \ufb01rst conducts mean pooling on\nex = {h1,\u00b7\u00b7\u00b7 , hT} to obtain ex = 1\nt ht, and then maps ex to a scalar value using a feed-forward\nnetwork as the predicted performance. For an architecture x and its performance sx as training data,\nthe optimization of f aims at minimizing the least-square regression loss (sx \u2212 f (E(x)))2 .\nConsidering the objective of performance prediction, an important requirement for the encoder is\nto guarantee the permutation invariance of architecture embedding: for two architectures x1 and\nx2, if they are symmetric (e.g., x2 is formed via swapping two branches within a node in x1), their\nembeddings should be close to produce the same performance prediction scores. To achieve that, we\nadopt a simple data augmentation approach inspired from the data augmentation method in computer\nvision (e.g., image rotation and \ufb02ipping): for each (x1, sx), we add an additional pair (x2, sx) where\nx2 is symmetrical to x1, and use both pairs (i.e., (x1, sx) and (x2, sx)) to train the encoder and\nperformance predictor. Empirically we found that acting in this way brings non-trivial gain: on\nCIFAR-10 about 2% improvement when we measure the quality of performance predictor via the\npairwise accuracy among all the architectures (and their performances).\nDecoder. Similar to the decoder in the neural sequence-to-sequence model [38, 9], the decoder in\nNAO is responsible to decode out the string tokens in x, taking ex as input and in an autoregressive\nmanner. Mathematically the decoder is denoted as D : E \u2192 x which decodes the string tokens x\nfrom its continuous representation: x = D(ex). We set D as an LSTM model with the initial hidden\nstate s0 = hT (x). Furthermore, attention mechanism [2] is leveraged to make decoding easier, which\nwill output a context vector ctxr combining all encoder outputs {ht}T\nt=1 at each timestep r. The\nr=1 PD(xr|ex, x 78% pairwise accuracy) with only roughly\n\n1f (E(x1))\u2265f (E(x2 )) 1sx1 \u2265sx2\n\nx1\u2208Xtest,x2\u2208Xtest\n\nx1\u2208Xtest,x2\u2208Xtest\n\n(cid:80)\n\n1\n\n(cid:80)\n\nx\u2208Xtest\n\n6\n\n\fModel\n\nDenseNet-BC [19]\nResNeXt-29 [41]\nNASNet-A [45]\nNASNet-B [45]\nNASNet-C [45]\nHier-EA [27]\n\nAmoebaNet-A [35]\nAmoebaNet-B [35]\nAmoebaNet-B [35]\nAmoebaNet-B [35]\n\nPNAS [26]\nENAS [34]\nRandom-WS\n\nDARTS + Cutout [28]\n\nNAONet\nNAONet\n\nNAONet + Cutout\n\nNAONet-WS\n\nAmoebaNet-B + Cutout [35]\n\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n\n6\n4\n4\n2\n6\n6\n6\n6\n6\n3\n5\n5\n6\n6\n6\n6\n5\n\n32\nN/A\nN/A\n64\n36\n36\n80\n128\n128\n48\n36\n36\n36\n36\n64\n128\n36\n\n/\n/\n13\n13\n13\n6\n10\n19\n19\n19\n19\n8\n5\n5\n7\n11\n11\n11\n5\n\n3.46\n3.58\n3.41\n3.73\n3.59\n3.75\n3.34\n3.37\n3.04\n2.98\n2.13\n3.41\n3.54\n3.92\n2.83\n3.18\n2.98\n2.11\n3.53\n\nM\n/\n/\n\n#params\n25.6M\n68.1M\n20000\n3.3M\n20000\n2.6M\n20000\n3.1M\n7000\n15.7M\n20000\n3.2M\n2.8M\n27000\n13.7M 27000\n34.9M 27000\n34.9M 27000\n3.2M\n1280\n4.6M\n3.9M\n4.6M\n10.6M\n28.6M\n128M\n2.5M\n\n1000\n1000\n1000\n\n/\n/\n/\n\n/\n\n/\n/\n\n2000\n2000\n2000\n300\n3150\n3150\n3150\n3150\n3150\n225\n0.45\n0.25\n\n4\n200\n200\n200\n0.3\n\nB\n\nN\n100\n\nF\n40\n\n#op\n\nError(%)\n\nGPU Days\n\nTable 1: Performances of different CNN models on CIFAR-10 dataset. \u2018B\u2019 is the number of nodes\nwithin a cell introduced in subsection 3.1. \u2018N\u2019 is the number of times the discovered normal cell\nis unrolled to form the \ufb01nal CNN architecture. \u2018F\u2019 represents the \ufb01lter size. \u2018#op\u2019 is the number\nof different operation for one branch in the cell, which is an indicator of the scale of architecture\nspace for automatic architecture design algorithm. \u2018M\u2019 is the total number of network architectures\nthat are trained to obtain the claimed performance. \u2018/\u2019 denotes that the criteria is meaningless for a\nparticular algorithm. \u2018NAONet-WS\u2019 represents the architecture discovered by NAO and the weight\nsharing method as described in subsection 3.4. \u2018Random-WS\u2019 represents the random search baseline,\nconducted in the weight sharing setting of ENAS [34].\n\n500 evaluated architectures. Furthermore, the decoder D is powerful in that it can almost exactly\nrecover the network architecture from its embedding, with averaged Hamming distance between the\ndescription strings of two architectures less than 0.5, which means that on average the difference\nbetween the decoded sequence and the original one is less than 0.5 tokens (totally 60 tokens).\n\n(a)\n\n(b)\n\nFigure 2: Left: the accuracy accf of performance predictor f (red line) and performance distD of\ndecoder D (blue line) on the test set, w.r.t. the number of training data (i.e., evaluated architectures).\nRight: the mean dev set accuracy, together with its predicted value by f, of candidate architectures\nset Xeval in each NAO optimization iteration l = 1, 2, 3. The architectures are trained for 25 epochs.\n\nFurthermore, we would like to inspect whether the gradient update in Eqn.(2) really helps to generate\nbetter architecture representations that are further decoded to architectures via D. In Fig. 2(b) we show\nthe average performances of architectures in Xeval discovered via NAO at each optimization iteration.\nRed bar indicates the mean of real performance values\nsx while blue bar indicates\n\n(cid:80)\n\n1\n\n|Xeval|\n\nx\u2208Xeval\n\n7\n\n\f(cid:80)\n\nf (E(x)). We can observe that the performances of\nthe mean of predicted value\narchitectures in Xeval generated via NAO gradually increase with each iteration. Furthermore, the\nperformance predictor f produces predictions well aligned with real performance, as is shown via the\nsmall gap between the paired red and blue bars.\n\nx\u2208Xeval\n\n1\n\n|Xeval|\n\n4.2 Transferring the discovered architecture to CIFAR-100\n\nTo evaluate the transferability of discovered NAOnet, we apply it to CIFAR-100. We use the best\narchitecture discovered on CIFAR-10 and exactly follow the same training setting. Meanwhile, we\nevaluate the performances of other automatically discovered neural networks on CIFAR-100 by\nstrictly using the reported architectures in previous NAS papers [35, 34, 26]. All results are listed\nin Table 2. NAONet gets test error rate of 14.75, better than the previous SOTA obtained with\ncutout [11](15.20). The results show that our NAONet derived with CIFAR-10 is indeed transferable\nto more complicated task such as CIFAR-100.\n\nModel\n\nDenseNet-BC [19]\nShake-shake [15]\n\nShake-shake + Cutout [11]\n\nNASNet-A [45]\n\nNASNet-A [45] + Cutout\nNASNet-A [45] + Cutout\n\nPNAS [26]\n\nPNAS [26] + Cutout\nPNAS [26] + Cutout\n\nENAS [34]\n\nENAS [34] + Cutout\nENAS [34] + Cutout\nAmoebaNet-B [35]\n\nAmoebaNet-B [35] + Cutout\n\nNAONet + Cutout\nNAONet + Cutout\n\nB\n/\n/\n/\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n5\n\nN\n100\n\n/\n/\n6\n6\n6\n3\n3\n6\n5\n5\n5\n6\n6\n6\n6\n\nF\n40\n/\n/\n32\n32\n128\n48\n48\n128\n36\n36\n36\n128\n128\n36\n128\n\n#op\n\nError (%)\n\n/\n/\n/\n13\n13\n13\n8\n8\n8\n5\n5\n5\n19\n19\n11\n11\n\n17.18\n15.85\n15.20\n19.70\n16.58\n16.03\n19.53\n17.63\n16.70\n19.43\n17.27\n16.44\n17.66\n15.80\n15.67\n14.75\n\n#params\n25.6M\n34.4M\n34.4M\n3.3M\n3.3M\n50.9M\n3.2M\n3.2M\n53.0M\n4.6M\n4.6M\n52.7M\n34.9M\n34.9M\n10.8M\n128M\n\nTable 2: Performances of different CNN models on CIFAR-100 dataset. \u2018NAONet\u2019 represents the\nbest architecture discovered by NAO on CIFAR-10.\n\n4.3 Results of Language Modeling on PTB\n\nModels and Techniques\nVanilla LSTM [43]\nLSTM + Zoneout [23]\nVariational LSTM [14]\nPointer Sentinel-LSTM [31]\nVariational LSTM + weight tying [20]\nVariational Recurrent Highway Network + weight tying [44]\n4-layer LSTM + skip connection + averaged\nweight drop + weight penalty + weight tying [29]\nLSTM + averaged weight drop + Mixture of Softmax\n+ weight penalty + weight tying [42]\nNAS + weight tying [45]\nENAS + weight tying + weight penalty [34]\nRandom-WS + weight tying + weight penalty\nDARTS+ weight tying + weight penalty [28]\nNAONet + weight tying + weight penalty\nNAONet-WS + weight tying + weight penalty\n\n#params\n\n66M\n66M\n19M\n51M\n51M\n23M\n24M\n\n22M\n54M\n24M\n27M\n23M\n27M\n27M\n\nTest Perplexity\n\n78.4\n77.4\n73.4\n70.9\n68.5\n65.4\n58.3\n\n56.0\n62.4\n58.65\n58.81\n56.1\n56.0\n56.6\n\nGPU Days\n\n/\n/\n\n/\n/\n/\n/\n\n/\n\n1e4 CPU days\n\n0.5\n0.4\n1\n300\n0.4\n\nTable 3: Performance of different models and techniques on PTB dataset. Similar to CIFAR-10\nexperiment, \u2018NAONet-WS\u2019 represents NAO accompanied with weight sharing,and \u2018Random-WS\u2019 is\nthe corresponding random search baseline.\n\n5We adopt the number reported via [28] which is similar to our reproduction.\n\n8\n\n\fWe leave the model training details of PTB to the Appendix. The encoder in NAO is an LSTM with\nembedding size 64 and hidden size 128. The hidden state of LSTM is further normalized to have unit\nlength. The performance predictor is a two-layer MLP with each layer size as 200, 1. The decoder is\na single layer LSTM with attention mechanism and the hidden size is 128. The trade-off parameters\nin Eqn. (1) is \u03bb = 0.8. The encoder, performance predictor, and decoder are trained using Adam\nwith a learning rate of 0.001. We perform the optimization process in Alg 1 for two iterations (i.e.,\nL = 2). We train the sampled RNN models for shorter time (600 epochs) during the training phase\nof NAO, and afterwards train the best architecture discovered yet for 2000 epochs for the sake of\nbetter result. We use 200 P100 GPU cards to complete all the process within 1.5 days. Similar to\nCIFAR-10, we furthermore explore the possibility of combining weight sharing with NAO and the\nresulting architecture is denoted as \u2018NAO-WS\u2019.\nWe report all the results in Table 3, separated into three blocks, respectively reporting the results\nof experts designed methods, architectures discovered via previous automatic neural architecture\nsearch methods, and our NAO. As can be observed, NAO successfully discovered an architecture\nthat achieves quite competitive perplexity 56.0, surpassing previous NAS methods and is on par\nwith the best performance from LSTM method with advanced manually designed techniques such\nas averaged weight drop [29]. Furthermore, NAO combined with weight sharing (i.e., NAO-WS)\nagain demonstrates ef\ufb01ciency to discover competitive architectures (e.g., achieving 56.6 perplexity\nvia searching in 10 hours).\n\n4.4 Transferring the discovered architecture to WikiText-2\n\nWe also apply the best discovered RNN architecture on PTB to another language modelling task\nbased on a much larger dataset WikiText-2 (WT2 for short in the following). Table 4 shows the result\nthat NAONet discovered by our method is on par with, or surpasses ENAS and DARTS.\n\nModels and Techniques\nVariational LSTM + weight tying [20]\nLSTM + continuos cache pointer [16]\nLSTM [30]\n4-layer LSTM + skip connection + averaged\nweight drop + weight penalty + weight tying [29]\nLSTM + averaged weight drop + Mixture of Softmax\n+ weight penalty + weight tying [42]\nENAS + weight tying + weight penalty [34] (searched on PTB)\nDARTS + weight tying + weight penalty (searched on PTB)\nNAONet + weight tying + weight penalty (searched on PTB)\n\n#params\n\n28M\n\n-\n33\n24M\n\nTest Perplexity\n\n87.0\n68.9\n66.0\n65.9\n\n33M\n33M\n33M\n36M\n\n63.3\n70.4\n66.9\n67.0\n\nTable 4: Performance of different models and techniques on WT2 dataset. \u2018NAONet\u2019 represents the\nbest architecture discovered by NAO on PTB.\n\n5 Conclusion\n\nWe design a new automatic architecture design algorithm named as neural architecture optimization\n(NAO), which performs the optimization within continuous space (using gradient based method)\nrather than searching discrete decisions. The encoder, performance predictor and decoder together\nmakes it more effective and ef\ufb01cient to discover better architectures and we achieve quite competitive\nresults on both image classi\ufb01cation task and language modeling task. For future work, \ufb01rst we would\nlike to try other methods to further improve the performance of the discovered architecture, such as\nmixture of softmax [42] for language modeling. Second, we would like to apply NAO to discovering\nbetter architectures for more applications such as Neural Machine Translation. Third, we plan to\ndesign better neural models from the view of teaching and learning to teach [13, 39].\n\n6 Acknowledgement\n\nWe thank Hieu Pham for the discussion on some details of ENAS implementation, and Hanxiao Liu\nfor the code base of language modeling task in DARTS [28]. We furthermore thank the anonymous\nreviewers for their constructive comments.\n\n9\n\n\fReferences\n[1] Mikel Artetxe, Gorka Labaka, Eneko Agirre, and Kyunghyun Cho. Unsupervised neural\n\nmachine translation. In International Conference on Learning Representations, 2018.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[3] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network archi-\ntectures using reinforcement learning. In International Conference on Learning Representations,\n2017.\n\n[4] Bowen Baker, Otkrist Gupta, Ramesh Raskar, and Nikhil Naik. Accelerating neural architecture\nsearch using performance prediction. In International Conference on Learning Representations,\nWorkshop Track, 2018.\n\n[5] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph, Vijay Vasudevan, and Quoc Le. Under-\nstanding and simplifying one-shot architecture search. In International Conference on Machine\nLearning, pages 549\u2013558, 2018.\n\n[6] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architec-\nture search through hypernetworks. In International Conference on Learning Representations,\n2018.\n\n[7] Han Cai, Tianyao Chen, Weinan Zhang, Yong Yu, and Jun Wang. Reinforcement learning for\n\narchitecture search by network transformation. arXiv preprint arXiv:1707.04873, 2017.\n\n[8] Han Cai, Jiacheng Yang, Weinan Zhang, Song Han, and Yong Yu. Path-level network transfor-\n\nmation for ef\ufb01cient architecture search. arXiv preprint arXiv:1806.02639, 2018.\n\n[9] Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares,\nHolger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder\u2013\ndecoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical\nMethods in Natural Language Processing (EMNLP), pages 1724\u20131734, Doha, Qatar, October\n2014. Association for Computational Linguistics.\n\n[10] Boyang Deng, Junjie Yan, and Dahua Lin. Peephole: Predicting network performance before\n\ntraining. arXiv preprint arXiv:1712.03351, 2017.\n\n[11] Terrance DeVries and Graham W Taylor. Improved regularization of convolutional neural\n\nnetworks with cutout. arXiv preprint arXiv:1708.04552, 2017.\n\n[12] Scott E Fahlman and Christian Lebiere. The cascade-correlation learning architecture. In\n\nAdvances in neural information processing systems, pages 524\u2013532, 1990.\n\n[13] Yang Fan, Fei Tian, Tao Qin, Xiang-Yang Li, and Tie-Yan Liu. Learning to teach. In Interna-\n\ntional Conference on Learning Representations, 2018.\n\n[14] Yarin Gal and Zoubin Ghahramani. A theoretically grounded application of dropout in recurrent\nneural networks. In Advances in neural information processing systems, pages 1019\u20131027,\n2016.\n\n[15] Xavier Gastaldi. Shake-shake regularization. CoRR, abs/1705.07485, 2017.\n\n[16] Edouard Grave, Armand Joulin, and Nicolas Usunier. Improving neural language models with a\n\ncontinuous cache. CoRR, abs/1612.04426, 2016.\n\n[17] Roger B. Grosse, Ruslan Salakhutdinov, William T. Freeman, and Joshua B. Tenenbaum.\nExploiting compositionality to explore a large space of model structures. In Proceedings of the\nTwenty-Eighth Conference on Uncertainty in Arti\ufb01cial Intelligence, UAI\u201912, pages 306\u2013315,\nArlington, Virginia, United States, 2012. AUAI Press.\n\n[18] Furong Huang, Jordan Ash, John Langford, and Robert Schapire. Learning deep resnet blocks\n\nsequentially using boosting theory. arXiv preprint arXiv:1706.04964, 2017.\n\n10\n\n\f[19] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected\nconvolutional networks. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 4700\u20134708, 2017.\n\n[20] Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classi\ufb01ers:\n\nA loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.\n\n[21] Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric Xing.\nNeural architecture search with bayesian optimisation and optimal transport. arXiv preprint\narXiv:1802.07191, 2018.\n\n[22] Hiroaki Kitano. Designing neural networks using genetic algorithms with graph generation\n\nsystem. Complex Systems Journal, 4:461\u2013476, 1990.\n\n[23] David Krueger, Tegan Maharaj, J\u00e1nos Kram\u00e1r, Mohammad Pezeshki, Nicolas Ballas, Nan Rose-\nmary Ke, Anirudh Goyal, Yoshua Bengio, Aaron Courville, and Chris Pal. Zoneout: Regu-\nlarizing rnns by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305,\n2016.\n\n[24] Guillaume Lample, Alexis Conneau, Ludovic Denoyer, and Marc\u2019Aurelio Ranzato. Unsuper-\nvised machine translation using monolingual corpora only. In International Conference on\nLearning Representations, 2018.\n\n[25] Quoc Le and Tomas Mikolov. Distributed representations of sentences and documents. In Eric P.\nXing and Tony Jebara, editors, Proceedings of the 31st International Conference on Machine\nLearning, volume 32 of Proceedings of Machine Learning Research, pages 1188\u20131196, Bejing,\nChina, 22\u201324 Jun 2014. PMLR.\n\n[26] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille,\nJonathan Huang, and Kevin Murphy. Progressive neural architecture search. arXiv preprint\narXiv:1712.00559, 2017.\n\n[27] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu.\nHierarchical representations for ef\ufb01cient architecture search. In International Conference on\nLearning Representations, 2018.\n\n[28] Hanxiao Liu, Karen Simonyan, and Yiming Yang. Darts: Differentiable architecture search.\n\narXiv preprint arXiv:1806.09055, 2018.\n\n[29] G\u00e1bor Melis, Chris Dyer, and Phil Blunsom. On the state of the art of evaluation in neural\n\nlanguage models. In International Conference on Learning Representations, 2018.\n\n[30] Stephen Merity, Nitish Shirish Keskar, and Richard Socher. Regularizing and optimizing LSTM\n\nlanguage models. CoRR, abs/1708.02182, 2017.\n\n[31] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture\n\nmodels. arXiv preprint arXiv:1609.07843, 2016.\n\n[32] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon,\nBala Raju, Hormoz Shahrzad, Arshak Navruzyan, Nigel Duffy, et al. Evolving deep neural\nnetworks. arXiv preprint arXiv:1703.00548, 2017.\n\n[33] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed repre-\nsentations of words and phrases and their compositionality. In Advances in neural information\nprocessing systems, pages 3111\u20133119, 2013.\n\n[34] Hieu Pham, Melody Y Guan, Barret Zoph, Quoc V Le, and Jeff Dean. Ef\ufb01cient neural\n\narchitecture search via parameter sharing. arXiv preprint arXiv:1802.03268, 2018.\n\n[35] Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. Regularized evolution for\n\nimage classi\ufb01er architecture search. arXiv preprint arXiv:1802.01548, 2018.\n\n[36] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Jie Tan,\nQuoc V Le, and Alexey Kurakin. Large-scale evolution of image classi\ufb01ers. In International\nConference on Machine Learning, pages 2902\u20132911, 2017.\n\n11\n\n\f[37] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization of machine\nlearning algorithms. In Advances in neural information processing systems, pages 2951\u20132959,\n2012.\n\n[38] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in neural information processing systems, pages 3104\u20133112, 2014.\n\n[39] Lijun Wu, Fei Tian, Yingce Xia, Yang Fan, Tao Qin, Lai Jian-Huang, and Tie-Yan Liu. Learning\nto teach with dynamic loss functions. In Advances in Neural Information Processing Systems,\npages 6465\u20136476, 2018.\n\n[40] L. Xie and A. Yuille. Genetic cnn. In 2017 IEEE International Conference on Computer Vision\n\n(ICCV), pages 1388\u20131397, Oct. 2017.\n\n[41] Saining Xie, Ross Girshick, Piotr Doll\u00e1r, Zhuowen Tu, and Kaiming He. Aggregated residual\ntransformations for deep neural networks. In Computer Vision and Pattern Recognition (CVPR),\n2017 IEEE Conference on, pages 5987\u20135995. IEEE, 2017.\n\n[42] Zhilin Yang, Zihang Dai, Ruslan Salakhutdinov, and William W. Cohen. Breaking the softmax\nIn International Conference on Learning\n\nbottleneck: A high-rank RNN language model.\nRepresentations, 2018.\n\n[43] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization.\n\narXiv preprint arXiv:1409.2329, 2014.\n\n[44] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutn\u00edk, and J\u00fcrgen Schmidhuber. Recurrent\nhighway networks. In International Conference on Machine Learning, pages 4189\u20134198, 2017.\n\n[45] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv\n\npreprint arXiv:1611.01578, 2016.\n\n[46] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable\narchitectures for scalable image recognition. In Proceedings of the IEEE conference on computer\nvision and pattern recognition, 2018.\n\n12\n\n\f", "award": [], "sourceid": 4861, "authors": [{"given_name": "Renqian", "family_name": "Luo", "institution": "University of Science and Technology of China"}, {"given_name": "Fei", "family_name": "Tian", "institution": "Miicrosoft Research"}, {"given_name": "Tao", "family_name": "Qin", "institution": "Microsoft Research"}, {"given_name": "Enhong", "family_name": "Chen", "institution": "University of Science and Technology of China"}, {"given_name": "Tie-Yan", "family_name": "Liu", "institution": "Microsoft Research Asia"}]}