{"title": "Neural Architecture Search with Bayesian Optimisation and Optimal Transport", "book": "Advances in Neural Information Processing Systems", "page_first": 2016, "page_last": 2025, "abstract": "Bayesian Optimisation (BO) refers to a class of methods for global optimisation\nof a function f which is only accessible via point evaluations. It is\ntypically used in settings where f is expensive to evaluate. A common use case\nfor BO in machine learning is model selection, where it is not possible to\nanalytically model the generalisation performance of a statistical model, and\nwe resort to noisy and expensive training and validation procedures to choose\nthe best model. Conventional BO methods have focused on Euclidean and\ncategorical domains, which, in the context of model selection, only permits\ntuning scalar hyper-parameters of machine learning algorithms. However, with\nthe surge of interest in deep learning, there is an increasing demand to tune\nneural network architectures. In this work, we develop NASBOT, a Gaussian\nprocess based BO framework for neural architecture search. To accomplish this,\nwe develop a distance metric in the space of neural network architectures which\ncan be computed efficiently via an optimal transport program. This distance\nmight be of independent interest to the deep learning community as it may find\napplications outside of BO. We demonstrate that NASBOT outperforms other\nalternatives for architecture search in several cross validation based model\nselection tasks on multi-layer perceptrons and convolutional neural networks.", "full_text": "Neural Architecture Search\n\nwith Bayesian Optimisation and Optimal Transport\n\nKirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnab\u00e1s P\u00f3czos, Eric P Xing\n\n{kandasamy, willie, schneide, bapoczos, epxing}@cs.cmu.edu\n\nCarnegie Mellon University,\n\nPetuum Inc.\n\nAbstract\n\nBayesian Optimisation (BO) refers to a class of methods for global optimisation of\na function f which is only accessible via point evaluations. It is typically used in\nsettings where f is expensive to evaluate. A common use case for BO in machine\nlearning is model selection, where it is not possible to analytically model the gener-\nalisation performance of a statistical model, and we resort to noisy and expensive\ntraining and validation procedures to choose the best model. Conventional BO\nmethods have focused on Euclidean and categorical domains, which, in the context\nof model selection, only permits tuning scalar hyper-parameters of machine learn-\ning algorithms. However, with the surge of interest in deep learning, there is an\nincreasing demand to tune neural network architectures. In this work, we develop\nNASBOT, a Gaussian process based BO framework for neural architecture search.\nTo accomplish this, we develop a distance metric in the space of neural network\narchitectures which can be computed ef\ufb01ciently via an optimal transport program.\nThis distance might be of independent interest to the deep learning community as it\nmay \ufb01nd applications outside of BO. We demonstrate that NASBOT outperforms\nother alternatives for architecture search in several cross validation based model\nselection tasks on multi-layer perceptrons and convolutional neural networks.\n\n1\n\nIntroduction\n\nIn many real world problems, we are required to sequentially evaluate a noisy black-box function\nf with the goal of \ufb01nding its optimum in some domain X . Typically, each evaluation is expensive\nin such applications, and we need to keep the number of evaluations to a minimum. Bayesian\noptimisation (BO) refers to an approach for global optimisation that is popularly used in such settings.\nIt uses Bayesian models for f to infer function values at unexplored regions and guide the selection\nof points for future evaluations. BO has been successfully applied for many optimisation problems in\noptimal policy search, industrial design, and scienti\ufb01c experimentation. That said, the quintessential\nuse case for BO in machine learning is model selection [14, 40]. For instance, consider selecting\nthe regularisation parameter \u03bb and kernel bandwidth h for an SVM. We can set this up as a zeroth\norder optimisation problem where our domain is a two dimensional space of (\u03bb, h) values, and each\nfunction evaluation trains the SVM on a training set, and computes the accuracy on a validation set.\nThe goal is to \ufb01nd the model, i.e. hyper-parameters, with the highest validation accuracy.\nThe majority of the BO literature has focused on settings where the domain X is either Euclidean\nor categorical. This suf\ufb01ces for many tasks, such as the SVM example above. However, with\nrecent successes in deep learning, neural networks are increasingly becoming the method of choice\nfor many machine learning applications. A number of recent work have designed novel neural\nnetwork architectures to signi\ufb01cantly outperform the previous state of the art [12, 13, 37, 45]. This\nmotivates studying model selection over the space of neural architectures to optimise for generalisation\nperformance. A critical challenge in this endeavour is that evaluating a network via train and validation\nprocedures is very expensive. This paper proposes a BO framework for this problem.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWhile there are several approaches to BO, those based on Gaussian processes (GP) [35] are most\ncommon in the BO literature. In its most unadorned form, a BO algorithm operates sequentially,\nstarting at time 0 with a GP prior for f; at time t, it incorporates results of evaluations from 1, . . . , t\u22121\nin the form of a posterior for f. It then uses this posterior to construct an acquisition function \u03d5t,\nwhere \u03d5t(x) is a measure of the value of evaluating f at x at time t if our goal is to maximise f.\nAccordingly, it chooses to evaluate f at the maximiser of the acquisition, i.e. xt = argmaxx\u2208X \u03d5t(x).\nThere are two key ingredients to realising this plan for GP based BO. First, we need to quantify the\nsimilarity between two points x, x(cid:48) in the domain in the form of a kernel \u03ba(x, x(cid:48)). The kernel is\nneeded to de\ufb01ne the GP, which allows us to reason about an unevaluated value f (x(cid:48)) when we have\nalready evaluated f (x). Secondly, we need a method to maximise \u03d5t.\nThese two steps are fairly straightforward in conventional domains. For example, in Euclidean spaces,\nwe can use one of many popular kernels such as Gaussian, Laplacian, or Mat\u00e9rn; we can maximise\n\u03d5t via off the shelf branch-and-bound or gradient based methods. However, when each x \u2208 X is a\nneural network architecture, this is not the case. Hence, our challenges in this work are two-fold.\nFirst, we need to quantify (dis)similarity between two networks. Intuitively, in Fig. 1, network 1a is\nmore similar to network 1b, than it is to 1c. Secondly, we need to be able to traverse the space of\nsuch networks to optimise the acquisition function. Our main contributions are as follows.\n1. We develop a (pseudo-)distance for neural network architectures called OTMANN (Optimal\nTransport Metrics for Architectures of Neural Networks) that can be computed ef\ufb01ciently via an\noptimal transport program.\n\n2. We develop a BO framework for optimising functions on neural network architectures called\nNASBOT (Neural Architecture Search with Bayesian Optimisation and Optimal Transport). This\nincludes an evolutionary algorithm to optimise the acquisition function.\n\n3. Empirically, we demonstrate that NASBOT outperforms other baselines on model selection tasks\nfor multi-layer perceptrons (MLP) and convolutional neural networks (CNN). Our python imple-\nmentations of OTMANN and NASBOT are available at github.com/kirthevasank/nasbot.\n\nRelated Work: Recently, there has been a surge of interest in methods for neural architecture\nsearch [1, 6, 8, 21, 25, 26, 30, 32, 36, 41, 51\u201354]. We discuss them in detail in the Appendix due to\nspace constraints. Broadly, they fall into two categories, based on either evolutionary algorithms (EA)\nor reinforcement learning (RL). EA provide a simple mechanism to explore the space of architectures\nby making a sequence of changes to networks that have already been evaluated. However, as we will\ndiscuss later, they are not ideally suited for optimising functions that are expensive to evaluate. While\nRL methods have seen recent success, architecture search is in essence an optimisation problem \u2013\n\ufb01nd the network with the lowest validation error. There is no explicit need to maintain a notion of\nstate and solve credit assignment [43]. Since RL is a fundamentally more dif\ufb01cult problem than\noptimisation [16], these approaches need to try a very large number of architectures to \ufb01nd the\noptimum. This is not desirable, especially in computationally constrained settings.\nNone of the above methods have been designed with a focus on the expense of evaluating a neural\nnetwork, with an emphasis on being judicious in selecting which architecture to try next. Bayesian\noptimisation (BO) uses introspective Bayesian models to carefully determine future evaluations and\nis well suited for expensive evaluations. BO usually consumes more computation to determine future\npoints than other methods, but this pays dividends when the evaluations are very expensive. While\nthere has been some work on BO for architecture search [2, 15, 28, 40, 44], they have only been\napplied to optimise feed forward structures, e.g. Fig. 1a, but not Figs. 1b, 1c. We compare NASBOT\nto one such method and demonstrate that feed forward structures are inadequate for many problems.\n2 Set Up\nOur goal is to maximise a function f de\ufb01ned on a space X of neural network architectures. When\nwe evaluate f at x \u2208 X , we obtain a possibly noisy observation y of f (x). In the context of\narchitecture search, f is the performance on a validation set after x is trained on the training set. If\nx(cid:63) = argmaxX f (x) is the optimal architecture, and xt is the architecture evaluated at time t, we\nwant f (x(cid:63)) \u2212 maxt\u2264n f (xt) to vanish fast as the number of evaluations n \u2192 \u221e. We begin with a\nreview of BO and then present a graph theoretic formalism for neural network architectures.\n\n2.1 A brief review of Gaussian Process based Bayesian Optimisation\nA GP is a random process de\ufb01ned on some domain X , and is characterised by a mean function\n\u00b5 : X \u2192 R and a (covariance) kernel \u03ba : X 2 \u2192 R. Given n observations Dn = {(xi, yi)}n\ni=1, where\n\n2\n\n\fFigure 1: An illustration of some CNN\narchitectures. In each layer, i: indexes\nthe layer, followed by the label (e.g\nconv3), and then the number of units\n(e.g. number of \ufb01lters). The input and\noutput layers are pink while the decision\n(softmax) layers are green.\nFrom Section 3: The layer mass is de-\nnoted in parentheses. The following are\nthe normalised and unnormalised dis-\ntances d, \u00afd . All self distances are 0,\ni.e. d(G,G) = \u00afd(G,G) = 0. Unnor-\nmalised: d(a, b) = 175.1, d(a, c) =\n1479.3, d(b, c) = 1621.4. Normalised:\n\u00afd(a, b) = 0.0286, \u00afd(a, c) = 0.2395,\n\u00afd(b, c) = 0.2625.\n\n(a)\n\n(b)\n\n(c)\n\nxi \u2208 X , yi = f (xi) + \u0001i \u2208 R, and \u0001i \u223c N (0, \u03b72), the posterior process f|Dn is also a GP with mean\n\u00b5n and covariance \u03ban. Denote Y \u2208 Rn with Yi = yi, k, k(cid:48) \u2208 Rn with ki = \u03ba(x, xi), k(cid:48)\ni = \u03ba(x(cid:48), xi),\nand K \u2208 Rn\u00d7n with Ki,j = \u03ba(xi, xj). Then, \u00b5n, \u03ban can be computed via,\n\n\u00b5n(x) = k(cid:62)(K + \u03b72I)\u22121Y,\n\n\u03ban(x, x(cid:48)) = \u03ba(x, x(cid:48)) \u2212 k(cid:62)(K + \u03b72I)\u22121k(cid:48).\n\n(1)\n\nFor more background on GPs, we refer readers to Rasmussen and Williams [35]. When tasked with\noptimising a function f over a domain X , BO models f as a sample from a GP. At time t, we have\nalready evaluated f at points {xi}t\u22121\ni=1. To determine the next point\nfor evaluation xt, we \ufb01rst use the posterior GP to de\ufb01ne an acquisition function \u03d5t : X \u2192 R, which\nmeasures the utility of evaluating f at any x \u2208 X according to the posterior. We then maximise the\nacquisition xt = argmaxX \u03d5t(x), and evaluate f at xt. The expected improvement acquisition [31],\n(2)\n\n\u03d5t(x) = E(cid:2) max{0, f (x) \u2212 \u03c4t\u22121}(cid:12)(cid:12){(xi, yi)}t\u22121\n\ni=1 and obtained observations {yi}t\u22121\n\n(cid:3),\n\ni=1\n\nmeasures the expected improvement over the current maximum value according to the posterior GP.\nHere \u03c4t\u22121 = argmaxi\u2264t\u22121 f (xi) denotes the current best value. This expectation can be computed in\nclosed form for GPs. We use EI in this work, but the ideas apply just as well to other acquisitions [3].\nGP/BO in the context of architecture search: Intuitively, \u03ba(x, x(cid:48)) is a measure of similarity\nbetween x and x(cid:48). If \u03ba(x, x(cid:48)) is large, then f (x) and f (x(cid:48)) are highly correlated. Hence, the GP\neffectively imposes a smoothness condition on f : X \u2192 R; i.e. since networks a and b in Fig. 1\nare similar, they are likely to have similar cross validation performance. In BO, when selecting the\nnext point, we balance between exploitation, choosing points that we believe will have high f value,\nand exploration, choosing points that we do not know much about so that we do not get stuck at a\nbad optimum. For example, if we have already evaluated f (a), then exploration incentivises us to\nchoose c over b since we can reasonably gauge f (b) from f (a). On the other hand, if f (a) has high\nvalue, then exploitation incentivises choosing b, as it is more likely to be the optimum than c.\n\n2.2 A Mathematical Formalism for Neural Networks\nOur formalism will view a neural network as a graph whose vertices are the layers of the network.\nWe will use the CNNs in Fig. 1 to illustrate the concepts. A neural network G = (L,E) is de\ufb01ned\nby a set of layers L and directed edges E. An edge (u, v) \u2208 E is a ordered pair of layers. In Fig. 1,\nthe layers are depicted by rectangles and the edges by arrows. A layer u \u2208 L is equipped with\na layer label (cid:96)(cid:96)(u) which denotes the type of operations performed at the layer. For instance, in\nFig. 1a, (cid:96)(cid:96)(1) = conv3, (cid:96)(cid:96)(5) = max-pool denote a 3\u00d7 3 convolution and a max-pooling operation.\nThe attribute (cid:96)u denotes the number of computational units in a layer. In Fig. 1b, (cid:96)u(5) = 32 and\n(cid:96)u(7) = 16 are the number of convolutional \ufb01lters and fully connected nodes.\nIn addition, each network has decision layers which are used to obtain the predictions of the\nnetwork. For a classi\ufb01cation task, the decision layers perform softmax operations and output the\nprobabilities an input datum belongs to each class. For regression, the decision layers perform\nlinear combinations of the outputs of the previous layers and output a single scalar. All networks\n\n3\n\n0: ip(235)1: conv3, 16(16)2: conv3, 16(256)3: conv3, 32(512)4: conv5, 32(1024)5: max-pool, 1(32)6: fc, 16(512)7: softmax(235)8: op(235)0: ip(235)1: conv3, 16(16)2: conv3, 16(256)3: conv3, 16(256)4: conv3, 16(256)5: conv5, 32(1024)6: max-pool, 1(32)7: fc, 16(512)8: softmax(235)9: op(235)0: ip(240)1: conv7, 16(16)2: conv5, 32(512)3: conv3 /2, 16(256)4: conv3, 16(256)5: avg-pool, 1(32)6: max-pool, 1(16)7: max-pool, 1(16)8: fc, 16(512)12: fc, 16(512)9: conv3, 16(256)10: softmax(120)13: softmax(120)11: max-pool, 1(16)14: op(240)\fhave at least one decision layer. When a network has multiple decision layers, we average the output\nof each decision layer to obtain the \ufb01nal output. The decision layers are shown in green in Fig. 1.\nFinally, every network has a unique input layer uip and output layer uop with labels (cid:96)(cid:96)(uip) = ip and\n(cid:96)(cid:96)(uop) = op. It is instructive to think of the role of uip as feeding a data point to the network and the\nrole of uop as averaging the results of the decision layers. The input and output layers are shown in\npink in Fig. 1. We refer to all layers that are not input, output or decision layers as processing layers.\nThe directed edges are to be interpreted as follows. The output of each layer is fed to each of its\nchildren; so both layers 2 and 3 in Fig. 1b take the output of layer 1 as input. When a layer has\nmultiple parents, the inputs are concatenated; so layer 5 sees an input of 16 + 16 \ufb01ltered channels\ncoming in from layers 3 and 4. Finally, we mention that neural networks are also characterised by the\nvalues of the weights/parameters between layers. In architecture search, we typically do not consider\nthese weights. Instead, an algorithm will (somewhat ideally) assume access to an optimisation oracle\nthat can minimise the loss function on the training set and \ufb01nd the optimal weights.\nWe next describe a distance d : X 2 \u2192 R+ for neural architectures. Recall that our eventual goal is\na kernel for the GP; given a distance d, we will aim for \u03ba(x, x(cid:48)) = e\u2212\u03b2d(x,x(cid:48))p, where \u03b2, p \u2208 R+,\nas the kernel. Many popular kernels take this form. For e.g. when X \u2282 Rn and d is the L2 norm,\np = 1, 2 correspond to the Laplacian and Gaussian kernels respectively.\n\nprocessing layers. Therefore, we use (cid:96)m(uip) = (cid:96)m(uop) = \u03b6(cid:80)\n\n3 The OTMANN Distance\nTo motivate this distance, note that the performance of a neural network is determined by the amount\nof computation at each layer, the types of these operations, and how the layers are connected. A\nmeaningful distance should account for these factors. To that end, OTMANN is de\ufb01ned as the\nminimum of a matching scheme which attempts to match the computation at the layers of one\nnetwork to the layers of the other. We incur penalties for matching layers with different types of\noperations or those at structurally different positions. We will \ufb01nd a matching that minimises these\npenalties, and the total penalty at the minimum will give rise to a distance. We \ufb01rst describe two\nconcepts, layer masses and path lengths, which we will use to de\ufb01ne OTMANN.\nLayer masses: The layer masses (cid:96)m : L \u2192 R+ will be the quantity that we match between the layers\nof two networks when comparing them. (cid:96)m(u) quanti\ufb01es the signi\ufb01cance of layer u. For processing\nlayers, (cid:96)m(u) will represent the amount of computation carried out by layer u and is computed via the\nproduct of (cid:96)u(u) and the number of incoming units. For example, in Fig. 1b, (cid:96)m(5) = 32\u00d7 (16 + 16)\nas there are 16 \ufb01ltered channels each coming from layers 3 and 4 respectively. As there is no\ncomputation at the input and output layers, we cannot de\ufb01ne the layer mass directly as we did for the\nu\u2208PL (cid:96)m(u) where PL denotes the\nset of processing layers, and \u03b6 \u2208 (0, 1) is a parameter to be determined. Intuitively, we are using an\namount of mass that is proportional to the amount of computation in the processing layers. Similarly,\nthe decision layers occupy a signi\ufb01cant role in the architecture as they directly in\ufb02uence the output.\nWhile there is computation being performed at these layers, this might be problem dependent \u2013 there\nis more computation performed at the softmax layer in a 10 class classi\ufb01cation problem than in a\n2 class problem. Furthermore, we found that setting the layer mass for decisions layers based on\ncomputation underestimates their contribution to the network. Following the same intuition as we did\nfor the input/output layers, we assign an amount of mass proportional to the mass in the processing\nlayers. Since the outputs of the decision layers are averaged, we distribute the mass among all\ndecision layers; that is, if DL are decision layers, \u2200 u \u2208 DL, (cid:96)m(u) = \u03b6|DL|\nu\u2208PL (cid:96)m(u). In all\nour experiments, we use \u03b6 = 0.1. In Fig. 1, the layer masses for each layer are shown in parantheses.\nPath lengths from/to uip/uop: In a neural network G, a path from u to v is a sequence of layers\nu1, . . . , us where u1 = u, us = v and (ui, ui+1) \u2208 E for all i \u2264 s \u2212 1. The length of this path is\nthe number of hops from one node to another in order to get from u to v. For example, in Fig. 1c,\n(2, 5, 8, 13) is a path from layer 2 to 13 of length 3. Let the shortest (longest) path length from u to\nv be the smallest (largest) number of hops from one node to another among all paths from u to v.\nAdditionally, de\ufb01ne the random walk path length as the expected number of hops to get from u to v, if,\nfrom any layer we hop to one of its children chosen uniformly at random. For example, in Fig. 1c, the\nshortest, longest and random walk path lengths from layer 1 to layer 14 are 5, 7, and 5.67 respectively.\nFor any u \u2208 L, let \u03b4sp\nop (u) denote the length of the shortest, longest and random walk\npaths from u to the output uop. Similarly, let \u03b4sp\nip (u) denote the corresponding lengths\n\nop(u), \u03b4lp\n\nop(u), \u03b4rw\n\n(cid:80)\n\nip (u), \u03b4lp\n\nip(u), \u03b4rw\n\n4\n\n\fconv3\n\nconv3\n0\nconv5\n0.2\nmax-pool \u221e\navg-pool \u221e\n\u221e\nfc\n\nconv5\n0.2\n0\n\u221e\n\u221e\n\u221e\n\nmax-pool\n\n\u221e\n\u221e\n0\n0.25\n\u221e\n\navg-pool\n\n\u221e\n\u221e\n0.25\n0\n\u221e\n\nfc\n\u221e\n\u221e\n\u221e\n\u221e\n0\n\nTable 1: An example label mismatch\ncost matrix M. There is zero cost for\nmatching identical layers, < 1 cost for\nsimilar layers, and in\ufb01nite cost for dis-\nparate layers.\n\nfor walks from the input uip to u. As the layers of a neural network can be topologically ordered1, the\nabove path lengths are well de\ufb01ned and \ufb01nite. Further, for any s \u2208 {sp,lp,rw} and t \u2208 {ip,op}, \u03b4s\nt (u)\ncan be computed for all u \u2208 L, in O(|E|) time (see Appendix A.3 for details).\nWe are now ready to describe OTMANN. Given two networks G1 = (L1,E1),G2 = (L2,E2) with\nn1, n2 layers respectively, we will attempt to match the layer masses in both networks. We let\nZ \u2208 Rn1\u00d7n2\nbe such that Z(i, j) denotes the amount of mass matched between layer i \u2208 G1 and\nj \u2208 G2. The OTMANN distance is computed by solving the following optimisation problem.\n\n+\n\nminimise\n\nZ\n\nsubject to\n\n(cid:88)\n\nj\u2208L2\n\n(cid:88)\n\ni\u2208L1\n\n\u03c6lmm(Z) + \u03c6nas(Z) + \u03bdstr\u03c6str(Z)\n\nZij \u2264 (cid:96)m(i),\n\nZij \u2264 (cid:96)m(j), \u2200i, j\n\n(3)\n\n(cid:1) +(cid:80)\n\n(cid:0)(cid:96)m(j) \u2212(cid:80)\n\nj\u2208L2\n\nZij\n\nj\u2208L2\n\ni\u2208L1\n\nZij\n\nThe label mismatch term \u03c6lmm, penalises matching masses that have different labels, while the\nstructural term \u03c6str penalises matching masses at structurally different positions with respect to each\nother. If we choose not to match any mass in either network, we incur a non-assignment penalty \u03c6nas.\n\u03bdstr > 0 determines the trade-off between the structural and other terms. The inequality constraints\nensure that we do not over assign the masses in a layer. We now describe \u03c6lmm, \u03c6nas, and \u03c6str.\nLabel mismatch penalty \u03c6lmm: We begin with a label penalty matrix M \u2208 RL\u00d7L where L is\nthe number of all label types and M (x, y) denotes the penalty for transporting a unit mass from\na layer with label x to a layer with label y. We then construct a matrix Clmm \u2208 Rn1\u00d7n2 with\nClmm(i, j) = M ((cid:96)(cid:96)(i), (cid:96)(cid:96)(j)) corresponding to the mislabel cost for matching unit mass from each\nZ(i, j)C(i, j)\nto be the sum of all matchings from L1 to L2 weighted by the label penalty terms. This matrix M,\nillustrated in Table 1, is a parameter that needs to be speci\ufb01ed for OTMANN. They can be speci\ufb01ed\nwith an intuitive understanding of the functionality of the layers; e.g. many values in M are \u221e, while\nfor similar layers, we choose a value less than 1.\nNon-assignment penalty \u03c6nas: We set this to be the amount of mass that is unassigned in both networks,\n\nlayer i \u2208 L1 to each layer j \u2208 L2. We then set \u03c6lmm(Z) = (cid:104)Z, Clmm(cid:105) =(cid:80)\n\ni\u2208L1,j\u2208L2\n\n(cid:1). This essentially\n\ni.e. \u03c6nas(Z) = (cid:80)\n\ni\u2208L1\n\n(cid:0)(cid:96)m(i) \u2212(cid:80)\n\n(cid:80)\ns\u2208{sp, lp, rw}(cid:80)\n\nimplies that the cost for not assigning unit mass is 1. The costs in Table 1 are de\ufb01ned relative to\nthis. For similar layers x, y, M (x, y) (cid:28) 1 and for disparate layers M (x, y) (cid:29) 1. That is, we would\nrather match conv3 to conv5 than not assign it, provided the structural penalty for doing so is small;\nconversely, we would rather not assign a conv3, than assign it to fc. This also explains why we did\nnot use a trade-off parameter like \u03bdstr for \u03c6lmm and \u03c6nas \u2013 it is simple to specify reasonable values for\nM (x, y) from an understanding of their functionality.\nStructural penalty \u03c6str: We de\ufb01ne a matrix Cstr \u2208 Rn1\u00d7n2 where Cstr(i, j) is small if layers i \u2208 L1\nand j \u2208 L2 are at structurally similar positions in their respective networks. We then set \u03c6str(Z) =\n(cid:104)Z, Cstr(cid:105). For i \u2208 L1, j \u2208 L2, we let Cstr(i, j) = 1\nt (j)| be the\naverage of all path length differences, where \u03b4s\nt are the path lengths de\ufb01ned previously. We de\ufb01ne\n\u03c6str in terms of the shortest/longest/random-walk path lengths from/to the input/output, because they\ncapture various notions of information \ufb02ow in a neural network; a layer\u2019s input is in\ufb02uenced by the\npaths the data takes before reaching the layer and its output in\ufb02uences all layers it passes through\nbefore reaching the decision layers. If the path lengths are similar for two layers, they are likely to be\nat similar structural positions. Further, this form allows us to solve (3) ef\ufb01ciently via an OT program\nand prove distance properties about the solution. If we need to compute pairwise distances for several\nnetworks, as is the case in BO, the path lengths can be pre-computed in O(|E|) time, and used to\nconstruct Cstr for two networks at the moment of computing the distance between them.\nThis completes the description of our matching program. In Appendix A, we prove that (3) can be\nformulated as an Optimal Transport (OT) program [47]. OT is a well studied problem with several\nef\ufb01cient solvers [33]. Our theorem below, shows that the solution of (3) is a distance.\n\nt\u2208{ip,op} |\u03b4s\n\nt (i) \u2212 \u03b4s\n\n6\n\n1A topological ordering is an ordering of the layers u1, . . . , u|L| such that u comes before v if (u, v) \u2208 E.\n\n5\n\n\fPick a layer at random and decrease the number of units by 1/8.\n\nOperation\ndec_single\ndec_en_masse Pick several layers at random and decrease the number of units by 1/8 for all of them.\ninc_single\ninc_en_masse\n\nDescription\n\nu\u2208Li\n\ndup_path\nremove_layer\nskip\nswap_label\nwedge_layer\n\nPick a layer at random and increase the number of units by 1/8.\nPick several layers at random and increase the number of units by 1/8 for all of them.\nPick a random path u1, . . . , uk, duplicate u2, . . . , uk\u22121 and connect them to u1 and uk.\nPick a layer at random and remove it. Connect the layer\u2019s parents to its children if necessary.\nRandomly pick layers u, v where u is topologically before v. Add (u, v) to E.\nRandomly pick a layer and change its label.\nRandomly remove an edge (u, v) from E. Create a new layer w and add (u, w), (w, v) to E.\nTable 2: Descriptions of modi\ufb01ers to transform one network to another. The \ufb01rst four change the number of\nunits in the layers but do not change the architecture, while the last \ufb01ve change the architecture.\nTheorem 1. Let d(G1,G2) be the solution of (3) for networks G1,G2. Under mild regularity condi-\ntions on M, d(\u00b7,\u00b7) is a pseudo-distance. That is, for all networks G1,G2,G3, it satis\ufb01es, d(G1,G2) \u2265 0,\nd(G1,G2) = d(G2,G1), d(G1,G1) = 0 and d(G1,G3) \u2264 d(G1,G2) + d(G2,G3).\n\nFor what follows, de\ufb01ne \u00afd(G1,G2) = d(G1,G2)/(tm(G1)+tm(G2)) where tm(Gi) =(cid:80)\n\n(cid:96)m(u)\nis the total mass of a network. Note that \u00afd \u2264 1. While \u00afd does not satisfy the triangle inequality, it\nprovides a useful measure of dissimilarity normalised by the amount of computation. Our experience\nsuggests that d puts more emphasis on the amount of computation at the layers over structure and\nvice versa for \u00afd. Therefore, it is prudent to combine both quantities in any downstream application.\nThe caption in Fig. 1 gives d, \u00afd values for the examples in that \ufb01gure when \u03bdstr = 0.5.\nWe conclude this section with a couple of remarks. First, OTMANN shares similarities with Wasser-\nstein (earth mover\u2019s) distances which also have an OT formulation. However, it is not a Wasserstein\ndistance itself\u2014in particular, the supports of the masses and the cost matrices change depending\non the two networks being compared. Second, while there has been prior work for de\ufb01ning various\ndistances and kernels on graphs, we cannot use them in BO because neural networks have additional\ncomplex properties in addition to graphical structure, such as the type of operations performed at\neach layer, the number of neurons, etc. The above work either de\ufb01ne the distance/kernel between\nvertices or assume the same vertex (layer) set [9, 23, 29, 38, 49], none of which apply in our setting.\nWhile some methods do allow different vertex sets [48], they cannot handle layer masses and layer\nsimilarities. Moreover, the computation of the above distances are more expensive than OTMANN.\nHence, these methods cannot be directly plugged into BO framework for architecture search.\nIn Appendix A, we provide additional material on OTMANN. This includes the proof of Theorem 1,\na discussion on some design choices, and implementation details such as the computation of the path\nlengths. Moreover, we provide illustrations to demonstrate that OTMANN is a meaningful distance\nfor architecture search. For example, a t-SNE embedding places similar architectures close to each\nother. Further, scatter plots showing the validation error vs distance on real datasets demonstrate that\nnetworks with small distance tend to perform similarly on the problem.\n4 NASBOT\nWe now describe NASBOT, our BO algorithm for neural architecture search. Recall that in order\nto realise the BO scheme outlined in Section 2.1, we need to specify (a) a kernel \u03ba for neural\narchitectures and (b) a method to optimise the acquisition \u03d5t over these architectures. Due to space\nconstraints, we will only describe the key ideas and defer all details to Appendix B.\nAs described previously, we will use a negative exponentiated distance for \u03ba. Precisely, \u03ba =\n\u03b1e\u2212\u03b2d + \u00af\u03b1d\u2212 \u00af\u03b2 \u00afd, where d, \u00afd are the OTMANN distance and its normalised version. We mention\nthat while this has the form of popular kernels, we do not know yet if it is in fact a kernel. In our\nexperiments, we did not encounter an instance where the eigenvalues of the kernel matrix were\nnegative. In any case, there are several methods to circumvent this issue in kernel methods [42].\nWe use an evolutionary algorithm (EA) approach to optimise the acquisition function (2). For this,\nwe begin with an initial pool of networks and evaluate the acquisition \u03d5t on those networks. Then\nwe generate a set of Nmut mutations of this pool as follows. First, we stochastically select Nmut\ncandidates from the set of networks already evaluated such that those with higher \u03d5t values are more\nlikely to be selected than those with lower values. Then we modify each candidate, to produce a\nnew architecture. These modi\ufb01cations, described in Table 2, might change the architecture either by\n\n6\n\n\fFigure 2: Cross validation results: In all \ufb01gures, the x axis is time. The y axis is the mean squared error\n(MSE) in the \ufb01rst 6 \ufb01gures and the classi\ufb01cation error in the last. Lower is better in all cases. The title of each\n\ufb01gure states the dataset and the number of parallel workers (GPUs). All \ufb01gures were averaged over at least 5\nindependent runs of each method. Error bars indicate one standard error.\n\nincreasing or decreasing the number of computational units in a layer, by adding or deleting layers,\nor by changing the connectivity of existing layers. Finally, we evaluate the acquisition on this Nmut\nmutations, add it to the initial pool, and repeat for the prescribed number of steps. While EA works\n\ufb01ne for cheap functions, such as the acquisition \u03d5t which is analytically available, it is not suitable\nwhen evaluations are expensive, such as training a neural network. This is because EA selects points\nfor future evaluations that are already close to points that have been evaluated, and is hence inef\ufb01cient\nat exploring the space. In our experiments, we compare NASBOT to the same EA scheme used to\noptimise the acquisition and demonstrate the former outperforms the latter.\nWe conclude this section by observing that this framework for NASBOT/OTMANN has additional\n\ufb02exibility to what has been described. If one wishes to tune over drop-out probabilities, regularisation\npenalties and batch normalisation at each layer, they can be treated as part of the layer label, via an\naugmented label penalty matrix M which accounts for these considerations. If one wishes to jointly\ntune other scalar hyper-parameters (e.g. learning rate), they can use an existing kernel for euclidean\nspaces and de\ufb01ne the GP over the joint architecture + hyper-parameter space via a product kernel.\nBO methods for early stopping in iterative training procedures [17\u201320, 22] can be easily incorporated\nby de\ufb01ning a \ufb01delity space. Using a line of work in scalable GPs [39, 50], one can apply our methods\nto challenging problems which might require trying a very large number (\u223c100K) of architectures.\nThese extensions will enable deploying NASBOT in large scale settings, but are tangential to our\ngoal of introducing a BO method for architecture search.\n\n5 Experiments\nMethods: We compare NASBOT to the following baselines. RAND: random search; EA (Evolution-\nary algorithm): the same EA procedure described above. TreeBO [15]: a BO method which only\nsearches over feed forward structures. Random search is a natural baseline to compare optimisation\nmethods. However, unlike in Euclidean spaces, there is no natural way to randomly explore the space\nof architectures. Our RAND implementation, operates in exactly the same way as NASBOT, except\nthat the EA procedure is fed a random sample from Unif(0, 1) instead of the GP acquisition each\ntime it evaluates an architecture. Hence, RAND is effectively picking a random network from the\nsame space explored by NASBOT; neither method has an unfair advantage because it considers a\ndifferent space. While there are other methods for architecture search, their implementations are\nhighly nontrivial and are not made available.\nDatasets: We use the following datasets: blog feedback [4], indoor location [46], slice localisa-\ntion [11], naval propulsion [5], protein tertiary structure [34], news popularity [7], Cifar10 [24]. The\n\ufb01rst six are regression problems for which we use MLPs. The last is a classi\ufb01cation task on images\nfor which we use CNNs. Table 3 gives the size and dimensionality of each dataset. For the \ufb01rst 6\ndatasets, we use a 0.6 \u2212 0.2 \u2212 0.2 train-validation-test split and normalised the input and output to\nhave zero mean and unit variance. Hence, a constant predictor will have a mean squared error of\napproximately 1. For Cifar10 we use 40K for training and 10K each for validation and testing.\n\n7\n\nTime(hours)02468CrossValidationMSE0.70.80.911.11.2BlogFeedback,#workers=2Time(hours)02468CrossValidationMSE0.10.150.20.25IndoorLocation,#workers=2Time(hours)02468CrossValidationMSE0.60.70.80.91SliceLocalisation,#workers=2Time(hours)02468CrossValidationMSE10-210-1NavalPropulsion,#workers=2Time(hours)02468CrossValidationMSE0.840.860.880.90.920.940.960.98Protein,#workers=2Time(hours)0123456CrossValidationMSE0.70.80.911.1News,#workers=4Time(hours)0246810CrossValidationError0.120.130.140.150.160.17Cifar10,#workers=4EARANDTreeBONASBOT\fMethod\n\nBlog\n(60K, 281)\n\nIndoor\n(21K, 529)\n\nSlice\n(54K, 385)\n\nNaval\n(12K, 17)\n\nProtein\n(46K, 9)\n\nNews\n(40K, 61)\n\nCifar10\n(60K, 3K)\n\nCifar10\n150K iters\n\nRAND\n\nEA\n\nTreeBO\n\n0.758\u00b1 0.041\n0.733\u00b1 0.041\n0.759\u00b1 0.079\n0.615\u00b10.044\n\n0.0103\u00b1 0.002\n0.0079\n\u00b10.004\n0.0102\u00b1 0.002\n0.0075\n\u00b10.002\n\n0.115\u00b10.023\n0.147\u00b1 0.010\n0.168\u00b1 0.023\n0.117\u00b10.008\n\n0.0914\u00b1 0.008\n0.780\u00b1 0.034\n0.0915\u00b1 0.010\n0.806\u00b1 0.040\n0.1121\u00b1 0.004\n0.928\u00b1 0.053\nNASBOT 0.731\u00b10.029\n0.0869\n\u00b10.004\nTable 3: The \ufb01rst row gives the number of samples N and the dimensionality D of each dataset in the form\n(N, D). The subsequent rows show the regression MSE or classi\ufb01cation error (lower is better) on the test set\nfor each method. The last column is for Cifar10 where we took the best models found by each method in 24K\niterations and trained it for 120K iterations. When we trained the VGG-19 architecture using our training\nprocedure, we got test errors 0.1718 (60K iterations) and 0.1018 (150K iterations).\n\n0.948\u00b1 0.024\n1.010\u00b1 0.038\n0.998\u00b1 0.007\n0.902\u00b10.033\n\n0.762\u00b10.013\n0.758\u00b10.038\n0.866\u00b1 0.085\n0.752\u00b10.024\n\n0.1342\u00b1 0.002\n0.1411\u00b1 0.002\n0.1533\u00b1 0.004\n0.1209\n\u00b10.003\n\nExperimental Set up: Each method is executed in an asynchronously parallel set up of 2-4 GPUs,\nThat is, it can evaluate multiple models in parallel, with each model on a single GPU. When the\nevaluation of one model \ufb01nishes, the methods can incorporate the result and immediately re-deploy\nthe next job without waiting for the others to \ufb01nish. For the blog, indoor, slice, naval and protein\ndatasets we use 2 GeForce GTX 970 (4GB) GPUs and a computational budget of 8 hours for each\nmethod. For the news popularity dataset we use 4 GeForce GTX 980 (6GB) GPUs with a budget of\n6 hours and for Cifar10 we use 4 K80 (12GB) GPUs with a budget of 10 hours. For the regression\ndatasets, we train each model with stochastic gradient descent (SGD) with a \ufb01xed step size of 10\u22125, a\nbatch size of 256 for 20K batch iterations. For Cifar10, we start with a step size of 10\u22122, and reduce\nit gradually. We train in batches of 32 images for 60K batch iterations. The methods evaluate between\n70-120 networks depending on the size of the networks chosen and the number of GPUs.\n\nResults: Fig. 2 plots the best validation score for each method against time. In Table 3, we present\nthe results on the test set with the best model chosen on the basis of validation set performance. On\nthe Cifar10 dataset, we also trained the best models for longer (150K iterations). These results are in\nthe last column of Table 3. We see that NASBOT is the most consistent of all methods. The average\ntime taken by NASBOT to determine the next architecture to evaluate was 46.13s. For RAND, EA,\nand TreeBO this was 26.43s, 0.19s, and 7.83s respectively. The time taken to train and validate\nmodels was on the order of 10-40 minutes depending on the model size. Fig. 2 includes this time\ntaken to determine the next point. Like many BO algorithms, while NASBOT\u2019s selection criterion is\ntime consuming, it pays off when evaluations are expensive. In Appendices B and C, we provide\nadditional details on the experiment set up and conduct synthetic ablation studies by holding out\ndifferent components of the NASBOT framework. We also illustrate some of the best architectures\nfound\u2014on many datasets, common features were long skip connections and multiple decision layers.\n\nFinally, we note that while our Cifar10 experiments fall short of the current state of the art [25, 26, 53],\nthe amount of computation in these work is several orders of magnitude more than ours (both the\ncomputation invested to train a single model and the number of models trained). Further, they use\nconstrained spaces specialised for CNNs, while NASBOT is deployed in a very general model space.\nWe believe that our results can also be improved by employing enhanced training techniques such as\nimage whitening, image \ufb02ipping, drop out, etc. For example, using our training procedure on the\nVGG-19 architecture [37] yielded a test set error of 0.1018 after 150K iterations. However, VGG-19\nis known to do signi\ufb01cantly better on Cifar10. That said, we believe our results are encouraging and\nlay out the premise for BO for neural architectures.\n\n6 Conclusion\nWe described NASBOT, a BO framework for neural architecture search. NASBOT \ufb01nds better\narchitectures for MLPs and CNNs more ef\ufb01ciently than other baselines on several datasets. A\nkey contribution of this work is the ef\ufb01ciently computable OTMANN distance for neural network\narchitectures, which may be of independent interest as it might \ufb01nd applications outside of BO. Our\ncode for NASBOT and OTMANN will be made available.\n\n8\n\n\fAcknolwedgements\n\nWe would like to thank Guru Guruganesh and Dougal Sutherland for the insightful discussions. This\nresearch is partly funded by DOE grant DESC0011114, NSF grant IIS1563887, and the Darpa D3M\nprogram. KK is supported by a Facebook fellowship and a Siebel scholarship.\n\nReferences\n[1] Bowen Baker, Otkrist Gupta, Nikhil Naik, and Ramesh Raskar. Designing neural network architectures\n\nusing reinforcement learning. arXiv preprint arXiv:1611.02167, 2016.\n\n[2] James Bergstra, Daniel Yamins, and David Daniel Cox. Making a science of model search: Hyperparameter\n\noptimization in hundreds of dimensions for vision architectures. 2013.\n\n[3] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A Tutorial on Bayesian Optimization of Expensive Cost\nFunctions, with Application to Active User Modeling and Hierarchical Reinforcement Learning. CoRR,\n2010.\n\n[4] Krisztian Buza. Feedback prediction for blogs.\n\ndiscovery, pages 145\u2013152. Springer, 2014.\n\nIn Data analysis, machine learning and knowledge\n\n[5] Andrea Coraddu, Luca Oneto, Aessandro Ghio, Stefano Savio, Davide Anguita, and Massimo Figari.\nMachine learning approaches for improving condition-based maintenance of naval propulsion plants.\nProceedings of the Institution of Mechanical Engineers, Part M: Journal of Engineering for the Maritime\nEnvironment, 230(1):136\u2013153, 2016.\n\n[6] Corinna Cortes, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. Adanet: Adaptive\n\nstructural learning of arti\ufb01cial neural networks. arXiv preprint arXiv:1607.01097, 2016.\n\n[7] Kelwin Fernandes, Pedro Vinagre, and Paulo Cortez. A proactive intelligent decision support system for\n\npredicting the popularity of online news. In Portuguese Conference on Arti\ufb01cial Intelligence, 2015.\n\n[8] Dario Floreano, Peter D\u00fcrr, and Claudio Mattiussi. Neuroevolution: from architectures to learning.\n\nEvolutionary Intelligence, 1(1):47\u201362, 2008.\n\n[9] Xinbo Gao, Bing Xiao, Dacheng Tao, and Xuelong Li. A survey of graph edit distance. Pattern Analysis\n\nand applications, 13(1):113\u2013129, 2010.\n\n[10] David Ginsbourger, Janis Janusevskis, and Rodolphe Le Riche. Dealing with asynchronicity in parallel\n\ngaussian process based global optimization. In ERCIM, 2011.\n\n[11] Franz Graf, Hans-Peter Kriegel, Matthias Schubert, Sebastian P\u00f6lsterl, and Alexander Cavallaro. 2d image\nregistration in ct images using radial image descriptors. In International Conference on Medical Image\nComputing and Computer-Assisted Intervention, pages 607\u2013614. Springer, 2011.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n[13] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolu-\n\ntional networks. In CVPR, 2017.\n\n[14] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential model-based optimization for general\n\nalgorithm con\ufb01guration. In LION, 2011.\n\n[15] Rodolphe Jenatton, Cedric Archambeau, Javier Gonz\u00e1lez, and Matthias Seeger. Bayesian optimization\n\nwith tree-structured dependencies. In International Conference on Machine Learning, 2017.\n\n[16] Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E Schapire. Contextual\n\ndecision processes with low bellman rank are pac-learnable. arXiv preprint arXiv:1610.09512, 2016.\n\n[17] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnab\u00e1s P\u00f3czos. Gaussian\nprocess bandit optimisation with multi-\ufb01delity evaluations. In Advances in Neural Information Processing\nSystems, pages 992\u20131000, 2016.\n\n[18] Kirthevasan Kandasamy, Gautam Dasarathy, Junier B Oliva, Jeff Schneider, and Barnabas Poczos. Multi-\n\n\ufb01delity gaussian process bandit optimisation. arXiv preprint arXiv:1603.06288, 2016.\n\n[19] Kirthevasan Kandasamy, Gautam Dasarathy, Barnabas Poczos, and Jeff Schneider. The multi-\ufb01delity\n\nmulti-armed bandit. In Advances in Neural Information Processing Systems, pages 1777\u20131785, 2016.\n\n[20] Kirthevasan Kandasamy, Gautam Dasarathy, Jeff Schneider, and Barnabas Poczos. Multi-\ufb01delity Bayesian\n\nOptimisation with Continuous Approximations. arXiv preprint arXiv:1703.06240, 2017.\n\n[21] Hiroaki Kitano. Designing neural networks using genetic algorithms with graph generation system.\n\nComplex systems, 4(4):461\u2013476, 1990.\n\n[22] Aaron Klein, Stefan Falkner, Simon Bartels, Philipp Hennig, and Frank Hutter. Fast bayesian optimization\n\nof machine learning hyperparameters on large datasets. arXiv preprint arXiv:1605.07079, 2016.\n\n[23] Risi Imre Kondor and John Lafferty. Diffusion kernels on graphs and other discrete input spaces. In ICML,\n\nvolume 2, pages 315\u2013322, 2002.\n\n[24] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.\n[25] Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang,\n\nand Kevin Murphy. Progressive neural architecture search. arXiv preprint arXiv:1712.00559, 2017.\n\n9\n\n\f[26] Hanxiao Liu, Karen Simonyan, Oriol Vinyals, Chrisantha Fernando, and Koray Kavukcuoglu. Hierarchical\n\nrepresentations for ef\ufb01cient architecture search. arXiv preprint arXiv:1711.00436, 2017.\n\n[27] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[28] Hector Mendoza, Aaron Klein, Matthias Feurer, Jost Tobias Springenberg, and Frank Hutter. Towards\nautomatically-tuned neural networks. In Workshop on Automatic Machine Learning, pages 58\u201365, 2016.\n[29] Bruno T Messmer and Horst Bunke. A new algorithm for error-tolerant subgraph isomorphism detection.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 20(5):493\u2013504, 1998.\n\n[30] Risto Miikkulainen, Jason Liang, Elliot Meyerson, Aditya Rawal, Dan Fink, Olivier Francon, Bala Raju,\nArshak Navruzyan, Nigel Duffy, and Babak Hodjat. Evolving deep neural networks. arXiv preprint\narXiv:1703.00548, 2017.\n\n[31] J.B. Mockus and L.J. Mockus. Bayesian approach to global optimization and application to multiobjective\n\nand constrained problems. Journal of Optimization Theory and Applications, 1991.\n\n[32] Renato Negrinho and Geoff Gordon. Deeparchitect: Automatically designing and training deep architec-\n\ntures. arXiv preprint arXiv:1704.08792, 2017.\n\n[33] Gabriel Peyr\u00e9 and Marco Cuturi. Computational Optimal Transport. Available online, 2017.\n[34] PS Rana. Physicochemical properties of protein tertiary structure data set, 2013.\n[35] C.E. Rasmussen and C.K.I. Williams. Gaussian Processes for Machine Learning. Adaptative computation\n\nand machine learning series. University Press Group Limited, 2006.\n\n[36] Esteban Real, Sherry Moore, Andrew Selle, Saurabh Saxena, Yutaka Leon Suematsu, Quoc Le, and Alex\n\nKurakin. Large-scale evolution of image classi\ufb01ers. arXiv preprint arXiv:1703.01041, 2017.\n\n[37] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[38] Alexander J Smola and Risi Kondor. Kernels and regularization on graphs. In Learning theory and kernel\n\nmachines, pages 144\u2013158. Springer, 2003.\n\n[39] Edward Snelson and Zoubin Ghahramani. Sparse gaussian processes using pseudo-inputs. In Advances in\n\nneural information processing systems, pages 1257\u20131264, 2006.\n\n[40] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical Bayesian Optimization of Machine Learning\n\nAlgorithms. In Advances in Neural Information Processing Systems, 2012.\n\n[41] Kenneth O Stanley and Risto Miikkulainen. Evolving neural networks through augmenting topologies.\n\nEvolutionary computation, 10(2):99\u2013127, 2002.\n\n[42] Dougal J Sutherland. Scalable, Active and Flexible Learning on Distributions. PhD thesis, Carnegie\n\nMellon University Pittsburgh, PA, 2015.\n\n[43] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[44] Kevin Swersky, David Duvenaud, Jasper Snoek, Frank Hutter, and Michael A Osborne. Raiders of the\nlost architecture: Kernels for bayesian optimization in conditional parameter spaces. arXiv preprint\narXiv:1409.4011, 2014.\n\n[45] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pages 1\u20139, 2015.\n\n[46] Joaqu\u00edn Torres-Sospedra, Ra\u00fal Montoliu, Adolfo Mart\u00ednez-Us\u00f3, Joan P Avariento, Tom\u00e1s J Arnau, Mauri\nBenedito-Bordonau, and Joaqu\u00edn Huerta. Ujiindoorloc: A new multi-building and multi-\ufb02oor database for\nwlan \ufb01ngerprint-based indoor localization problems. In Indoor Positioning and Indoor Navigation (IPIN),\n2014 International Conference on, pages 261\u2013270. IEEE, 2014.\n\n[47] C\u00e9dric Villani. Optimal transport: old and new, volume 338. Springer Science & Business Media, 2008.\n[48] S Vichy N Vishwanathan, Nicol N Schraudolph, Risi Kondor, and Karsten M Borgwardt. Graph kernels.\n\nJournal of Machine Learning Research, 11(Apr):1201\u20131242, 2010.\n\n[49] Walter D Wallis, Peter Shoubridge, M Kraetz, and D Ray. Graph distances using graph union. Pattern\n\nRecognition Letters, 22(6-7):701\u2013704, 2001.\n\n[50] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussian processes\n\n(kiss-gp). In International Conference on Machine Learning, pages 1775\u20131784, 2015.\n\n[51] Lingxi Xie and Alan Yuille. Genetic cnn. arXiv preprint arXiv:1703.01513, 2017.\n[52] Zhao Zhong, Junjie Yan, and Cheng-Lin Liu. Practical network blocks design with q-learning. arXiv\n\npreprint arXiv:1708.05552, 2017.\n\n[53] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint\n\narXiv:1611.01578, 2016.\n\n[54] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V Le. Learning transferable architectures for\n\nscalable image recognition. arXiv preprint arXiv:1707.07012, 2017.\n\n10\n\n\f", "award": [], "sourceid": 1012, "authors": [{"given_name": "Kirthevasan", "family_name": "Kandasamy", "institution": "Carnegie Mellon University"}, {"given_name": "Willie", "family_name": "Neiswanger", "institution": "Carnegie Mellon University"}, {"given_name": "Jeff", "family_name": "Schneider", "institution": "CMU"}, {"given_name": "Barnabas", "family_name": "Poczos", "institution": "Carnegie Mellon University"}, {"given_name": "Eric", "family_name": "Xing", "institution": "Petuum Inc. /  Carnegie Mellon University"}]}