{"title": "Practical Bayesian Optimization of Machine Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2951, "page_last": 2959, "abstract": "The use of machine learning algorithms frequently involves careful tuning of learning parameters and model hyperparameters. Unfortunately, this tuning is often a \u201cblack art\u201d requiring expert experience, rules of thumb, or sometimes brute-force search. There is therefore great appeal for automatic approaches that can optimize the performance of any given learning algorithm to the problem at hand. In this work, we consider this problem through the framework of Bayesian optimization, in which a learning algorithm\u2019s generalization performance is modeled as a sample from a Gaussian process (GP). We show that certain choices for the nature of the GP, such as the type of kernel and the treatment of its hyperparameters, can play a crucial role in obtaining a good optimizer that can achieve expert-level performance. We describe new algorithms that take into account the variable cost (duration) of learning algorithm experiments and that can leverage the presence of multiple cores for parallel experimentation. We show that these proposed algorithms improve on previous automatic procedures and can reach or surpass human expert-level optimization for many algorithms including Latent Dirichlet Allocation, Structured SVMs and convolutional neural networks.", "full_text": "Practical Bayesian Optimization of Machine\n\nLearning Algorithms\n\nJasper Snoek\n\nDepartment of Computer Science\n\nUniversity of Toronto\n\nHugo Larochelle\n\nDepartment of Computer Science\n\nUniversity of Sherbrooke\n\njasper@cs.toronto.edu\n\nhugo.larochelle@usherbrooke.edu\n\nRyan P. Adams\n\nSchool of Engineering and Applied Sciences\n\nHarvard University\n\nrpa@seas.harvard.edu\n\nAbstract\n\nThe use of machine learning algorithms frequently involves careful tuning of\nlearning parameters and model hyperparameters. Unfortunately, this tuning is of-\nten a \u201cblack art\u201d requiring expert experience, rules of thumb, or sometimes brute-\nforce search. There is therefore great appeal for automatic approaches that can\noptimize the performance of any given learning algorithm to the problem at hand.\nIn this work, we consider this problem through the framework of Bayesian opti-\nmization, in which a learning algorithm\u2019s generalization performance is modeled\nas a sample from a Gaussian process (GP). We show that certain choices for the\nnature of the GP, such as the type of kernel and the treatment of its hyperparame-\nters, can play a crucial role in obtaining a good optimizer that can achieve expert-\nlevel performance. We describe new algorithms that take into account the variable\ncost (duration) of learning algorithm experiments and that can leverage the pres-\nence of multiple cores for parallel experimentation. We show that these proposed\nalgorithms improve on previous automatic procedures and can reach or surpass\nhuman expert-level optimization for many algorithms including latent Dirichlet\nallocation, structured SVMs and convolutional neural networks.\n\nIntroduction\n\n1\nMachine learning algorithms are rarely parameter-free: parameters controlling the rate of learning\nor the capacity of the underlying model must often be speci\ufb01ed. These parameters are often con-\nsidered nuisances, making it appealing to develop machine learning algorithms with fewer of them.\nAnother, more \ufb02exible take on this issue is to view the optimization of such parameters as a proce-\ndure to be automated. Speci\ufb01cally, we could view such tuning as the optimization of an unknown\nblack-box function and invoke algorithms developed for such problems. A good choice is Bayesian\noptimization [1], which has been shown to outperform other state of the art global optimization\nalgorithms on a number of challenging optimization benchmark functions [2]. For continuous func-\ntions, Bayesian optimization typically works by assuming the unknown function was sampled from\na Gaussian process and maintains a posterior distribution for this function as observations are made\nor, in our case, as the results of running learning algorithm experiments with different hyperpa-\nrameters are observed. To pick the hyperparameters of the next experiment, one can optimize the\nexpected improvement (EI) [1] over the current best result or the Gaussian process upper con\ufb01dence\nbound (UCB)[3]. EI and UCB have been shown to be ef\ufb01cient in the number of function evaluations\nrequired to \ufb01nd the global optimum of many multimodal black-box functions [4, 3].\n\n1\n\n\fMachine learning algorithms, however, have certain characteristics that distinguish them from other\nblack-box optimization problems. First, each function evaluation can require a variable amount of\ntime: training a small neural network with 10 hidden units will take less time than a bigger net-\nwork with 1000 hidden units. Even without considering duration, the advent of cloud computing\nmakes it possible to quantify economically the cost of requiring large-memory machines for learn-\ning, changing the actual cost in dollars of an experiment with a different number of hidden units.\nSecond, machine learning experiments are often run in parallel, on multiple cores or machines. In\nboth situations, the standard sequential approach of GP optimization can be suboptimal.\nIn this work, we identify good practices for Bayesian optimization of machine learning algorithms.\nWe argue that a fully Bayesian treatment of the underlying GP kernel is preferred to the approach\nbased on optimization of the GP hyperparameters, as previously proposed [5]. Our second contri-\nbution is the description of new algorithms for taking into account the variable and unknown cost of\nexperiments or the availability of multiple cores to run experiments in parallel.\nGaussian processes have proven to be useful surrogate models for computer experiments and good\npractices have been established in this context for sensitivity analysis, calibration and prediction [6].\nWhile these strategies are not considered in the context of optimization, they can be useful to re-\nsearchers in machine learning who wish to understand better the sensitivity of their models to various\nhyperparameters. Hutter et al. [7] have developed sequential model-based optimization strategies for\nthe con\ufb01guration of satis\ufb01ability and mixed integer programming solvers using random forests. The\nmachine learning algorithms we consider, however, warrant a fully Bayesian treatment as their ex-\npensive nature necessitates minimizing the number of evaluations. Bayesian optimization strategies\nhave also been used to tune the parameters of Markov chain Monte Carlo algorithms [8]. Recently,\nBergstra et al. [5] have explored various strategies for optimizing the hyperparameters of machine\nlearning algorithms. They demonstrated that grid search strategies are inferior to random search [9],\nand suggested the use of Gaussian process Bayesian optimization, optimizing the hyperparameters\nof a squared-exponential covariance, and proposed the Tree Parzen Algorithm.\n2 Bayesian Optimization with Gaussian Process Priors\nAs in other kinds of optimization, in Bayesian optimization we are interested in \ufb01nding the mini-\nmum of a function f (x) on some bounded set X , which we will take to be a subset of RD. What\nmakes Bayesian optimization different from other procedures is that it constructs a probabilistic\nmodel for f (x) and then exploits this model to make decisions about where in X to next evaluate\nthe function, while integrating out uncertainty. The essential philosophy is to use all of the informa-\ntion available from previous evaluations of f (x) and not simply rely on local gradient and Hessian\napproximations. This results in a procedure that can \ufb01nd the minimum of dif\ufb01cult non-convex func-\ntions with relatively few evaluations, at the cost of performing more computation to determine the\nnext point to try. When evaluations of f (x) are expensive to perform \u2014 as is the case when it\nrequires training a machine learning algorithm \u2014 then it is easy to justify some extra computation\nto make better decisions. For an overview of the Bayesian optimization formalism and a review of\nprevious work, see, e.g., Brochu et al. [10]. In this section we brie\ufb02y review the general Bayesian\noptimization approach, before discussing our novel contributions in Section 3.\nThere are two major choices that must be made when performing Bayesian optimization. First, one\nmust select a prior over functions that will express assumptions about the function being optimized.\nFor this we choose the Gaussian process prior, due to its \ufb02exibility and tractability. Second, we\nmust choose an acquisition function, which is used to construct a utility function from the model\nposterior, allowing us to determine the next point to evaluate.\n2.1 Gaussian Processes\nThe Gaussian process (GP) is a convenient and powerful prior distribution on functions, which we\nwill take here to be of the form f : X \u2192 R. The GP is de\ufb01ned by the property that any \ufb01nite set of N\npoints {xn \u2208 X}N\nn=1 induces a multivariate Gaussian distribution on RN . The nth of these points\nis taken to be the function value f (xn), and the elegant marginalization properties of the Gaussian\ndistribution allow us to compute marginals and conditionals in closed form. The support and prop-\nerties of the resulting distribution on functions are determined by a mean function m : X \u2192 R and\na positive de\ufb01nite covariance function K : X \u00d7 X \u2192 R. We will discuss the impact of covariance\nfunctions in Section 3.1. For an overview of Gaussian processes, see Rasmussen and Williams [11].\n\n2\n\n\f2.2 Acquisition Functions for Bayesian Optimization\nWe assume that the function f (x) is drawn from a Gaussian process prior and that our observa-\ntions are of the form {xn, yn}N\nn=1, where yn \u223c N (f (xn), \u03bd) and \u03bd is the variance of noise intro-\nduced into the function observations. This prior and these data induce a posterior over functions;\nthe acquisition function, which we denote by a : X \u2192 R+, determines what point in X should be\nevaluated next via a proxy optimization xnext = argmaxx a(x), where several different functions\nhave been proposed. In general, these acquisition functions depend on the previous observations,\nas well as the GP hyperparameters; we denote this dependence as a(x ; {xn, yn}, \u03b8). There are\nseveral popular choices of acquisition function. Under the Gaussian process prior, these functions\ndepend on the model solely through its predictive mean function \u00b5(x ; {xn, yn}, \u03b8) and predictive\nvariance function \u03c32(x ; {xn, yn}, \u03b8).\nIn the proceeding, we will denote the best current value\nas xbest = argminxn f (xn) and the cumulative distribution function of the standard normal as \u03a6(\u00b7).\nProbability of Improvement One intuitive strategy is to maximize the probability of improving\nover the best current value [12]. Under the GP this can be computed analytically as\n\naPI(x ; {xn, yn}, \u03b8) = \u03a6(\u03b3(x)),\n\n\u03b3(x) =\n\nf (xbest) \u2212 \u00b5(x ; {xn, yn}, \u03b8)\n\n\u03c3(x ; {xn, yn}, \u03b8)\n\n.\n\n(1)\n\nExpected Improvement Alternatively, one could choose to maximize the expected improvement\n(EI) over the current best. This also has closed form under the Gaussian process:\n\naEI(x ; {xn, yn}, \u03b8) = \u03c3(x ; {xn, yn}, \u03b8) (\u03b3(x) \u03a6(\u03b3(x)) + N (\u03b3(x) ; 0, 1))\n\n(2)\n\nGP Upper Con\ufb01dence Bound A more recent development is the idea of exploiting lower con\ufb01-\ndence bounds (upper, when considering maximization) to construct acquisition functions that mini-\nmize regret over the course of their optimization [3]. These acquisition functions have the form\n\naLCB(x ; {xn, yn}, \u03b8) = \u00b5(x ; {xn, yn}, \u03b8) \u2212 \u03ba \u03c3(x ; {xn, yn}, \u03b8),\n\n(3)\n\nwith a tunable \u03ba to balance exploitation against exploration.\nIn this work we will focus on the EI criterion, as it has been shown to be better-behaved than\nprobability of improvement, but unlike the method of GP upper con\ufb01dence bounds (GP-UCB), it\ndoes not require its own tuning parameter. Although the EI algorithm performs well in minimization\nproblems, we wish to note that the regret formalization may be more appropriate in some settings.\nWe perform a direct comparison between our EI-based approach and GP-UCB in Section 4.1.\n3 Practical Considerations for Bayesian Optimization of Hyperparameters\nAlthough an elegant framework for optimizing expensive functions, there are several limitations\nthat have prevented it from becoming a widely-used technique for optimizing hyperparameters in\nmachine learning problems. First, it is unclear for practical problems what an appropriate choice is\nfor the covariance function and its associated hyperparameters. Second, as the function evaluation\nitself may involve a time-consuming optimization procedure, problems may vary signi\ufb01cantly in\nduration and this should be taken into account. Third, optimization algorithms should take advantage\nof multi-core parallelism in order to map well onto modern computational environments. In this\nsection, we propose solutions to each of these issues.\n3.1 Covariance Functions and Treatment of Covariance Hyperparameters\nThe power of the Gaussian process to express a rich distribution on functions rests solely on the\nshoulders of the covariance function. While non-degenerate covariance functions correspond to\nin\ufb01nite bases, they nevertheless can correspond to strong assumptions regarding likely functions. In\nparticular, the automatic relevance determination (ARD) squared exponential kernel\n(xd \u2212 x(cid:48)\n\nKSE(x, x(cid:48)) = \u03b80 exp\n\nr2(x, x(cid:48)) =\n\nr2(x, x(cid:48))\n\n(cid:27)\n\n(cid:26)\n\nd)2/\u03b82\nd.\n\n(4)\n\n\u2212 1\n2\n\nD(cid:88)\n\nd=1\n\nis often a default choice for Gaussian process regression. However, sample functions with this co-\nvariance function are unrealistically smooth for practical optimization problems. We instead propose\nthe use of the ARD Mat\u00b4ern 5/2 kernel:\n\n(cid:18)\n\n1 +(cid:112)5r2(x, x(cid:48)) +\n\n(cid:19)\n\n(cid:110)\u2212(cid:112)5r2(x, x(cid:48))\n(cid:111)\n\nr2(x, x(cid:48))\n\nexp\n\n5\n3\n\n.\n\n(5)\n\nKM52(x, x(cid:48)) = \u03b80\n\n3\n\n\f(a) Posterior samples under varying hyperparameters\n\n(a) Posterior samples after three data\n\n(b) Expected improvement under varying hyperparameters\n\n(b) Expected improvement under three fantasies\n\n(c) Integrated expected improvement\n\nFigure 1: Illustration of integrated expected improve-\nment.\n(a) Three posterior samples are shown, each\nwith different length scales, after the same \ufb01ve obser-\nvations. (b) Three expected improvement acquisition\nfunctions, with the same data and hyperparameters.\nThe maximum of each is shown. (c) The integrated\nexpected improvement, with its maximum shown.\n\n(c) Expected improvement across fantasies\nIllustration of the acquisition with pend-\nFigure 2:\ning evaluations.\n(a) Three data have been observed\nand three posterior functions are shown, with \u201cfan-\ntasies\u201d for three pending evaluations. (b) Expected im-\nprovement, conditioned on the each joint fantasy of the\npending outcome. (c) Expected improvement after in-\ntegrating over the fantasy outcomes.\n\nThis covariance function results in sample functions which are twice-differentiable, an assumption\nthat corresponds to those made by, e.g., quasi-Newton methods, but without requiring the smooth-\nness of the squared exponential.\nAfter choosing the form of the covariance, we must also manage the hyperparameters that govern its\nbehavior (Note that these \u201chyperparameters\u201d are distinct from those being subjected to the overall\nBayesian optimization.), as well as that of the mean function. For our problems of interest, typically\nwe would have D + 3 Gaussian process hyperparameters: D length scales \u03b81:D, the covariance\namplitude \u03b80, the observation noise \u03bd, and a constant mean m. The most commonly advocated ap-\nproach is to use a point estimate of these parameters by optimizing the marginal likelihood under the\nGaussian process, p(y |{xn}N\nn=1, \u03b8, \u03bd, m) = N (y | m1, \u03a3\u03b8 + \u03bdI), where y = [y1, y2,\u00b7\u00b7\u00b7 , yN ]T,\nand \u03a3\u03b8 is the covariance matrix resulting from the N input points under the hyperparameters \u03b8.\nHowever, for a fully-Bayesian treatment of hyperparameters (summarized here by \u03b8 alone), it is\ndesirable to marginalize over hyperparameters and compute the integrated acquisition function:\n\n(cid:90)\n\n\u02c6a(x ; {xn, yn}) =\n\na(x ; {xn, yn}, \u03b8) p(\u03b8 |{xn, yn}N\n\nn=1) d\u03b8,\n\n(6)\n\nwhere a(x) depends on \u03b8 and all of the observations. For probability of improvement and EI, this\nexpectation is the correct generalization to account for uncertainty in hyperparameters. We can\ntherefore blend acquisition functions arising from samples from the posterior over GP hyperparam-\neters and have a Monte Carlo estimate of the integrated expected improvement. These samples can\nbe acquired ef\ufb01ciently using slice sampling, as described in Murray and Adams [13]. As both opti-\nmization and Markov chain Monte Carlo are computationally dominated by the cubic cost of solving\nan N-dimensional linear system (and our function evaluations are assumed to be much more expen-\nsive anyway), the fully-Bayesian treatment is sensible and our empirical evaluations bear this out.\nFigure 1 shows how the integrated expected improvement changes the acquistion function.\n3.2 Modeling Costs\nUltimately, the objective of Bayesian optimization is to \ufb01nd a good setting of our hyperparameters\nas quickly as possible. Greedy acquisition procedures such as expected improvement try to make\n\n4\n\n\f\u02c6a(x ; {xn, yn}, \u03b8,{xj}) =\n\n(cid:90)\n\nRJ\n\nthe best progress possible in the next function evaluation. From a practial point of view, however,\nwe are not so concerned with function evaluations as with wallclock time. Different regions of\nthe parameter space may result in vastly different execution times, due to varying regularization,\nlearning rates, etc. To improve our performance in terms of wallclock time, we propose optimizing\nwith the expected improvement per second, which prefers to acquire points that are not only likely\nto be good, but that are also likely to be evaluated quickly. This notion of cost can be naturally\ngeneralized to other budgeted resources, such as reagents or money.\nJust as we do not know the true objective function f (x), we also do not know the duration func-\ntion c(x) : X \u2192 R+. We can nevertheless employ our Gaussian process machinery to model ln c(x)\nalongside f (x). In this work, we assume that these functions are independent of each other, although\ntheir coupling may be usefully captured using GP variants of multi-task learning (e.g., [14, 15]).\nUnder the independence assumption, we can easily compute the predicted expected inverse duration\nand use it to compute the expected improvement per second as a function of x.\n3.3 Monte Carlo Acquisition for Parallelizing Bayesian Optimization\nWith the advent of multi-core computing, it is natural to ask how we can parallelize our Bayesian\noptimization procedures. More generally than simply batch parallelism, however, we would like to\nbe able to decide what x should be evaluated next, even while a set of points are being evaluated.\nClearly, we cannot use the same acquisition function again, or we will repeat one of the pending\nexperiments. Ideally, we could perform a roll-out of our acquisition policy, to choose a point that\nappropriately balanced information gain and exploitation. However, such roll-outs are generally\nintractable. Instead we propose a sequential strategy that takes advantage of the tractable inference\nproperties of the Gaussian process to compute Monte Carlo estimates of the acquisiton function\nunder different possible results from pending function evaluations.\nConsider the situation in which N evaluations have completed, yielding data {xn, yn}N\nwhich J evaluations are pending at locations {xj}J\non the expected acquisition function under all possible outcomes of these pending evaluations:\n\nn=1, and in\nj=1. Ideally, we would choose a new point based\n\na(x ; {xn, yn}, \u03b8,{xj, yj}) p({yj}J\n\nj=1 |{xj}J\n\nj=1,{xn, yn}N\n\nn=1) dy1 \u00b7\u00b7\u00b7 dyJ .\n\n(7)\n\nThis is simply the expectation of a(x) under a J-dimensional Gaussian distribution, whose mean and\ncovariance can easily be computed. As in the covariance hyperparameter case, it is straightforward\nto use samples from this distribution to compute the expected acquisition and use this to select the\nnext point. Figure 2 shows how this procedure would operate with queued evaluations. We note that\na similar approach is touched upon brie\ufb02y by Ginsbourger and Riche [16], but they view it as too\nintractable to warrant attention. We have found our Monte Carlo estimation procedure to be highly\neffective in practice, however, as will be discussed in Section 4.\n4 Empirical Analyses\nIn this section, we empirically analyse1 the algorithms introduced in this paper and compare to ex-\nisting strategies and human performance on a number of challenging machine learning problems.\nWe refer to our method of expected improvement while marginalizing GP hyperparameters as \u201cGP\nEI MCMC\u201d, optimizing hyperparameters as \u201cGP EI Opt\u201d, EI per second as \u201cGP EI per Second\u201d, and\nN times parallelized GP EI MCMC as \u201cNx GP EI MCMC\u201d. Each results \ufb01gure plots the progres-\nsion of minxn f (xn) over the number of function evaluations or time, averaged over multiple runs\nof each algorithm. If not speci\ufb01ed otherwise, xnext = argmaxx a(x) is computed using gradient-\nbased search with multiple restarts (see supplementary material for details). The code used is made\npublicly available at http://www.cs.toronto.edu/\u02dcjasper/software.html.\n4.1 Branin-Hoo and Logistic Regression\nWe \ufb01rst compare to standard approaches and the recent Tree Parzen Algorithm2 (TPA) of Bergstra\net al. [5] on two standard problems. The Branin-Hoo function is a common benchmark for Bayesian\n\n1All experiments were conducted on identical machines using the Amazon EC2 service.\n2Using the publicly available code from https://github.com/jaberg/hyperopt/wiki\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Comparisons on the Branin-Hoo function (3a) and training logistic regression on MNIST (3b). (3c)\nshows GP EI MCMC and GP EI per Second from (3b), but in terms of time elapsed.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 4: Different strategies of optimization on the Online LDA problem compared in terms of function\nevaluations (4a), walltime (4b) and constrained to a grid or not (4c).\noptimization techniques [2] that is de\ufb01ned over x \u2208 R2 where 0 \u2264 x1 \u2264 15 and \u22125 \u2264 x2 \u2264 15. We\nalso compare to TPA on a logistic regression classi\ufb01cation task on the popular MNIST data. The\nalgorithm requires choosing four hyperparameters, the learning rate for stochastic gradient descent,\non a log scale from 0 to 1, the (cid:96)2 regularization parameter, between 0 and 1, the mini batch size,\nfrom 20 to 2000 and the number of learning epochs, from 5 to 2000. Each algorithm was run on the\nBranin-Hoo and logistic regression problems 100 and 10 times respectively and mean and standard\nerror are reported. The results of these analyses are presented in Figures 3a and 3b in terms of\nthe number of times the function is evaluated. On Branin-Hoo, integrating over hyperparameters is\nsuperior to using a point estimate and the GP EI signi\ufb01cantly outperforms TPA, \ufb01nding the minimum\nin less than half as many evaluations, in both cases. For logistic regression, 3b and 3c show that\nalthough EI per second is less ef\ufb01cient in function evaluations it outperforms standard EI in time.\n4.2 Online LDA\nLatent Dirichlet Allocation (LDA) is a directed graphical model for documents in which words\nare generated from a mixture of multinomial \u201ctopic\u201d distributions. Variational Bayes is a popular\nparadigm for learning and, recently, Hoffman et al. [17] proposed an online learning approach in\nthat context. Online LDA requires 2 learning parameters, \u03c40 and \u03ba, that control the learning rate\n\u03c1t = (\u03c40 + t)\u2212\u03ba used to update the variational parameters of LDA based on the tth minibatch of\ndocument word count vectors. The size of the minibatch is also a third parameter that must be\nchosen. Hoffman et al. [17] relied on an exhaustive grid search of size 6 \u00d7 6 \u00d7 8, for a total of 288\nhyperparameter con\ufb01gurations.\nWe used the code made publically available by Hoffman et al. [17] to run experiments with online\nLDA on a collection of Wikipedia articles. We downloaded a random set of 249 560 articles, split\ninto training, validation and test sets of size 200 000, 24 560 and 25 000 respectively. The documents\nare represented as vectors of word counts from a vocabulary of 7702 words. As reported in Hoffman\net al. [17], we used a lower bound on the per word perplexity of the validation set documents as the\nperformance measure. One must also specify the number of topics and the hyperparameters \u03b7 for\nthe symmetric Dirichlet prior over the topic distributions and \u03b1 for the symmetric Dirichlet prior\nover the per document topic mixing weights. We followed Hoffman et al. [17] and used 100 topics\nand \u03b7 = \u03b1 = 0.01 in our experiments in order to emulate their analysis and repeated exactly the grid\nsearch reported in the paper3. Each online LDA evaluation generally took between \ufb01ve to ten hours\nto converge, thus the grid search requires approximately 60 to 120 processor days to complete.\n\n3i.e. the only difference was the randomly sampled collection of articles in the data set and the choice of the\n\nvocabulary. We ran each evaluation for 10 hours or until convergence.\n\n6\n\n0102030405005101520253035Min Function ValueFunction evaluations  GP EI OptGP EI MCMCGP\u2212UCBTPA0204060801000.080.10.120.140.160.180.20.220.24Min Function ValueFunction Evaluations  GP EI MCMCGP EI OptGP EI per SecTree Parzen Algorithm510152025303540450.080.10.120.140.160.180.2Min Function ValueMinutes  GP EI MCMCGP EI per Second010203040501260127012801290130013101320133013401350Min Function ValueFunction evaluations  GP EI MCMCGP EI per secondGP EI OptRandom Grid Search3x GP EI MCMC5x GP EI MCMC10x GP EI MCMC0246810121260127012801290130013101320133013401350Min function valueTime (Days)  GP EI MCMCGP EI per secondGP EI Opt3x GP EI MCMC5x GP EI MCMC10x GP EI MCMC010203040501260127012801290130013101320133013401350Min Function ValueFunction evaluations  3x GP EI MCMC (On grid)5x GP EI MCMC (On grid)3x GP EI MCMC (Off grid)5x GP EI MCMC (Off grid)\f(a)\n\n(b)\n\n(c)\n\nFigure 5: A comparison of various strategies for optimizing the hyperparameters of M3E models on the protein\nmotif \ufb01nding task in terms of walltime (5a), function evaluations (5b) and different covariance functions(5c).\nIn Figures 4a and 4b we compare our various strategies of optimization over the same grid on this\nexpensive problem. That is, the algorithms were restricted to only the exact parameter settings as\nevaluated by the grid search. Each optimization was then repeated 100 times (each time picking two\ndifferent random experiments to initialize the optimization with) and the mean and standard error\nare reported4. Figure 4c also presents a 5 run average of optimization with 3 and 5 times parallelized\nGP EI MCMC, but without restricting the new parameter setting to be on the pre-speci\ufb01ed grid (see\nsupplementary material for details). A comparison with their \u201con grid\u201d versions is illustrated.\nClearly integrating over hyperparameters is superior to using a point estimate in this case. While\nGP EI MCMC is the most ef\ufb01cient in terms of function evaluations, we see that parallelized GP EI\nMCMC \ufb01nds the best parameters in signi\ufb01cantly less time. Finally, in Figure 4c we see that the\nparallelized GP EI MCMC algorithms \ufb01nd a signi\ufb01cantly better minimum value than was found in\nthe grid search used by Hoffman et al. [17] while running a fraction of the number of experiments.\n4.3 Motif Finding with Structured Support Vector Machines\nIn this example, we consider optimizing the learning parameters of Max-Margin Min-Entropy\n(M3E) Models [18], which include Latent Structured Support Vector Machines [19] as a special\ncase. Latent structured SVMs outperform SVMs on problems where they can explicitly model\nproblem-dependent hidden variables. A popular example task is the binary classi\ufb01cation of pro-\ntein DNA sequences [18, 20, 19]. The hidden variable to be modeled is the unknown location of\nparticular subsequences, or motifs, that are indicators of positive sequences.\nSetting the hyperparameters, such as the regularisation term, C, of structured SVMs remains a chal-\nlenge and these are typically set through a time consuming grid search procedure as is done in\n[18, 19]. Indeed, Kumar et al. [20] avoided hyperparameter selection for this task as it was too\ncomputationally expensive. However, Miller et al. [18] demonstrate that results depend highly on\nthe setting of the parameters, which differ for each protein. M3E models introduce an entropy term,\nparameterized by \u03b1, which enables the model to outperform latent structured SVMs. This additional\nperformance, however, comes at the expense of an additional problem-dependent hyperparameter.\nWe emulate the experiments of Miller et al. [18] for one protein with approximately 40 000 se-\nquences. We explore 25 settings of the parameter C, on a log scale from 10\u22121 to 106, 14 settings of\n\u03b1, on a log scale from 0.1 to 5 and the model convergence tolerance, \u0001 \u2208 {10\u22124,10\u22123,10\u22122,10\u22121}.\nWe ran a grid search over the 1400 possible combinations of these parameters, evaluating each over\n5 random 50-50 training and test splits.\nIn Figures 5a and 5b, we compare the randomized grid search to GP EI MCMC, GP EI per Second\nand their 3x parallelized versions, all constrained to the same points on the grid. Each algorithm\nwas repeated 100 times and the mean and standard error are shown. We observe that the Bayesian\noptimization strategies are considerably more ef\ufb01cient than grid search which is the status quo. In\nthis case, GP EI MCMC is superior to GP EI per Second in terms of function evaluations but GP\nEI per Second \ufb01nds better parameters faster than GP EI MCMC as it learns to use a less strict\nconvergence tolerance early on while exploring the other parameters. Indeed, 3x GP EI per second,\nis the least ef\ufb01cient in terms of function evaluations but \ufb01nds better parameters faster than all the\nother algorithms. Figure 5c compares the use of various covariance functions in GP EI MCMC\noptimization on this problem, again repeating the optimization 100 times. It is clear that the selection\n\n4The restriction of the search to the same grid was chosen for ef\ufb01ciency reasons: it allowed us to repeat\nthe experiments several times ef\ufb01ciently, by \ufb01rst computing all function evaluations over the whole grid and\nreusing these values within each repeated experiment.\n\n7\n\n05101520250.240.2450.250.2550.26Time (hours)Min function value  GP EI MCMCGP EI per Second3x GP EI MCMC3x GP EI per SecondRandom Grid Search0204060801000.240.2450.250.2550.26Min Function ValueFunction evaluations  GP EI MCMCGP EI per Second3x GP EI MCMC3x GP EI per Second0204060801000.240.2450.250.2550.260.2650.270.2750.28Min Function ValueFunction evaluations  Matern 52 ARDSqExpSqExp ARDMatern 32 ARD\fFigure 6: Validation error on the CIFAR-10 data for different optimization strategies.\n\nof an appropriate covariance signi\ufb01cantly affects performance and the estimation of length scale\nparameters is critical. The assumption of the in\ufb01nite differentiability as imposed by the commonly\nused squared exponential is too restrictive for this problem.\n4.4 Convolutional Networks on CIFAR-10\nNeural networks and deep learning methods notoriously require careful tuning of numerous hyper-\nparameters. Multi-layer convolutional neural networks are an example of such a model for which a\nthorough exploration of architechtures and hyperparameters is bene\ufb01cial, as demonstrated in Saxe\net al. [21], but often computationally prohibitive. While Saxe et al. [21] demonstrate a methodology\nfor ef\ufb01ciently exploring model architechtures, numerous hyperparameters, such as regularisation\nparameters, remain. In this empirical analysis, we tune nine hyperparameters of a three-layer con-\nvolutional network [22] on the CIFAR-10 benchmark dataset using the code provided 5. This model\nhas been carefully tuned by a human expert [22] to achieve a highly competitive result of 18% test\nerror on the unaugmented data, which matches the published state of the art result [23] on CIFAR-\n10. The parameters we explore include the number of epochs to run the model, the learning rate,\nfour weight costs (one for each layer and the softmax output weights), and the width, scale and\npower of the response normalization on the pooling layers of the network.\nWe optimize over the nine parameters for each strategy on a withheld validation set and report the\nmean validation error and standard error over \ufb01ve separate randomly initialized runs. Results are\npresented in Figure 6 and contrasted with the average results achieved using the best parameters\nfound by the expert. The best hyperparameters found by the GP EI MCMC approach achieve an\nerror on the test set of 14.98%, which is over 3% better than the expert and the state of the art on\nCIFAR-10. The same procedure was repeated on the CIFAR-10 data augmented with horizontal\nre\ufb02ections and translations, similarly improving on the expert from 11% to 9.5% test error. To our\nknowledge this is the lowest error reported, compared to the 11% state of the art and a recently\npublished 11.21% [24] using similar methods, on the competitive CIFAR-10 benchmark.\n5 Conclusion\nWe presented methods for performing Bayesian optimization for hyperparameter selection of gen-\neral machine learning algorithms. We introduced a fully Bayesian treatment for EI, and algorithms\nfor dealing with variable time regimes and running experiments in parallel. The effectiveness of our\napproaches were demonstrated on three challenging recently published problems spanning different\nareas of machine learning. The resulting Bayesian optimization \ufb01nds better hyperparameters sig-\nni\ufb01cantly faster than the approaches used by the authors and surpasses a human expert at selecting\nhyperparameters on the competitive CIFAR-10 dataset, beating the state of the art by over 3%.\nAcknowledgements\nThe authors thank Alex Krizhevsky, Hoffman et al. [17] and Miller et al. [18] for making their code\nand data available, and George Dahl for valuable feedback. This work was funded by DARPA Young\nFaculty Award N66001-12-1-4219, NSERC and an Amazon AWS in Research grant.\nReferences\n[1] Jonas Mockus, Vytautas Tiesis, and Antanas Zilinskas. The application of Bayesian methods\n\nfor seeking the extremum. Towards Global Optimization, 2:117\u2013129, 1978.\n\n5Available at: http://code.google.com/p/cuda-convnet/\n\n8\n\n010203040500.20.250.30.350.4Min Function ValueFunction evaluations  GP EI MCMCGP EI OptGP EI per SecondGP EI MCMC 3x ParallelHuman Expert0102030405060700.20.250.30.350.4Min function valueTime (Hours)  GP EI MCMCGP EI OptGP EI per SecondGP EI MCMC 3x Parallel\f[2] D.R. Jones. A taxonomy of global optimization methods based on response surfaces. Journal\n\nof Global Optimization, 21(4):345\u2013383, 2001.\n\n[3] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussian process\noptimization in the bandit setting: No regret and experimental design. In Proceedings of the\n27th International Conference on Machine Learning, 2010.\n\n[4] Adam D. Bull. Convergence rates of ef\ufb01cient global optimization algorithms. Journal of\n\nMachine Learning Research, (3-4):2879\u20132904, 2011.\n\n[5] James S. Bergstra, R\u00b4emi Bardenet, Yoshua Bengio, and B\u00b4al\u00b4azs K\u00b4egl. Algorithms for hyper-\n\nparameter optimization. In Advances in Neural Information Processing Systems 25. 2011.\n\n[6] Marc C. Kennedy and Anthony O\u2019Hagan. Bayesian calibration of computer models. Journal\n\nof the Royal Statistical Society: Series B (Statistical Methodology), 63(3), 2001.\n\n[7] Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. Sequential model-based optimization\n\nfor general algorithm con\ufb01guration. In Learning and Intelligent Optimization 5, 2011.\n\n[8] Nimalan Mahendran, Ziyu Wang, Firas Hamze, and Nando de Freitas. Adaptive mcmc with\n\nbayesian optimization. In AISTATS, 2012.\n\n[9] James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. Journal\n\nof Machine Learning Research, 13:281\u2013305, 2012.\n\n[10] Eric Brochu, Vlad M. Cora, and Nando de Freitas. A tutorial on Bayesian optimization of ex-\npensive cost functions, with application to active user modeling and hierarchical reinforcement\nlearning. pre-print, 2010. arXiv:1012.2599.\n\n[11] Carl E. Rasmussen and Christopher Williams. Gaussian Processes for Machine Learning. MIT\n\nPress, 2006.\n\n[12] H. J. Kushner. A new method for locating the maximum point of an arbitrary multipeak curve\n\nin the presence of noise. Journal of Basic Engineering, 86, 1964.\n\n[13] Iain Murray and Ryan P. Adams. Slice sampling covariance hyperparameters of latent Gaussian\nmodels. In Advances in Neural Information Processing Systems 24, pages 1723\u20131731. 2010.\n[14] Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models.\n\nIn AISTATS, 2005.\n\n[15] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task Gaussian\n\nprocess prediction. In Advances in Neural Information Processing Systems 22, 2008.\n\n[16] David Ginsbourger and Rodolphe Le Riche. Dealing with asynchronicity in parallel Gaus-\nhttp://hal.archives-ouvertes.fr/\n\nsian process based global optimization.\nhal-00507632, 2010.\n\n[17] Matthew Hoffman, David M. Blei, and Francis Bach. Online learning for latent Dirichlet\n\nallocation. In Advances in Neural Information Processing Systems 24, 2010.\n\n[18] Kevin Miller, M. Pawan Kumar, Benjamin Packer, Danny Goodman, and Daphne Koller. Max-\n\nmargin min-entropy models. In AISTATS, 2012.\n\n[19] Chun-Nam John Yu and Thorsten Joachims. Learning structural SVMs with latent variables.\n\nIn Proceedings of the 26th International Conference on Machine Learning, 2009.\n\n[20] M. Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable\n\nmodels. In Advances in Neural Information Processing Systems 25. 2010.\n\n[21] Andrew Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Ng.\nOn random weights and unsupervised feature learning. In Proceedings of the 28th International\nConference on Machine Learning, 2011.\n\n[22] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nDepartment of Computer Science, University of Toronto, 2009.\n\n[23] Adam Coates and Andrew Y. Ng. Selecting receptive \ufb01elds in deep networks. In Advances in\n\nNeural Information Processing Systems 25. 2011.\n\n[24] Dan Claudiu Ciresan, Ueli Meier, and J\u00a8urgen Schmidhuber. Multi-column deep neural net-\n\nworks for image classi\ufb01cation. In Computer Vision and Pattern Recognition, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1338, "authors": [{"given_name": "Jasper", "family_name": "Snoek", "institution": null}, {"given_name": "Hugo", "family_name": "Larochelle", "institution": null}, {"given_name": "Ryan", "family_name": "Adams", "institution": null}]}