{"title": "An Empirical Bayes Approach to Optimizing Machine Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 2712, "page_last": 2721, "abstract": "There is rapidly growing interest in using Bayesian optimization to tune model and inference hyperparameters for machine learning algorithms that take a long time to run. For example, Spearmint is a popular software package for selecting the optimal number of layers and learning rate in neural networks. But given that there is uncertainty about which hyperparameters give the best predictive performance, and given that fitting a model for each choice of hyperparameters is costly, it is arguably wasteful to \"throw away\" all but the best result, as per Bayesian optimization. A related issue is the danger of overfitting the validation data when optimizing many hyperparameters. In this paper, we consider an alternative approach that uses more samples from the hyperparameter selection procedure to average over the uncertainty in model hyperparameters. The resulting approach, empirical Bayes for hyperparameter averaging (EB-Hyp) predicts held-out data better than Bayesian optimization in two experiments on latent Dirichlet allocation and deep latent Gaussian models. EB-Hyp suggests a simpler approach to evaluating and deploying machine learning algorithms that does not require a separate validation data set and hyperparameter selection procedure.", "full_text": "An Empirical Bayes Approach to Optimizing\n\nMachine Learning Algorithms\n\nJames McInerney\nSpotify Research\n\n45 W 18th St, 7th Floor\nNew York, NY 10011\njamesm@spotify.com\n\nAbstract\n\nThere is rapidly growing interest in using Bayesian optimization to tune model and\ninference hyperparameters for machine learning algorithms that take a long time to\nrun. For example, Spearmint is a popular software package for selecting the optimal\nnumber of layers and learning rate in neural networks. But given that there is\nuncertainty about which hyperparameters give the best predictive performance, and\ngiven that \ufb01tting a model for each choice of hyperparameters is costly, it is arguably\nwasteful to \u201cthrow away\u201d all but the best result, as per Bayesian optimization.\nA related issue is the danger of over\ufb01tting the validation data when optimizing\nmany hyperparameters. In this paper, we consider an alternative approach that\nuses more samples from the hyperparameter selection procedure to average over\nthe uncertainty in model hyperparameters. The resulting approach, empirical\nBayes for hyperparameter averaging (EB-Hyp) predicts held-out data better than\nBayesian optimization in two experiments on latent Dirichlet allocation and deep\nlatent Gaussian models. EB-Hyp suggests a simpler approach to evaluating and\ndeploying machine learning algorithms that does not require a separate validation\ndata set and hyperparameter selection procedure.\n\n1\n\nIntroduction\n\nThere is rapidly growing interest in using Bayesian optimization (BayesOpt) to tune model and\ninference hyperparameters for machine learning algorithms that take a long time to run (Snoek\net al., 2012). Tuning algorithms by grid search is a time consuming task. Tuning by hand is\nalso time consuming and requires trial, error, and expert knowledge of the model. To capture this\nknowledge, BayesOpt uses a performance model (usually a Gaussian process) as a guide to regions\nof hyperparameter space that perform well. BayesOpt balances exploration and exploitation to decide\nwhich hyperparameter to evaluate next in an iterative procedure.\nBayesOpt for machine learning algorithms is a form of model selection in which some objective, such\nas predictive likelihood or root mean squared error, is optimized with respect to hyperparameters \u03b7.\nThus, it is an empirical Bayesian procedure where the marginal likelihood is replaced by a proxy\nobjective. Empirical Bayes optimizes the marginal likelihood of data set X (a summary of symbols\nis provided in Table 1),\n\n\u02c6\u03b7 := arg max\n\n(1)\nthen uses p(\u03b8 | X, \u02c6\u03b7) as the posterior distribution over the unknown model parameters \u03b8 (Carlin\nand Louis, 2000). Empirical Bayes is applied in different ways, e.g., gradient-based optimization\nof Gaussian process kernel parameters, optimization of hyperparameters to conjugate priors in\nvariational inference. What is special about BayesOpt is that it performs empirical Bayes in a way\n\nEp(\u03b8 | \u03b7)[p(X | \u03b8)],\n\n\u03b7\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f(a) Negative log likelihood on validation data\n\n(b) Negative log likelihood on test data\n\nFigure 1: Performance in negative logarithm of the predictive likelihood for the validation data\n(left plot) and test data (right plot) ordered by validation error. Each iteration represents a different\nhyperparameter setting.\n\nTable 1: Summary of Symbols\n\nSymbol Meaning\n\u03b8\n\u03b7\n\u03bb\n\u02c6\u03b7\n\u02c6\u03bb\nX\nX\u2217\n\nthe model parameters\nthe hyperparameters\nthe hyper-hyperparameters\nthe hyperparameters \ufb01t by empirical Bayes\nthe hyper-hyperparameters \ufb01t by empirical Bayes\nthe dataset\nunseen data\n\nthat requires calculating the posterior p(\u03b8 | X, \u03b7(s)) for each member in a sequence 1, . . . , S of\ncandidate hyperparameters \u03b7(1), \u03b7(2), . . . , \u03b7(S). Often these posteriors are approximate, such as a\npoint estimate, a Monte Carlo estimate, or a variational approximation. Nonetheless, these operations\nare usually expensive to compute.\nTherefore, what is surprising about BayesOpt for approximate inference is that it disregards most\nof the computed posteriors and keeps only the posterior p(\u03b8 | X, \u02c6\u03b7) that optimizes the marginal\nlikelihood. It is surprising because the intermediate posteriors have something to say about the data,\neven if they condition on hyperparameter con\ufb01gurations that do not maximize the marginal likelihood.\nIn other words, when we harbour uncertainty about \u03b7, should we be more Bayesian? We argue for\nthis approach, especially if one believes there is a danger of over\ufb01tting \u03b7 on the validation set, which\nis especially the case as the dimensionality of the hyperparameters grows. As an illustrative example,\nFigure 1 shows the predictive performance of a set of 115 posteriors (each corresponding to a\ndifferent hyperparameter) of latent Dirichlet allocation on validation data and testing data. Over\ufb01tting\nvalidation means that the single best posterior would not be selected as the \ufb01nal answer in BayesOpt.\nBayes empirical Bayes (Carlin and Louis, 2000) extends the empirical Bayes paradigm by introducing\na family of hyperpriors p(\u03b7 | \u03bb) indexed by \u03bb and calculates the posterior over the model parameters\nby integrating,\n\n(2)\nThis leads to the question of how to select the hyper-hyperparameter \u03bb. A natural answer is a\nhierarchical empirical Bayes approach where \u03bb is maximized1,\n\np(\u03b8 | X, \u03bb) = Ep(\u03b7 | X,\u03bb)[p(\u03b8 | X, \u03b7)].\n\n\u02c6\u03bb = arg max\n\n\u03bb\n\nEp(\u03b7 | \u03bb)Ep(\u03b8 | \u03b7)[p(X | \u03b8, \u03b7)],\n\n(3)\n\n1this approach could also be called type-III maximum likelihood because it involves marginalizing over\n\nmodel parameters \u03b8, hyperparameters \u03b7, and maximizing hyper-hyperparameters \u03bb.\n\n2\n\n020406080100120iteration ordered by valiation error020000040000060000080000010000001200000negative log lik on validation data020406080100120iteration ordered by valiation error0200000400000600000800000100000012000001400000negative log lik on test data\fand p(\u03b8 | X, \u02c6\u03bb) is used as the posterior. Comparing Eq. 3 to Eq. 1 highlights that we are adding an\nextra layer of marginalization that can be exploited with the intermediate posteriors in hand. Note\nthe distinction between marginalizing the hyperparameters to the model vs. hyperparameters to the\nGaussian process of model performance. Eq. 3 describes the former; the latter is already a staple of\nBayesOpt (Osborne, 2010).\nIn this paper, we present empirical Bayes for hyperparameter averaging (EB-Hyp), an extension to\nBayesOpt that makes use of this hierarchical approach to incorporate the intermediate posteriors in\nan approximate predictive distribution over unseen data X\u2217.\n\nThe Train-Marginalize-Test Pipeline EB-Hyp is an alternative procedure for evaluating and\ndeploying machine learning algorithms that reduces the need for a separate validation data set.\nValidation data is typically used to avoid over\ufb01tting. Over\ufb01tting is a danger in selecting both\nparameters and hyperparameters. The state of the art provides sophisticated ways of regularizing or\nmarginalizing over parameters to avoid over\ufb01tting on training data. But there is no general method\nfor regularizing hyperparameters and typically there is a requirement of conjugacy or continuity in\norder to simultaneously \ufb01t parameters and hyperparameters in the same training procedure.\nTherefore, the standard practice for dealing with the hyperparameters of machine learning models and\nalgorithms is to use a separate validation data set (Murphy, 2012). One selects the hyperparameter\nthat results in the best performance on validation data after \ufb01tting the training data. The best\nhyperparameter and corresponding posterior are then applied to a held-out test data set and the\nresulting performance is the \ufb01nal estimate of the generalization performance of the entire system.\nThis practice of separate validation has carried over to BayesOpt.\nEB-Hyp avoids over\ufb01tting training data through marginalization and allows us to train, marginalize,\nand test without a separate validation data set. It consists of three steps:\n\n1. Train a set of parameters on training data Xtrain, each one conditioned on a choice of\n\nhyperparameter.\n\n2. Marginalize the hyperparameters out of the set of full or approximate posteriors.\n3. Test (or Deploy) the marginal predictive distribution on test data Xtest and report the\n\nperformance.\n\nIn this paper, we argue in favour of this framework as a way of simplifying the evaluation and\ndeployment pipeline. We emphasize that the train step admits a broad category of posterior approxi-\nmation methods for a large number of models, including maximum likelihood, maximum a posteriori,\nvariational inference, or Markov chain Monte Carlo.\nIn summary, our contributions are the following:\n\n\u2022 We highlight the three main shortcomings of the current prevalent approach to tuning\nhyperparameters of machine learning algorithms (computationally wasteful, potentially\nover\ufb01tting validation, added complexity of a separate validation data set) and propose a new\nempirical Bayes procedure, EB-Hyp, to address those issues.\n\u2022 We develop an ef\ufb01cient algorithm to perform EB-Hyp using Monte Carlo approximation\nto both sample hyperparameters from the marginal posterior and to optimize over the\nhyper-hyperparameters.\n\u2022 We apply EB-Hyp to two models and real world data sets, comparing to random search and\nBayesOpt, and \ufb01nd a signi\ufb01cant improvement in held out predictive likelihood validating\nthe approach and approximation in practice.\n\n2 Related Work\n\nEmpirical Bayes has a long history started by Robbins (1955) with a nonparametric approach,\nto parametric EB (Efron and Morris, 1972) and modern applications of EB (Snoek et al., 2012;\nRasmussen and Williams, 2006). Our work builds on these hierarchical Bayesian approaches.\nBayesOpt uses a GP to model performance of machine learning algorithms. A previous attempt at\nreducing the wastefulness of BayesOpt has focused on directing computational resources toward\n\n3\n\n\fmore optimal regions of hyperparameter space (Swersky et al., 2014). Another use of the GP as a\nperformance model arises in Bayesian quadrature, which uses a GP to approximately marginalize over\nparameters (Osborne et al., 2012). However, quadrature is computationally infeasible for forming a\npredictive density after marginalizing hyperparameters because that requires knowing p(\u03b8 | X, \u03b7) for\nthe whole space of \u03b7. In contrast, the EB-Hyp approximation depends on the posterior only at the\nsampled points, which has already been calculated to estimate the marginals.\nFinally, EB-Hyp resembles ensemble methods, such as boosting and bagging, because it is a weighted\nsum over posteriors. Boosting trains models on data reweighted to emphasize errors from previous\nmodels (Freund et al., 1999) while bagging takes an average of models trained on bootstrapped data\n(Breiman, 1996).\n\n3 Empirical Bayes for Hyperparameter Averaging\n\nAs introduced in Section 1, EB-Hyp adds another layer in the model hierarchy with the addition of\na hyperprior p(\u03b7 | \u03bb). The Bayesian approach is to marginalize over \u03b7 but, as usual, the question\nof how to select the hyper-hyperparameter \u03bb lingers. Empirical Bayes provides a response to the\nselection of hyperprior in the form a maximum marginal likelihood approach (see Eq. 3). It is\nuseful to incorporate maximization into the posterior approximation when tuning machine learning\nalgorithms because of the small number of samples we can collect (due to the underlying assumption\nthat the inner training procedure is expensive to run).\nOur starting point is to approximate the posterior predictive distribution under EB-Hyp using Monte\nCarlo samples of \u03b7(s) \u223c p(\u03b7 | X, \u02c6\u03bb),\n\nS(cid:88)\n\ns=1\n\np(X\u2217 | X) \u2248 1\nS\n\np(\u03b8 | X,\u03b7(s))[p(X\u2217 | \u03b8, \u03b7(s))]\nE\n\n(4)\n\nfor a choice of hyperprior p(\u03b7 | \u03bb).\nThere are two main challenges that Eq. 4 presents. The \ufb01rst is that the marginal posterior p(\u03b7 | X, \u02c6\u03bb)\nis not readily available to sample from. We address this in Section 3.1. The second is the choice of\nhyperprior p(\u03b7 | \u03bb) and how to \ufb01nd \u02c6\u03bb. We describe our approach to this in Section 3.2.\n\n3.1 Acquisition Strategy\n\nThe acquisition strategy describes which hyperparameter to evaluate next during tuning. A na\u00efve way\nto choose evaluation point \u03b7 is to sample from the uniform distribution or the hyperprior. However,\nthis is likely to select a number of points where p(X|\u03b7, \u03bb) has low density, squandering computational\nresources.\nBayesOpt addresses this by using an acquisition function conditioned on the current performance\nmodel posterior then maximizing this function to select the next evaluation point. BayesOpt offers\nseveral choices for the acquisition function. The most prominent are expected improvement, upper\ncon\ufb01dence bound, and Thompson sampling (Brochu et al., 2010; Chapelle and Li, 2011). Expected\nimprovement and the upper con\ufb01dence bound result in deterministic acquisition functions and are\ntherefore hard to incorporate into Eq. 4, which is a Monte Carlo average. In contrast, Thompson\nsampling is a stochastic procedure that is competitive with the non-stochastic procedures (Chapelle\nand Li, 2011), so we use it as a starting point for our acquisition strategy.\nThompson sampling maintains a model of rewards for actions performed in an environment and\nrepeats the following for iteration s = 1, . . . , S:\n\n1. Draw a simulation of rewards from the current reward posterior conditioned on the history\n\nr(s) \u223c p(r | {\u03b7(t), f (t) | t < s}).\n\n2. Choose the action that gives the maximum reward in the simulation \u03b7(s) =\n\narg max\u03b7 r(s)(\u03b7).\n\n3. Observe reward f (s) from the environment for performing action \u03b7(s).\n\n4\n\n\fThompson sampling balances exploration with exploitation because actions with large posterior\nmeans and actions with high variance are both more likely to appear as the optimal action in the\nsample r(s). However, the arg max presents dif\ufb01culties in the reweighting required to perform Bayes\nempirical Bayes approaches. We discuss these dif\ufb01culties in more depth in Section 3.2. Furthermore,\nit is unclear exactly what the sample set {\u03b7(1), . . . , \u03b7(S)} represents. This question becomes pertinent\nwhen we care about more than just the optimal hyperparameter. To address these issues, we next\npresent a procedure that generalizes Thompson sampling when it is used for hyperparameter tuning.\n\nPerformance Model Sampling Performance model sampling is based on the idea that the set of\nsimulated rewards r(s) can themselves be treated as a probability distribution of hyperparameters,\nfrom which we can also draw samples. In a hyperparameter selection context, let \u02dcp(s)(X | \u03b7) \u2261 r(s),\nthe marginal likelihood. The procedure repeats for iterations s = 1, . . . , S:\n\n1. draw \u02dcp(s)(X | \u03b7) \u223c P(p(X | \u03b7) | {\u03b7(t), f (t)\n2. draw \u03b7(s) \u223c \u02dcp(s)(\u03b7 | X)\n3. evaluate f (s)\nwhere \u02dcp(s)(\u03b7 | X) := Z\u22121 \u02dcp(s)(X | \u03b7)p(\u03b7)\n\np(X | \u03b8)p(\u03b8 | \u03b7(s))d\u03b8\n\nX | t < s})\n\n(cid:90)\n\nX =\n\n(5)\nwhere P is the performance model distribution and Z is the normalization constant.2 The marginal\nlikelihood p(X | \u03b7(s)) may be evaluated exactly (e.g., Gaussian process marginal given kernel\nhyperparameters) or estimated using methods that approximate the posterior p(\u03b8 | X, \u03b7(s)) such as\nmaximum likelihood estimation, Markov chain Monte Carlo sampling, or variational inference.\nThompson sampling is recovered from performance model sampling when the sample in Step 2 of\nEq. 5 is replaced with the maximum a posteriori approximation (with a uniform prior over the bounds\nof the hyperparameters) to select where to obtain the next hyperparameter sample \u03b7(s). Given the\neffectiveness of Thompson sampling in various domains (Chapelle and Li, 2011), this is likely to\nwork well for hyperparameter selection. Furthermore, Eq. 5 admits a broader range of acquisition\nstrategies, the simplest being a full sample. And importantly, it allows us to consider the convergence\nof EB-Hyp.\nThe sample \u02dcp(s)(X | \u03b7) of iteration s from the procedure in Eq. 5 converges to the true probability\ndensity function p(X|\u03b7) as s \u2192 \u221e under the assumptions that p(X|\u03b7) is smooth and the performance\nmodel P is drawn from a log Gaussian process with smooth mean and covariance over a \ufb01nite input\nspace. Consistency of the Gaussian process in one dimension has been shown for \ufb01xed Borel\nprobability measures (Choi and Schervish, 2004). Furthermore, rates of convergence are favourable\nfor a variety of covariance functions using the log Gaussian process for density estimation (van der\nVaart and van Zanten, 2008). Performance model sampling additionally changes the sampling\ndistribution of \u03b7 on each iteration. Simulation \u02dcp(s)(\u03b7 | X) from the posterior of P conditioned on\nthe evaluation history has non-zero density wherever the prior p(\u03b7) is non-zero by the de\ufb01nition of\n\u02dcp(s)(\u03b7 | X) in Eq. 5 and the fact that draws from a log Gaussian process are non-zero. Therefore, as\ns \u2192 \u221e, the input-output set {\u03b7(t), f (t)\nX | t < s} on which P is conditioned will cover the input space.\nIt follows from the above discussion that the samples {\u03b7(s) | s \u2208 [1, S]} from the procedure in\nEq. 5 converge to the posterior distribution p(\u03b7 | X) as S \u2192 \u221e. Therefore, the sample \u02dcp(s)(X | \u03b7)\nconverges to the true pdf p(X | \u03b7) as s \u2192 \u221e. Since {\u03b7(s) | s \u2208 [1, S]} is sampled independently from\n{\u02dcp(s)(X | \u03b7) | s \u2208 [1, S]} (respectively), the set of samples therefore tends to p(\u03b7 | X) as S \u2192 \u221e.\nA key limitation to the above discussion for continuous hyperparameters is the assumption that the\ntrue marginal p(X | \u03b7) is smooth. This may not always be the case, for example an in\ufb01nitesimal\nchange in the learning rate for gradient descent on a non-convex objective could result in \ufb01nding a\ncompletely different local optimum. This affects asymptotic convergence but discontinuities in the\n2Z can be easily calculated if \u03b7 is discrete or if p(\u03b7) is conjugate to p(X | \u03b7). In non-conjugate continuous\ncases, \u03b7 may be discretized to a high granularity. Since EB-Hyp is an active procedure, the limiting computational\nbottleneck is to calculate the posterior of the performance model. For GPs, this is an O(S3) operation in the\nnumber of hyperparameter evaluations S.\nIf onerous, the operation is amenable to well established fast\napproximations, e.g,. the inducing points method (Hensman et al., 2013).\n\n5\n\n\f1 inputs training data Xtrain and inference algorithm A : (X, \u03b7) \u2192 p(\u03b8 | X, \u03b7)\n2 output predictive density p(X\u2217 | Xtrain)\n3 initialize evaluation history V = {}\n4 while V not converged do\n5\n6\n7\n8\n\ndraw performance function from GP posterior \u02dcp(s)(X | \u03b7) \u223c GP(\u00b7 | V )\ncalculate hyperparameter posterior \u02dcp(s)(\u03b7 | X) := Z\u22121 \u02dcp(s)(X | \u03b7)p(\u03b7)\ndraw next evaluation point \u03b7(s) := arg max\u03b7 \u02dcp(s)(\u03b7 | X)\nrun parameter inference conditioned on hyperparameter p(\u03b8 | \u03b7(s)) := A(Xtrain, \u03b7(s))\nevaluate performance f (s)\nappend (\u03b7(s), f (s)\n\nX :=(cid:82) p(Xtrain | \u03b8)p(\u03b8 | \u03b7(s))d\u03b8\n\n9\n\nX ) to history V\n\n10\n11 end\n12 \ufb01nd optimal \u02c6\u03bb using Eq. 3 (discussed in Section 3.2)\n13 return: approximation to p(X\u2217 | Xtrain) using Eq. 4\n\nAlgorithm 1: Empirical Bayes for hyperparameter averaging (EB-Hyp)\n\nTable 2: Predictive log likelihood for latent Dirichlet allocation (LDA), 20 Newsgroup dataset\n\nMethod\n\nEB-Hyp\n\nRandom\n\nBayesOpt with validation\n\nwithout validation\nwith validation\nwithout validation\n\nPredictive Log Lik.\n(% Improvement on BayesOpt)\n-357648 (0.00%)\n-361661 (-1.12%)\n-357650 (-0.00%)\n-351911 (+1.60%)\n-2666074 (-645%)\n\nmarginal likelihood are not likely to affect the outcome at the scale number of evaluations typical\nin hyperparameter tuning. Importantly, the smoothness assumption does not pose a problem to\ndiscrete hyperparameters (e.g., number of units in a hidden layer). Another limitation of performance\nmodel sampling is that it focuses on the marginal likelihood as the metric to be optimized. This\nis less of a restriction as it may \ufb01rst appear. Various performance metrics are often equivalent or\napproximations to a particular likelihood, e.g., mean squared error is the negative log likelihood of a\nGaussian-distributed observation.\n\n3.2 Weighting Strategy\n\nPerformance model sampling provides a set of hyperparameter samples, each with a performance\nX and a computed posterior p(\u03b8 | X, \u03b7(s)). These three elements can be combined in a Monte Carlo\nf (s)\naverage to provide a prediction over unseen data or a mean parameter value.\nFollowing from Section 3.1, the samples of \u03b7 from Eq. 5 converge to the distribution of p(\u03b7 | X, \u03bb).\nA standard Bayesian treatment of the hierarchical model requires selecting a \ufb01xed \u03bb, equivalent to\na predetermined weighted or unweighted average of the models of a BayesOpt run. However, we\nfound that \ufb01xing \u03bb is not competitive with approaches to hyperparameter tuning that involve some\nmaximization. This is likely to arise from the small number of samples collected during tuning\n(recall that collecting more samples involves new entire runs of parameter training and is usually\ncomputationally expensive).\nThe empirical Bayes selection of \u02c6\u03bb selects the best hyper-hyperparameter and reintroduces maximiza-\ntion in a way that makes use of the intermediate posteriors during tuning, as in Eq. 4. In addition, it\nuses hyper-hyperparameter optimization to \ufb01nd \u02c6\u03bb. This depends on the choice of hyperprior. There is\n\ufb02exibility in this choice; we found that a nonparametric hyperprior that places a uniform distribution\nover the top T < S samples (by value of fX (\u03b7(t))) from Eq. 4 works well in practice, and this is\nwhat we use in Section 4 with T = (cid:98) S\n10(cid:99). This choice of hyperprior avoids converging on a point\nmass in the limit of in\ufb01nite sized data X and forces the approximate marginal to spread probability\n\n6\n\n\fTable 3: Predictive log lik. for deep latent Gaussian model (DLGM), Labeled Faces in the Wild\n\nMethod\n\nEB-Hyp\n\nRandom\n\nBayesOpt with validation\n\nwithout validation\nwith validation\nwithout validation\n\nPredictive Log Lik.\n(% Improvement on BayesOpt)\n-17071 (0.00%)\n-15970 (+6.45%)\n-16375 (+4.08%)\n-15872 (+7.02%)\n-17271 (-1.17%)\n\nmass across a well-performing set of models, any one of which is likely to dominate the prediction\nfor any given data point (though, importantly, it will not always be the same model).\nAfter the Markov chain in Eq. 5 converges, the samples {\u03b7(s) | s = 1, . . . , S} and the (approximated)\nposteriors p(\u03b8 | X, \u03b7(s)) can be used in Eq. 4. The EB-Hyp algorithm is summarized in Algorithm 1.\nThe dominating computational cost comes from running inference to evaluate A(Xtrain, \u03b7(s)). All\nthe other steps combined are negligible in comparison.\n\n4 Experiments\n\nWe apply EB-Hyp and BayesOpt to two approximate inference algorithms and data sets. We also\napply uniform random search, which is known to outperform a grid or manual search (Bergstra and\nBengio, 2012).\nIn the \ufb01rst experiment, we consider stochastic variational inference on latent Dirichlet allocation\n(SVI-LDA) applied to the 20 Newsgroups data.3 In the second, a deep latent Gaussian model (DLGM)\non the Labeled Faces in the Wild data set (Huang et al., 2007). We \ufb01nd that EB-Hyp outperforms\nBayesOpt and random search as measured by predictive likelihood.\nFor the performance model, we use the log Gaussian process in our experiments implemented in\nthe GPy package (GPy, 2012). The performance model uses the Mat\u00e9rn 32 kernel to express the\nassumption that nearby hyperparameters typically perform similarly; but this kernel has the advantage\nof being less smooth than the squared exponential, making it more suitable to capture abrupt changes\nin the marginal likelihood (Stein, 1999). Between each hyperparameter sample, we optimize the\nkernel parameters and the independent noise distribution for the observations so far by maximizing\nthe marginal likelihood of the Gaussian process.\nThroughout, we randomly split the data into training, validation, and test sets. To assess the ne-\ncessity of a separate validation set we consider two scenarios: (1) training and validating on the\ntrain+validation data, (2) training on the train data and validating on the validation data. In either\ncase, the test data is used only at the \ufb01nal step to report overall performance.\n\n4.1 Latent Dirichlet Allocation\n\nLatent Dirichlet allocation (LDA) is an unsupervised model that \ufb01nds topic structure in a set of text\ndocuments expressed as K word distributions (one per topic) and D topic distributions (one per\ndocument). We apply stochastic variational inference to LDA (Hoffman et al., 2013), a method that\napproximates the posterior over parameters p(\u03b8| X, \u03b7) in Eq. 4 with variational distribution q(\u03b8| v, \u03b7).\nThe algorithm minimizes the KL divergence between q and p by adjusting the variational parameters.\nWe explored four hyperparameters of SVI-LDA in the experiments: K \u2208 [50, 200], the number\nof topics; log(\u03b1) \u2208 [\u22125, 0], the hyperparameter to the Dirichlet document-topic prior; log(\u03b7) \u2208\n[\u22125, 0], the hyperparameter to the Dirichlet topic-word distribution prior; \u03ba \u2208 [0.5, 0.9], the decay\nparameter to the learning rate (t0 + t)\u2212\u03ba, where t0 was \ufb01xed at 10 for this experiment. Several other\nhyperparameters are required and were kept \ufb01xed during the experiment. The minibatch size was\n\ufb01xed at 100 documents and the vocabulary was selected from the top 1,000 words, excluding stop\nwords, words that appear in over 95% of documents, and words that appear in only one document.\n\n3http://qwone.com/~jason/20Newsgroups/\n\n7\n\n\fFigure 2: A 2D slice of the performance model posterior after a run of EB-Hyp on LDA. The two\nhyperparameters control the sparsity of the Dirichlet priors. The plot indicates a negative relationship\nbetween them.\n\nThe 11,314 resulting documents were randomly split 80%-10%-10% into training, validation, and\ntest sets.\nTable 2 shows performance in log likelihood on the test data of the two approaches. The percentage\nchange over the BayesOpt benchmark is reported in parentheses. EB-Hyp performs signi\ufb01cantly\nbetter than BayesOpt in this problem. To understand why, Figure 1 examines the error (negative log\nlikelihood) on both the validation and test data for all the hyperparameters selected during BayesOpt.\nIn the test scenario, BayesOpt chooses the hyperparameters corresponding to the left-most bar in\nFigure 1b because those hyperparameters minimized error on the validation set. However, Figure 1b\nshows that other hyperparameter settings outperform this selection when testing. For \ufb01nite validation\ndata, there is no way of knowing how the optimal hyperparameter will behave on test data before\nseeing it, motivating an averaging approach like EB-Hyp. In addition, Table 2 shows that a separate\nvalidation data set is not necessary with EB-Hyp. In contrast, BayesOpt does need separate validation\nand over\ufb01ts the training data without it.\nFigure 2 shows a slice of the posterior mean function of the performance model for two of the\nhyperparameters, \u03b1 and \u03b7, controlling the sparsity of the document-topics and the topic-word\ndistributions, respectively. There is a negative relationship between the two hyperparameters, meaning\nthat the sparser we make the topic distribution for documents, the denser we need to make the word\ndistribution for topics to maintain the same performance (and vice versa). EB-Hyp combines several\nmodels of different degrees of sparsity in a way that respects this trade-off.\n\n4.2 Supervised Deep Latent Gaussian Models\n\nStochastic backpropagation for deep latent Gaussian models (DLGMs) approximates the posterior\nof an unsupervised deep model using variational inference and stochastic gradient ascent (Rezende\net al., 2014). In addition to a generator network, a recognition network is introduced that amortizes\ninference (i.e., once trained, the recognition network \ufb01nds variational parameters for new data in a\nclosed-form expression). In this experiment, we use an extension of the DLGM with supervision\n(Li et al., 2015) to perform label prediction on a subset of the Labeled Faces in the Wild data set\n(Huang et al., 2007). The data consist of 1,288 images of 1,850 pixels each, split 60%-20%-20% into\ntraining, validation, and test data (respectively).\nWe considered 4 hyperparameters for the DLGM with a one-layered recognition model: N1 \u2208\n[10, 200], the number of hidden units in the \ufb01rst layer of the generative and recognition models;\nN2 \u2208 [0, 200], the number of hidden units in the second layer of the generative model only (when\nN2 = 0, only one layer is used); log(\u03ba) \u2208 [\u22125,\u22120.05], the variance of the prior of the weights\nin the generative model; and log(\u03c1) \u2208 [\u22125,\u22120.05], the gradient ascent step size. Table 3 shows\nperformance for the DLGM. The single best performing hyperparameters were (N1 = 91, N2 =\n86, log(\u03ba) = \u22125, log(\u03c1) = \u22125). We \ufb01nd again that, EB-Hyp outperforms all the other methods on\ntest data. This is achieved without validation.\n\n8\n\n543210log(alpha)543210log(eta)\f5 Conclusions\n\nWe introduced a general-purpose procedure for dealing with unknown hyperparameters that control\nthe behaviour of machine learning models and algorithms. Our approach is based on approximately\nmarginalizing the hyperparameters by taking a weighted average of posteriors calculated by existing\ninference algorithms that are time intensive. To do this, we introduced a procedure for sampling\ninformative hyperparameters from a performance model. Our approaches are supported by an ef\ufb01cient\nalgorithm. In two sets of experiments, we found this algorithm outperforms optimization and random\napproaches.\nThe arguments and evidence presented in this paper point toward a tendency of the standard\noptimization-based methodologies to over\ufb01t hyperparameters. Other things being equal, this tendency\npunishes (in reported performance on test data) methods that are more sensitive to hyperparameters\ncompared to methods that are less sensitive. The result is a bias in the literature towards methods\nwhose generalization performance is less sensitive to hyperparameters. Averaging approaches like\nEB-Hyp help reduce this bias.\n\nAcknowledgments\n\nMany thanks to Scott Linderman, Samantha Hansen, Eric Humphrey, Ching-Wei Chen, and the\nreviewers of the workshop on Advances in Approximate Bayesian Inference (2016) for their insightful\ncomments and feedback.\n\nReferences\nBergstra, J. and Bengio, Y. (2012). Random search for hyper-parameter optimization. Journal of Machine\n\nLearning Research, 13, 281\u2013305.\n\nBreiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123\u2013140.\n\nBrochu, E., Cora, V. M., and De Freitas, N. (2010). A tutorial on Bayesian optimization of expensive cost\nfunctions, with application to active user modeling and hierarchical reinforcement learning. arXiv preprint\narXiv:1012.2599.\n\nCarlin, B. P. and Louis, T. A. (2000). Empirical Bayes: Past, present and future. Journal of the American\n\nStatistical Association, 95(452), 1286\u20131289.\n\nChapelle, O. and Li, L. (2011). An empirical evaluation of Thompson sampling. In Advances in Neural\n\nInformation Processing Systems, pages 2249\u20132257.\n\nChoi, T. and Schervish, M. J. (2004). Posterior consistency in nonparametric regression problems under gaussian\n\nprocess priors.\n\nEfron, B. and Morris, C. (1972). Limiting the risk of Bayes and empirical Bayes estimators\u2014Part II: The\n\nempirical Bayes case. Journal of the American Statistical Association, 67(337), 130\u2013139.\n\nFreund, Y., Schapire, R., and Abe, N. (1999). A short introduction to boosting. Journal-Japanese Society For\n\nArti\ufb01cial Intelligence, 14(771-780), 1612.\n\nGPy (2012). GPy: A Gaussian process framework in Python. http://github.com/SheffieldML/GPy.\n\nHensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. arXiv preprint\n\narXiv:1309.6835.\n\nHoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. W. (2013). Stochastic variational inference. Journal of\n\nMachine Learning Research, 14(1), 1303\u20131347.\n\nHuang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. (2007). Labeled faces in the wild: A database\nfor studying face recognition in unconstrained environments. Technical Report 07-49, University of Mas-\nsachusetts, Amherst.\n\nLi, C., Zhu, J., Shi, T., and Zhang, B. (2015). Max-margin deep generative models. In Advances in Neural\n\nInformation Processing Systems, pages 1837\u20131845.\n\nMurphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.\n\n9\n\n\fOsborne, M. (2010). Bayesian Gaussian Processes for Sequential Prediction, Optimisation and Quadrature.\n\nPh.D. thesis, PhD thesis, University of Oxford.\n\nOsborne, M., Garnett, R., Ghahramani, Z., Duvenaud, D. K., Roberts, S. J., and Rasmussen, C. E. (2012). Active\nLearning of Model Evidence Using Bayesian Quadrature. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 25, pages 46\u201354. Curran Associates,\nInc.\n\nRasmussen, C. E. and Williams, C. K. (2006). Gaussian processes for machine learning. the MIT Press, 2(3), 4.\n\nRezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference\nin deep generative models. In Proceedings of The 31st International Conference on Machine Learning, pages\n1278\u20131286.\n\nRobbins, H. (1955). The empirical Bayes approach to statistical decision problems. In Herbert Robbins Selected\n\nPapers, pages 49\u201368. Springer.\n\nSnoek, J., Larochelle, H., and Adams, R. P. (2012). Practical Bayesian optimization of machine learning\n\nalgorithms. In Advances in neural information processing systems, pages 2951\u20132959.\n\nStein, M. L. (1999). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media.\n\nSwersky, K., Snoek, J., and Adams, R. P. (2014). Freeze-thaw bayesian optimization. arXiv preprint\n\narXiv:1406.3896.\n\nvan der Vaart, A. W. and van Zanten, J. H. (2008). Rates of contraction of posterior distributions based on\n\nGaussian process priors. The Annals of Statistics, pages 1435\u20131463.\n\n10\n\n\f", "award": [], "sourceid": 1538, "authors": [{"given_name": "James", "family_name": "McInerney", "institution": "Spotify Research"}]}