{"title": "Meta-Surrogate Benchmarking for Hyperparameter Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 6270, "page_last": 6280, "abstract": "Despite the recent progress in hyperparameter optimization (HPO), available benchmarks that resemble real-world scenarios consist of a few and very large problem instances that are expensive to solve. This blocks researchers and practitioners no only from systematically running large-scale comparisons that are needed to draw statistically significant results but also from reproducing experiments that were conducted before.\nThis work proposes a method to alleviate these issues by means of a meta-surrogate model for HPO tasks trained on off-line generated data. The model combines a probabilistic encoder with a multi-task model such that it can generate inexpensive and realistic tasks of the class of problems of interest.\nWe demonstrate that benchmarking HPO methods on samples of the generative model allows us to draw more coherent and statistically significant conclusions that can be reached orders of magnitude faster than using the original tasks. We provide evidence of our findings for various HPO methods on a wide class of problems.", "full_text": "Meta-Surrogate Benchmarking for Hyperparameter\n\nOptimization\n\nAaron Klein1\n\nZhenwen Dai2\n\nFrank Hutter1\n\nNeil Lawrence3\n\nJavier Gonz\u00e1lez2\n\n1University of Freiburg\n\n2Amazon Cambridge\n\n3University of Cambridge\n\n{kleinaa,fh}@cs.uni-freiburg.de\n\n{zhenwend, gojav}@amazon.com\n\nndl21@cam.ac.uk\n\nAbstract\n\nDespite the recent progress in hyperparameter optimization (HPO), available bench-\nmarks that resemble real-world scenarios consist of a few and very large problem\ninstances that are expensive to solve. This blocks researchers and practitioners not\nonly from systematically running large-scale comparisons that are needed to draw\nstatistically signi\ufb01cant results but also from reproducing experiments that were\nconducted before. This work proposes a method to alleviate these issues by means\nof a meta-surrogate model for HPO tasks trained on off-line generated data. The\nmodel combines a probabilistic encoder with a multi-task model such that it can\ngenerate inexpensive and realistic tasks of the class of problems of interest. We\ndemonstrate that benchmarking HPO methods on samples of the generative model\nallows us to draw more coherent and statistically signi\ufb01cant conclusions that can\nbe reached orders of magnitude faster than using the original tasks. We provide\nevidence of our \ufb01ndings for various HPO methods on a wide class of problems.\n\n1\n\nIntroduction\n\nAutomated Machine Learning (AutoML) [19] is an emerging \ufb01eld that studies the progressive\nautomation of machine learning. A core part of an AutoML system is the hyperparameter optimization\n(HPO) of a machine learning algorithm. It has already shown promising results by outperforming\nhuman experts in \ufb01nding better hyperparameters [34], an thereby, for example, substantially improved\nAlphaGo [7].\nDespite recent progress (see e. g.\ndeveloping and evaluating new HPO methods one frequently faces the following problems:\n\nthe review by Feurer and Hutter [11]), during the phases of\n\n\u2022 Evaluating the objective function is often expensive in terms of wall-clock time; e.g., the\nevaluation of a single hyperparameter con\ufb01guration may take several hours or days. This\nrenders extensive HPO or repeated runs of HPO methods computationally infeasible.\n\u2022 Even though repositories of datasets, such as OpenML [41] provide thousands of datasets, a\nlarge fraction cannot meaningfully be used for HPO since they are too small or too easy\n(in the sense that even simple methods achieve top performance). Hence, useful available\ndatasets are scarce, making it hard to produce a comprehensive evaluation of how well a\nHPO method will generalize across tasks.\n\nDue to these two problems researchers can only carry out a limit number of comparisons within a\nreasonable computational budget. This delays the progress of the \ufb01eld as statistically signi\ufb01cant\nconclusions about the performance of different HPO methods may not be possible to draw. See\nFigure 1 for an illustrative experiment of the HPO of XGBoost [4]. It is well known that Bayesian\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Common pitfalls in the evaluation of HPO methods: we compare two different HPO\nmethods for optimizing the hyperparameters of XGBoost on three UCI regression datasets (see\nAppendix B for more datasets). The small number of tasks makes it hard to draw any conclusions,\nsince the ranking of the methods varies between tasks. Furthermore, a full run might take several\nhours which makes it prohibitively expensive to average across a large number of runs.\n\noptimization with Gaussian processes (BO-GP) [33] outperforms naive random search (RS) in terms\nof number of function evaluations on most HPO problems. While we show clear evidence for this\nin Appendix B on a larger set of datasets, this conclusion cannot be reached when optimizing on\nthe three \"unlucky\" picked datasets in Figure 1. Surprisingly, the community has not paid much\nattention to this issue of proper benchmarking, which is a key step required to generate new scienti\ufb01c\nknowledge.\nIn this work we present a generative meta-model\nthat, conditioned on off-line generated data, al-\nlows to sample an unlimited number of new tasks\nthat share properties with the original ones. There\nare several advantages to this approach. First, the\nnew problem instances are inexpensive to evalu-\nate as they are generated with a parameteric form,\nwhich drastically reduces the resources needed to\ncompare HPO methods, bounded only by the opti-\nmizer\u2019s computational overhead (see Figure 2 for\nan example). Second, there is no limit in the num-\nber of tasks that can be generated, which helps to\ndraw statistically more reliable conclusions. Third,\nthe shape and properties of the tasks are not prede-\n\ufb01ned but learned using a few real tasks of an HPO\nproblem. While the global properties of the initial\ntasks are preserved in the samples, the generative\nmodel allows the exploration of instances with di-\nverse local properties making comparisons more\nrobust and reliable (see Appendix D for some ex-\nample tasks).\nIn light of the recent call for more reproducibility,\nwe are convinced that our meta-surrogate bench-\nmarks enable more reproducible research in AutoML: First of all, these cheap-to-evaluate surrogate\nbenchmarks allows researches to reproduce experiments or perform many repeats of their own ex-\nperiments without relying on tremendous computational resources. Second, based on our-proposed\nmethod, we provide a more thorough benchmarking protocol that reduces the risk of extensively\ntuning an optimization method on single tasks. Third, surrogate benchmarks in general are less\ndependent on hardware and technical details, such as complicated training routines or preprocessing\nstrategies.\n\nFigure 2: The three blue bars on the left show\nthe total wall-clock time of executing 20 inde-\npendent runs of GP-BO, RS and Bohamiann\n(see Section 5) with 100 function evaluations\nfor the HPO of a feed forward neural network\non MNIST. The orange bars show the same for\noptimizing a task sampled from our proposed\nmeta-model, where benchmarking is orders of\nmagnitude cheaper in terms of wall-clock time\nthan the original benchmarks, thereby the com-\nputational time is almost exclusively spend for\nthe optimizer (hence the larger bars for GP-BO\nand Bohamiann compared to RS).\n\n2 Related Work\n\nThe use of meta-models that learn across tasks has been investigated by others before. To warm-start\nHPO on new tasks from previously optimized tasks, Swersky et al. [38] extended Bayesian optimiza-\ntion to the multi-task setting by using a Gaussian process that also takes the correlation between tasks\ninto account. Instead of a Gaussian process, Springenberg et al. [36] used a Bayesian neural network\ninside multi-task Bayesian optimization which learns an embedding of tasks during optimization.\n\n2\n\n51015202530iterations100101102testerrorEnergy51015202530iterations101102testerrorBoston-HousingRSGP-BO51015202530iterations101102103104105testerrorPowerPlantGP-BORSBohamiannGP-BORSBohamiann02500005000007500001000000125000015000001750000seconds\fSimilarly, Perrone et al. [32] used Bayesian linear regression, where the basis functions are learned by\na neural network, to warm-start the optimization from previous tasks. Feurer et al. [13] used a set of\ndataset statistics as meta-features to measure the similarity between tasks, such that hyperparameter\ncon\ufb01gurations that were superior on previously optimized similar tasks can be evaluated during the\ninitial design. This technique is also applied inside the auto-sklearn framework [12]. In a similar vein,\nFusi et al. [14] proposed to use a probabilistic matrix factorization approach to exploit knowledge\ngathered on previously seen tasks. van Rijn and Hutter [40] evaluated random hyperparameter\ncon\ufb01gurations on a large range of tasks to learn priors for the hyperparameters of support vector\nmachines, random forests and Adaboosts. The idea of using a latent variable to represent correlation\namong multiple outputs of a Gaussian process has been exploited by Dai et al. [8].\nBesides the aforementioned work on BO, also other methods have been proposed for ef\ufb01cient HPO.\nLi et al. [29] proposed Hyperband, which based on the bandit strategy successive halving [21],\ndynamically allocates resources across a set of random hyperparameter con\ufb01gurations. Similarly,\nJaderberg et al. [20] presented an evolutionary algorithm, dubbed PBT, which adapts a population of\nhyperparameter con\ufb01gurations during training by either random perturbations or exploiting values of\nwell-performing con\ufb01gurations in the population.\nIn the context of benchmarking HPO methods, HPOlib [9] is a benchmarking library that provides\na \ufb01xed and rather small set of common HPO problems. In earlier work, Eggensperger et al. [10]\nproposed surrogates to speed up the empirical benchmarking of HPO methods. Similar to our work,\nthese surrogates are trained on data generated in an off-line step. Afterwards, evaluating the objective\nfunction only requires querying the surrogate model instead of actually running the benchmark.\nHowever, these surrogates only mimic one particular task and do not allow for generating new tasks\nas presented in this work. Recently, tabular benchmarks were introduced for neural architecture\nsearch [43] and hyperparameter optimization [24], which \ufb01rst perform an exhaustive search of a\ndiscrete benchmark problem to store all results in a database and then replace expensive function\nevaluations by ef\ufb01cient table lookups. While this does not introduce any bias due to a model (see\nSection 6 for a more detailed discussion), tabular benchmarks are only applicable for problems\nwith few, discrete hyperparameters. Related to our work, but for benchmarking general blackbox\noptimization methods, is the COCO platform [17]. However, compared to our approach, it is based\non handcrafted synthetic functions that do not resemble real world HPO problems.\n\n3 Benchmarking HPO methods with generative models\n\nWe now describe the generative meta-model to create HPO tasks. First we give a formal de\ufb01nition of\nbenchmarking HPO methods across tasks sampled from a unknown distribution and then describe\nhow we can approximate this distribution by our new proposed meta-model.\n\n3.1 Problem De\ufb01nition\nWe denote t1, . . . , tM to be a set of related objectives/tasks with the same input domain X , for\nexample X \u2282 Rd. We assume that each ti for i = 1, . . . M, is an instantiation of an unknown\ndistribution of tasks ti \u223c p(t). Every task t has an associated objective function ft : X \u2192 R where\nx \u2208 X represents a hyperparameter con\ufb01guration and we assume that we can observe ft only through\nnoise: yt \u223c N (ft(x), \u03c32\nt ).\nLet us denote by r(\u03b1, t) the performance of an optimization method \u03b1 on a task t; for instance, a\ncommon example for r is the regret of the best observed solution (called incumbent). To compare\ntwo different methods \u03b1A and \u03b1B, the standard practice is to compare r(\u03b1A, ti) with r(\u03b1B, ti) on a\nset of hand-picked tasks ti \u2208 {t1, . . . tM}. However, to draw statistically more signi\ufb01cant conclusion,\nwe would ideally like to integrate over all tasks:\n\n(cid:90)\n\nSp(t)(\u03b1) =\n\nr(\u03b1, t)p(t)dt.\n\n(1)\n\nD =(cid:8){(xtn, ytn)}N\n\n(cid:9)T\nUnfortunately, the above integral is intractable as p(t) is unknown. The main contribution of this paper\nis to approximate p(t) with a generative meta-model \u02c6p(t | D) based on some off-line generated data\nt=1. This enables us to sample an arbitrary amount of tasks ti \u223c \u02c6p(t | D) in\n\nn=1\n\norder to perform a Monte-Carlo approximation of Equation 1.\n\n3\n\n\f3.2 Meta-Model for Task Generation\nIn order to reason across tasks, we de\ufb01ne a probabilistic encoder p(ht | D) that learns a latent\nrepresentation ht \u2208 RQ of a task t.\nMore precisely, we use Bayesian GP-LVM [39] which assumes that the target values that belong to\nthe task t, stacked into a vector yt = (yt1, . . . , ytN ) follow the generative process:\n\nyt = g(ht) + \u0001,\n\ng \u223c GP(0, k),\n\n\u0001 \u223c N (0, \u03c32),\n\n(2)\n\nwhere k is the covariance function of the GP. By assuming that the latent variable ht has an\nuninformative prior ht \u223c N (0, I), the latent embedding of each task is inferred as the posterior\ndistribution p(ht | D). The exact formulation of the posterior distribution is intractable, but following\nthe variational inference presented in Titsias and Lawrence [39], we can estimate a variational\nposterior distribution q(ht) = N (mt, \u03a3t) \u2248 p(ht | D) for each task t.\nSimilar to Multi-Task Bayesian Optimization [38, 36], we de\ufb01ne a probabilistic model for the\nobjective function p(yt | x, ht) across tasks which gets as an additional input a task embedding based\non our independently trained probabilistic encoder. Following Springenberg et al. [36], we use a\nBayesian neural network with M weight vectors {\u03b81, . . . , \u03b8M} to model\n\np(yt | x, ht,D) =\n\np(yt | x, ht, \u03b8)p(\u03b8 | D)d\u03b8 \u2248 1\nM\n\np(yt | x, ht, \u03b8i).\n\n(3)\n\nwhere \u03b8i \u223c p(\u03b8 | D) is sampled from the posterior of the neural network weights.\n\nBy approximating p(yt | x, ht) = N(cid:0)\u00b5(x, ht), \u03c32(x, ht)(cid:1) to be Gaussian [36], we can compute the\n\ni=1\n\npredictive mean and variance by:\n\nM(cid:88)\n\n(cid:90)\n\nM(cid:88)\n\ni=1\n\n\u00b5(x, ht) =\n\n1\nM\n\n\u02c6\u00b5(x, ht | \u03b8i)\n\n; \u03c32(x, ht) =\n\n1\nM\n\n(cid:0)\u02c6\u00b5(x, ht | \u03b8i) \u2212 \u00b5(x, ht)(cid:1)2\n\n+ \u02c6\u03c32\n\u03b8i\n\n,\n\nM(cid:88)\n\ni=1\n\nwhere \u02c6\u00b5(x, ht | \u03b8i) and \u02c6\u03c32\n1. To get\na set of weights {\u03b81, . . . , \u03b8M}, we use stochastic gradient Hamiltonian Monte-Carlo [5] to sample\n\u03b8i \u223c p(\u03b8,D) from:\n\n\u03b8i are the output of a single neural network with parameters \u03b8i\n\nN(cid:88)\n\nH(cid:88)\n\np(\u03b8,D) =\n\n1\nN\n\n1\nH\n\nn=1\n\nj=1\n\nlog p(yn | xn, hnj)\n\nwith N = |D| the number of datapoints in our training set and H the number of samples we draw\nfrom the latent space htj \u223c q(ht).\n\n3.3 Sampling New Tasks\nIn order to generate a new task t(cid:63) \u223c \u02c6p(t | D), we need the associated objective function ft(cid:63) in a\nparameteric form such that we can evaluate it later on any x \u2208 X .\nGiven the meta-model above, we perform the following steps: (i) we sample a new latent task vector\nht(cid:63) \u223c q(ht); (ii) given ht(cid:63) we pick a random \u03b8i from the set of weights {\u03b81, . . . \u03b8M} of our Bayesian\nneural network and set the new task to be ft(cid:63) (x) = \u02c6\u00b5(x, ht(cid:63) | \u03b8i).\nNote that using ft(cid:63) (x) makes our new task unrealisticly smooth. Instead, we can emulate the typical\n\nnoise appearing in HPO benchmarks by returning yt(cid:63) (x) \u223c N(cid:0)\u02c6\u00b5(x, ht(cid:63) | \u03b8i), \u02c6\u03c32\n\n(cid:1), which can be\n\n\u03b8i\n\ndone at an insigni\ufb01cant cost.\n\n4 Profet\n\nWe now present our PRObabilistic data-eF\ufb01cient Experimentation Tool, called PROFET, a\nbenchmarking suite for HPO methods (an open-source implementation is available here:\n\n1Note that we model an homoscedastic noise, because of that, \u02c6\u03c32\n\n\u03b8i does not depend on the input\n\n4\n\n\fFigure 3: Latent space representations of our probabilistic encoder. Left: Representation of task pairs\n(same color) generated by partitioning eleven datasets from the fully connected network benchmark\ndetailed in Section 4.1. Mean of tasks are visualized with different markers and ellipses represent 4\nstandard deviations. Right: Latent space learned for a model where the input tasks are generated by\ntraining a SVM on subsets of MNIST (see Klein et al. [25] for more details).\n\nhttps://github.com/amzn/emukit). We provide pseudo code in Appendix G. The following sec-\ntion describes \ufb01rst how we collected the data to train our meta-model based on three typical HPO\nproblems classes. We then explain how we generated T = 1000 different tasks for each problem\nclass from our meta-model. As described above, we provide a noisy and noiseless version of each\ntask. Last, we discuss two ways that are commonly used in the literature to assess and aggregate the\nperformance of HPO methods across tasks.\n\n4.1 Data Collection\n\nWe consider three different HPO problems, two for classi\ufb01cation and one for regression, with varying\ndimensions D. For classi\ufb01cation, we considered a support vector machine (SVM) with D = 2\nhyperparameters and a feed forward neural network (FC-Net) with D = 6 hyperparameters on 16\nOpenML [41] tasks each. We used gradient boosting (XGBoost)2 with D = 8 hyperparameters for\nregression on 11 different UCI datasets [30]. For further details about the datasets and the con\ufb01gura-\ntion spaces see Appendix A. To make sure that our meta-model learns a descriptive representation we\nneed a solid coverage over the whole input space. For that we drew 100D pseudo randomly generated\ncon\ufb01gurations from a Sobol grid [35].\nDetails of our meta-model are described in Appendix F. We show some qualitative examples of our\nprobabilistic encoder in Section 5.1. We can also apply the same machinery to model the cost in\nterms of computation time for evaluating a hyperparameter con\ufb01guration to use time rather than\nfunction evaluations as budget. This enables future work to benchmark HPO methods that explicitly\ntake the cost into account (e. g. EIperSec [34]).\n\n4.2 Performance Assessment\n\nTo assess the performance of a HPO method aggregated over tasks, we consider two different ways\ncommonly used in the literature. First, we measure the runtime r(\u03b1, t, ytarget) that a HPO method\n\u03b1 needs to \ufb01nd a con\ufb01guration that achieves a performance that is equal or lower than a certain\ntarget value ytarget on task t [16]. Here we de\ufb01ne runtime either in terms of function evaluations\nor estimated wall-clock time predicted by our meta-model. Using a \ufb01xed target approach allows\nus to make quantitative statements, such as: method A is, on average, twice as fast than method B.\nSee Hansen et al. [16] for a more detailed discussion. We average across target values with a different\ncomplexity by evaluating the Sobol grid from above on each generated task. We use the corresponding\nfunction values as targets, which, with the same argument as described in Section 4.1, provides a good\ncoverage of the error surface. To aggregate the runtime we use the empirical cumulative distribution\nfunction (ECDF) [31], which, intuitively, shows for each budget on the x-axis the fraction of solved\ntasks and target pairs on the y-axis (see Figure 5 left for an example).\nAnother common way to compare different HPO methods is to compute the average ranking score\nin every iteration and for every task [1]. We follow the procedure described by Feurer et al. [13] and\ncompute the average ranking score as follows: assuming we run K different HPO methods M times\n\n2We used the implementation from Chen and Guestrin [4]\n\n5\n\n\u22122\u221210102pph0h1LatentSpaceFC-Net\u22121\u22120.500.511.5\u2212101h0h1LatentSpaceSVMSubsets1/5121/2561/1281/641/321/161/81/41/2fulldataset\ffor each task, we draw a bootstrap sample of 1000 runs out of the KM possible combinations. For\neach of these samples, we compute the average fractional ranking (ties are broken by the average of\nthe ordinal ranks) after each iteration. At the end, all the assigned ranks are further averaged over\nall tasks. Note that, averaged ranks are a relative performance measurement and can worsen for one\nmethod if another method improves (see Figure 5 right for an example).\n\n5 Experiments\n\nIn this section we present: (i) some qualitative insights of our meta-model by showing how it is able\nto coherently represent a sets of tasks in its latent space, (ii) an illustration of why PROFET helps to\nobtain statistically meaningful results and (iii) a comparison of various methods from the literature\non our new benchmark suite. In particular, we show results for the following state-of-the-art BO\nmethods as well as two popular evolutionary algorithms:\n\n\u2022 BO with Gaussian processes (BO-GP) [22]. We used expected improvement as acquisition\nfunction and marginalize over the Gaussian process\u2019 hyperparameters as described by Snoek\net al. [34].\n\u2022 SMAC [18]: which is a variant of BO that uses random forests to model the objective\n\nfunction and stochastic local search to optimize expected improvement.\nWe use the implementation from https://github.com/automl/SMAC3.\n\n\u2022 The BO method TPE by Bergstra et al. [3] which models the density of good and bad con-\n\ufb01gurations in the input space with a kernel density estimators. We used the implementation\nprovided from the Hyperopt package [28]\n\u2022 BO with Bayesian neural networks (BOHAMIANN) as described by Springenberg et al.\n[36]. To avoid introducing any bias, we used a different architecture with less parameters (3\nlayers, 50 units in each) than we used for our meta-model (see Section 3).\n\u2022 Differential Evolution (DE) [37] (we used our own implementation) with rand1 strategy for\n\u2022 Covariance Matrix Adaption Evolution Strategy (CMA-ES) by Hansen [15] where we used\n\u2022 Random Search (RS) [2] which samples con\ufb01gurations uniformly at random.\n\nthe implementation from https://github.com/CMA-ES/pycma\n\nthe mutation operators and a population size of 10.\n\nFor BO-GP, BOHAMIANN and RS we used the implementation provided by the RoBO package [26].\nWe provide more details for every method in Appendix E.\n\n5.1 Tasks Representation in the Latent Space\n\nWe demonstrate the interpretability of the learned latent representations of tasks in two examples. For\nthe \ufb01rst experiment we used the fully connected network benchmark described in Section 4.1. To\nvisualize that our meta-model learns a meaningful latent space, we doubled 11 out of the 18 original\ntasks to train the model by splitting each one of them randomly in two of the same size. Thereby,\nwe guarantee that there are pairs of tasks that are similar to each other. In Figure 3 (left), each color\nrepresents the partition of the original task and each ellipse represents the mean and four times the\nstandard deviation of the latent task representations. One can see that the closest neighbour of each\ntask is the other task that belongs to the same original task.\nThe second experiment targets multi-\ufb01delity problems that arise when training a machine learning\nmodel on large datasets and approximate versions of the target objective are generated by considering\nsubsamples of different sizes. We used the SVM surrogate for different dataset subsets from Klein\net al. [25], which consists of a random forest trained on a grid of hyperparameter con\ufb01gurations of\na SVM evaluated on different subsets of the training data. In particular, we de\ufb01ned the following\nsubsets: {1/512, 1/256, 1/128, 1/64, 1/32, 1/16, 1/8, 1/4, 1/2, 1} as tasks and sampled 100 con\ufb01gurations\nper task to train our meta-model. Note that we only provide the observed targets and not the subset\nsize to our model. Figure 3 (right) shows the latent space of the trained meta-model: the latent\nrepresentation of the model captures that similar data subsets are also close in the latent space. In\nparticular, the \ufb01rst latent dimension h0 coherently captures the sample size, which is learned using\nexclusively the correlation between the datasets and with no further information about their size.\n\n6\n\n\fFigure 4: Heatmaps of the p-values of the pairwise Mann-Whitney U test on three scenarios. Small\np-values should be interpreted as \ufb01nding evidence that the method in the column outperforms the\nmethod in the row. Using tasks from our meta-model lead to results that are close to using the large\nset of original tasks. Left: results with 1000 real tasks. Middle: subset of only 9 reals tasks. Right:\nresults with 1000 tasks generated from our meta-model.\n\n5.2 Benchmarking with PROFET\n\nComparing HPO methods using a small number of instances affects our ability to properly perform\nstatistical tests. To illustrate this we consider a distribution of tasks that are variations of the Forrester\nfunction f (x) = ((\u03b1x \u2212 2)2) sin(\u03b2x \u2212 4). We generated 1000 tasks by uniformly sampling random\n\u03b1 and \u03b2 in [0, 1] and compared six HPO methods: RS, DE, TPE, SMAC, BOHAMIANN and BO-GP\n(we left CMA-ES out because the python version does not support 1-dimensional optimization\nproblems).\nFigure 4 (left) shows the p-values of all pairwise comparisons with the null hypothesis \u2018Methodcolumn\nachieves a higher error after 50 function evaluations averaged over 20 runs than Methodrow\u2019 for the\nMann-Whitney U test. Squares in the \ufb01gure with a p-value smaller than 0.05 are comparisons in\nwhich with a 95% con\ufb01dence we have evidence to show that the method in the column is better that\nthe method in the row (we have evidence to reject the null hypothesis). To reproduce a realistic setting\nwhere one has access to only a small set of tasks, we picked 9 out of the 1000 tasks randomly. Now,\nin order to acquire a comparable number of samples to perform a statistical test, we performed 2220\nruns of each method on every task, and then computed the average of groups of 20 runs, such that we\nobtained 999 samples per method to compute the statistical test. One can see in Figure 4 (middle),\nthat although the results are statistically signi\ufb01cant, they are misleading: for example, BOHAMIANN\nis dominating all other methods (except BO-GP), whereas it is signi\ufb01cantly worse than all other\nmethods if we consider all 1000 tasks.\nTo solve this issue and obtain more information from the same limited number of a subset of 9 tasks,\nwe use PROFET. We \ufb01rst train the meta-model on the same 9 selected tasks and then use it to generate\n1000 new surrogate tasks (see Appendix C for a visualization). Next, we use these tasks to run the\ncomparison of the HPO methods. Results are shown in Figure 4 (right). The heatmap of statistical\ncomparisons reaches very similar conclusions to those obtained with the original 1000 tasks, contrary\nto what happened when we did the comparisons with 9 tasks only (i. e. p-values are closer to the\noriginal ones). We conclude that our meta-model captures the variability across tasks such that using\nsamples from it (generated based on a subset of tasks) allows us to draw conclusion that are more in\nline with experiments on the full dataset of tasks than running directly on the subset of tasks.\n\n5.3 Comparing State-of-the-art HPO Methods\n\nWe conducted 20 independent runs for each method on every task of all three problem classes\ndescribed in Section 4.1 with different random seeds. Each method had a budget of 200 function\nevaluations per task, except for BO-GP and BOHAMIANN, where, due to their computational\noverhead, we were only able to perform 100 function evaluations. Note that conducting this kind of\ncomparison on the original benchmarks would have been prohibitively expensive. In Figure 5 we\nshow the ECDF curves and the average ranking for the noiseless version of the SVM benchmark. The\nresults for all other benchmarks are shown in Appendix E. We can make the following observations:\n\u2022 Given enough budget, all methods are able to outperform RS. BO approaches can exploit\ntheir internal model such that they start to outperform RS earlier than evolutionary algo-\nrithms (DE, CMA-ES). Thereby, more sophisticated models, such as Gaussian processes or\n\n7\n\nRSDETPESMACBOHAMIANNBO-GPRSDETPESMACBOHAMIANNBO-GP0.5000.6960.0480.4631.0000.0200.3040.5000.0150.2741.0000.0050.9520.9850.5000.9431.0000.3450.5370.7260.0570.5001.0000.0250.0000.0000.0000.0000.5000.0000.9800.9950.6550.9751.0000.5000.00.20.40.60.81.0RSDETPESMACBOHAMIANNBO-GPRSDETPESMACBOHAMIANNBO-GP0.5000.0000.0000.0000.0000.0001.0000.5000.0000.0030.0000.0001.0001.0000.5001.0000.0000.0001.0000.9970.0000.5000.0000.0001.0001.0001.0001.0000.5000.0001.0001.0001.0001.0001.0000.5000.00.20.40.60.81.0RSDETPESMACBOHAMIANNBO-GPRSDETPESMACBOHAMIANNBO-GP0.5000.6140.0000.0411.0000.0000.3860.5000.0000.0151.0000.0001.0001.0000.5000.9991.0000.9440.9590.9850.0010.5001.0000.0350.0000.0000.0000.0000.5000.0001.0001.0000.0560.9651.0000.5000.00.20.40.60.81.0\fFigure 5: Comparison of various HPO methods on 1000 tasks of the noiseless SVM benchmark: left:\nECDF for the runtime right: average ranks. See Appendix D for the results of all benchmarks.\n\nBayesian neural networks are more sample ef\ufb01cient than somewhat simpler methods, e. g.\nrandom forests or kernel density estimators.\n\n\u2022 The performance of BO methods that model the objective function (BO-GP, BOHAMIANN,\nSMAC) instead of just the distribution of the input space (TPE) decays if we evaluate the\nfunction through noise. Also evolutionary algorithms seem to struggle with noise.\n\n\u2022 Standard BO (BO-GP) works superior on these benchmarks, but its performance decays\n\nrapidly with the number of dimensions.\n\n\u2022 Runner-up is BOHAMIANN which works slightly worse than BO-GP but seems to suffer\nless under noisy function values. Note that this result can only be achieved by using PROFET\nas we could not have evaluated with and without noise on the original datasets.\n\n\u2022 Given a suf\ufb01cient budget, DE starts to outperform CMA-ES as well as BO with simpler (and\ncheaper) models of the objective function (SMAC, TPE), making it a competitive baseline\nparticularly for higher dimensional benchmarks.\n\n6 Discussion and future work\n\nWe presented PROFET, a new tool for benchmarking HPO algorithms. The key idea is to use a\ngenerative meta-model, trained on of\ufb02ine generated data, to produce new tasks, possibly perturbed by\nnoise. The new tasks retain the properties of the original ones but can be evaluated inexpensively,\nwhich represents a major advance to speed up comparisons of HPO methods.\nIn a battery of\nexperiments we have illustrated the representation power of PROFET and its utility when comparing\nHPO methods in families of problems where only a few tasks are available.\nBesides these strong bene\ufb01ts, there are certain drawbacks of our proposed method: First, since we\nencode new tasks based on a machine learning model, our approach is based on the assumptions that\ncome with this model. Second, while we show in Section 5 empirical evidence that conclusions based\non PROFET are virtually identical to the ones based on the original tasks, there are no theoretical\nguarantees that results translate one-to-one to the original benchmarks. Nevertheless, we believe\nthat PROFET sets the ground for further research in this direction to provide more realistic use-cases\nthan commonly used synthetic functions, e. g. Branin, such that future work on HPO can rapidly\nperform reliable experiments during development and only execute the \ufb01nal evaluation on expensive\nreal benchmarks. Ultimately, we think this is an important step towards more reproducibility, which\nis paramount in such a empirical-driven \ufb01eld as AutoML.\nA possible extension of PROFET would be to consider multi-\ufb01delity benchmarks [25, 23, 27, 29]\nwhere cheap, but approximate \ufb01delities of the objective function are available, e. g. learning curves\nor dataset subsets. Also, different types of observation noise, e.g non-stationary or heavy-tailed\ndistributions as well as higher dimensional input spaces with discrete and continuous hyperparameters\ncould be investigated. Furthermore, since PROFET also provides gradient information, it could serve\nas a training distribution for learning-to-learn approaches [6, 42].\n\n8\n\n100101102functionevaluations0.00.20.40.60.81.0averagedinstancesandtargetsECDFMeta-SVM(noiseless)RSDETPESMACBOHAMIANNBO-GPCMAES20406080100functionevaluations123456averageranksRanksMeta-SVM(noiseless)\fAcknowledgement\n\nWe want to thank Noor Awad for providing an implementation of differential evolution.\n\nReferences\n[1] R. Bardenet, M. Brendel, B. K\u00e9gl, and M. Sebag. Collaborative hyperparameter tuning. In\n\nProceedings of the 30th International Conference on Machine Learning (ICML\u201913), 2013.\n\n[2] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of\n\nMachine Learning Research, 2012.\n\n[3] J. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00e9gl. Algorithms for hyper-parameter optimization.\nIn Proceedings of the 24th International Conference on Advances in Neural Information\nProcessing Systems (NIPS\u201911), 2011.\n\n[4] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD),\n2016.\n\n[5] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In Proceedings\n\nof the 31th International Conference on Machine Learning, (ICML\u201914), 2014.\n\n[6] Y. Chen, M. W. Hoffman, S. G. Colmenarejo, M. Denil, T. P. Lillicrap, M. Botvinick, and\nN. de Freitas. Learning to learn without gradient descent by gradient descent. In Proceedings of\nthe 34th International Conference on Machine Learning (ICML\u201917), 2017.\n\n[7] Y. Chen, A. Huang, Z. Wang, I. Antonoglou, J. Schrittwieser, D. Silver, and N. de Freitas.\n\nBayesian optimization in AlphaGo. arXiv:1812.06855 [cs.LG], 2018.\n\n[8] Z. Dai, M. A. \u00c1lvarez, and N. Lawrence. Ef\ufb01cient modeling of latent information in supervised\nlearning using gaussian processes. In Proceedings of the 30th International Conference on\nAdvances in Neural Information Processing Systems (NIPS\u201917), 2017.\n\n[9] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown.\nTowards an empirical foundation for assessing Bayesian optimization of hyperparameters. In\nNIPS Workshop on Bayesian Optimization (BayesOpt\u201913), 2013.\n\n[10] K. Eggensperger, F. Hutter, H.H. Hoos, and K. Leyton-Brown. Ef\ufb01cient benchmarking of\nhyperparameter optimizers via surrogates. In Proceedings of the 29th National Conference on\nArti\ufb01cial Intelligence (AAAI\u201915), 2015.\n\n[11] M. Feurer and F. Hutter. Hyperparameter optimization.\n\nMethods, Systems, Challenges. Springer, 2018.\n\nIn Automatic Machine Learning:\n\n[12] M. Feurer, A. Klein, K. Eggensperger, J. Springenberg, M. Blum, and F. Hutter. Ef\ufb01cient and\nrobust automated machine learning. In Proceedings of the 28th International Conference on\nAdvances in Neural Information Processing Systems (NIPS\u201915), 2015.\n\n[13] M. Feurer, T. Springenberg, and F. Hutter. Initializing bayesian hyperparameter optimization\nvia meta-learning. In Proceedings of the 29th National Conference on Arti\ufb01cial Intelligence\n(AAAI\u201915), 2015.\n\n[14] N. Fusi, R. Sheth, and M. Elibol. Probabilistic matrix factorization for automated machine learn-\ning. In Proceedings of the 31th International Conference on Advances in Neural Information\nProcessing Systems (NIPS\u201918), 2018.\n\n[15] N. Hansen. The CMA evolution strategy: a comparing review. In Towards a new evolutionary\ncomputation. Advances on estimation of distribution algorithms. Springer Berlin Heidelberg,\n2006.\n\n[16] N. Hansen, A. Auger, D. Brockhoff, D. Tusar, and T. Tusar. COCO: performance assessment.\n\narXiv:1605.03560 [cs.NE], 2016.\n\n9\n\n\f[17] N. Hansen, A. Auger, O. Mersmann, T. Tu\u0161ar, and D. Brockhoff. COCO: A platform for\n\ncomparing continuous optimizers in a black-box setting. arXiv:1603.08785 [cs.AI], 2016.\n\n[18] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general\nalgorithm con\ufb01guration. In Proceedings of the Fifth International Conference on Learning and\nIntelligent Optimization (LION\u201911), 2011.\n\n[19] F. Hutter, L. Kotthoff, and J. Vanschoren, editors. Automatic Machine Learning: Methods,\n\nSystems, Challenges. Springer, 2018.\n\n[20] M. Jaderberg, V. Dalibard, S. Osindero, W. M. Czarnecki, J. Donahue, A. Razavi, O. Vinyals,\nT. Green, I. Dunning, K. Simonyan, C. Fernando, and K. Kavukcuoglu. Population based\ntraining of neural networks. arXiv:1711.09846 [cs.LG], 2017.\n\n[21] K. Jamieson and A. Talwalkar. Non-stochastic best arm identi\ufb01cation and hyperparameter\noptimization. In Proceedings of the 17th International Conference on Arti\ufb01cial Intelligence and\nStatistics (AISTATS\u201916), 2016.\n\n[22] D. Jones, M. Schonlau, and W. Welch. Ef\ufb01cient global optimization of expensive black box\n\nfunctions. Journal of Global Optimization, 1998.\n\n[23] K. Kandasamy, G. Dasarathy, J. Schneider, and B. P\u00f3czos. Multi-\ufb01delity bayesian optimisation\nIn Proceedings of the 34th International Conference on\n\nwith continuous approximations.\nMachine Learning (ICML\u201917), 2017.\n\n[24] A. Klein and F. Hutter. Tabular benchmarks for joint architecture and hyperparameter optimiza-\n\ntion. arXiv:1905.04970 [cs.LG], 2019.\n\n[25] A. Klein, S. Falkner, S. Bartels, P. Hennig, and F. Hutter. Fast Bayesian hyperparameter\n\noptimization on large datasets. In Electronic Journal of Statistics, 2017.\n\n[26] A. Klein, S. Falkner, N. Mansur, and F. Hutter. Robo: A \ufb02exible and robust bayesian opti-\nmization framework in python. In NIPS Workshop on Bayesian Optimization (BayesOpt\u201917),\n2017.\n\n[27] A. Klein, S. Falkner, J. T. Springenberg, and F. Hutter. Learning curve prediction with Bayesian\nneural networks. In International Conference on Learning Representations (ICLR\u201917), 2017.\n\n[28] B. Komer, J. Bergstra, and C. Eliasmith. Hyperopt-sklearn: Automatic hyperparameter con\ufb01gu-\n\nration for scikit-learn. In ICML 2014 AutoML Workshop, 2014.\n\n[29] L. Li, K. Jamieson, G. DeSalvo, A. Rostamizadeh, and A. Talwalkar. Hyperband: Bandit-based\ncon\ufb01guration evaluation for hyperparameter optimization. In International Conference on\nLearning Representations (ICLR\u201917), 2017.\n\n[30] M. Lichman. UCI machine learning repository, 2013.\n\n[31] J. J. Mor\u00e9 and S. M. Wild. Benchmarking derivative-free optimization algorithms. SIAM\n\nJournal on Optimization, 2009.\n\n[32] V. Perrone, R. Jenatton, M. Seeger, and C. Archambeau. Scalable hyperparameter transfer learn-\ning. In Proceedings of the 31th International Conference on Advances in Neural Information\nProcessing Systems (NIPS\u201918), 2018.\n\n[33] B. Shahriari, K. Swersky, Z. Wang, R. Adams, and N. de Freitas. Taking the human out of the\n\nloop: A review of Bayesian optimization. Proceedings of the IEEE, 2016.\n\n[34] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning\nIn Proceedings of the 25th International Conference on Advances in Neural\n\nalgorithms.\nInformation Processing Systems (NIPS\u201912), 2012.\n\n[35] I. M. Sobol. Distribution of points in a cube and approximate evaluation of integrals. USSR\n\nComputational Mathematics and Mathematical Physics, 1967.\n\n10\n\n\f[36] J. T. Springenberg, A. Klein, S. Falkner, and F. Hutter. Bayesian optimization with robust\nbayesian neural networks. In Proceedings of the 29th International Conference on Advances in\nNeural Information Processing Systems (NIPS\u201916), 2016.\n\n[37] R. Storn and K. Price. Differential evolution \u2013 a simple and ef\ufb01cient heuristic for global\n\noptimization over continuous spaces. Journal of Global Optimization, 1997.\n\n[38] K. Swersky, J. Snoek, and R. Adams. Multi-task Bayesian optimization. In Proceedings of\nthe 26th International Conference on Advances in Neural Information Processing Systems\n(NIPS\u201913), 2013.\n\n[39] M. Titsias and N. Lawrence. Bayesian Gaussian process latent variable model. In Proceedings\nof the 13th International Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS\u201910),\n2010.\n\n[40] J. van Rijn and F. Hutter. Hyperparameter importance across datasets. In Proceedings of the 24th\nACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD\u201918),\n2018.\n\n[41] J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine\n\nlearning. SIGKDD Explorations, 2014.\n\n[42] M. Volpp, L. Fr\u00f6hlich, A. Doerr, F. Hutter, and C. Daniel. Meta-learning acquisition functions\n\nfor bayesian optimization. arXiv:1904.02642 [stat.ML], 2019.\n\n[43] C. Ying, A. Klein, E. Real, E. Christiansen, K. Murphy, and F. Hutter. NAS-Bench-101:\n\nTowards reproducible neural architecture search. arXiv:1902.09635 [cs.LG], 2019.\n\n11\n\n\f", "award": [], "sourceid": 3383, "authors": [{"given_name": "Aaron", "family_name": "Klein", "institution": "Amazon Berlin"}, {"given_name": "Zhenwen", "family_name": "Dai", "institution": "Amazon"}, {"given_name": "Frank", "family_name": "Hutter", "institution": "University of Freiburg & Bosch"}, {"given_name": "Neil", "family_name": "Lawrence", "institution": "Amazon"}, {"given_name": "Javier", "family_name": "Gonzalez", "institution": "Amazon.com"}]}