{"title": "Efficient and Robust Automated Machine Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2962, "page_last": 2970, "abstract": "The success of machine learning in a broad range of applications has led to an ever-growing demand for machine learning systems that can be used off the shelf by non-experts. To be effective in practice, such systems need to automatically choose a good algorithm and feature preprocessing steps for a new dataset at hand, and also set their respective hyperparameters. Recent work has started to tackle this automated machine learning (AutoML) problem with the help of efficient Bayesian optimization methods. In this work we introduce a robust new AutoML system based on scikit-learn (using 15 classifiers, 14 feature preprocessing methods, and 4 data preprocessing methods, giving rise to a structured hypothesis space with 110 hyperparameters). This system, which we dub auto-sklearn, improves on existing AutoML methods by automatically taking into account past performance on similar datasets, and by constructing ensembles from the models evaluated during the optimization. Our system won the first phase of the ongoing ChaLearn AutoML challenge, and our comprehensive analysis on over 100 diverse datasets shows that it substantially outperforms the previous state of the art in AutoML. We also demonstrate the performance gains due to each of our contributions and derive insights into the effectiveness of the individual components of auto-sklearn.", "full_text": "Ef\ufb01cient and Robust Automated Machine Learning\n\nMatthias Feurer\n\nJost Tobias Springenberg\n\nAaron Klein\nManuel Blum\n\nKatharina Eggensperger\n\nFrank Hutter\n\nDepartment of Computer Science\nUniversity of Freiburg, Germany\n\n{feurerm,kleinaa,eggenspk,springj,mblum,fh}@cs.uni-freiburg.de\n\nAbstract\n\nThe success of machine learning in a broad range of applications has led to an\never-growing demand for machine learning systems that can be used off the shelf\nby non-experts. To be effective in practice, such systems need to automatically\nchoose a good algorithm and feature preprocessing steps for a new dataset at hand,\nand also set their respective hyperparameters. Recent work has started to tackle this\nautomated machine learning (AutoML) problem with the help of ef\ufb01cient Bayesian\noptimization methods. Building on this, we introduce a robust new AutoML system\nbased on scikit-learn (using 15 classi\ufb01ers, 14 feature preprocessing methods, and\n4 data preprocessing methods, giving rise to a structured hypothesis space with\n110 hyperparameters). This system, which we dub AUTO-SKLEARN, improves on\nexisting AutoML methods by automatically taking into account past performance\non similar datasets, and by constructing ensembles from the models evaluated\nduring the optimization. Our system won the \ufb01rst phase of the ongoing ChaLearn\nAutoML challenge, and our comprehensive analysis on over 100 diverse datasets\nshows that it substantially outperforms the previous state of the art in AutoML. We\nalso demonstrate the performance gains due to each of our contributions and derive\ninsights into the effectiveness of the individual components of AUTO-SKLEARN.\n\n1\n\nIntroduction\n\nMachine learning has recently made great strides in many application areas, fueling a growing\ndemand for machine learning systems that can be used effectively by novices in machine learning.\nCorrespondingly, a growing number of commercial enterprises aim to satisfy this demand (e.g.,\nBigML.com, Wise.io, SkyTree.com, RapidMiner.com, Dato.com, Prediction.io, DataRobot.com, Microsoft\u2019s Azure Machine\nLearning, Google\u2019s Prediction API, and Amazon Machine Learning). At its core, every effective machine learning\nservice needs to solve the fundamental problems of deciding which machine learning algorithm to\nuse on a given dataset, whether and how to preprocess its features, and how to set all hyperparameters.\nThis is the problem we address in this work.\nMore speci\ufb01cally, we investigate automated machine learning (AutoML), the problem of automatically\n(without human input) producing test set predictions for a new dataset within a \ufb01xed computational\nbudget. Formally, this AutoML problem can be stated as follows:\nDe\ufb01nition 1 (AutoML problem). For i = 1, . . . , n+m, let xi \u2208 Rd denote a feature vector and yi \u2208\nY the corresponding target value. Given a training dataset Dtrain = {(x1, y1), . . . , (xn, yn)} and\nthe feature vectors xn+1, . . . , xn+m of a test dataset Dtest = {(xn+1, yn+1), . . . , (xn+m, yn+m)}\ndrawn from the same underlying data distribution, as well as a resource budget b and a loss metric\nL(\u00b7,\u00b7), the AutoML problem is to (automatically) produce test set predictions \u02c6yn+1, . . . , \u02c6yn+m. The\nloss of a solution \u02c6yn+1, . . . , \u02c6yn+m to the AutoML problem is given by 1\nm\n\n(cid:80)m\nj=1 L(\u02c6yn+j, yn+j).\n\n1\n\n\fIn practice, the budget b would comprise computational resources, such as CPU and/or wallclock time\nand memory usage. This problem de\ufb01nition re\ufb02ects the setting of the ongoing ChaLearn AutoML\nchallenge [1]. The AutoML system we describe here won the \ufb01rst phase of that challenge.\nHere, we follow and extend the AutoML approach \ufb01rst introduced by AUTO-WEKA [2] (see\nhttp://automl.org). At its core, this approach combines a highly parametric machine learning\nframework F with a Bayesian optimization [3] method for instantiating F well for a given dataset.\nThe contribution of this paper is to extend this AutoML approach in various ways that considerably\nimprove its ef\ufb01ciency and robustness, based on principles that apply to a wide range of machine\nlearning frameworks (such as those used by the machine learning service providers mentioned above).\nFirst, following successful previous work for low dimensional optimization problems [4, 5, 6],\nwe reason across datasets to identify instantiations of machine learning frameworks that perform\nwell on a new dataset and warmstart Bayesian optimization with them (Section 3.1). Second, we\nautomatically construct ensembles of the models considered by Bayesian optimization (Section 3.2).\nThird, we carefully design a highly parameterized machine learning framework from high-performing\nclassi\ufb01ers and preprocessors implemented in the popular machine learning framework scikit-learn [7]\n(Section 4). Finally, we perform an extensive empirical analysis using a diverse collection of datasets\nto demonstrate that the resulting AUTO-SKLEARN system outperforms previous state-of-the-art\nAutoML methods (Section 5), to show that each of our contributions leads to substantial performance\nimprovements (Section 6), and to gain insights into the performance of the individual classi\ufb01ers and\npreprocessors used in AUTO-SKLEARN (Section 7).\n\n2 AutoML as a CASH problem\n\nWe \ufb01rst review the formalization of AutoML as a Combined Algorithm Selection and Hyperparameter\noptimization (CASH) problem used by AUTO-WEKA\u2019s AutoML approach. Two important problems\nin AutoML are that (1) no single machine learning method performs best on all datasets and (2) some\nmachine learning methods (e.g., non-linear SVMs) crucially rely on hyperparameter optimization.\nThe latter problem has been successfully attacked using Bayesian optimization [3], which nowadays\nforms a core component of an AutoML system. The former problem is intertwined with the latter since\nthe rankings of algorithms depend on whether their hyperparameters are tuned properly. Fortunately,\nthe two problems can ef\ufb01ciently be tackled as a single, structured, joint optimization problem:\nDe\ufb01nition 2 (CASH). Let A = {A(1), . . . , A(R)} be a set of algorithms, and let the hyperparameters\nof each algorithm A(j) have domain \u039b(j). Further, let Dtrain = {(x1, y1), . . . , (xn, yn)} be a train-\ning set which is split into K cross-validation folds {D(1)\ntrain}\nsuch that D(i)\nvalid) denote the\nloss that algorithm A(j) achieves on D(i)\ntrain with hyperparameters \u03bb. Then,\nthe Combined Algorithm Selection and Hyperparameter optimization (CASH) problem is to \ufb01nd the\njoint algorithm and hyperparameter setting that minimizes this loss:\n\nvalid, . . . , D(K)\nvalid for i = 1, . . . , K. Finally, let L(A(j)\n\ntrain = Dtrain\\D(i)\n\nvalid} and {D(1)\ntrain, D(i)\n\n\u03bb , D(i)\n\nvalid when trained on D(i)\n\ntrain, . . . , D(K)\n\nK(cid:88)\n\ni=1\n\nA(cid:63), \u03bb(cid:63) \u2208\n\nargmin\n\nA(j)\u2208A,\u03bb\u2208\u039b(j)\n\n1\nK\n\nL(A(j)\n\n\u03bb , D(i)\n\ntrain, D(i)\n\nvalid).\n\n(1)\n\nThis CASH problem was \ufb01rst tackled by Thornton et al. [2] in the AUTO-WEKA system using the\nmachine learning framework WEKA [8] and tree-based Bayesian optimization methods [9, 10]. In\na nutshell, Bayesian optimization [3] \ufb01ts a probabilistic model to capture the relationship between\nhyperparameter settings and their measured performance; it then uses this model to select the most\npromising hyperparameter setting (trading off exploration of new parts of the space vs. exploitation\nin known good regions), evaluates that hyperparameter setting, updates the model with the result,\nand iterates. While Bayesian optimization based on Gaussian process models (e.g., Snoek et al. [11])\nperforms best in low-dimensional problems with numerical hyperparameters, tree-based models have\nbeen shown to be more successful in high-dimensional, structured, and partly discrete problems [12] \u2013\nsuch as the CASH problem \u2013 and are also used in the AutoML system HYPEROPT-SKLEARN [13].\nAmong the tree-based Bayesian optimization methods, Thornton et al. [2] found the random-forest-\nbased SMAC [9] to outperform the tree Parzen estimator TPE [10], and we therefore use SMAC\nto solve the CASH problem in this paper. Next to its use of random forests [14], SMAC\u2019s main\ndistinguishing feature is that it allows fast cross-validation by evaluating one fold at a time and\ndiscarding poorly-performing hyperparameter settings early.\n\n2\n\n\fFigure 1: Our improved AutoML approach. We add two components to Bayesian hyperparameter optimization\nof an ML framework: meta-learning for initializing the Bayesian optimizer and automated ensemble construction\nfrom con\ufb01gurations evaluated during optimization.\n\n3 New methods for increasing ef\ufb01ciency and robustness of AutoML\n\nWe now discuss our two improvements of the AutoML approach. First, we include a meta-learning\nstep to warmstart the Bayesian optimization procedure, which results in a considerable boost in\nef\ufb01ciency. Second, we include an automated ensemble construction step, allowing us to use all\nclassi\ufb01ers that were found by Bayesian optimization.\nFigure 1 summarizes the overall AutoML work\ufb02ow, including both of our improvements. We note\nthat we expect their effectiveness to be greater for \ufb02exible ML frameworks that offer many degrees of\nfreedom (e.g., many algorithms, hyperparameters, and preprocessing methods).\n\n3.1 Meta-learning for \ufb01nding good instantiations of machine learning frameworks\n\nDomain experts derive knowledge from previous tasks: They learn about the performance of machine\nlearning algorithms. The area of meta-learning [15] mimics this strategy by reasoning about the\nperformance of learning algorithms across datasets. In this work, we apply meta-learning to select\ninstantiations of our given machine learning framework that are likely to perform well on a new\ndataset. More speci\ufb01cally, for a large number of datasets, we collect both performance data and a set\nof meta-features, i.e., characteristics of the dataset that can be computed ef\ufb01ciently and that help to\ndetermine which algorithm to use on a new dataset.\nThis meta-learning approach is complementary to Bayesian optimization for optimizing an ML\nframework. Meta-learning can quickly suggest some instantiations of the ML framework that are\nlikely to perform quite well, but it is unable to provide \ufb01ne-grained information on performance.\nIn contrast, Bayesian optimization is slow to start for hyperparameter spaces as large as those of\nentire ML frameworks, but can \ufb01ne-tune performance over time. We exploit this complementarity by\nselecting k con\ufb01gurations based on meta-learning and use their result to seed Bayesian optimization.\nThis approach of warmstarting optimization by meta-learning has already been successfully applied\nbefore [4, 5, 6], but never to an optimization problem as complex as that of searching the space\nof instantiations of a full-\ufb02edged ML framework. Likewise, learning across datasets has also\nbeen applied in collaborative Bayesian optimization methods [16, 17]; while these approaches are\npromising, they are so far limited to very few meta-features and cannot yet cope with the high-\ndimensional partially discrete con\ufb01guration spaces faced in AutoML.\nMore precisely, our meta-learning approach works as follows. In an of\ufb02ine phase, for each machine\nlearning dataset in a dataset repository (in our case 140 datasets from the OpenML [18] repository),\nwe evaluated a set of meta-features (described below) and used Bayesian optimization to determine\nand store an instantiation of the given ML framework with strong empirical performance for that\ndataset. (In detail, we ran SMAC [9] for 24 hours with 10-fold cross-validation on two thirds of the\ndata and stored the resulting ML framework instantiation which exhibited best performance on the\nremaining third). Then, given a new dataset D, we compute its meta-features, rank all datasets by\ntheir L1 distance to D in meta-feature space and select the stored ML framework instantiations for\nthe k = 25 nearest datasets for evaluation before starting Bayesian optimization with their results.\nTo characterize datasets, we implemented a total of 38 meta-features from the literature, including\nsimple, information-theoretic and statistical meta-features [19, 20], such as statistics about the number\nof data points, features, and classes, as well as data skewness, and the entropy of the targets. All\nmeta-features are listed in Table 1 of the supplementary material. Notably, we had to exclude the\nprominent and effective category of landmarking meta-features [21] (which measure the performance\nof simple base learners), because they were computationally too expensive to be helpful in the online\nevaluation phase. We note that this meta-learning approach draws its power from the availability of\n\n3\n\nAutoMLsystemMLframework{Xtrain,Ytrain,Xtest,b,L}meta-learningdatapre-processorfeaturepreprocessorclassi\ufb01erbuildensemble\u02c6YtestBayesianoptimizer\fa repository of datasets; due to recent initiatives, such as OpenML [18], we expect the number of\navailable datasets to grow ever larger over time, increasing the importance of meta-learning.\n\n3.2 Automated ensemble construction of models evaluated during optimization\n\nWhile Bayesian hyperparameter optimization is data-ef\ufb01cient in \ufb01nding the best-performing hyperpa-\nrameter setting, we note that it is a very wasteful procedure when the goal is simply to make good\npredictions: all the models it trains during the course of the search are lost, usually including some\nthat perform almost as well as the best. Rather than discarding these models, we propose to store them\nand to use an ef\ufb01cient post-processing method (which can be run in a second process on-the-\ufb02y) to\nconstruct an ensemble out of them. This automatic ensemble construction avoids to commit itself to a\nsingle hyperparameter setting and is thus more robust (and less prone to over\ufb01tting) than using the\npoint estimate that standard hyperparameter optimization yields. To our best knowledge, we are the\n\ufb01rst to make this simple observation, which can be applied to improve any Bayesian hyperparameter\noptimization method.\nIt is well known that ensembles often outperform individual models [22, 23], and that effective\nensembles can be created from a library of models [24, 25]. Ensembles perform particularly well if\nthe models they are based on (1) are individually strong and (2) make uncorrelated errors [14]. Since\nthis is much more likely when the individual models are different in nature, ensemble building is\nparticularly well suited for combining strong instantiations of a \ufb02exible ML framework.\nHowever, simply building a uniformly weighted ensemble of the models found by Bayesian optimiza-\ntion does not work well. Rather, we found it crucial to adjust these weights using the predictions of\nall individual models on a hold-out set. We experimented with different approaches to optimize these\nweights: stacking [26], gradient-free numerical optimization, and the method ensemble selection [24].\nWhile we found both numerical optimization and stacking to over\ufb01t to the validation set and to be\ncomputationally costly, ensemble selection was fast and robust. In a nutshell, ensemble selection\n(introduced by Caruana et al. [24]) is a greedy procedure that starts from an empty ensemble and then\niteratively adds the model that maximizes ensemble validation performance (with uniform weight,\nbut allowing for repetitions). Procedure 1 in the supplementary material describes it in detail. We\nused this technique in all our experiments \u2013 building an ensemble of size 50.\n\n4 A practical automated machine learning system\n\nTo design a robust AutoML system, as\nour underlying ML framework we chose\nscikit-learn [7], one of the best known\nand most widely used machine learning\nlibraries. It offers a wide range of well es-\ntablished and ef\ufb01ciently-implemented ML\nalgorithms and is easy to use for both ex-\nperts and beginners. Since our AutoML\nsystem closely resembles AUTO-WEKA,\nbut \u2013 like HYPEROPT-SKLEARN \u2013 is based\non scikit-learn, we dub it AUTO-SKLEARN.\nFigure 2 depicts AUTO-SKLEARN\u2019s overall\ncomponents. It comprises 15 classi\ufb01cation\nalgorithms, 14 preprocessing methods, and\n4 data preprocessing methods. We param-\neterized each of them, which resulted in a\nspace of 110 hyperparameters. Most of these are conditional hyperparameters that are only active\nif their respective component is selected. We note that SMAC [9] can handle this conditionality\nnatively.\nAll 15 classi\ufb01cation algorithms in AUTO-SKLEARN are listed in Table 1a (and described in detail in\nSection A.1 of the supplementary material). They fall into different categories, such as general linear\nmodels (2 algorithms), support vector machines (2), discriminant analysis (2), nearest neighbors\n(1), na\u00a8\u0131ve Bayes (3), decision trees (1) and ensembles (4). In contrast to AUTO-WEKA [2], we\n\nFigure 2: Structured con\ufb01guration space. Squared boxes\ndenote parent hyperparameters whereas boxes with rounded\nedges are leaf hyperparameters. Grey colored boxes mark\nactive hyperparameters which form an example con\ufb01guration\nand machine learning pipeline. Each pipeline comprises one\nfeature preprocessor, classi\ufb01er and up to three data prepro-\ncessor methods plus respective hyperparameters.\n\n4\n\ndatapreprocessorestimatorfeaturepreprocessorclassi\ufb01erAdaBoost\u00b7\u00b7\u00b7RFkNN#estimatorslearningratemax.depthpreprocessing\u00b7\u00b7\u00b7NonePCAfastICArescaling\u00b7\u00b7\u00b7min/maxstandardonehotenc.\u00b7\u00b7\u00b7imputationmean\u00b7\u00b7\u00b7medianbalancingweightingNone\f#\u03bb\nname\n4\nAdaBoost (AB)\nBernoulli na\u00a8\u0131ve Bayes\n2\n4\ndecision tree (DT)\n5\nextreml. rand. trees\nGaussian na\u00a8\u0131ve Bayes\n-\n6\ngradient boosting (GB)\n3\nkNN\n4\nLDA\n4\nlinear SVM\n7\nkernel SVM\nmultinomial na\u00a8\u0131ve Bayes 2\n3\npassive aggressive\n2\nQDA\n5\nrandom forest (RF)\nLinear Class. (SGD)\n10\n\ncat (cond)\n\ncont (cond)\n\n1 (-)\n1 (-)\n1 (-)\n2 (-)\n\n-\n-\n\n2 (-)\n1 (-)\n2 (-)\n2 (-)\n1 (-)\n1 (-)\n\n-\n\n2 (-)\n4 (-)\n\n3 (-)\n1 (-)\n3 (-)\n3 (-)\n\n-\n\n6 (-)\n1 (-)\n3 (1)\n2 (-)\n5 (2)\n1 (-)\n2 (-)\n2 (-)\n3 (-)\n6 (3)\n\n(a) classi\ufb01cation algorithms\n\nname\n#\u03bb\nextreml. rand. trees prepr. 5\n4\nfast ICA\n4\nfeature agglomeration\n5\nkernel PCA\nrand. kitchen sinks\n2\n3\nlinear SVM prepr.\n-\nno preprocessing\n5\nnystroem sampler\n2\nPCA\n3\npolynomial\n4\nrandom trees embed.\nselect percentile\n2\n3\nselect rates\n\ncat (cond)\n\ncont (cond)\n\n2 (-)\n3 (-)\n3 ()\n1 (-)\n\n1 (-)\n2 (-)\n\n1 (-)\n\n1 (-)\n1 (-)\n2 (-)\n\n-\n\n-\n\n-\n\n3 (-)\n1 (1)\n1 (-)\n4 (3)\n2 (-)\n2 (-)\n\n-\n\n4 (3)\n1 (-)\n1 (-)\n4 (-)\n1 (-)\n1 (-)\n\none-hot encoding\nimputation\nbalancing\nrescaling\n\n2\n1\n1\n1\n\n1 (-)\n1 (-)\n1 (-)\n1 (-)\n\n1 (1)\n\n-\n-\n-\n\n(b) preprocessing methods\n\nTable 1: Number of hyperparameters for each possible classi\ufb01er (left) and feature preprocessing method\n(right) for a binary classi\ufb01cation dataset in dense representation. Tables for sparse binary classi\ufb01cation and\nsparse/dense multiclass classi\ufb01cation datasets can be found in the Section E of the supplementary material,\nTables 2a, 3a, 4a, 2b, 3b and 4b. We distinguish between categorical (cat) hyperparameters with discrete values\nand continuous (cont) numerical hyperparameters. Numbers in brackets are conditional hyperparameters, which\nare only relevant when another parameter has a certain value.\n\nfocused our con\ufb01guration space on base classi\ufb01ers and excluded meta-models and ensembles that\nare themselves parameterized by one or more base classi\ufb01ers. While such ensembles increased\nAUTO-WEKA\u2019s number of hyperparameters by almost a factor of \ufb01ve (to 786), AUTO-SKLEARN\n\u201conly\u201d features 110 hyperparameters. We instead construct complex ensembles using our post-hoc\nmethod from Section 3.2. Compared to AUTO-WEKA, this is much more data-ef\ufb01cient: in AUTO-\nWEKA, evaluating the performance of an ensemble with 5 components requires the construction and\nevaluation of 5 models; in contrast, in AUTO-SKLEARN, ensembles come largely for free, and it is\npossible to mix and match models evaluated at arbitrary times during the optimization.\nThe preprocessing methods for datasets in dense representation in AUTO-SKLEARN are listed in\nTable 1b (and described in detail in Section A.2 of the supplementary material). They comprise data\npreprocessors (which change the feature values and are always used when they apply) and feature\npreprocessors (which change the actual set of features, and only one of which [or none] is used). Data\npreprocessing includes rescaling of the inputs, imputation of missing values, one-hot encoding and\nbalancing of the target classes. The 14 possible feature preprocessing methods can be categorized into\nfeature selection (2), kernel approximation (2), matrix decomposition (3), embeddings (1), feature\nclustering (1), polynomial feature expansion (1) and methods that use a classi\ufb01er for feature selection\n(2). For example, L1-regularized linear SVMs \ufb01tted to the data can be used for feature selection by\neliminating features corresponding to zero-valued model coef\ufb01cients.\nAs with every robust real-world system, we had to handle many more important details in AUTO-\nSKLEARN; we describe these in Section B of the supplementary material.\n\n5 Comparing AUTO-SKLEARN to AUTO-WEKA and HYPEROPT-SKLEARN\n\nAs a baseline experiment, we compared the performance of vanilla AUTO-SKLEARN (without our\nimprovements) to AUTO-WEKA and HYPEROPT-SKLEARN, reproducing the experimental setup\nwith 21 datasets of the paper introducing AUTO-WEKA [2]. We describe this setup in detail in\nSection G in the supplementary material.\nTable 2 shows that AUTO-SKLEARN performed statistically signi\ufb01cantly better than AUTO-WEKA\nin 6/21 cases, tied it in 12 cases, and lost against it in 3. For the three datasets where AUTO-\nWEKA performed best, we found that in more than 50% of its runs the best classi\ufb01er it chose is not\nimplemented in scikit-learn (trees with a pruning component). So far, HYPEROPT-SKLEARN is more\nof a proof-of-concept \u2013 inviting the user to adapt the con\ufb01guration space to her own needs \u2013 than\na full AutoML system. The current version crashes when presented with sparse data and missing\nvalues. It also crashes on Cifar-10 due to a memory limit which we set for all optimizers to enable a\n\n5\n\n\fe\nn\no\nl\na\nb\nA\n\nn\no\nz\na\nm\nA\n\n0\n1\n-\nr\na\nf\ni\n\nC\n\nr\na\nC\n\n0\n1\n-\nr\na\nf\ni\n\nC\n\nl\nl\na\nm\nS\n\nx\ne\nv\nn\no\nC\n\nr\ne\nt\nx\ne\nD\n\na\ne\nh\nt\no\nr\no\nD\n\nn\na\nm\nr\ne\nG\n\nt\ni\nd\ne\nr\nC\n\ne\nt\nt\ne\ns\ni\nG\n\n73.50 16.00 0.39\nAS\nAW 73.50 30.00 0.00\nHS\n76.21 16.22 0.39\n\n51.70 54.81 17.53 5.56\n56.95 56.20 21.80 8.33\n-\n\n57.95 19.18 -\n\n5.51\n6.38\n-\n\n27.00 1.62\n28.33 2.29\n27.67 2.29\n\n9\n0\nD\nD\nK\n\ny\nc\nn\ne\nt\ne\np\np\nA\n1.74\n1.74\n-\n\nP\nK\n-\ns\nv\n-\nR\nK\n\nn\no\nl\ne\nd\na\nM\n\nT\nS\nI\nN\nM\n\nc\ni\ns\na\nB\n\nI\n\nB\nR\nM\n\nm\no\nc\ne\nS\n\nn\no\ni\ne\nm\ne\nS\n\ne\nl\nt\nt\nu\nh\nS\n\nm\nr\no\nf\ne\nv\na\nW\n\ny\nt\ni\nl\na\nu\nQ\n\ne\nn\ni\nW\n\nt\ns\na\ne\nY\n\n0.42\n0.31\n0.42\n\n12.44 2.84\n18.21 2.84\n14.74 2.82\n\n46.92 7.87\n60.34 8.09\n55.79 -\n\n5.24\n5.24\n5.87\n\n0.01\n0.01\n0.05\n\n14.93 33.76 40.67\n14.13 33.36 37.75\n14.07 34.72 38.45\n\nTable 2: Test set classi\ufb01cation error of AUTO-WEKA (AW), vanilla AUTO-SKLEARN (AS) and HYPEROPT-\nSKLEARN (HS), as in the original evaluation of AUTO-WEKA [2]. We show median percent error across\n100 000 bootstrap samples (based on 10 runs), simulating 4 parallel runs. Bold numbers indicate the best result.\nUnderlined results are not statistically signi\ufb01cantly different from the best according to a bootstrap test with\np = 0.05.\n\nFigure 3: Average rank of all four AUTO-SKLEARN variants (ranked by balanced test error rate (BER)) across\n140 datasets. Note that ranks are a relative measure of performance (here, the rank of all methods has to add up\nto 10), and hence an improvement in BER of one method can worsen the rank of another. The supplementary\nmaterial shows the same plot on a log-scale to show the time overhead of meta-feature and ensemble computation.\n\nfair comparison. On the 16 datasets on which it ran, it statistically tied the best optimizer in 9 cases\nand lost against it in 7.\n\n6 Evaluation of the proposed AutoML improvements\n\nIn order to evaluate the robustness and general applicability of our proposed AutoML system on\na broad range of datasets, we gathered 140 binary and multiclass classi\ufb01cation datasets from the\nOpenML repository [18], only selecting datasets with at least 1000 data points to allow robust\nperformance evaluations. These datasets cover a diverse range of applications, such as text classi\ufb01-\ncation, digit and letter recognition, gene sequence and RNA classi\ufb01cation, advertisement, particle\nclassi\ufb01cation for telescope data, and cancer detection in tissue samples. We list all datasets in Table 7\nand 8 in the supplementary material and provide their unique OpenML identi\ufb01ers for reproducibility.\nSince the class distribution in many of these datasets is quite imbalanced we evaluated all AutoML\nmethods using a measure called balanced classi\ufb01cation error rate (BER). We de\ufb01ne balanced error\nrate as the average of the proportion of wrong classi\ufb01cations in each class. In comparison to standard\nclassi\ufb01cation error (the average overall error), this measure (the average of the class-wise error)\nassigns equal weight to all classes. We note that balanced error or accuracy measures are often used\nin machine learning competitions (e.g., the AutoML challenge [1] uses balanced accuracy).\nWe performed 10 runs of AUTO-SKLEARN both with and without meta-learning and with and without\nensemble prediction on each of the datasets. To study their performance under rigid time constraints,\nand also due to computational resource constraints, we limited the CPU time for each run to 1 hour; we\nalso limited the runtime for a single model to a tenth of this (6 minutes). To not evaluate performance\non data sets already used for meta-learning, we performed a leave-one-dataset-out validation: when\nevaluating on dataset D, we only used meta-information from the 139 other datasets.\nFigure 3 shows the average ranks over time of the four AUTO-SKLEARN versions we tested. We\nobserve that both of our new methods yielded substantial improvements over vanilla AUTO-SKLEARN.\nThe most striking result is that meta-learning yielded drastic improvements starting with the \ufb01rst\n\n6\n\n500100015002000250030003500time [sec]1.82.02.22.42.62.83.0average rankvanilla auto-sklearnauto-sklearn + ensembleauto-sklearn + meta-learningauto-sklearn + meta-learning + ensemble\fD\n\nI\n\nt\ne\ns\na\nt\na\nd\n\nL\nM\nn\ne\np\nO\n\n38\n46\n179\n184\n554\n772\n917\n1049\n1111\n1120\n1128\n293\n389\n\nN\nR\nA\nE\nL\nK\nS\n\n-\nO\nT\nU\nA\n\n2.15\n3.76\n16.99\n10.32\n1.55\n46.85\n10.22\n12.93\n23.70\n13.81\n4.21\n2.86\n19.65\n\nt\ns\no\no\nB\na\nd\nA\n\n2.68\n4.65\n17.03\n10.52\n2.42\n49.68\n9.11\n12.53\n23.16\n13.54\n4.89\n4.07\n22.98\n\ns\ne\ny\na\nB\ne\nv\n\u00a8\u0131\na\nn\n\ni\nl\nl\nu\no\nn\nr\ne\nB\n\n50.22\n-\n19.27\n-\n-\n47.90\n25.83\n15.50\n28.40\n18.81\n4.71\n24.30\n-\n\nn\no\ni\ns\ni\nc\ne\nd\n\ne\ne\nr\nt\n\n2.15\n5.62\n18.31\n17.46\n12.00\n47.75\n11.00\n19.31\n24.40\n17.45\n9.30\n5.03\n33.14\n\ns\ne\ne\nr\nt\n\n.\nd\nn\na\nr\n\n.\nl\n\nm\ne\nr\nt\nx\ne\n\n18.06\n4.74\n17.09\n11.10\n2.91\n45.62\n10.22\n17.18\n24.47\n13.86\n3.89\n3.59\n19.38\n\ns\ne\ny\na\nB\ne\nv\n\u00a8\u0131\na\nn\n\nn\na\ni\ns\ns\nu\na\nG\n\n11.22\n7.88\n21.77\n64.74\n10.52\n48.83\n33.94\n26.23\n29.59\n21.50\n4.77\n32.44\n29.18\n\nt\nn\ne\ni\nd\na\nr\ng\n\ng\nn\ni\nt\ns\no\no\nb\n\n1.77\n3.49\n17.00\n10.42\n3.86\n48.15\n10.11\n13.38\n22.93\n13.61\n4.58\n24.48\n19.20\n\nN\nN\nk\n\nA\nD\nL\n\n50.00\n7.57\n22.23\n31.10\n2.68\n48.00\n11.11\n23.80\n50.30\n17.23\n4.59\n4.86\n30.87\n\n8.55\n8.67\n18.93\n35.44\n3.34\n46.74\n34.22\n25.12\n24.11\n15.48\n4.58\n24.40\n19.68\n\nM\nV\nS\nr\na\ne\nn\ni\nl\n\n16.29\n8.31\n17.30\n15.76\n2.23\n48.38\n18.67\n17.28\n23.99\n14.94\n4.83\n14.16\n17.95\n\nM\nV\nS\n\nl\ne\nn\nr\ne\nk\n\n17.89\n5.36\n17.57\n12.52\n1.50\n48.66\n6.78\n21.44\n23.56\n14.17\n4.59\n100.00\n22.04\n\nl\na\ni\nm\no\nn\ni\nt\nl\nu\nm\n\ns\ne\ny\na\nB\ne\nv\n\u00a8\u0131\na\nn\n\n46.99\n7.55\n18.97\n27.13\n10.37\n47.21\n25.50\n26.40\n27.67\n18.33\n4.46\n24.20\n20.04\n\ne\nv\ni\ns\ne\nr\ng\ng\na\n\ne\nv\ni\ns\ns\na\np\n\n50.00\n9.23\n22.29\n20.01\n100.00\n48.75\n20.67\n29.25\n43.79\n16.37\n5.65\n21.34\n20.14\n\nA\nD\nQ\n\n8.78\n7.57\n19.06\n47.18\n2.75\n47.67\n30.44\n21.38\n25.86\n15.62\n5.59\n28.68\n39.57\n\nt\ns\ne\nr\no\nf\n\nm\no\nd\nn\na\nr\n\n2.34\n4.20\n17.24\n10.98\n3.08\n47.71\n10.83\n13.75\n28.06\n13.70\n3.83\n2.57\n20.66\n\n.\ns\ns\na\nl\nC\n\nr\na\ne\nn\ni\nL\n\n)\n\nD\nG\nS\n(\n\n15.82\n7.31\n17.01\n12.76\n2.50\n47.93\n18.33\n19.92\n23.36\n14.66\n4.33\n15.54\n17.99\n\nTable 3: Median balanced test error rate (BER) of optimizing AUTO-SKLEARN subspaces for each classi\ufb01cation\nmethod (and all preprocessors), as well as the whole con\ufb01guration space of AUTO-SKLEARN, on 13 datasets.\nAll optimization runs were allowed to run for 24 hours except for AUTO-SKLEARN which ran for 48 hours.\nBold numbers indicate the best result; underlined results are not statistically signi\ufb01cantly different from the best\naccording to a bootstrap test using the same setup as for Table 2.\n\nD\n\nI\n\nt\ne\ns\na\nt\na\nd\n\nL\nM\nn\ne\np\nO\n\n38\n46\n179\n184\n554\n772\n917\n1049\n1111\n1120\n1128\n293\n389\n\nN\nR\nA\nE\nL\nK\nS\n\n-\nO\nT\nU\nA\n\n2.15\n3.76\n16.99\n10.32\n1.55\n46.85\n10.22\n12.93\n23.70\n13.81\n4.21\n2.86\n19.65\n\nr\ne\n\ufb01\n\ni\ns\nn\ne\nd\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n24.40\n20.63\n\n.\n\nd\nn\na\nr\n\n.\nl\n\nm\ne\nr\nt\nx\ne\n\n.\nr\np\ne\nr\np\ns\ne\ne\nr\nt\n\n4.03\n4.98\n17.83\n55.78\n1.56\n47.90\n8.33\n20.36\n23.36\n16.29\n4.90\n3.41\n21.40\n\nn\no\ni\nt\na\nr\ne\nm\no\nl\ng\ng\na\n\ne\nr\nu\nt\na\ne\nf\n\n2.24\n4.40\n16.92\n11.31\n1.65\n48.62\n10.33\n13.14\n23.73\n13.73\n4.76\n-\n-\n\nA\nC\n\nI\n\nt\ns\na\nf\n\n7.27\n7.95\n17.24\n19.96\n2.52\n48.65\n16.06\n19.92\n24.69\n14.22\n4.96\n-\n-\n\nA\nC\nP\n\nl\ne\nn\nr\ne\nk\n\n5.84\n8.74\n100.00\n36.52\n100.00\n47.59\n20.94\n19.57\n100.00\n14.57\n4.21\n100.00\n17.50\n\ns\nk\nn\ni\ns\nn\ne\nh\nc\nt\ni\nk\n\n.\n\nd\nn\na\nr\n\n8.57\n8.41\n17.34\n28.05\n100.00\n47.68\n35.44\n20.06\n25.25\n14.82\n5.08\n19.30\n19.66\n\n.\nr\np\ne\nr\np\nM\nV\nS\n\nr\na\ne\nn\ni\nl\n\n2.28\n4.25\n16.84\n9.92\n2.21\n47.72\n8.67\n13.28\n23.43\n14.02\n4.52\n3.01\n19.89\n\n.\nc\no\nr\np\ne\nr\np\n\no\nn\n\n2.28\n4.52\n16.97\n11.43\n1.60\n48.34\n9.44\n15.84\n22.27\n13.85\n4.59\n2.66\n20.87\n\nm\ne\no\nr\nt\ns\ny\nn\n\nr\ne\nl\np\nm\na\ns\n\n7.70\n8.48\n17.30\n25.53\n2.21\n48.06\n37.83\n18.96\n23.95\n14.66\n4.08\n20.94\n18.46\n\nl\na\ni\nm\no\nn\ny\nl\no\np\n\n2.90\n4.21\n16.94\n10.54\n100.00\n48.00\n9.11\n12.95\n26.94\n13.22\n50.00\n-\n-\n\nA\nC\nP\n\n7.23\n8.40\n17.64\n21.15\n1.65\n47.30\n22.33\n17.22\n23.25\n14.23\n4.59\n-\n-\n\ne\nl\ni\nt\nn\ne\nc\nr\ne\np\nt\nc\ne\nl\ne\ns\n\nn\no\ni\nt\na\nc\n\ufb01\n\ni\ns\ns\na\nl\nc\n\n.\n\nd\ne\nb\nm\ne\n\ns\ne\ne\nr\nt\n\nm\no\nd\nn\na\nr\n\n18.50\n7.51\n17.05\n12.68\n3.48\n47.84\n17.67\n18.52\n26.68\n15.03\n9.23\n8.05\n44.83\n\n2.20\n4.17\n17.09\n45.03\n1.46\n47.56\n10.00\n11.94\n23.53\n13.65\n4.33\n2.86\n20.17\n\nD\nV\nS\nd\ne\nt\na\nc\nn\nu\nr\nt\n\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n-\n4.05\n21.58\n\ns\ne\nt\na\nr\n\nt\nc\ne\nl\ne\ns\n\n2.28\n4.68\n16.86\n10.47\n1.70\n48.43\n10.44\n14.38\n23.33\n13.67\n4.08\n2.74\n19.18\n\nTable 4: Like Table 3, but instead optimizing subspaces for each preprocessing method (and all classi\ufb01ers).\n\ncon\ufb01guration it selected and lasting until the end of the experiment. We note that the improvement\nwas most pronounced in the beginning and that over time, vanilla AUTO-SKLEARN also found good\nsolutions without meta-learning, letting it catch up on some datasets (thus improving its overall rank).\nMoreover, both of our methods complement each other: our automated ensemble construction\nimproved both vanilla AUTO-SKLEARN and AUTO-SKLEARN with meta-learning. Interestingly, the\nensemble\u2019s in\ufb02uence on the performance started earlier for the meta-learning version. We believe\nthat this is because meta-learning produces better machine learning models earlier, which can be\ndirectly combined into a strong ensemble; but when run longer, vanilla AUTO-SKLEARN without\nmeta-learning also bene\ufb01ts from automated ensemble construction.\n\n7 Detailed analysis of AUTO-SKLEARN components\n\nWe now study AUTO-SKLEARN\u2019s individual classi\ufb01ers and preprocessors, compared to jointly\noptimizing all methods, in order to obtain insights into their peak performance and robustness. Ideally,\nwe would have liked to study all combinations of a single classi\ufb01er and a single preprocessor in\nisolation, but with 15 classi\ufb01ers and 14 preprocessors this was infeasible; rather, when studying the\nperformance of a single classi\ufb01er, we still optimized over all preprocessors, and vice versa. To obtain\na more detailed analysis, we focused on a subset of datasets but extended the con\ufb01guration budget for\noptimizing all methods from one hour to one day and to two days for AUTO-SKLEARN. Speci\ufb01cally,\nwe clustered our 140 datasets with g-means [27] based on the dataset meta-features and used one\ndataset from each of the resulting 13 clusters (see Table 6 in the supplementary material for the list of\ndatasets). We note that, in total, these extensive experiments required 10.7 CPU years.\nTable 3 compares the results of the various classi\ufb01cation methods against AUTO-SKLEARN. Overall,\nas expected, random forests, extremely randomized trees, AdaBoost, and gradient boosting, showed\n\n7\n\n\f(a) MNIST (OpenML dataset ID 554)\n\n(b) Promise pc4 (OpenML dataset ID 1049)\n\nFigure 4: Performance of a subset of classi\ufb01ers compared to AUTO-SKLEARN over time. We show median test\nerror rate and the \ufb01fth and 95th percentile over time for optimizing three classi\ufb01ers separately with optimizing\nthe joint space. A plot with all classi\ufb01ers can be found in Figure 4 in the supplementary material. While\nAUTO-SKLEARN is inferior in the beginning, in the end its performance is close to the best method.\n\nthe most robust performance, and SVMs showed strong peak performance for some datasets. Besides\na variety of strong classi\ufb01ers, there are also several models which could not compete: The decision\ntree, passive aggressive, kNN, Gaussian NB, LDA and QDA were statistically signi\ufb01cantly inferior\nto the best classi\ufb01er on most datasets. Finally, the table indicates that no single method was the best\nchoice for all datasets. As shown in the table and also visualized for two example datasets in Figure\n4, optimizing the joint con\ufb01guration space of AUTO-SKLEARN led to the most robust performance.\nA plot of ranks over time (Figure 2 and 3 in the supplementary material) quanti\ufb01es this across all\n13 datasets, showing that AUTO-SKLEARN starts with reasonable but not optimal performance and\neffectively searches its more general con\ufb01guration space to converge to the best overall performance\nover time.\nTable 4 compares the results of the various preprocessors against AUTO-SKLEARN. As for the\ncomparison of classi\ufb01ers above, AUTO-SKLEARN showed the most robust performance: It performed\nbest on three of the datasets and was not statistically signi\ufb01cantly worse than the best preprocessor on\nanother 8 of 13.\n\n8 Discussion and Conclusion\n\nWe demonstrated that our new AutoML system AUTO-SKLEARN performs favorably against the\nprevious state of the art in AutoML, and that our meta-learning and ensemble improvements for\nAutoML yield further ef\ufb01ciency and robustness. This \ufb01nding is backed by the fact that AUTO-\nSKLEARN won the auto-track in the \ufb01rst phase of ChaLearn\u2019s ongoing AutoML challenge. In this\npaper, we did not evaluate the use of AUTO-SKLEARN for interactive machine learning with an expert\nin the loop and weeks of CPU power, but we note that that mode has also led to a third place in\nthe human track of the same challenge. As such, we believe that AUTO-SKLEARN is a promising\nsystem for use by both machine learning novices and experts. The source code of AUTO-SKLEARN is\navailable under an open source license at https://github.com/automl/auto-sklearn.\nOur system also has some shortcomings, which we would like to remove in future work. As one\nexample, we have not yet tackled regression or semi-supervised problems. Most importantly, though,\nthe focus on scikit-learn implied a focus on small to medium-sized datasets, and an obvious direction\nfor future work will be to apply our methods to modern deep learning systems that yield state-of-\nthe-art performance on large datasets; we expect that in that domain especially automated ensemble\nconstruction will lead to tangible performance improvements over Bayesian optimization.\n\nAcknowledgments\n\nThis work was supported by the German Research Foundation (DFG), under Priority Programme Autonomous\nLearning (SPP 1527, grant HU 1900/3-1), under Emmy Noether grant HU 1900/2-1, and under the BrainLinks-\nBrainTools Cluster of Excellence (grant number EXC 1086).\n\n8\n\n101102103104time [sec]0246810Balanced Error Rateauto-sklearngradient boostingkernel SVMrandom forest101102103104time [sec]1520253035404550Balanced Error Rateauto-sklearngradient boostingkernel SVMrandom forest\fReferences\n\n[1] I. Guyon, K. Bennett, G. Cawley, H. Escalante, S. Escalera, T. Ho, N.Maci`a, B. Ray, M. Saeed, A. Statnikov,\n\nand E. Viegas. Design of the 2015 ChaLearn AutoML Challenge. In Proc. of IJCNN\u201915, 2015.\n\n[2] C. Thornton, F. Hutter, H. Hoos, and K. Leyton-Brown. Auto-WEKA: combined selection and hyperpa-\n\nrameter optimization of classi\ufb01cation algorithms. In Proc. of KDD\u201913, pages 847\u2013855, 2013.\n\n[3] E. Brochu, V. Cora, and N. de Freitas. A tutorial on Bayesian optimization of expensive cost functions,\nwith application to active user modeling and hierarchical reinforcement learning. CoRR, abs/1012.2599,\n2010.\n\n[4] M. Feurer, J. Springenberg, and F. Hutter. Initializing Bayesian hyperparameter optimization via meta-\n\nlearning. In Proc. of AAAI\u201915, pages 1128\u20131135, 2015.\n\n[5] Reif M, F. Shafait, and A. Dengel. Meta-learning for evolutionary parameter optimization of classi\ufb01ers.\n\nMachine Learning, 87:357\u2013380, 2012.\n\n[6] T. Gomes, R. Prud\u02c6encio, C. Soares, A. Rossi, and A. Carvalho. Combining meta-learning and search\n\ntechniques to select parameters for support vector machines. Neurocomputing, 75(1):3\u201313, 2012.\n\n[7] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,\nR. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay.\nScikit-learn: Machine learning in Python. JMLR, 12:2825\u20132830, 2011.\n\n[8] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten. The WEKA data mining\n\nsoftware: An update. SIGKDD, 11(1):10\u201318, 2009.\n\n[9] F. Hutter, H. Hoos, and K. Leyton-Brown. Sequential model-based optimization for general algorithm\n\ncon\ufb01guration. In Proc. of LION\u201911, pages 507\u2013523, 2011.\n\n[10] J. Bergstra, R. Bardenet, Y. Bengio, and B. K\u00b4egl. Algorithms for hyper-parameter optimization. In Proc.\n\nof NIPS\u201911, pages 2546\u20132554, 2011.\n\n[11] J. Snoek, H. Larochelle, and R. P. Adams. Practical Bayesian optimization of machine learning algorithms.\n\nIn Proc. of NIPS\u201912, pages 2960\u20132968, 2012.\n\n[12] K. Eggensperger, M. Feurer, F. Hutter, J. Bergstra, J. Snoek, H. Hoos, and K. Leyton-Brown. Towards\nan empirical foundation for assessing Bayesian optimization of hyperparameters. In NIPS Workshop on\nBayesian Optimization in Theory and Practice, 2013.\n\n[13] B. Komer, J. Bergstra, and C. Eliasmith. Hyperopt-sklearn: Automatic hyperparameter con\ufb01guration for\n\nscikit-learn. In ICML workshop on AutoML, 2014.\n[14] L. Breiman. Random forests. MLJ, 45:5\u201332, 2001.\n[15] P. Brazdil, C. Giraud-Carrier, C. Soares, and R. Vilalta. Metalearning: Applications to Data Mining.\n\nSpringer, 2009.\n\n[16] R. Bardenet, M. Brendel, B. K\u00b4egl, and M. Sebag. Collaborative hyperparameter tuning. In Proc. of\n\nICML\u201913 [28], pages 199\u2013207.\n\n[17] D. Yogatama and G. Mann. Ef\ufb01cient transfer learning method for automatic hyperparameter tuning. In\n\nProc. of AISTATS\u201914, pages 1077\u20131085, 2014.\n\n[18] J. Vanschoren, J. van Rijn, B. Bischl, and L. Torgo. OpenML: Networked science in machine learning.\n\nSIGKDD Explorations, 15(2):49\u201360, 2013.\n\n[19] D. Michie, D. Spiegelhalter, C. Taylor, and J. Campbell. Machine Learning, Neural and Statistical\n\nClassi\ufb01cation. Ellis Horwood, 1994.\n\n[20] A. Kalousis. Algorithm Selection via Meta-Learning. PhD thesis, University of Geneve, 2002.\n[21] B. Pfahringer, H. Bensusan, and C. Giraud-Carrier. Meta-learning by landmarking various learning\n\nalgorithms. In Proc. of (ICML\u201900), pages 743\u2013750, 2000.\n\n[22] I. Guyon, A. Saffari, G. Dror, and G. Cawley. Model selection: Beyond the Bayesian/Frequentist divide.\n\nJMLR, 11:61\u201387, 2010.\n\n[23] A. Lacoste, M. Marchand, F. Laviolette, and H. Larochelle. Agnostic Bayesian learning of ensembles. In\n\nProc. of ICML\u201914, pages 611\u2013619, 2014.\n\n[24] R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In\n\nProc. of ICML\u201904, page 18, 2004.\n\n[25] R. Caruana, A. Munson, and A. Niculescu-Mizil. Getting the most out of ensemble selection. In Proc. of\n\nICDM\u201906, pages 828\u2013833, 2006.\n\n[26] D. Wolpert. Stacked generalization. Neural Networks, 5:241\u2013259, 1992.\n[27] G. Hamerly and C. Elkan. Learning the k in k-means. In Proc. of NIPS\u201904, pages 281\u2013288, 2004.\n[28] Proc. of ICML\u201913, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1680, "authors": [{"given_name": "Matthias", "family_name": "Feurer", "institution": "University of Freiburg"}, {"given_name": "Aaron", "family_name": "Klein", "institution": "University of Freiburg"}, {"given_name": "Katharina", "family_name": "Eggensperger", "institution": "University of Freiburg"}, {"given_name": "Jost", "family_name": "Springenberg", "institution": "University of Freiburg"}, {"given_name": "Manuel", "family_name": "Blum", "institution": "University of Freiburg"}, {"given_name": "Frank", "family_name": "Hutter", "institution": "U Freiburg"}]}