{"title": "Regularized Gradient Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 5449, "page_last": 5458, "abstract": "Gradient Boosting (\\GB) is a popular and very successful ensemble method for binary trees. While various types of regularization of the base predictors are used with this algorithm, the theory that connects such regularizations with  generalization guarantees is poorly understood. We fill this gap by deriving data-dependent learning guarantees for \\GB\\ used with \\emph{regularization}, expressed in terms of the Rademacher complexities of the constrained families of base predictors. We introduce a new algorithm, called \\rgb\\, that directly benefits from these generalization bounds and that, at every boosting round, applies the \\emph{Structural Risk Minimization} principle to search for a base predictor with the best empirical fit versus complexity trade-off.\nInspired by \\emph{Randomized Coordinate Descent} we provide a scalable implementation of our algorithm, able to search over large families of base predictors. Finally, we provide experimental results, demonstrating that our algorithm achieves significantly better out-of-sample performance on multiple datasets than the standard \\GB\\ algorithm used with its regularization.", "full_text": "Regularized Gradient Boosting\n\nCorinna Cortes\nGoogle Research\n\nNew York, NY 10011\ncorinna@google.com\n\nMehryar Mohri\n\nDmitry Storcheus\n\nGoogle & Courant Institute\n\nCourant Institute & Google\n\nNew York, NY 10012\nmohri@google.com\n\nNew York, NY 10012\n\ndstorcheus@google.com\n\nAbstract\n\nGradient Boosting (GB) is a popular and very successful ensemble method for\nbinary trees. While various types of regularization of the base predictors are used\nwith this algorithm, the theory that connects such regularizations with generaliza-\ntion guarantees is poorly understood. We \ufb01ll this gap by deriving data-dependent\nlearning guarantees for GB used with regularization, expressed in terms of the\nRademacher complexities of the constrained families of base predictors. We intro-\nduce a new algorithm, called RGB, that directly bene\ufb01ts from these generalization\nbounds and that, at every boosting round, applies the Structural Risk Minimization\nprinciple to search for a base predictor with the best empirical \ufb01t versus complexity\ntrade-off. Inspired by Randomized Coordinate Descent we provide a scalable\nimplementation of our algorithm, able to search over large families of base predic-\ntors. Finally, we provide experimental results, demonstrating that our algorithm\nachieves signi\ufb01cantly better out-of-sample performance on multiple datasets than\nthe standard GB algorithm used with its regularization.\n\n1\n\nIntroduction\n\nEnsemble methods form a powerful family of techniques in machine learning that combine multiple\nbase predictors to create more accurate ones. These methods are often very effective in practice and\ncan achieve a signi\ufb01cant performance improvement over the individual base predictors [Quinlan\net al., 1996, Caruana et al., 2004, Freund et al., 1996, Dietterich, 2000]. ADABOOST [Freund and\nSchapire, 1997] and its variants are among the most prominent ensemble methods since they are both\nvery effective in practice and bene\ufb01t from well-studied theoretical margin guarantees [Freund and\nSchapire, 1997, Koltchinskii and Panchenko, 2002].\n\nGradient Boosting (GB) [Friedman, 2001] is another popular tree-based ensemble method that has\ninspired a number of widely-used software libraries (e.g., XGBOOST [Chen and Guestrin, 2016],\nMART [Friedman, 2002], and DART [Rashmi and Gilad-Bachrach, 2015]) and has frequently\nranked among the top in benchmark competitions such as Kaggle. But, while it is often introduced\nand presented differently, GB exactly coincides with AdaBoost, when the objective function used is\nthe exponential function, as shown for example by [Schapire and Freund, 2012]. More generally,\nboth of these algorithms are instances of Functional Gradient Descent [Mason et al., 2000, Grubb and\nBagnell, 2011] when non-increasing convex and differentiable upper bounds on the zero-one loss are\nused. Viewed from the Functional Gradient Descent perspective, at every boosting step, GB seeks a\npredictor function h that is closest to the functional gradient of the objective within some constrained\nfamily of base predictors H. Specifying this base predictor family H such that the selected function\ndoes not over\ufb01t the gradient, as well as de\ufb01ning an ef\ufb01cient search procedure over H is crucial for\nthe success of the algorithm. In most practical instances, several types of constraints are imposed to\ndo so. As an example, for binary regression trees, XGBOOST bounds the number of leaves and the\nnorm of the leaf values vector. This can be viewed as a regularization. However, to our knowledge,\nno theoretical analysis has been provided for these commonly-used constraints.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA natural question is whether one can derive learning guarantees that explain how this regularization\non H, and, perhaps even more general forms of constraints on functions h \u2208 H, are connected to\nthe generalization performance of GB. We seek inspiration from the margin-based learning bounds\ngiven for ADABOOST [Schapire et al., 1997, Mohri et al., 2012]. These guarantees, however, do\nnot provide a detailed analysis of the constraints on the families of tree base predictors, nor do they\nprovide guidance on how to conduct an ef\ufb01cient search of these families to select a predictor during\neach boosting round.\n\nWe \ufb01ll this gap by providing a comprehensive analysis of regularization in GB and derive learning\nguarantees that explain what type of regularization should be used and how. We give data-dependent\nlearning bounds for GB with regularization, expressed in terms of the Rademacher complexities of\nthe constrained hypotheses\u2019 sub-families, from which the base predictors are selected, as well as the\nensemble mixture weights. We present a new algorithm, called RGB for Regularized Gradient Boost-\ning, which generalizes the existing gradient boosting methods by introducing a general functional\nq-norm constraint for the families of the tree base predictors.\n\nOur algorithm and its objective function are directly guided by the theory we develop. Our bound\nsuggests that the Structural Risk Minimization principle (SRM) [Vapnik, 1992] should be used to\nbreak down H into subsets of varying complexities and, at each round, select a base learner h\nfrom a subset that provides the best trade-off between proximity to the functional gradient and the\ncomplexity.\n\nApplying SRM to search over subsets of H is challenging, since often these subsets are extremely\nrich, possibly in\ufb01nite. An example is the families of decision trees with bounded depth used in GB.\nWe provide a solution to the problem of expensive search and show how Randomized Coordinate\nDescent [Nesterov, 2012] can be used to search over H ef\ufb01ciently, using our generalization bounds.\n\nFinally, this paper provides experimental results, demonstrating that our algorithm achieves signi\ufb01-\ncantly better out-of-sample performance than the baselines such as XGBOOST on multiple datasets.\nWe give speci\ufb01c bounds, as well as the pseudocode and experimental results, for the families of\nbinary regression trees, but our analysis can be extended to broader families of functions, such as\nSVMs [Cortes and Vapnik, 1995] and Deep Neural Networks [LeCun et al., 2015].\n\nThe paper is organized as follows. In Section 2, we introduce what we name a Regularized Gradient\nBoosting framework. In Section 3 we derive a Rademacher complexity bound on the families of\nregularized regression trees, which allows us to establish learning guarantees for Regularized Gradient\nBoosting. This bound directly inspire the optimization objective and the RGB algorithm presented\nin Section 4 that bene\ufb01ts from the guarantees following from the SRM principle. A non-uniform\nrandomized search over the families of base predictors provides an ef\ufb01cient solution. In Section 5,\nwe present our experimental results, which illustrate the bene\ufb01ts of the RGB algorithm.\n\n2 Regularized Gradient Boosting\n\nIn this section, we examine the correspondence between gradient descent in functional spaces and\ncoordinate descent in vector spaces. This connection will help us rigorously de\ufb01ne a Regularized\nGradient Boosting learning scenario and develop a scalable implementation for it.\n\n2.1 Gradient Boosting as Functional Gradient Descent\n\nLet X denote the input space, and let F be an inner product space of functions from X to R. We\nde\ufb01ne a restricted family of functions H \u2286 F to be a set of base hypotheses. In a standard supervised\nlearning scenario, the training and test points are drawn i.i.d. according to some distribution D over\nX \u00d7 {\u22121, 1}, and S = {(x1, y1), . . . , (xm, ym)} is a training sample of size m drawn from Dm. In\nthis scenario, a general boosting algorithm selects a sequence of functions h1, . . . , hT from H to\nminimize a certain empirical loss L : F 7\u2192 R. [Friedman, 2001, Grubb and Bagnell, 2011, Mason\net al., 2000, Schapire, 1999, Cortes et al., 2014]. The speci\ufb01cation of H and the method of selecting\neach ht \u2208 H are essential for the success of the boosting algorithms. In fact, different answers to these\ntwo questions have resulted in distinct and separately-studied algorithms, such as GB, ADABOOST,\nand LOGITBOOST [Friedman et al., 1998].\n\n2\n\n\fThe goal of boosting algorithms is typically to minimize an empirical loss functional:\n\nL(F ) =\n\n1\nm\n\nmXi=1\n\n\u03a6(cid:18)yi, F (xi)(cid:19),\n\n(1)\n\nwhere F (x) =PT\n\nt=1 \u03b1tht(x) such that \u2200t \u2208 [1, T ] : ht \u2208 H. Popular ensemble learning algorithms,\nsuch as ADABOOST and GB, despite having originated in different research communities at different\ntimes, are particular instances of a more general algorithm, Functional Gradient Descent. The\nobjective in Equation 1 is viewed by the Functional Gradient Descent as a functional rather than a\nvector-valued function, with the goal of minimizing L over F by taking steps in the direction of the\nsteepest descent F \u2190 F \u2212 \u03b7\u2207L(F ) for some positive learning rate \u03b7 \u2208 R. In the learning scenario\ndescribed above, only the trace of F on x1, . . . , xm is observable; therefore, the functional gradient\n\n\u2202F (x1) , . . . , \u2202L(F )\n\n\u2202F (xm)(cid:21). This makes the Functional Gradient Descent update equal to\n\nof L is \u2207L =(cid:20) \u2202L(F )\n\nF (xi) \u2190 F (xi) \u2212 \u03b7 \u2202L(F )\n\u2202F (xi) . Of course, to make sure this functional update is well de\ufb01ned and to\navoid over-\ufb01tting, it is natural to restrict F (x) to some hypothesis set H, which implies the following\nform of the functional update:\n\nh = argmin\n\nd(\u2207L, h),\n\nh\u2208H\n\n(2)\n\nwhere d is some distance measure. This means that h \u2208 H is chosen to be the closest function h \u2208 H\nto the projection of \u2207L onto H. The update in Equation 2 is a fundamental but not well-studied\ncomponent of virtually all boosting methods. Simply by varying the choice of H and d, this single\nequation recovers most widely-used boosting algorithms.\n\nIf we restrict the optimization steps to a set of base hypotheses H, then each step is chosen to be the\nfunction closest in the direction to the negative gradient, which means it maximizes\n\n\u2212\u2207L \u00b7 h = \u2212\n\nmXi=1\n\n\u2202L(F )\n\u2202F (xi)\n\nh(xi).\n\n(3)\n\nParticularly, if \u03a6(yi, f (xi)) = e\u2212yif (xi), then the Functional Gradient Descent recovers ADABOOST,\n\nand if \u03a6(yi, f (xi)) = log(cid:0)1 + e\u2212yif (xi)(cid:1), then it recovers LOGITBOOST. When, instead of the\n\nnegative inner product \u2212\u2207L \u00b7 h, we minimize the distance k \u2212 \u2207L(F ) \u2212 hk2\nalgorithm.\n\n2, we recover the GB\n\n2.2 Gradient Boosting as Vector Space Coordinate Descent\n\nThere is an equivalence relation between gradient descent in functional spaces and coordinate descent\nin vector spaces that often helps to obtain ef\ufb01cient algorithms for ensemble learning. At each of the\nT steps of the Functional Gradient Descent, \u2207L is projected onto H, hence the \ufb01nal solution F can\nt=1 \u03b1tht for some \u03b1 \u2208 RT , where \u22001 \u2264 t \u2264 T : ht \u2208 HIt \u2286 H, where\nHIt indicates the subset of H selected at the t-th step. The subsets HIt can be viewed as coordinate\nblocks in H.\nIn this view, at boosting step t a particular subspace HIt out of {H1, . . . , HK} is\nselected; then a base predictor ht \u2208 HIt from that subspace is added to the ensemble.\n\nbe expressed as F\u03b1 = PT\n\nThis allows switching from minimizing the loss functional L(F ) to minimizing the loss function\nL(\u03b1) = L(F\u03b1).\n\nL(\u03b1) =\n\n1\nm\n\nmXi=1\n\n\u03a6(cid:18)yi,\n\nTXt=1\n\n\u03b1tht(xi)(cid:19)\n\n(4)\n\nover the ensemble weights vector \u03b1 \u2208 RT . Selecting a projection ht and a step size \u03b1t on the t-th\nstep of the Functional Gradient Descent on L(F\u03b1) or alternatively selecting a coordinate \u03b1t on the\nt-th step of the vector space coordinate descent on L(\u03b1) both result in the same form of the update\nF\u03b1,t = F\u03b1,t\u22121 + \u03b1tht. Additionally, the full sequence of these updates for t from 1 to T is equal\nsince, by the chain rule\n\n\u22001 \u2264 t \u2264 T : \u2212\n\n\u2202L(\u03b1t)\n\n\u2202\u03b1t\n\n= \u2212\n\n\u2202L(F\u03b1)\n\u2202F\u03b1(xi)\n\nmXi=1\n\n3\n\nht(xi) = \u2212\u2207L \u00b7 ht,\n\n(5)\n\n\fwhich means that min1\u2264t\u2264T \u2212 \u2202L(\u03b1t)\nht selected by Functional Gradient Descent.\n\n\u2202\u03b1t\n\nselected by the coordinate descent is equal to min1\u2264t\u2264T \u2212\u2207L\u00b7\n\nThis equivalence illustrates two important points. First, coordinate descent methods can be used to\nprovide ef\ufb01cient numerical solutions for boosting. Second, the proper construction of the subsets\nHt such that ht \u2208 HIt \u2286 H is crucial for the success of boosting algorithms. We rely on this\nequivalence when presenting a coordinate-descent-style algorithm for minimizing the regularized\nboosting objective that scales well to large families of base predictors.\n\n2.3 Regularized Gradient Boosting\n\nIn this subsection, we describe the main novelty of our work \u2013 the analysis of regularization applied\nto GB. We formulate what we name a Regularized Gradient Boosting framework and show the\nsubtle connection between the regularization and the properties of Hk \u2286 H. As we shall see, the\nregularization terms are not explicitly introduced in the de\ufb01nition of the objective, but only in the\nde\ufb01nition of an approximation to the functional gradient.\n\nWhile the unregularized projection step, as in Equation 2, has been extensively studied for GB,\nthe fundamental theory of the regularization commonly used is missing. However, a number of\nempirical studies and software frameworks [Sun et al., 2014, Chen and Guestrin, 2016] indicate that\nintroducing regularization to this step is extremely bene\ufb01cial. For example, the popular XGBOOST\nlibrary, dedicated to boosted decision trees, regularizes the norm of the leaf values, as well as the\nnumber of leafs. We are \ufb01lling this gap by providing a theory that links regularization with learning\nguarantees for GB algorithms.\n\nFor a convex function \u2126 : F 7\u2192 R, a closed subspace H \u2286 F and \u03b2 \u2208 R+, let the Regularized\nGradient Boosting step be de\ufb01ned by\n\nh = argmin\n\nd(\u2207L, h) + \u03b2\u2126(h).\n\nh\u2208H\n\n(6)\n\nGiven the convexity of \u2126, this step is equivalent to h = argminh\u2208 bH d(\u2207L, h), where bH = H \u2229\n\n{h : \u2126(h) \u2264 \u03b2}. Such a reduction illustrates the subtle, yet extremely important, connection between\nregularization and the de\ufb01nition of hypothesis set H. The equivalence between vector space coordinate\ndescent and Functional Gradient Descent presented in Section 2, meaning that both of these methods\niteratively select the same sequence of functions ht \u2208 HIt \u2286 H, suggests that a natural way to use\nregularization for boosting is to de\ufb01ne F = conv(\u222aK\nk=1Hk), where Hk = {h : \u03b8k\u22121 < \u2126(h) \u2264 \u03b8k}\nare disjointed sets of functions for a set of parameters [\u03b81, . . . , \u03b8K]. Note that, with this formulation,\nthe regularization is not in the objective function; instead the search for the gradient approximation is\nconstrained by a regularization.\n\nWe show, in the following section, that such a de\ufb01nition of F allows us to obtain margin-based\nlearning guarantees for the Regularized Gradient Boosting that are dependent on the complexities of\neach individual Hk.\n\n3 Learning Guarantees\n\nAs described in the previous section, by projecting the functional gradient onto F = conv(\u222aK\n\nk=1Hk)\nt=1 \u03b1tft \u2208 F , where the Hks\nare families of functions with varying complexity. Thus, it is natural to seek learning guarantees\ndepending on the properties of each Hk and the mixture weight vector \u03b1 = [\u03b11, . . . , \u03b1T ].\n\nat each step, we are able to learn an ensemble function f = PT\nThe \ufb01rst margin bound based on the VC-dimension for ensemblesPT\n\nt=1 \u03b1tft was given by Freund\nand Schapire [1997]. Later, tighter data-dependent bounds in terms of the Rademacher complexity of\nthe underlying function class H were given by Koltchinskii and Panchenko [2002], see also [Mohri\net al., 2018]. For the speci\ufb01c case where H = conv(\u222aK\nk=1Hk), Rademacher complexity-based\nguarantees were given in [Cortes et al., 2014]. In this section, we will use these theoretical results to\nderive margin-based guarantees based on the Rademacher complexities of the families of regularized\ndecision trees Hk and the mixture weights \u03b1. The bounds that we show, being data-dependent, will\nnot only \ufb01ll the missing generalization theory for the existing gradient tree boosting frameworks but\nalso motivate a new scalable learning algorithm for the Regularized Gradient Boosting framework,\ncalled RGB, in Section 4.\n\n4\n\n\fHere, we restrict our analysis to the hypothesis families Hk of regression trees. However, our results\ncan be extended to other families, such as kernel-based hypotheses and neural networks, so long as\nthe sample Rademacher complexities of these families can be bounded.\n\nEach leaf l in a regression tree contains a real-valued number wl providing the output value of the\ntree for any sample point allocated to that leaf; thus, we let w be a vector of stacked leaf values. The\n\nfunction computed by a regression tree can thus be represented by h(x) =Pl\u2208leaves(h) wlI{x \u2208 leafl},\n\nwhere I{x \u2208 leafl} is the indicator function for sample point x \u2208 Rd being allocated to leafl; this\nvalue h(x) can be used for classi\ufb01cation in a straightforward manner by thresholding.\n\nThe node partition functions in binary regression trees are of the form [x]j \u2264 \u03b8 for some feature index\nj \u2208 [1, d] and \u03b8 \u2208 R, which means that if [xi]j \u2264 \u03b8 for a sample point xi \u2208 Rd, then xi is allocated\nto the left subtree and to the right subtree otherwise. Let Hn,\u03bb,q be the set of all regularized binary\nregression trees with the number of internal nodes bounded by n and a leaf values vector w such\nthat kwkq \u2264 \u03bb, q \u2265 1. Special instances of these families of trees are widely used in practice. For\nexample, Hn,\u03bb,1 and Hn,\u03bb,2 are implemented in XGBOOST and frequently used in practice.\nTheorem 1. For any sample S = (x1, . . . , xm), the empirical Rademacher complexity of a hypothe-\n\nsis set H is de\ufb01ned by bRS(H) = E\u03c3(cid:2) suph\u2208HPm\n\nuniformly distributed random variables taking values in {\u22121, 1}. Let d be the input data dimension.\nThe following upper bound holds for the empirical Rademacher complexity of Hn,\u03bb,q:\n\ni=1 \u03c3ih(xi)(cid:3), where, \u03c3is, i \u2208 [m], are independent\n\nbRS(Hn,\u03bb,q) \u2264 \u03bbr (4n + 2) log2(d + 2) log(m + 1)\n\nm\n\n.\n\nThe proof of Theorem 1 is given in the Appendix. This bound shows how the empirical Rademacher\ncomplexity of the regularized decision trees depends both on on the number of internal nodes n and\nthe upper bound \u03bb on the q-norm of leaf values.\n\nUsing this bound, we can now derive our margin-based learning guarantees for the family F .\nLet R(f ) denote the binary classi\ufb01cation error of f \u2208 F , R(f ) = E(x,y)\u223cD I{yf (x) \u2264 0}, and\n\nI{yf (x) \u2264 \u03c1}.\n\nR\u03c1(f ) its empirical \u03c1-margin loss for a sample S, R\u03c1(f ) = E(x,y)\u223cD I{yf (x) \u2264 \u03c1}. Let bR\u03c1(f ) =\n\nE(x,y)\u223cS\nTheorem 2. Fix \u03c1 > 0. Let Hk = Hnk,\u03bbk,qk , where (nk), (\u03bbk) are sequences of constraints on\nthe number of internal nodes n and the leaf vector norm kwkq. De\ufb01ne F = conv(\u222aK\nk=1Hk). Then,\nfor any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over the draw of a sample S of size m, the following\n\ninequality holds for all f =PT\nTXt=1\n\nR(f ) \u2264 bRS,\u03c1(f ) +\n\n4\n\u03c1\n\nt=1 \u03b1tht \u2208 F :\n\n\u03b1t\u03bbItr (4nIt + 2) log2(d + 2) log(m + 1)\n\nm\n\n+ C(m, K),\n\nwhere It is the index of the subclass selected at time t and C(m, K) = O(cid:18)q log(K)\n\n\u03c12m log(cid:2) \u03c12m\n\nlog(K)(cid:3)(cid:19).\n\nThe proof of Theorem 2 is given in the Appendix. The generalization bound of Theorem 2 motivates\na speci\ufb01c algorithm for Regularized Gradient Boosting, described and discussed in the next section.\n\n4 Algorithm\n\nThe multiplicative structure of the bound in Theorem 2 with respect to the mixture weights\n[\u03b11, . . . , \u03b1T ] and the complexities HIt suggests the use of these complexities (or their upper bounds)\nin the regularization \u2126(h). Additionally, one may upper-bound the empirical loss function of\nu 7\u2192 I{u \u2264 0} in Theorem 2, leading to the following objective:\n\nL(\u03b1) =\n\n1\nm\n\nmXi=1\n\n\u03a6(cid:18)yi,\n\nTXt=1\n\n\u03b1tht(xi)(cid:19) + \u03b2\n\nTXt=1\n\n|\u03b1t|\u03bbItr (4nIt + 2) log2(d + 2) log(m + 1)\n\nm\n\n.\n\n(7)\n\nMinimizing the function with vector space coordinate descent is equivalent to solving for a projection\nat each Functional Gradient Descent step of the form\n\nht = argmin\n\nd(\u2207L, h) + \u03b2\n\nh\u2208H\n\nKXk=1\n\n\u03bbkr (4nk + 2) log2(d + 2) log(m + 1)\n\nm\n\nI{h \u2208 Hk}.\n\n(8)\n\n5\n\n\fIn this section we will devise an algorithm for minimizing the regularized objective L(\u03b1), called\nRGB, that is able to scale to large families of base predictors.\n\n4.1 Randomized Coordinate Descent\n\nThe practical challenge of building an ensemble of base predictors in the Regularized Gradient\nBoosting scenario is to both de\ufb01ne the hypothesis sets Hk and implement an ef\ufb01cient search across\nthese sets to select the best update direction ht, at each optimization step. Applying coordinate\ndescent to the objective in Equation 7 may be feasible for \ufb01nite hypothesis sets; however, we are\noften required to work with in\ufb01nite spaces of subfamilies of functions. A typical example would be\none where each subfamily is a decision tree with a \ufb01xed topology and \ufb01xed leaf values. It is common\nto resort to heuristics or to discretize the search space to de\ufb01ne an approximate search.\n\nTo solve the problem of an extensive search over Hk, we propose a novel method for boosting updates\nusing randomization applied to the functional space. Random selection of base learners for GB in\nthe context of Randomized Coordinate Descent has been shown to be successful in practice. For\nexample, [Lu and Mazumder, 2018] demonstrated that uniform sampling helps make the search\nover base hypothesis classes more scalable, gave favorable convergence guarantees for this method.\nNesterov [2012] introduced probabilistic convergence guarantees for Randomized Coordinate Descent\nexpressed in terms of the local smoothness properties of the objective and suggested a distribution to\nsample the coordinates.\n\nInspired by the analysis of Nesterov [2012], our work is the \ufb01rst one to provide a fundamentally-\njusti\ufb01ed method of searching over the subspaces Hk, an algorithm that is both scalable and admits\nconvergence guarantees. The RGB algorithm picks at each round at random a subset {Ht1 , . . . , HtS }.\nGiven a meaningful distribution over H that captures the steepness of the objective L(\u03b1) within\neach of these subsets, RGB is able to learn an ensemble of functions from families Hk of varying\ncomplexity. In the following, we show how to apply the Randomized Coordinate Descent method, as\nin [Nesterov, 2012], to the objective 7.\n\n4.2 Lipschitz-Continuous Gradients\n\nConsider the problem of minimizing L(\u03b1) as in Equation 7. The following lemma describes the\ncontinuity properties of the partial derivatives of L(\u03b1), which are needed for the application of\nRandomized Coordinate Descent.\nLemma 3. Assume that \u03a6(y, h) is differentiable with respect to the second argument, and that \u2202\u03a6\n\u2202h is\nC\u03a6(y)-Lipschitz with respect to the second argument, for any \ufb01xed value y of the \ufb01rst argument. For\nall k \u2208 [0, K], de\ufb01ne L\u2032\n\nk(\u03b1) is Ck-Lipschitz with Ck bounded as follows:\n\nk(\u03b1) = \u2202L\n\n. Then, L\u2032\n\n\u2202\u03b1k\n\nCk \u2264\n\n1\nm\n\nmXi=1\n\nh2\nk(xi)C\u03a6(yi).\n\n(9)\n\nRandomized Coordinate Descent samples the k-th coordinate with probability pk = Ck/PK\n\nk=1 Ck.\nThe convergence guarantees for this procedure are given in [Nesterov, 2012] as a function of the\nLipschitz constants Ck.\n\nWe can further give upper bounds for the Lipschitz constants above to avoid the computation of the\nk(xi) and\n\nk(xi). If we introduce the vectors hk and C\u03a6 in Rm such that [hk]i = h2\n\ni=1 h2\n\n[C\u03a6]i = C\u03a6(yi), then, by H\u00f6lder\u2019s inequality,\n\nsumsPm\n\nCk \u2264\n\n1\nm\n\nmXi=1\n\nh2\nk(xi)C\u03a6(yi) =\n\n1\nm\n\nh \u00b7 C\u03a6 \u2264\n\n1\nm\n\nkhkrkC\u03a6kq,\n\n(10)\n\nr + 1\n\nwhere 1\nq = 1. Various (r, q)-dual norms can be used, depending on the computational constraints\nand the complexity of the hypothesis classes for the application of the Randomized Coordinate\nDescent method. For example, using khk1 and kC\u03a6k\u221e gives the following upper bound: Ck \u2264\n1\n\nm(cid:2) max1\u2264i\u2264m C\u03a6(yi)(cid:3)Pm\n\ni=1 h2\n\nk(xi).\n\nThe developed generalization bounds imply the Lipschitz constants and thus de\ufb01ne the Randomized\nCoordinate Descent steps for the minimization of L(\u03b1), controlling its convergence. To illustrate this\n\n6\n\n\fAlgorithm 1 RGB. Input: \u03b1 = 0, F = 0\n\n1: for t \u2208 [1, T ] do\n2:\n3:\n4:\n\n[t1, \u00b7 \u00b7 \u00b7 , tS] \u2190 P\nfor s \u2208 [1, S] do\n\nhs \u2190 argminh\u2208Hts\n\n1\n\ni=1 \u03a6(cid:0)yi, F \u2212 1\nmPm\ni=1 \u03a6(cid:0)yi, F \u2212 1\nmPm\n\nCts\n\nL\u2032\n\nts (\u03b1)h(cid:1)\nts (\u03b1)hs(cid:1) + \u03b2\u2126(hts )(cid:3)\n\nCts\n\nL\u2032\n\n5:\n6:\n\n7:\n\nend for\n\ns\u22c6 = argmins\u2208[1,S](cid:2) 1\n\n\u03b1 \u2190 \u03b1 \u2212 1\nF \u2190 F \u2212 1\n\nCs\u22c6 L\u2032\nCs\u22c6 L\u2032\n\ns\u22c6 (\u03b1)ets\u22c6\ns\u22c6 (\u03b1)hs\u22c6\n\n8:\n9: end for\n\npoint, the convergence rate stated in [Nesterov, 2012] is as follows:\n\nE\n\nt\u22121(cid:2)L(\u03b1t)(cid:3) \u2212 L(\u03b1\u22c6) \u2264\n\n2\n\nt + 1(cid:18) KXj=1\n\nCj(cid:19)R2\n\n0(\u03b10),\n\n(11)\n\nwhere \u03b10 is the starting point, \u03b1\u22c6 is the global minimizer of L(\u03b1) and R0(\u03b10) is the size of the\ninitial level set of the objective. The conditional expectation is taken over the random choice of the\nnext coordinate. The regularization applied to the base predictor families in our Regularized Gradient\nBoosting Framework implies the bounds on Ck, thus controlling the convergence of the algorithm.\n\n4.3 Pseudocode\n\ndistribution over [1, K] with pk = Ck/PK\n\nThe pseudocode of our RGB algorithm is given in Algorithm 1. The algorithm seeks to minimize the\nobjective given in Equation 7, using Randomized Coordinate Descent. Let P be a discrete probability\nj=1 Cj . Equivalently, P is the distribution over the base\nhypothesis sets H1, \u00b7 \u00b7 \u00b7 , HK . At each draw from P , we select a sample Ht1 , \u00b7 \u00b7 \u00b7 , HtS of size S and,\nfrom this sample, select one function that provides the best trade-off in the decrease in objective L(\u03b1)\nand the complexity bound of Theorem 1 of the underlying hypotheses family.\n\nThe local optimization procedure in Line 6 is an extra step required in the coordinate descent\nprocedure to select a single function from Hts . The step in Line 8 is required to select, out of S\nsampled directions, the one with the best trade-off between sample \ufb01t and complexity bounds. Note\nthat the evaluation of sampled candidates in Line 5 can be done in parallel, making the time of\nRGB per thread comparable to that of standard GB. More speci\ufb01cally, given a \ufb01xed sample of S\ncoordinates, the runtime of one RGB round is equal to the runtime of S rounds of GB when the same\nsubroutine is used for tree splitting.\n\n5 Experiments\n\nIn this section, we present the results of experiments with our RGB algorithm. We restrict our\nattention to learning an ensemble of the regularized regression trees as de\ufb01ned in the family Hn,\u03bb,q,\nand to simplify the presentation, we let q = 2, although similar experiments can be easily done for\nother norms. For the complexity of these base classi\ufb01ers we use the bound derived in Theorem 1.\n\nTo de\ufb01ne the subfamilies of base learners we impose a grid of size 7 on the maximum number of\ninternal nodes n \u2208 {2, 4, 8, 16, 32, 64, 256} and a grid of size 7 on \u03bb \u2208 {0.001, 0.01, 0.1, 0.5, 1, 2, 4}.\nFor each element from the Cartesian product of these grids, we assign (nk, \u03bbk), thus de\ufb01ning the base\n\nfamilies Hnk,\u03bbk,2 and F = conv(cid:0) \u222a49\n\nk=1 Hnk,\u03bbk,2(cid:1). Given such a decomposition of the functional\n\nspace, we directly minimize the regularized objective in Equation 7 using Randomized Coordinate\nDescent with the distribution over the coordinate blocks as described above. We use the logistic loss\nas the per-instance loss \u03a6. For a given training sample, we normalize the regularization \u2126(h) to be in\n[0, 1] and tune the RGB parameter \u03b2 using a grid search over \u03b2 \u2208 {0.001, 0.01, 0.1, 0.3, 1}.\n\nSection 4 describes multiple ways to bound the coordinate-wise Lipschitz constants of the derivative\nof the objective function, resulting in various coordinate sampling distributions for the Randomized\nCoordinate Descent. For our experiments, and speci\ufb01cally to the families Hnk,\u03bbk,2 bound the\n\n7\n\n\fTable 1: Experimental Results\n\nError % sonar\n\ncancer diabetes\n\nocr17\n\nocr49 mnist17 mnist49\n\nhiggs\n\nMean\n(Std)\n\n26.94\n(2.10)\n\n5.19\n(0.97)\n\n28.86\n(4.85)\n\nMean\n(Std)\n\n28.64\n(2.13)\n\n6.14\n(0.94)\n\n28.39\n(5.08)\n\nRGB\n0.90\n(0.45)\n\nGB\n1.35\n(0.52)\n\n3.10\n(0.69)\n\n0.43\n(0.10)\n\n1.53\n(0.38)\n\n28.60\n(0.41)\n\n3.50\n(0.65)\n\n0.55\n(0.11)\n\n1.66\n(0.32)\n\n29.11\n(0.37)\n\nSignif.\n\n5%\n\n5%\n\n-\n\n2.5% 2.5%\n\n2%\n\n5%\n\n2.5%\n\nOne-tailed, paired sample t-test\n\nLipschitz constants by Ck \u2264 \u03bbk(cid:2) max1\u2264i\u2264m C\u03a6(yi)(cid:3), which implies that the k-th coordinate is\nsampled with probability pk = \u03bbk/PK\n\nj=1 \u03bbj , since the max1\u2264i\u2264m C\u03a6(yi) terms cancel out (see\n\nLemma 4 in the Appendix for the derivation of this bound).\n\nAs a comparison, we run the standard GB, using the XGBOOST library with \u21132 regularization on the\nvector of leaf scores w. We use grid search to tune the hyperparameters of XGBOOST with a grid\nthat makes the families of trees explored comparable to the H de\ufb01ned for RGB above. Speci\ufb01cally,\nwe let the \u21132 norm regularization parameter be in {0.001, 0.01, 0.1, 0.5, 1, 2, 4}, the maximum tree\ndepth parameter in {1, 2, 3, 4, 5, 6, 7}, and the learning rate parameter in {0.001, 0.01, 0.1, 0.5, 1}.\nBoth GB and RGB are run for T = 100 boosting rounds. The hyperparameters are chosen via 5-fold\ncross-validation, and the standard errors for the best set of hyperparameters reported.\n\nTable 1 shows the classi\ufb01cation errors on the test sets for the UCI datasets studied, for both RGB and\nGB, see Table 2 in the appendix for details on the dataset. A one-tailed, paired sample t-test on the\npairs of results from the different trials demonstrate that these results are in general signi\ufb01cant at a\n5% level or better. Only for one of the dataset, diabetes with an input dimension of just 8, we do\nnot observe an improvement of RGB over GB. One natural hypothesis is that the larger the input\ndimension, the more the need for proper regularization of the binary regression trees forming the base\nlearner, and the larger the advantage of the RGB algorithm.\n\nIn general, the results demonstrate that by randomly taking steps into coordinates that correspond to\nsubspaces Ht with a theoretically justi\ufb01ed distribution, RGB can explore larger hypothesis families\nmore ef\ufb01ciently that the baseline methods. Furthermore, compared to baselines that operate on the\nsame hypotheses space H, by optimizing for the trade-off between sample \ufb01t and functional subclass\ncomplexity, RGB can reduce over-\ufb01tting, thereby achieving greater test accuracy on multiple datasets.\n\n6 Conclusion\n\nWe introduced and analyzed a general framework of Regularized Gradient Boosting, for which we also\ndevised an effective algorithm, RGB. In this framework, regularization is not directly incorporated as\na term in the loss function. Instead, its de\ufb01nition affects each boosting step by restricting the search\nfor the gradient approximation to a constrained subset of base functions. Our analysis is based upon\nstrong margin-based Rademacher complexity learning guarantees. These bounds suggest a natural\napproach for our optimization solution, which consists of dividing the space of base learners into\nsubfamilies of increasing complexity. For the special case of binary regression trees, we derived\nexplicit Rademacher complexity bounds that we subsequently exploit in the de\ufb01nition of our RGB\nalgorithm. Randomization over the subfamilies of base functions allows us to scale our algorithm to\nlarge families of base predictors. Our experimental results suggest improved performance, thanks to\na more ef\ufb01cient and theoretically motivated exploration of large function spaces without over-\ufb01tting.\nAlso, as already stated, the run-times of the algorithms are comparable, thereby making RGB a\nstrong alternative to XGBOOST. Finally, our analysis can be extended in a similar way to that of\nboosting with other families of base predictors, such as kernel-based hypothesis sets and Deep Neural\nNetworks.\n\n8\n\n\fAcknowledgments\n\nWe thank our colleagues Natalia Ponomareva and Vitaly Kuznetsov for insightful discussions and\nfeedback. This work was partly supported by NSF CCF-1535987, NSF IIS-1618662, and a Google\nResearch Award.\n\nReferences\n\nR. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of\n\nmodels. In Proceedings of ICML, page 18. ACM, 2004.\n\nT. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm\nsigkdd international conference on knowledge discovery and data mining, pages 785\u2013794. ACM,\n2016.\n\nC. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273\u2013297, 1995.\n\nC. Cortes, M. Mohri, and U. Syed. Deep boosting. In Proceedings of ICML, 2014.\n\nT. G. Dietterich. An experimental comparison of three methods for constructing ensembles of\n\ndecision trees: Bagging, boosting, and randomization. Machine learning, 40(2):139\u2013157, 2000.\n\nY. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and application\n\nto boosting. Journal of Computer and System Sciences, 55(1):119\u2013139, 1997.\n\nY. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In icml, volume 96,\n\npages 148\u2013156. Citeseer, 1996.\n\nJ. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of boosting.\n\nAnnals of Statistics, 28:2000, 1998.\n\nJ. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics,\n\npages 1189\u20131232, 2001.\n\nJ. H. Friedman. Stochastic gradient boosting. Computational statistics & data analysis, 38(4):\n\n367\u2013378, 2002.\n\nA. Grubb and J. A. Bagnell. Generalized boosting algorithms for convex optimization. arXiv preprint\n\narXiv:1105.2054, 2011.\n\nV. Koltchinskii and D. Panchenko. Empirical margin distributions and bounding the generalization\n\nerror of combined classi\ufb01ers. Annals of Statistics, 30, 2002.\n\nY. LeCun, Y. Bengio, and G. Hinton. Deep learning. nature, 521(7553):436, 2015.\n\nH. Lu and R. Mazumder. Randomized gradient boosting machine. arXiv preprint arXiv:1810.10158,\n\n2018.\n\nL. Mason, J. Baxter, P. L. Bartlett, and M. R. Frean. Boosting algorithms as gradient descent. In\n\nAdvances in neural information processing systems, pages 512\u2013518, 2000.\n\nP. Massart and J. Picard. Concentration inequalities and model selection: Ecole d\u2019Et\u00e9 de Probabilit\u00e9s\nde Saint-Flour XXXIII - 2003. Number no. 1896 in Ecole d\u2019Et\u00e9 de Probabilit\u00e9s de Saint-Flour.\nSpringer-Verlag, 2007.\n\nM. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2012.\n\nM. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine Learning. The MIT Press,\n\nsecond edition, 2018.\n\nY. Nesterov. Ef\ufb01ciency of coordinate descent methods on huge-scale optimization problems. SIAM\n\nJournal on Optimization, 22(2):341\u2013362, 2012.\n\nJ. R. Quinlan et al. Bagging, boosting, and c4. 5. In AAAI/IAAI, Vol. 1, pages 725\u2013730, 1996.\n\n9\n\n\fK. V. Rashmi and R. Gilad-Bachrach. Dart: Dropouts meet multiple additive regression trees. In\n\nAISTATS, pages 489\u2013497, 2015.\n\nR. E. Schapire. A brief introduction to boosting. In Ijcai, volume 99, pages 1401\u20131406, 1999.\n\nR. E. Schapire and Y. Freund. Boosting: Foundations and algorithms. MIT press, 2012.\n\nR. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee. Boosting the margin: A new explanation for the\n\neffectiveness of voting methods. In Proceedings of ICML, pages 322\u2013330, 1997.\n\nP. Sun, T. Zhang, and J. Zhou. A convergence rate analysis for logitboost, mart and their variant. In\n\nICML, pages 1251\u20131259, 2014.\n\nV. Vapnik. Principles of risk minimization for learning theory. In Advances in neural information\n\nprocessing systems, pages 831\u2013838, 1992.\n\n10\n\n\f", "award": [], "sourceid": 2920, "authors": [{"given_name": "Corinna", "family_name": "Cortes", "institution": "Google Research"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Inst. of Math. Sciences & Google Research"}, {"given_name": "Dmitry", "family_name": "Storcheus", "institution": "Google Research"}]}