{"title": "Variance-based Regularization with Convex Objectives", "book": "Advances in Neural Information Processing Systems", "page_first": 2971, "page_last": 2980, "abstract": "We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.", "full_text": "Variance-based Regularization with Convex\n\nObjectives\n\nHongseok Namkoong\nStanford University\n\nhnamk@stanford.edu\n\nJohn C. Duchi\n\nStanford University\n\njduchi@stanford.edu\n\nAbstract\n\nWe develop an approach to risk minimization and stochastic optimization that pro-\nvides a convex surrogate for variance, allowing near-optimal and computationally\nef\ufb01cient trading between approximation and estimation error. Our approach builds\noff of techniques for distributionally robust optimization and Owen\u2019s empirical\nlikelihood, and we provide a number of \ufb01nite-sample and asymptotic results char-\nacterizing the theoretical performance of the estimator. In particular, we show that\nour procedure comes with certi\ufb01cates of optimality, achieving (in some scenarios)\nfaster rates of convergence than empirical risk minimization by virtue of auto-\nmatically balancing bias and variance. We give corroborating empirical evidence\nshowing that in practice, the estimator indeed trades between variance and absolute\nperformance on a training sample, improving out-of-sample (test) performance\nover standard empirical risk minimization for a number of classi\ufb01cation problems.\n\nIntroduction\n\n1\nLet X be a sample space, P0 a distribution on X , and \u21e5 a parameter space. For a loss function\n` :\u21e5 \u21e5X ! R, consider the problem of \ufb01nding \u2713 2 \u21e5 minimizing the risk\n\nR(\u2713) := E[`(\u2713, X )] =Z `(\u2713, x)dP (x)\n\n(1)\ngiven a sample {X1, . . . , Xn} drawn i.i.d. according to the distribution P . Under appropriate\nconditions on the loss `, parameter space \u21e5, and random variables X, a number of researchers [2, 6,\n12, 7, 3] have shown results of the form that with high probability,\n\n1\nn\n\nnXi=1\n\n`(\u2713, Xi) + C1r Var(`(\u2713, X ))\n\nn\n\nC2\nn\n\n+\n\nR(\u2713) \uf8ff\n\nfor all \u2713 2 \u21e5\nwhere C1 and C2 depend on the parameters of problem (1) and the desired con\ufb01dence guarantee.\nnPn\nSuch bounds justify empirical risk minimization, which choosesb\u2713n to minimize 1\ni=1 `(\u2713, Xi)\nover \u2713 2 \u21e5. Further, these bounds showcase a tradeoff between bias and variance, where we identify\nthe bias (or approximation error) with the empirical risk 1\ni=1 `(\u2713, Xi), while the variance arises\nfrom the second term in the bound.\nConsidering the bias-variance tradeoff (1) in statistical learning, it is natural to instead choose \u2713 to\ndirectly minimize a quantity trading between approximation and estimation error:\n\n(2)\n\nnPn\n`(\u2713, Xi) + Cs VarbPn\n\n(`(\u2713, X ))\nn\n\n1\nn\n\nnXi=1\n\ndenotes the empirical variance. Maurer and Pontil [16] consider this idea, giving\nguarantees on the convergence and good performance of such a procedure. Unfortunately, even when\n\nwhere VarbPn\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n,\n\n(3)\n\n\fthe loss ` is convex in \u2713, the formulation (3) is generally non-convex, which limits the applicability\nof procedures that minimize the variance-corrected empirical risk (3). In this paper, we develop\nan approach based on Owen\u2019s empirical likelihood [19] and ideas from distributionally robust\noptimization [4, 5, 10] that\u2014whenever the loss ` is convex\u2014provides a tractable convex formulation\nclosely approximating the penalized risk (3). We give a number of theoretical guarantees and\nempirical evidence for its performance.\nTo describe our approach, we require a few de\ufb01nitions. For a convex function  : R+ ! R with\n(1) = 0, D (P||Q) =RX\ndQ )dQ is the -divergence between distributions P and Q de\ufb01ned on\nX . Throughout this paper, we use (t) = 1\n2 (t  1)2, which gives the 2-divergence. Given  and an\ni.i.d. sample X1, . . . , Xn, we de\ufb01ne the \u21e2-neighborhood of the empirical distribution\n\n( dP\n\nPn :=ndistributions P s.t. D(P||bPn) \uf8ff\nwhere bPn denotes the empirical distribution of the sample {Xi}n\nmeans that Pn has support {Xi}n\n\n\u21e2\n\nno ,\n\ni=1. We then de\ufb01ne the robustly regularized risk\n\ni=1, and our choice (t) = 1\n\n2 (t 1)2\n\nRn(\u2713, Pn) := sup\nP2Pn\n\nEP [`(\u2713, X )] = sup\n\nP nEP [`(\u2713, X )] : D(P||bPn) \uf8ff\n\n\u21e2\n\nno .\n\nAs it is the supremum of a family of convex functions, the robust risk \u2713 7! Rn(\u2713, Pn) is convex in\n\u2713 regardless of the value of \u21e2  0 whenever the original loss `(\u00b7; X) is convex and \u21e5 is a convex\nset. Namkoong and Duchi [18] propose a stochastic procedure for minimizing (4) almost as fast as\nstochastic gradient descent. See Appendix C for a detailed account of an alternative method.\nWe show that the robust risk (4) provides an excellent surrogate for the variance-regularized quan-\ntity (3) in a number of ways. Our \ufb01rst result (Thm. 1 in Sec. 2) is that for bounded loss functions,\n\nRn(\u2713, Pn) = EbPn\n\n[`(\u2713, X )] +r 2\u21e2\n\nn\n\nVarbPn\n\n(`(\u2713, X )) + \"n(\u2713),\n\n(5)\n\n(4)\n\n(6)\n\nwhere \"n(\u2713) \uf8ff 0 and is O(1/n) uniformly in \u2713. We show that when `(\u2713, X ) has suitably large\nvariance, we have \"n = 0 with high probability. With the expansion (5) in hand, we can show a\nnumber of \ufb01nite-sample convergence guarantees for the robustly regularized estimator\n\nn of problem (6) enjoy automatic \ufb01nite sample optimality\n\nb\u2713rob\n\nn 2 argmin\n\nP nEP [`(\u2713, X )] : D(P||bPn) \uf8ff\n\n\u27132\u21e5 \u21e2sup\nBased on the expansion (5), solutionsb\u2713rob\ncerti\ufb01cates: for \u21e2  0, with probability at least 1  C1 exp(\u21e2) we have\nn ; X)] \uf8ff Rn(b\u2713rob\n\nE[`(b\u2713rob\n\nn ;Pn) +\n\n= inf\n\u27132\u21e5\n\nC2\u21e2\nn\n\n\u21e2\n\nno .\n\nRn(\u2713, Pn) +\n\nC2\u21e2\nn\n\nwhere C1, C2 are constants (which we specify) that depend on the loss ` and domain \u21e5. That is, with\nhigh probability the robust solution has risk no worse than the optimal \ufb01nite sample robust objective\nup to an O(\u21e2/n) error term. To guarantee a desired level of risk performance with probability 1  ,\nwe may specify the robustness penalty \u21e2 = O(log 1\nSecondly, we show that the procedure (6) allows us to automatically and near-optimally trade between\napproximation and estimation error (bias and variance), so that\n\n ).\n\n\u27132\u21e5(E[`(\u2713; X)] + 2r 2\u21e2\n\nn\n\nVar(`(\u2713; X))) +\n\nC\u21e2\nn\n\nn ; X)] \uf8ff inf\n\nE[`(b\u2713rob\n\nparameter \u2713?) and small variance Var(`(\u2713, X )), this guarantees that the excess risk R(b\u2713rob\n\nwith high probability. When there are parameters \u2713 with small risk R(\u2713) (relative to the optimal\nn )  R(\u2713?)\nis essentially of order O(\u21e2/n), where \u21e2 governs our desired con\ufb01dence level. We give an explicit\nexample in Section 3.2 where our robustly regularized procedure (6) converges at O(log n/n)\ncompared to O(1/pn) of empirical risk minimization.\nBounds that trade between risk and variance are known in a number of cases in the empirical risk\nminimization literature [15, 22, 2, 1, 6, 3, 7, 12], which is relevant when one wishes to achieve \u201cfast\n\n2\n\n\frates\u201d of convergence for statistical learning algorithms. In many cases, such tradeoffs require either\nconditions such as the Mammen-Tsybakov noise condition [15, 6] or localization results [3, 2, 17]\nmade possible by curvature conditions that relate the risk and variance. The robust solutions (6) enjoy\na variance-risk tradeoff that is differen but holds essentially without conditions except compactness\nof \u21e5. We show in Section 3.3 that the robust solutions enjoy fast rates of convergence under typitcal\ncurvature conditions on the risk R.\nWe complement our theoretical results in Section 4, where we conclude by providing two experiments\ncomparing empirical risk minimization (ERM) strategies to robustly-regularized risk minimization (6).\nThese results validate our theoretical predictions, showing that the robust solutions are a practical\nalternative to empirical risk minimization. In particular, we observe that the robust solutions outper-\nform their ERM counterparts on \u201charder\u201d instances with higher variance. In classi\ufb01cation problems,\nfor example, the robustly regularized estimators exhibit an interesting tradeoff, where they improve\nperformance on rare classes (where ERM usually sacri\ufb01ces performance to improve the common\ncases\u2014increasing variance slightly) at minor cost in performance on common classes.\n\n2 Variance Expansion\nWe begin our study of the robust regularized empirical risk Rn(\u2713, Pn) by showing that it is a good\napproximation to the empirical risk plus a variance term (5). Although the variance of the loss is\nin general non-convex, the robust formulation (6) is a convex optimization problem for variance\nregularization whenever the loss function is convex [cf. 11, Prop. 2.1.2.].\nTo gain intuition for the variance expansion that follows, we consider the following equivalent\nformulation for the robust objective supP2Pn EP [Z]\n\nmaximize\n\np\n\nnXi=1\n\npizi subject to p 2P n =\u21e2p 2 Rn\nn kzk2\n\nn = 1\n\n+ :\n\n1\n2 knp  1k2\n\n2 \uf8ff \u21e2,h1, pi = 1 ,\nn kz  zk2\n\n2 denote the empirical\nn h1, zi is the mean value of z. Then by introducing the\nn 1, the objective in problem (7) satis\ufb01es hp, zi = z + hu, zi = z + hu, z  zi\n\nwhere z 2 Rn is a vector. For simplicity, let s2\n\u201cvariance\u201d of the vector z, where z = 1\nvariable u = p  1\nbecause hu, 1i = 0. Thus problem (7) is equivalent to solving\n2\u21e2\n2 \uf8ff\nn2 , h1, ui = 0, u  \n\nsubject to kuk2\n\nz + hu, z  zi\n\n2  (z)2 = 1\n\nmaximize\n\n1\nn\n\n.\n\nNotably, by the Cauchy-Schwarz inequality, we have hu, z  zi \uf8ff p2\u21e2kz  zk2 /n =p2\u21e2s2\n\nand equality is attained if and only if\n\nu2Rn\n\nn/n,\n\n(7)\n\nOf course, it is possible to choose such ui while satisfying the constraint ui  1/n if and only if\n(8)\n\nThus, if inequality (8) holds for the vector z\u2014that is, there is enough variance in z\u2014we have\n\nsup\n\np2Pn hp, zi = z +r 2\u21e2s2\nFor losses `(\u2713, X ) with enough variance relative to `(\u2713, Xi)  EbPn\nVarbPn\n\nRn(\u2713, Pn) = EbPn\n\ninequality (8), then, we have\n\nn\n\nn\n\n.\n\nA slight elaboration of this argument, coupled with the application of a few concentration inequalities,\nyields the next theorem. Recall that (t) = 1\n\n[`(\u2713, X )] +r 2\u21e2\n2 (t  1)2 in our de\ufb01nition of the -divergence.\n\n(`(\u2713, X )).\n\nn\n\n[`(\u2713, Xi)], that is, those satisfying\n\nui =\n\np2\u21e2(zi  z)\nnkz  zk2\n\n=\n\np2\u21e2(zi  z)\nnpns2\n\nn\n\n.\n\nmin\ni2[n]\n\np2\u21e2(zi  z)\npns2\n\nn\n\n 1.\n\n3\n\n\f\u21e2\n\nn\n\n18\n\n16\n\nn\n\nn\n\nsup\n\n(Z)\n\nn\n\nn\n\n(Z).\n\n2M\u21e2\n\n(10)\n\n(9)\n,\n\nVar(Z) ,\n\n(Z) \n\n\uf8ff sup\n\nVarbPn\n\nVarbPn\nn q Var(Z)\n\n[Z] \uf8ffr 2\u21e2\n2  M 2\n\nP nEP [Z] : D(P||bPn) \uf8ff\n[Z] +r 2\u21e2\nEP [Z] = EbPn\n2M 2 )  1  exp( nVar(Z)\n36M 2 ).\n\nTheorem 1. Let Z be a random variable taking values in [M0, M1] where M = M1  M0 and \ufb01x\n\u21e2  0. Then\n r 2\u21e2\nVarbPn\nIf n  max{ 24\u21e2\n\nn !+\nno  EbPn\nVar(Z) , 1}M 2 and we set tn =pVar(Z)p1  n1  1\nP :D(P||bPn)\uf8ff \u21e2\nwith probability at least 1  exp( nt2\nSee Appendix A.1 for the proof. Inequality (9) and the exact expansion (10) show that, at least for\nbounded loss functions `, the robustly regularized risk (4) is a natural (and convex) surrogate for\nempirical risk plus standard deviation of the loss, and the robust formulation approximates exact\nvariance regularization with a convex penalty.\nWe also provide a uniform variant of Theorem 1 based on the standard notion of the covering\nnumber, which we now de\ufb01ne. Let V be a vector space with (semi)norm k\u00b7k on V, and let V \u21e2\nV. We say a collection v1, . . . , vN \u21e2 V is an \u270f-cover of V if for each v 2 V , there exists vi\nsuch that kv  vik \uf8ff \u270f. The covering number of V with respect to k\u00b7k is then N (V, \u270f, k\u00b7k) :=\ninf {N 2 N : there is an \u270f-cover of V with respect to k\u00b7k}. Now, let F be a collection of functions\nf : X! R, and de\ufb01ne the L1(X )-norm by kf  gkL1(X ) := supx2X |f (x)  g(x)|. Although\nwe state our results abstractly, we typically take F := {`(\u2713, \u00b7) | \u2713 2 \u21e5}.\nAs a motivating example, we give the following standard bound on the covering number of Lipschitz\nlosses [24].\nExample 1: Let \u21e5 \u21e2 Rd and assume that ` :\u21e5 \u21e5X ! R is L-Lipschitz in \u2713 with respect to the\n`2-norm for all x 2X , meaning that |`(\u2713, x)  `(\u27130, x)|\uf8ff Lk\u2713  \u27130k2. Then taking F = {`(\u2713, \u00b7) :\n\u2713 2 \u21e5}, any \u270f-covering {\u27131, . . . ,\u2713 N} of \u21e5 in `2-norm guarantees that mini |`(\u2713, x) `(\u2713i, x)|\uf8ff L\u270f\nfor all \u2713, x. That is,\n\nN (F,\u270f, k\u00b7kL1(X )) \uf8ff N (\u21e5, \u270f/L,k\u00b7k2) \uf8ff\u27131 +\n\ndiam(\u21e5)L\n\n\u270f\n\n\u25c6d\n\n,\n\nwhere diam(\u21e5) = sup\u2713,\u271302\u21e5 k\u2713  \u27130k2. Thus `2-covering numbers of \u21e5 control L1-covering\nnumbers of the family F. \u2325\nWith this de\ufb01nition, we provide a result showing that the variance expansion (5) holds uniformly for\nall functions with enough variance.\nTheorem 2. Let F be a collection of bounded functions f : X! [M0, M1] where M = M1  M0,\nand let \u2327  0 be a constant. De\ufb01ne F\u2327 :=f 2F : Var(f )  \u2327 2 and tn = \u2327 (p1  n1  1\n2 )\n2M 2\u2318, we have\nn . If \u2327 2  32\u21e2M 2\nfor all f 2F \u2327\n\n32 ,k\u00b7kL1(X )\u2318 exp\u21e3 nt2\n\n, then with probability at least 1 N\u21e3F, \u2327\n[f (X)] +r 2\u21e2\n\n(f (X)).\n\nVarbPn\n\n(11)\n\nEP [f (X)] = EbPn\n\nsup\n\nM 2\n\nn\n\nn\n\nn\n\nn\n\nWe prove the theorem in Section A.2. Theorem 2 shows that the variance expansion of Theorem 1\nholds uniformly for all functions f with suf\ufb01cient variance. See Duchi, Glynn, and Namkoong [10]\nfor an asymptotic analogue of the equality (11) for heavier tailed random variables.\n\nP :D(P||bPn)\uf8ff \u21e2\n\n3 Optimization by Minimizing the Robust Loss\nBased on the variance expansions in the preceding section, we show that the robust solution (6)\nautomatically trades between approximation and estimation error. In addition to k\u00b7kL1(X )-covering\n\n4\n\n\fThen for \u21e2  t, with probability at least 1  2(N (F,\u270f, k\u00b7kL1(X )) + 1)et,\n\nf2F \u21e2sup\n\nbf 2 argmin\n\nE[bf (X)] \uf8ff\n\nsup\n\nn\n\nP :D(P||bPn)\uf8ff \u21e2\nf2F(E[f ] + 2r 2\u21e2\n\uf8ff inf\n\nn\n\n\u21e2\n\nn\n\n7M\u21e2\n\nno .\nP nEP [f (X)] : D(P||bPn) \uf8ff\nn  1! \u270f\n+ 2 +r 2t\nEP [bf (X)] +\nn  1! \u270f.\n+ 2 +r 2t\n+ 2 + 4r 2t\nn! \u270f\n+ 2 + 4r 2t\nn! \u270f.\n\nEP [bf (X)] +\n\nVar(f )) +\n\nVar(f )) +\n\nM\u21e2\nn\n\n11M\u21e2\n\n19M\u21e2\n\n11\n3\n\n3n\n\nn\n\n(12a)\n\n(12b)\n\n(13b)\n\nnumbers de\ufb01ned in the previous section, we use the tighter notion of empirical `1-covering numbers.\nFor x 2X n, de\ufb01ne F(x) = {(f (x1), . . . , f (xn)) : f 2F} and the empirical `1-covering numbers\nN1(F,\u270f, n ) := supx2X n N (F(x),\u270f, k\u00b7k1), which bound the number of `1-balls of radius \u270f\nrequired to cover F(x). Note that we always have N1(F) \uf8ff N (F).\nTypically, we consider the function class F := {`(\u2713, \u00b7) : \u2713 2 \u21e5}, though we state our minimization\nresults abstractly. Although the below result is in terms of covering numbers for ease of exposition, a\nvariant holds depending on localized Rademacher averages [2] of the class F, which can yield tighter\nguarantees (we omit such results for lack of space). We prove the following theorem in Section A.3.\nTheorem 3. Let F be a collection of functions f : X! [M0, M1] with M = M1  M0. De\ufb01ne the\nempirical minimizer\n\n, t  log 12, and \u21e2  9t, with probability at least 12(3N1 (F,\u270f, 2n)+1)et,\n(13a)\n\nsup\n\nt\n\nFurther, for n  8M 2\nE[bf (X)] \uf8ff\n\nn\n\nP :D(P||bPn)\uf8ff \u21e2\nf2F(E[f ] + 2r 2\u21e2\n\uf8ff inf\n\nn\n\nUnlike analogous results for empirical risk minimization [6], Theorem 3 does not require the self-\nbounding type assumption Var(f ) \uf8ff BE[f ]. A consequence of this is that when v = Var(f\u21e4)\nis small, where f\u21e4 2 argminf2F E[f ], we achieve O(1/n +pv/n) (fast) rates of convergence.\nThis condition is different from the typical conditions required for empirical risk minimization to\nhave fast rates of convergence, highlighting the possibilities of variance-based regularization. It will\nbe interesting to understand appropriate low-noise conditions (e.g. the Mammen-Tsybakov noise\ncondition [15, 6]) guaranteeing good performance. Additionally, the robust objective Rn(\u2713, Pn) is\nan empirical likelihood con\ufb01dence bound on the population risk [10], and as empirical likelihood\ncon\ufb01dence bounds are self-normalizing [19], other fast-rate generalizations may exist.\n\n3.1 Consequences of Theorem 3\n\nWe now turn to a number of corollaries that expand on Theorem 3 to investigate its consequences.\nOur \ufb01rst corollary shows that Theorem 3 applies to standard Vapnik-Chervonenkis (VC) classes.\nAs VC dimension is preserved through composition, this result also extends to the procedure (6) in\ntypical empirical risk minimization scenarios. See Section A.4 for its proof.\nCorollary 3.1. In addition to the conditions of Theorem 3, let F have \ufb01nite VC-dimension\nVC(F). Then for a numerical constant c < 1, the bounds (13) hold with probability at least\n1 \u21e3c VC(F) 16M ne\nNext, we focus more explicitly on the estimatorb\u2713rob\nn de\ufb01ned by minimizing the robust regularized\nrisk (6). Let us assume that \u21e5 \u21e2 Rd, and that we have a typical linear modeling situation, where\na loss h is applied to an inner product, that is, `(\u2713, x) = h(\u2713>x). In this case, by making the\nsubstitution that the class F = {`(\u2713, \u00b7) : \u2713 2 \u21e5} in Corollary 3.1, we have VC(F) \uf8ff d, and we\nobtain the following corollary. Recall the de\ufb01nition (1) of the population risk R(\u2713) = E[`(\u2713, X )],\nand the uncertainty set Pn = {P : D(P||bPn) \uf8ff \u21e2\nn}, and that Rn(\u2713, Pn) = supP2Pn EP [`(\u2713, X )].\n\nBy setting \u270f = M/n in Corollary 3.1, we obtain the following result.\n\n+ 2\u2318 et.\n\nVC(F)1\n\n\u270f\n\n5\n\n\fn\n\nn\n\n+\n\n3n\n\n11M\u21e2\n\n11M\u21e2\n\nn ,Pn) +\n\nVar(`(\u2713; X))) +\n\n\u27132\u21e5(R(\u2713) + 2r 2\u21e2\n\nCorollary 3.2. Let the conditions of the previous paragraph hold and assume that `(\u2713, x) 2 [0, M ]\nfor all \u2713 2 \u21e5, x 2X . Then if n  \u21e2  9 log 12,\n4M\nR(b\u2713rob\nn ) \uf8ff Rn(b\u2713rob\nn \uf8ff inf\nwith probability at least 1  2 exp(c1d log n  c2\u21e2), where ci are universal constants with c2  1/9.\nUnpacking Theorem 3 and Corollary 3.2 a bit, the \ufb01rst result (13a) provides a high-probability\nguarantees that the true expectation E[bf ] cannot be more than O(1/n) worse than its robustly-\nregularized empirical counterpart, that is, R(b\u2713rob\nn ,Pn) + O(\u21e2/n), which is (roughly)\na consequence of uniform variants of Bernstein\u2019s inequality. The second result (13b) guarantee\nthe convergence of the empirical minimizer to a parameter with risk at most O(1/n) larger than\nthe best possible variance-corrected risk. In the case that the losses take values in [0, M ], then\nVar(`(\u2713, X )) \uf8ff M R(\u2713), and thus for \u270f = 1/n in Theorem 3, we obtain\nM\u21e2\nn\n\nn ) \uf8ff Rn(b\u2713rob\n\nn ) \uf8ff R(\u2713?) + Cr M \u21e2R(\u2713?)\nR(b\u2713rob\n\na type of result well-known and achieved by empirical risk minimization for bounded nonnegative\nlosses [6, 26, 25]. In some scenarios, however, the variance may satisfy Var(`(\u2713, X )) \u2327 M R(\u2713),\nyielding improvements.\nTo give an alternative variant of Corollary 3.2, let \u21e5 \u21e2 Rd and assume that for each x 2X ,\ninf \u27132\u21e5 `(\u2713, x) = 0 and that ` is L-Lipschitz in \u2713. If D := diam(\u21e5) = sup\u2713,\u271302\u21e5 k\u2713  \u27130k < 1,\nthen 0 \uf8ff `(\u2713, x) \uf8ff L diam(\u21e5) =: M.\nCorollary 3.3. Let the conditions of the preceeding paragraph hold. Set t = \u21e2 = log 2n +\nn in Theorem 3 and assume that D . nk and L . nk for a numerical\nd log(2nDL) and \u270f = 1\nconstant k. With probability at least 1  1/n,\n\n+ C\n\nn\n\n,\n\n\u27132\u21e5(R(\u2713) + Cr d Var(`(\u2713, X ))\n\nn\n\nlog n) + C\n\ndLD log n\n\nn\n\nE[`(b\u2713rob\n\nn ; X)] = R(b\u2713rob\n\nwhere C is a numerical constant.\n\nn ) \uf8ff inf\n\n3.2 Beating empirical risk minimization\nWe now provide an example in which the robustly-regularized estimator (6) exhibits a substantial\nimprovement over empirical risk minimization. We expect the robust approach to offer performance\nbene\ufb01ts in situations in which the empirical risk minimizer is highly sensitive to noise, say, because\nthe losses are piecewise linear, and slight under- or over-estimates of slope may signi\ufb01cantly degrade\nsolution quality. With this in mind, we construct a toy 1-dimensional example\u2014estimating the\nmedian of a distribution supported on X = {1, 0, 1}\u2014in which the robust-regularized estimator\nhas convergence rate log n/n, while empirical risk minimization is at best 1/pn.\nDe\ufb01ne the loss `(\u2713; x) = |\u2713  x|| x|, and for  2 (0, 1) let the distribution P be de\ufb01ned by\n2 , P (X = 0) = . Then for \u2713 2 R, the risk of the loss is\nP (X = 1) = 1\n1  \n2 |\u2713  1| +\n\n1  \n2 |\u2713 + 1| (1  ).\n\n2 , P (X = 1) = 1\n\nR(\u2713) = |\u2713| +\n\nBy symmetry, it is clear that \u2713? := argmin\u2713 R(\u2713) = 0, which satis\ufb01es R(\u2713?) = 0. (Note that\n`(\u2713, x) = `(\u2713, x)  `(\u2713?, x).) Without loss of generality, we assume that \u21e5= [ 1, 1]. De\ufb01ne the\nempirical risk minimizer and the robust solution\n\nb\u2713erm := argmin\n\n\u27132R\n\nEbPn\n\n[`(\u2713, X )] = argmin\n\u27132[1,1]\n\nEbPn\n\n[|\u2713  X|], b\u2713rob\n\nn 2 argmin\n\u27132\u21e5\n\nRn(\u2713, Pn).\n\nIntuitively, if too many of the observations satisfy Xi = 1 or too many satisfy Xi = 1, thenb\u2713erm\nwill be either 1 or 1; for small , such events become reasonably probable. On the other hand, we\nhave `(\u2713?; x) = 0 for all x 2X , so that Var(`(\u2713?; X)) = 0 and variance regularization achieves the\nrate O(log n/n) as opposed to empirical risk minimizer\u2019s O(1/pn). See Section A.6 for the proof.\n\n6\n\n\fProposition 1. Under the conditions of the previous paragraph, for n  \u21e2 = 3 log n, with probability\n. However, with probability at least 2(q n\nn, we have R(b\u2713rob\nat least 1 4\nn1 )\n2p2/p\u21e1en  2(q n\nFor n  20, the probability of the latter event is  .088. Hence, for this (specially constructed)\nexample, we see that there is a gap of nearly n 1\n\n2 , we have R(b\u2713erm)  R(\u2713?) + n 1\n\nn )R(\u2713?) \uf8ff 45 log n\nn1 )  n 1\n\n2 in order of convergence.\n\n2 .\n\nn\n\n3.3 Fast Rates\n\nIn cases in which the risk R has curvature, empirical risk minimization often enjoys faster\nrates of convergence [6, 21]. The robust solution b\u2713rob\nsimilarly attains faster rates of conver-\ngence in such cases, even with approximate minimizers of Rn(\u2713, Pn). For the risk R and\n\u270f  0, let S\u270f := {\u2713 2 \u21e5: R(\u2713) \uf8ff inf \u2713?2\u21e5 R(\u2713?) + \u270f} denote the \u270f-sub-optimal (solution) set,\nand similarly let bS\u270f := {\u2713 2 \u21e5: Rn(\u2713, Pn) \uf8ff inf \u271302\u21e5 Rn(\u27130,Pn) + \u270f}. For a vector \u2713 2 \u21e5, let\n\u21e1S(\u2713) = argmin\u2713?2S k\u2713?  \u2713k2 denote the Euclidean projection of \u2713 onto the set S.\nOur below result depends on a local notion of Rademacher complexity. For i.i.d. random signs\n\"i 2 {\u00b11}, the empirical Rademacher complexity of a function class F\u21e2{ f : X! R} is\n\nn\n\nRnF := E\uf8ff sup\n\nf2F\n\n\"if (Xi) | X.\n\n1\nn\n\nnXi=1\n\nAlthough we state our results abstractly, we typically take F := {`(\u2713, \u00b7) | \u2713 2 \u21e5}. For example,\nwhen F is a VC-class, we typically have E[RnF] .pVC(F)/n. Many other bounds on E[RnF]\nare possible [1, 24, Ch. 2]. For A \u21e2 \u21e5 let Rn(A) denote the Rademacher complexity of the localized\nprocess {x 7! `(\u2713; x)  `(\u21e1S(\u2713); x) : \u2713 2 A}. We then have the following result, whose proof we\nprovide in Section A.7.\nTheorem 4. Let \u21e5 \u21e2 Rd be convex and let `(\u00b7; x) be convex and L-Lipshitz for all x 2X . For\nconstants > 0, > 1, and r > 0, assume that R satis\ufb01es\n(14)\n\nR(\u2713)   dist(\u2713, S) for all \u2713 such that dist(\u2713, S) \uf8ff r.\n\nR(\u2713)  inf\n\u27132\u21e5\n2 r satis\ufb01es\n\nLet t > 0. If 0 \uf8ff \u270f \uf8ff 1\n\u270f \u2713 8L2\u21e2\n\nn \u25c6 \n\n2(1)\u2713 2\n\u25c6 1\n\nr 2t\n\u25c6 1\n2  2E[Rn(S2\u270f)] + L\u2713 2\u270f\nthen P(bS\u270f \u21e2 S2\u270f)  1  et, and inequality (15) holds for all \u270f & ( L2(t+\u21e2+d)\n\nand \u270f\n\n2(1) .\n\n2/ n\n\n1\n\nn\n\n\n\n)\n\n4 Experiments\n\n,\n\n(15)\n\nWe present two real classi\ufb01cation experiments to carefully compare standard empirical risk minimiza-\ntion (ERM) to the variance-regularized approach we present. Empirically, we show that the ERM\n\nestimatorb\u2713erm performs poorly on rare classes with (relatively) more variance, where the robust\n\nsolution achieves improved classi\ufb01cation performance on rare instances. In all our experiments, this\noccurs with little expense over the more common instances.\n\n4.1 Protease cleavage experiments\n\nFor our \ufb01rst experiment, we compare our robust regularization procedure to other regularizers using\nthe HIV-1 protease cleavage dataset from the UCI ML-repository [14]. In this binary classi\ufb01cation\ntask, one is given a string of amino acids (a protein) and a featurized representation of the string of\ndimension d = 50960, and the goal is to predict whether the HIV-1 virus will cleave the amino acid\nsequence in its central position. We have a sample of n = 6590 observations of this process, where\nthe class labels are somewhat skewed: there are 1360 examples with label Y = +1 (HIV-1 cleaves)\nand 5230 examples with Y = 1 (does not cleave).\n\n7\n\n\f(a) test error\n\n(b) rare class (Yi = +1)\n\n(c) common class (Yi = 1)\n\nFigure 1: HIV-1 Protease Cleavage plots (2-standard error con\ufb01dence bars). Comparison of\nmisclassi\ufb01cation test error rates among different regularizers.\n\nWe use the logistic loss `(\u2713; (x, y)) = log(1 + exp(y\u2713>x)). We compare the performance of\ndifferent constraint sets \u21e5 by taking \u21e5= \u2713 2 Rd : a1 k\u2713k1 + a2 k\u2713k2 \uf8ff r , which is equivalent\nto elastic net regularization [27], while varying a1, a2, and r. We experiment with `1-constraints\n(a1 = 1, a2 = 0) with r 2{ 50, 100, 500, 1000, 5000}, `2-constraints (a1 = 0, a2 = 1) with\nr 2{ 5, 10, 50, 100, 500}, elastic net (a1 = 1, a2 = 10) with r 2{ 102, 2\u00b7 102, 103, 2\u00b7 103, 104}, our\nrobust regularizer with \u21e2 2{ 102, 103, 104, 5 \u00b7 104, 105} and our robust regularizer coupled with the\n`1-constraint (a1 = 1, a2 = 0) with r = 100. Though we use a convex surrogate (logistic loss), we\nmeasure performance of the classi\ufb01ers using the zero-one (misclassi\ufb01cation) loss 1{sign(\u2713T x)y \uf8ff 0}.\nFor validation, we perform 50 experiments, where in each experiment we randomly select 9/10 of\nthe data to train the model, evaluating its performance on the held out 1/10 fraction (test).\nWe plot results summarizing these experiments in Figure 1. The horizontal axis in each \ufb01gure\nindexes our choice of regularization value (so \u201cRegularizer = 1\u201d for the `1-constrained problem\ncorresponds to r = 50). The \ufb01gures show that the robustly regularized risk provides a different\ntype of protection against over\ufb01tting than standard regularization or constraint techniques do: while\nother regularizers underperform in heavily constrained settings, the robustly regularized estimator\nb\u2713rob\nachieves low classi\ufb01cation error for all values of \u21e2. Notably, even when coupled with a fairly\nn\nstringent `1-constraint (r = 100), robust regularization has performance better than `1 except for\nlarge values r, especially on the rare label Y = +1.\nWe investigate the effects of the robust regularizer with a slightly different perspective in Table 1,\nwhere we use \u21e5= {\u2713 : k\u2713k1 \uf8ff 100} for the constraint set for each experiment. We give error rates\nand logistic risk values for the different procedures, averaged over 50 independent runs. We note that\nall gaps are signi\ufb01cant at the 3-standard error level. We see that the ERM solutions achieve good\nperformance on the common class (Y = 1) but sacri\ufb01ce performance on the uncommon class. As\nwe increase \u21e2, performance of the robust solutionb\u2713rob\nn on the rarer label Y = +1 improves, while\nthe error rate on the common class degrades a small (insigni\ufb01cant) amount.\n\nTable 1: HIV-1 Cleavage Error\n\nrisk\n\n\u21e2\nerm\n100\n1000\n10000\n\ntrain\n0.1587\n0.1623\n0.1777\n0.283\n\ntest\n\n0.1706\n0.1763\n0.1944\n0.3031\n\nerror (%)\ntest\ntrain\n5.52\n6.39\n5.92\n4.99\n5.92\n4.5\n2.39\n5.67\n\nerror (Y = +1)\ntrain\n17.32\n15.01\n13.35\n7.18\n\ntest\n18.79\n17.04\n16.33\n14.65\n\nerror (Y = 1)\ntest\ntrain\n2.45\n3.17\n3.02\n2.38\n3.2\n2.19\n1.15\n3.32\n\n4.2 Document classi\ufb01cation in the Reuters corpus\n\nFor our second experiment, we consider a multi-label classi\ufb01cation problem with a reasonably large\ndataset. The Reuters RCV1 Corpus [13] has 804,414 examples with d = 47,236 features, where\nfeature j is an indicator variable for whether word j appears in a given document. The goal is to\nclassify documents as a subset of the 4 categories where documents are labeled with a subset of\nthose. As documents can belong to multiple categories, we \ufb01t binary classi\ufb01ers on each of the four\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: Reuters corpus experiment. (a) Logistic risks. (b) Recall. (c) Recall on Economics (rare).\n\ncategories. Each category has different number of documents (Corporate: 381, 327, Economics:\n119, 920, Government: 239, 267, Markets: 204, 820) In this experiment, we expect the robust solution\nto outperform ERM on the rarer category (Economics), as the robusti\ufb01cation (6) naturally upweights\nrarer (harder) instances, which disproportionally affect variance\u2014as in the previous experiment.\nFor each category k 2{ 1, 2, 3, 4}, we use the logistic loss `(\u2713k; (x, y)) = log(1 + exp(y\u2713>k x)).\nFor each binary classi\ufb01er, we use the `1 constraint set \u21e5= \u2713 2 Rd : k\u2713k1 \uf8ff 1000 . To evaluate\nperformance on this multi-label problem, we use precision (ratio of the number of correct positive\nlabels to the number classi\ufb01ed as positive) and recall (ratio of the number of correct positive labels\nto the number of actual positive labels). We partition the data into ten equally-sized sub-samples\nand perform ten validation experiments, where in each experiment we use one of the ten subsets for\n\ufb01tting the logistic models and the remaining nine partitions as a test set to evaluate performance.\nIn Figure 2, we summarize the results of our experiment averaged over the 10 runs, with 2-standard\nerror bars (computed across the folds). To facilitate comparison across the document categories,\n\nn andb\u2713erm have reasonably high\nwe give exact values of these averages in Tables 2 and 3. Bothb\u2713rob\nprecision across all categories, with increasing \u21e2 giving a mild improvement in precision (from\n.93 \u00b1 .005 to .94 \u00b1 .005). On the other hand, we observe in Figure 2(c) that ERM has low recall\n(.69 on test) for the Economics category, which contains about 15% of documents. As we increase \u21e2\nfrom 0 (ERM) to 105, we see a smooth and substantial improvement in recall for this rarer category\n(without signi\ufb01cant degradation in precision). This improvement in recall amounts to reducing\nvariance in predictions on the rare class. This precision and recall improvement comes in spite of\nthe increase in the average binary logistic risk for each of the 4 classes. In Figure 2(a), we plot the\naverage binary logistic loss (on train and test sets) averaged over the 4 categories as well as the upper\ncon\ufb01dence bound Rn(\u2713, Pn) as we vary \u21e2. The robust regularization effects reducing variance appear\nto improve the performance of the binary logistic loss as a surrogate for true error rate.\n\ntrain\n\u21e2\nerm 92.72\n1E3\n92.97\n93.45\n1E4\n94.17\n1E5\n1E6\n91.2\n\nPrecision\ntest\n92.7\n92.95\n93.45\n94.16\n91.19\n\nTable 2: Reuters Corpus Precision (%)\nCorporate\ntest\n93.55\n93.33\n93.61\n94.19\n92.02\n\nEconomics\ntest\ntrain\n89\n89.02\n87.84\n87.81\n87.58\n87.6\n86.56\n86.55\n74.81\n74.8\n\nGovernment\ntest\ntrain\n94.12\n94.1\n93.73\n93.76\n93.8\n93.77\n94.09\n94.07\n91.19\n91.25\n\ntrain\n93.55\n93.31\n93.58\n94.18\n92\n\nRecall\n\ntrain\n\u21e2\nerm 90.97\n91.72\n1E3\n92.40\n1E4\n93.46\n1E5\n1E6\n93.10\n\ntest\n90.96\n91.69\n92.39\n93.44\n93.08\n\nTable 3: Reuters Corpus Recall (%)\nCorporate\ntest\n90.25\n90.86\n91.54\n92.71\n92.04\n\nEconomics\ntest\ntrain\n67.53\n67.56\n70.39\n70.42\n72.36\n72.38\n76.78\n76.79\n79.84\n79.71\n\nGovernment\ntest\ntrain\n90.49\n90.49\n91.23\n91.26\n91.76\n91.76\n92.21\n92.26\n91.89\n91.90\n\ntrain\n90.20\n90.83\n91.47\n92.65\n92.00\n\nCode is available at https://github.com/hsnamkoong/robustopt.\n\n9\n\nMarkets\n\ntrain\n92.88\n92.56\n92.71\n93.16\n89.98\n\ntest\n92.94\n92.62\n92.75\n93.24\n90.18\n\nMarkets\n\ntrain\n88.77\n89.62\n90.48\n91.46\n92.00\n\ntest\n88.78\n89.58\n90.45\n91.47\n91.97\n\n\fAcknowledgments We thank Feng Ruan for pointing out a much simpler proof of Theorem 1 than\nin our original paper. JCD and HN were partially supported by the SAIL-Toyota Center for AI\nResearch and HN was partially supported Samsung Fellowship. JCD was also partially supported by\nthe National Science Foundation award NSF-CAREER-1553086 and the Sloan Foundation.\n\nReferences\n[1] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural\n\nresults. Journal of Machine Learning Research, 3:463\u2013482, 2002.\n\n[2] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local Rademacher complexities. Annals of Statistics, 33(4):\n\n1497\u20131537, 2005.\n\n[3] P. L. Bartlett, M. I. Jordan, and J. McAuliffe. Convexity, classi\ufb01cation, and risk bounds. Journal of the\n\nAmerican Statistical Association, 101:138\u2013156, 2006.\n\n[4] A. Ben-Tal, D. den Hertog, A. D. Waegenaere, B. Melenberg, and G. Rennen. Robust solutions of\n\noptimization problems affected by uncertain probabilities. Management Science, 59(2):341\u2013357, 2013.\n\n[5] D. Bertsimas, V. Gupta, and N. Kallus. Robust SAA. arXiv:1408.4445 [math.OC], 2014. URL http:\n\n//arxiv.org/abs/1408.4445.\n\n[6] S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classi\ufb01cation: a survey of some recent advances.\n\nESAIM: Probability and Statistics, 9:323\u2013375, 2005.\n\n[7] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities: a Nonasymptotic Theory of\n\nIndependence. Oxford University Press, 2013.\n\n[8] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[9] J. C. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra. Ef\ufb01cient projections onto the `1-ball for learning\n\nin high dimensions. In Proceedings of the 25th International Conference on Machine Learning, 2008.\n\n[10] J. C. Duchi, P. W. Glynn, and H. Namkoong. Statistics of robust optimization: A generalized empirical\nlikelihood approach. arXiv:1610.03425 [stat.ML], 2016. URL https://arxiv.org/abs/1610.03425.\n[11] J. Hiriart-Urruty and C. Lemar\u00e9chal. Convex Analysis and Minimization Algorithms I & II. Springer, New\n\n[12] V. Koltchinskii. Local Rademacher complexities and oracle inequalities in risk minimization. Annals of\n\nYork, 1993.\n\nStatistics, 34(6):2593\u20132656, 2006.\n\n[13] D. Lewis, Y. Yang, T. Rose, and F. Li. RCV1: A new benchmark collection for text categorization research.\n\nJournal of Machine Learning Research, 5:361\u2013397, 2004.\n\n[14] M. Lichman. UCI machine learning repository, 2013. URL http://archive.ics.uci.edu/ml.\n[15] E. Mammen and A. B. Tsybakov. Smooth discrimination analysis. Annals of Statistics, 27:1808\u20131829,\n\n1999.\n\n[16] A. Maurer and M. Pontil. Empirical Bernstein bounds and sample variance penalization. In Proceedings of\n\nthe Twenty Second Annual Conference on Computational Learning Theory, 2009.\n\n[17] S. Mendelson. Learning without concentration. In Proceedings of the Twenty Seventh Annual Conference\n\non Computational Learning Theory, 2014.\n\n[18] H. Namkoong and J. C. Duchi. Stochastic gradient methods for distributionally robust optimization with\n\nf-divergences. In Advances in Neural Information Processing Systems 29, 2016.\n\n[19] A. B. Owen. Empirical likelihood. CRC press, 2001.\n[20] P. Samson. Concentration of measure inequalities for Markov chains and -mixing processes. Annals of\n\nProbability, 28(1):416\u2013461, 2000.\n\n[21] A. Shapiro, D. Dentcheva, and A. Ruszczy\u00b4nski. Lectures on Stochastic Programming: Modeling and\n\nTheory. SIAM and Mathematical Programming Society, 2009.\n\n[22] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics, pages\n\n135\u2013166, 2004.\n\n[23] A. B. Tsybakov. Introduction to Nonparametric Estimation. Springer, 2009.\n[24] A. W. van der Vaart and J. A. Wellner. Weak Convergence and Empirical Processes: With Applications to\n\nStatistics. Springer, New York, 1996.\n\n[25] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998.\n[26] V. N. Vapnik and A. Y. Chervonenkis. On the uniform convergence of relative frequencies of events to\n\ntheir probabilities. Theory of Probability and its Applications, XVI(2):264\u2013280, 1971.\n\n[27] H. Zou and T. Hastie. Regularization and variable selection via the elastic net. Journal of the Royal\n\nStatistical Society, Series B, 67(2):301\u2013320, 2005.\n\n[28] A. Zubkov and A. Serov. A complete proof of universal inequalities for the distribution function of the\n\nbinomial law. Theory of Probability & Its Applications, 57(3):539\u2013544, 2013.\n\n10\n\n\f", "award": [], "sourceid": 1710, "authors": [{"given_name": "Hongseok", "family_name": "Namkoong", "institution": "Stanford University"}, {"given_name": "John", "family_name": "Duchi", "institution": "Stanford"}]}