{"title": "Distributionally Robust Optimization and Generalization in Kernel Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 9134, "page_last": 9144, "abstract": "Distributionally robust optimization (DRO) has attracted attention in machine learning due to its connections to regularization, generalization, and robustness. Existing work has considered uncertainty sets based on phi-divergences and Wasserstein distances, each of which have drawbacks. In this paper, we study DRO with uncertainty sets measured via maximum mean discrepancy (MMD). We show that MMD DRO is roughly equivalent to regularization by the Hilbert norm and, as a byproduct, reveal deep connections to classic results in statistical learning. In particular, we obtain an alternative proof of a generalization bound for Gaussian kernel ridge regression via a DRO lense. The proof also suggests a new regularizer. Our results apply beyond kernel methods: we derive a generically applicable approximation of MMD DRO, and show that it generalizes recent work on variance-based regularization.", "full_text": "Distributionally Robust Optimization and\n\nGeneralization in Kernel Methods\n\nMatthew Staib\n\nMIT CSAIL\n\nmstaib@mit.edu\n\nStefanie Jegelka\n\nMIT CSAIL\n\nstefje@csail.mit.edu\n\nAbstract\n\nDistributionally robust optimization (DRO) has attracted attention in machine\nlearning due to its connections to regularization, generalization, and robustness.\nExisting work has considered uncertainty sets based on \u03c6-divergences and Wasser-\nstein distances, each of which have drawbacks. In this paper, we study DRO with\nuncertainty sets measured via maximum mean discrepancy (MMD). We show that\nMMD DRO is roughly equivalent to regularization by the Hilbert norm and, as a\nbyproduct, reveal deep connections to classic results in statistical learning. In par-\nticular, we obtain an alternative proof of a generalization bound for Gaussian kernel\nridge regression via a DRO lense. The proof also suggests a new regularizer. Our\nresults apply beyond kernel methods: we derive a generically applicable approxi-\nmation of MMD DRO, and show that it generalizes recent work on variance-based\nregularization.\n\n1\n\nIntroduction\n\n(cid:80)\n\nx\u223c\u02c6Pn\n\n[(cid:96)f (x)] = 1\nn\n\nDistributionally robust optimization (DRO) is an attractive tool for improving machine learning\nmodels. Instead of choosing a model f to minimize empirical risk E\ni (cid:96)f (xi),\nan adversary is allowed to perturb the sample distribution within a set U centered around the\nempirical distribution \u02c6Pn. DRO seeks a model that performs well regardless of the perturbation:\nEx\u223cQ[(cid:96)f (x)]. The induced robustness can directly imply generalization: if the data that\ninf f supQ\u2208U\nforms \u02c6Pn is drawn from a population distribution P, and U is large enough to contain P, then we\nimplicitly optimize for P too and the DRO objective value upper bounds out of sample performance.\nMore broadly, robustness has gained attention due to adversarial examples [17, 39, 26]; indeed, DRO\ngeneralizes robustness to adversarial examples [35, 36].\nIn machine learning, the DRO uncertainty set U has so far always been de\ufb01ned as a \u03c6-divergence\nball or Wasserstein ball around the empirical distribution \u02c6Pn. These choices are convenient, due\nto a number of structural results. For example, DRO with \u03c72-divergence is roughly equivalent to\nregularizing by variance [18, 22, 31], and the worst case distribution Q \u2208 U can be computed exactly\nin O(n log n) [37]. Moreover, DRO with Wasserstein distance is asymptotically equivalent to certain\ncommon norm penalties [15], and the worst case Q \u2208 U can be computed approximately in several\ncases [28, 14]. These structural results are key, because the most challenging part of DRO is solving\n(or bounding) the DRO objective.\nHowever, there are substantial drawbacks to these two types of uncertainty sets. Any \u03c6-divergence\nuncertainty set U around \u02c6Pn contains only distributions with the same (\ufb01nite) support as \u02c6Pn. Hence,\nthe population P is typically not in U, and so the DRO objective value cannot directly certify out of\nsample performance. Wasserstein uncertainty sets do not suffer from this problem. But, they are\nmore computationally expensive, and the above key results on equivalences and computation need\nnontrivial assumptions on the loss function and the speci\ufb01c ground distance metric used.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fIn this paper, we introduce and develop a new class of DRO problems, where the uncertainty set U is\nde\ufb01ned with respect to the maximum mean discrepancy (MMD) [19], a kernel-based distance between\ndistributions. MMD DRO complements existing approaches and avoids some of their drawbacks,\ne.g., unlike \u03c6-divergences, the uncertainty set U will contain P if the radius is large enough.\nFirst, we show that MMD DRO is roughly equivalent to regularizing by the Hilbert norm (cid:107)(cid:96)f(cid:107)H\nof the loss (cid:96)f (not the model f). While, in general, (cid:107)(cid:96)f(cid:107)H may be dif\ufb01cult to compute, we show\nsettings in which it is tractable. Speci\ufb01cally, for kernel ridge regression with a Gaussian kernel, we\nprove a bound on (cid:107)(cid:96)f(cid:107)H that, as a byproduct, yields generalization bounds that match (up to a small\nconstant) the standard ones. Second, beyond kernel methods, we show how MMD DRO generalizes\nvariance-based regularization. Finally, we show how MMD DRO can be ef\ufb01ciently approximated\nempirically, and in fact generalizes variance-based regularization.\nOverall, our results offer deeper insights into the landscape of regularization and robustness ap-\nproaches, and a more complete picture of the effects of different divergences for de\ufb01ning robustness.\nIn short, our contributions are:\n\n1. We prove fundamental structural results for MMD DRO, and its rough equivalence to\n\npenalizing by the Hilbert norm of the loss.\n\n2. We give a new generalization proof for Gaussian kernel ridge regression by way of DRO.\nAlong the way, we prove bounds on the Hilbert norm of products of functions that may be\nof independent interest.\n\n3. Our generalization proof suggests a new regularizer for Gaussian kernel ridge regression.\n4. We derive a computationally tractable approximation of MMD DRO, with application to\ngeneral learning problems, and we show how the aforementioned approximation generalizes\nvariance regularization.\n\n2 Background and Related Work\n\nDistributionally robust optimization (DRO) [16, 2], introduced by Scarf [32], asks to not only perform\nwell on a \ufb01xed problem instance (parameterized by a distribution), but simultaneously for a range\nof problems, each determined by a distribution in an uncertainty set U. This results in more robust\nsolutions. The uncertainty set plays a key role: it implicitly de\ufb01nes the induced notion of robustness.\nThe DRO problem we address asks to learn a model f that solves\n\ninf\nf\n\nEx\u223cQ[(cid:96)f (x)],\n\n(1)\n\n(DRO)\n\n(cid:80)n\n\nsup\nQ\u2208U\nwhere (cid:96)f (x) is the loss incurred under prediction f (x).\nIn this work, we focus on data-driven DRO, where U is centered around an empirical sample\n\u02c6Pn = 1\ni=1 \u03b4xi, and its size is determined in a data-dependent way. Data-driven DRO yields a\nnatural approach for certifying out-of-sample performance.\nPrinciple 2.1 (DRO Generalization Principle). Fix any model f. Let U be a set of distributions\ncontaining \u02c6Pn. Suppose U is large enough so that, with probability 1 \u2212 \u03b4, U contains the population\nP. Then with probability 1 \u2212 \u03b4, the population loss Ex\u223cP[(cid:96)f (x)] is bounded by\n(2)\n\nn\n\nEx\u223cP[(cid:96)f (x)] \u2264 sup\nQ\u2208U\n\nEx\u223cQ[(cid:96)f (x)].\n\nEssentially, if the uncertainty set U is chosen appropriately, the corresponding DRO problem gives\na high probability bound on population performance. The two key steps in using Principle 2.1 are\n1. arguing that U actually contains P with high probability (e.g. via concentration); 2. solving the\nDRO problem on the right hand side, or an upper bound thereof.\nIn practice, U is typically chosen as a ball of radius \u0001 around the empirical sample \u02c6Pn: U = {Q :\nd(Q, \u02c6Pn) \u2264 \u0001}. Here, d is a discrepancy between distributions, and is of utmost signi\ufb01cance: the\nchoice of d determines how large \u0001 must be, and how tractable the DRO problem is.\nIn machine learning, two choices of the divergence d are prevalent, \u03c6-divergences [1, 11, 22],\nand Wasserstein distance [28, 33, 6]. The \ufb01rst option, \u03c6-divergences, have the form d\u03c6(P, Q) =\n\n2\n\n\f(cid:82) \u03c6(dP/dQ) dQ. In particular, they include the \u03c72-divergence, which makes the DRO problem equiv-\nalent to regularizing by variance [18, 22, 31]. Beyond better generalization, variance regularization\nhas applications in fairness [20]. However, a major shortcoming of DRO with \u03c6-divergences is that\nthe ball U = {Q : d\u03c6(Q, P0) \u2264 \u0001} only contains distributions Q whose support is contained in\nthe support of P0. If P0 = \u02c6Pn is an empirical distribution on n points, the ball U only contains\ndistributions with the same \ufb01nite support. Hence, the population distribution P typically cannot\nbelong to U, and it is not possible to certify out-of-sample perfomance by Principle 2.1. Though\n(cid:82) g(x, y)p d\u03b3(x, y) :\nPrinciple 2.1 does not apply here, generalization bounds are still possible via other means [31].\nThe second option, Wasserstein distance, is based on a distance metric g on the data space. The\np-Wasserstein distance Wp between measure \u00b5, \u03bd is given by Wp(\u00b5, \u03bd) = inf{\n\u03b3 \u2208 \u03a0(\u00b5, \u03bd)}1/p, where \u03a0(\u00b5, \u03bd) is the set of couplings of \u00b5 and \u03bd [40]. Wasserstein DRO has a key\nbene\ufb01t over \u03c6-divergences: the set U = {Q : Wp(Q, P0) \u2264 \u0001} contains continuous distributions.\nMoreover, concentration results bounding Wp(P, \u02c6Pn) with high probability are available for many\nsettings, e.g. [13, 23, 34, 41]. However, Wasserstein distance is much harder to work with, and\nnontrivial assumptions are needed to derive the necessary structural and algorithmic results for\nsolving the associated DRO problem. To our knowledge, in all Wasserstein DRO work so far, the\nground metric g is limited to slight variations of either a Euclidean or Mahalanobis metric [7, 8].\nSuch metrics may be a poor \ufb01t for complex data such as images or distributions. These assumptions\nrestrict the extent to which Wasserstein DRO can utilize complex, nonlinear structure in the data.\n\nMaximum Mean Discrepancy (MMD). MMD is a distance metric between distributions that\nleverages kernel embeddings. Let H be a reproducing kernel Hilbert space (RKHS) with kernel k\nand norm (cid:107)\u00b7(cid:107)H. MMD is de\ufb01ned as follows:\nDe\ufb01nition 2.1. The maximum mean discrepancy (MMD) between distributions P and Q is\n\ndMMD(P, Q) :=\n\n(3)\nFact 2.1. De\ufb01ne the mean embedding \u00b5P of the distribution P by \u00b5P = Ex\u223cP[k(x,\u00b7)]. Then the\nMMD between distributions P and Q can be equivalently written\ndMMD(P, Q) = (cid:107)\u00b5P \u2212 \u00b5Q(cid:107)H.\n(4)\n\nEx\u223cP[f (x)] \u2212 Ex\u223cQ[f (x)].\n\nsup\n\nf\u2208H:(cid:107)f(cid:107)H\u22641\n\nMMD and (more generally) kernel mean embeddings have been used in many applications, particu-\nlarly in two- and one-sample tests [19, 21, 25, 9] and in generative modeling [12, 24, 38, 5]. We refer\nthe interested reader to the monograph by Muandet et al. [30]. MMD admits ef\ufb01cient estimation, as\nwell as fast convergence properties, which are of chief importance in our work.\n\nFurther related work. Beyond \u03c6-divergences and Wasserstein distances, work in operations\nresearch has considered DRO problems that capture uncertainty in moments of the distribution,\ne.g. [10]. These approaches typically focus on \ufb01rst- and second-order moments; in contrast, an MMD\nuncertainty set allows high order moments to vary, depending on the choice of kernel.\nRobust and adversarial machine learning have strong connections to our work and DRO more\ngenerally. Robustness to adversarial examples [39, 17], where individual inputs to the model are\nperturbed in a small ball, can be cast as a robust optimization problem [26]. When the ball is a\nnorm ball, this robust formulation is a special case of Wasserstein DRO [35, 36]. Xu et al. [42]\nstudy the connection between robustness and regularization in SVMs, and perturbations within a\n(possibly Hilbert) norm ball. Unlike our work, their results are limited to SVMs instead of general\nloss minimization. Moreover, they consider only perturbation of individual data points instead of\nshifts in the entire distribution. Bietti et al. [4] show that many regularizers used for neural networks\ncan also be interpreted in light of an appropriately chosen Hilbert norm [3].\n\n3 Generalization bounds via MMD DRO\n\nThe main focus of this paper is Distributionally Robust Optimization where the uncertainty set is\nde\ufb01ned via the MMD distance dMMD:\n\nEx\u223cQ[(cid:96)f (x)].\n\n(5)\n\ninf\nf\n\nsup\n\nQ:dMMD(Q,\u02c6Pn)\u2264\u0001\n\n3\n\n\fOne motivation for considering MMD in this setting are its possible implications for Generalization.\nRecall that for the DRO Generalization Principle 2.1 to apply, the uncertainty set U must contain the\npopulation distribution with high probability. To ensure this, the radius of U must be large enough.\nBut, the larger the radius, the more pessimistic is the DRO minimax problem, which may lead to\nover-regularization. This radius depends on how quickly dMMD(P, \u02c6Pn) shrinks to zero, i.e., on the\nempirical accuracy of the divergence.\nIn contrast to Wasserstein distance, which converges at a rate of O(n\u22121/d) [13], MMD between the\nempirical sample \u02c6Pn and population P shrinks as O(n\u22121/2):\nLemma 3.1 (Modi\ufb01ed from [30], Theorem 3.4). Suppose that k(x, x) \u2264 M for all x. Let \u02c6Pn be an\nn sample empirical approximation to P. Then with probability 1 \u2212 \u03b4,\n\n(cid:113)\ndMMD(P, \u02c6Pn) \u2264 2\n\n(cid:113)\n\nM\nn +\n\n2 log(1/\u03b4)\n\nn\n\n.\n\n(6)\n\nThe constant M is dimension-independent for many common universal kernels, e.g. Gaussian,\n(cid:112)\n(cid:112)\nLaplace, and Matern kernels. With Lemma 3.1 in hand, we conclude a simple high probability bound\non out-of-sample performance:\nCorollary 3.1. Suppose that k(x, x) \u2264 M for all x. Set \u0001 = 2\nprobability 1 \u2212 \u03b4, we have the following bound on population risk:\nEx\u223cQ[(cid:96)f (x)].\n\n2 log(1/\u03b4)/n. Then with\n\nEx\u223cP[(cid:96)f (x)] \u2264\n\nM/n +\n\nsup\n\n(7)\n\nQ:dMMD(Q,\u02c6Pn)\u2264\u0001\n\nWe refer to the right hand side as the DRO adversary\u2019s problem. In the next section we develop\nresults that enable us to bound its value, and consequently bound the DRO problem (5).\n\n3.1 Bounding the DRO adversary\u2019s problem\nThe DRO adversary\u2019s problem seeks the distribution Q in the MMD ball so that Ex\u223cQ[(cid:96)f (x)] is as\nhigh as possible. Reasoning about the optimal worst-case Q is the main dif\ufb01culty in DRO. With\nMMD, we take two steps for simpli\ufb01cation. First, instead of directly optimizing over distributions,\nwe optimize over their mean embeddings in the Hilbert space (described in Fact 2.1). Second, while\nthe adversary\u2019s problem (7) makes sense for general (cid:96)f , we assume that the loss (cid:96)f is in H. In\ncase (cid:96)f (cid:54)\u2208 H, often k is a universal kernel, meaning under mild conditions (cid:96)f can be approximated\narbitrarily well by a member of H [30, De\ufb01nition 3.3].\nWith the additional assumption that (cid:96)f \u2208 H, the risk Ex\u223cP[(cid:96)f (x)] can also be written as (cid:104)(cid:96)f , \u00b5P(cid:105)H.\nThen we obtain\n(8)\n\nsup\n\nEx\u223cQ[(cid:96)f (x)] \u2264\n\nsup\n\n\u00b5Q\u2208H:(cid:107)\u00b5Q\u2212\u00b5P(cid:107)H\u2264\u0001(cid:104)(cid:96)f , \u00b5Q(cid:105)H,\n\nQ:dMMD(Q,P)\u2264\u0001\n\nwhere we have an inequality because not every function in H is the mean embedding of some\nprobability distribution. If k is a characteristic kernel [30, De\ufb01nition 3.2], the mapping P (cid:55)\u2192 \u00b5P is\ninjective. In this case, the only looseness in the bound is due to discarding the constraints that Q\nintegrates to one and is nonnegative. However it is dif\ufb01cult to constrain the mean embedding \u00b5Q in\nthis way as it is a function.\nThe mean embedding form of the problem is simpler to work with, and leads to further interpretations.\nTheorem 3.1. Let (cid:96)f , \u00b5P \u2208 H. We have the following equality:\n\n\u00b5Q\u2208H:(cid:107)\u00b5Q\u2212\u00b5P(cid:107)H\u2264\u0001(cid:104)(cid:96)f , \u00b5Q(cid:105)H = (cid:104)(cid:96)f , \u00b5P(cid:105)H + \u0001(cid:107)(cid:96)f(cid:107)H = Ex\u223cP[(cid:96)f (x)] + \u0001(cid:107)(cid:96)f(cid:107)H.\n\nsup\n\n(9)\n\nIn particular, the adversary\u2019s optimal solution is \u00b5\u2217Q = \u00b5P + \u0001\n\n(cid:107)(cid:96)(cid:107)H (cid:96)f .\n\nCombining Theorem 3.1 with equation (8) yields our main result for this section:\nCorollary 3.2. Let (cid:96)f \u2208 H, let P be a probability distribution, and \ufb01x \u0001 > 0. Then,\n\nsup\n\nQ:dMMD(P,Q)\u2264\u0001\nQ:dMMD(P,Q)\u2264\u0001\n\nsup\n\ninf\nf\n\nEx\u223cQ[(cid:96)f (x)] \u2264 Ex\u223cP[(cid:96)f (x)] + \u0001(cid:107)(cid:96)f(cid:107)H\nEx\u223cQ[(cid:96)f (x)] \u2264 inf\n\nEx\u223cP[(cid:96)f (x)] + \u0001(cid:107)(cid:96)f(cid:107)H.\n\nf\n\nand therefore\n\n(10)\n\n(11)\n\n4\n\n\f(cid:80)n\ni=1(cid:107)\u2207x(cid:96)f (xi)(cid:107)q\n\u2217\n\nn\n\n(cid:1)1/q measures a kind of\n\n[(cid:96)f (x)] + \u0001(cid:107)\u2207x(cid:96)f(cid:107)\u02c6Pn,q, where (cid:107)\u2207x(cid:96)f(cid:107)\u02c6Pn,q =(cid:0) 1\n\nCombining Corollary 3.2 with Corollary 3.1 shows that minimizing the empirical risk plus a norm\non (cid:96)f leads to a high probability bound on out-of-sample performance. This result is similar to\nresults that equate Wasserstein DRO to norm regularization. For example, Gao et al. [15] show that\nunder appropriate assumptions on (cid:96)f , DRO with a p-Wasserstein ball is asymptotically equivalent to\nE\nx\u223c\u02c6Pn\nq-norm average of (cid:107)\u2207x(cid:96)f (xi)(cid:107)\u2217 at each data point xi (here q is such that 1/p + 1/q = 1, and (cid:107) \u00b7 (cid:107)\u2217\nis the dual norm of the metric de\ufb01ning the Wasserstein distance).\nThere are a few key differences between our result and that of Gao et al. [15]. First, the norms are\ndifferent. Second, their result penalizes only the gradient of (cid:96)f , while ours penalizes (cid:96)f directly.\nThird, except for certain special cases, the Wasserstein results cannot serve as a true upper bound;\nthere are higher order terms that only shrink to zero as \u0001 \u2192 0. These higher order terms may not be\nso small: in high dimension d, the radius \u0001 of the uncertainty set needed so that P \u2208 U shrinks very\nslowly, as O(n\u22121/d) [13].\nRemark 3.1. Theorem 3.1 and Corollary 3.2 require that (cid:96)f is in the RKHS H. Though this may\nseem restrictive, if the kernel k is universal, as is the case for many kernels used in practice such as\nGaussian and Laplace kernels, we can readily extend our results to all bounded continuous functions.\nSuppose (cid:96)f is a bounded continuous function on a compact metric space X . By de\ufb01nition (e.g. [30],\nDe\ufb01nition 3.3), if k is a universal kernel on X , then for any \u0001 > 0, there is some (cid:96)(cid:48) \u2208 H with\nsupx\u2208X |(cid:96)f (x) \u2212 (cid:96)(cid:48)(x)| < \u0001. It follows that for any measure P, we can bound the expectation of\n(cid:96)f (x) by that of (cid:96)(cid:48): Ex\u223cP[(cid:96)f (x)] < Ex\u223cP[(cid:96)(cid:48)(x)] + \u0001. Then, we can apply our results to (cid:96)(cid:48) \u2208 H.\n4 Connections to kernel ridge regression\n\nAfter applying Corollary 3.2, we are interested in solving:\n\ninf\nf\n\nE\nx\u223c\u02c6Pn\n\n[(cid:96)f (x)] + \u0001(cid:107)(cid:96)f(cid:107)H.\n\n(12)\n\nHere, we penalize our model f by (cid:107)(cid:96)f(cid:107)H. This looks similar to but is very different from the usual\npenalty (cid:107)f(cid:107)H in kernel methods. In fact, Hilbert norms of function compositions such as (cid:96)f pose\nseveral challenges. For example, f and (cid:96)f may not belong to the same RKHS \u2013 it is not hard\nto construct counterexamples, even when (cid:96) is merely quadratic. So, the objective (12) is not yet\ncomputational.\nDespite these challenges, we next develop tools that will allow us to bound (cid:107)(cid:96)f(cid:107)H and use it as a\nregularizer. These tools may be of independent interest to bound RKHS norms of composite functions\n(e.g., for settings as in [4]). Due to the dif\ufb01culty of this task, we specialize to Gaussian kernels\nk\u03c3(x, y) = exp(\u2212(cid:107)x \u2212 y(cid:107)2/(2\u03c32)). Since we will need to take care regarding the bandwidth \u03c3, we\nexplicitly write it out for the inner product (cid:104)\u00b7,\u00b7(cid:105)\u03c3 and norm (cid:107)\u00b7(cid:107)\u03c3, of the corresponding RKHS H\u03c3.\nTo make the setting concrete, consider kernel ridge regression, with Gaussian kernel k\u03c3. As usual,\nwe assume there is a simple target function h that \ufb01ts our data: h(xi) = yi. Then the loss (cid:96)f of f is\n(cid:96)f (x) = (f (x) \u2212 h(x))2, so we wish to solve\n\ninf\nf\n\nE\nx\u223c\u02c6Pn\n\n[(f (x) \u2212 h(x))2] + \u0001(cid:107)(f \u2212 h)2(cid:107)\u03c3.\n\n(13)\n\n4.1 Bounding norms of products\nTo bound (cid:107)(f \u2212 h)2(cid:107)\u03c3, it will suf\ufb01ce to bound RKHS norms of products. The key result for this\nsubsection is the following deceptively simple-looking bound:\nTheorem 4.1. Let f, g \u2208 H\u03c3, that is, the RKHS corresponding to the Gaussian kernel k\u03c3 of\nbandwidth \u03c3. Then, (cid:107)f g(cid:107)\u03c3/\u221a2 \u2264 (cid:107)f(cid:107)\u03c3(cid:107)g(cid:107)\u03c3.\nIndeed, there are already subtleties: if f, g \u2208 H\u03c3, then, to discuss the norm of the product f g, we\nneed to decrease the bandwidth from \u03c3 to \u03c3/\u221a2.\nWe prove Theorem 4.1 via two steps. First, we represent the functions f, g, and f g exactly in terms\nof traces of certain matrices. This step is highly dependent on the speci\ufb01c structure of the Gaussian\nkernel. Then, we can apply standard trace inequalities. Proofs of both results are given in Appendix B.\n\n5\n\n\fj bjk\u03c3(xj,\u00b7). For\nshorthand denote by zi = \u03c6\u221a2\u03c3(xi) the (possibly in\ufb01nite) feature expansion of xi in H\u221a2\u03c3. Then,\n\nProposition 4.1. Let f, g \u2208 H\u03c3 have expansions f =(cid:80)\nwhere A =(cid:80)\n\ni and B =(cid:80)\n\n(cid:107)f(cid:107)2\nj .\nj ajzjzT\n\n\u03c3/\u221a2 = tr(A2B2),\n\n(cid:107)f g(cid:107)2\ni aizizT\n\n\u03c3 = tr(A2),\n\ni aik\u03c3(xi,\u00b7) and g =(cid:80)\n\nand\n\n\u03c3 = tr(B2),\n\n(cid:107)g(cid:107)2\n\nLemma 4.1. Let X, Y be symmetric and positive semide\ufb01nite. Then tr(XY ) \u2264 tr(X) tr(Y ).\nWith these intermediate results in hand, we can prove the main bound of interest:\n\nProof of Theorem 4.1. By Proposition 4.1, we may write\n\nwhere A =(cid:80)\n\ni and B =(cid:80)\n\n(cid:107)f g(cid:107)2\ni aizizT\n\n\u03c3/\u221a2 = tr(A2B2),\n\n(cid:107)f(cid:107)2\nj are chosen as described in Proposition 4.1. Since A and\nB are each symmetric, it follows that A2 and B2 are each symmetric and positive semide\ufb01nite. Then\nwe can apply Lemma 4.1 to conclude that\n\n\u03c3 = tr(B2),\n\n\u03c3 = tr(A2),\n\nj bjzjzT\n\n(cid:107)g(cid:107)2\n\nand\n\n(cid:107)f g(cid:107)2\n\n\u03c3/\u221a2 = tr(A2B2) \u2264 tr(A2) tr(B2) = (cid:107)f(cid:107)2\n\n\u03c3(cid:107)g(cid:107)2\n\u03c3.\n\n4.2\n\nImplications: kernel ridge regression\n\nWith the help of Theorem 4.1, we can develop DRO-based bounds for actual learning problems. In\nthis section we develop such bounds for Gaussian kernel ridge regression, i.e. problem (13).\nFor shorthand, we write RQ(f ) = Ex\u223cQ[(cid:96)f (x)] = Ex\u223cQ[(f (x) \u2212 h(x))2] for the risk of f on a\ndistribution Q. Generalization amounts to proving that the population risk RP(f ) is not too different\nthan the empirical risk R\u02c6Pn\nTheorem 4.2. Assume the target function h satis\ufb01es (cid:107)h2(cid:107)\u03c3/\u221a2 \u2264 \u039bh2 and (cid:107)h(cid:107)\u03c3 \u2264 \u039bh. Then, for\nany \u03b4 > 0, with probability 1 \u2212 \u03b4, the following holds for all functions f satisfying (cid:107)f 2(cid:107)\u03c3/\u221a2 \u2264 \u039bf 2\nand (cid:107)f(cid:107)\u03c3 \u2264 \u039bf :\n\n(f ).\n\n(cid:113)\n\n(cid:19)(cid:0)\u039bf 2 + \u039bh2 + 2\u039bf \u039bh\n\n(cid:1) .\n\nRP(f ) \u2264 R\u02c6Pn\n\n(f ) + 2\u221an\n\n1 +\n\nlog(1/\u03b4)\n\n2\n\n(14)\n\nProof. We utilize the DRO Generalization Principle 2.1, By Lemma 3.1 we know that with probability\n2 log(1/\u03b4))/\u221an, since k\u03c3(x, x) \u2264 M = 1. Note the\n1 \u2212 \u03b4, dMMD(\u02c6Pn, P) \u2264 \u0001 for \u0001 = (2 +\nbandwidth \u03c3 does not affect the convergence result. As a result of Lemma 3.1, with probability 1 \u2212 \u03b4:\n(15)\n\nRP(f ) = Ex\u223cP[(f (x) \u2212 h(x))2]\n\n(cid:18)\n(cid:112)\n\n(cid:16)\n\n(a)\n\n(b)\n\n\u2264 E\n\u2264 R\u02c6Pn\n\u2264 R\u02c6Pn\n\n(c)\n\nx\u223c\u02c6Pn\n\n[(f (x) \u2212 h(x))2] + \u0001(cid:107)(f \u2212 h)2(cid:107)\u03c3/\u221a2\n\n(cid:107)f 2(cid:107)\u03c3/\u221a2 + (cid:107)h2(cid:107)\u03c3/\u221a2 + 2(cid:107)f h(cid:107)\u03c3/\u221a2\n\n(f ) + \u0001\n\n(f ) + \u0001(cid:0)\u039bf 2 + \u039bh2 + 2\u039bf \u039bh\n\n(cid:1) ,\n\n(cid:17)\n\n(16)\n\n(17)\n\n(18)\nwhere (a) is by Corollary 3.2, (b) is by the triangle inequality, and (c) follows from Theorem 4.1 and\nour assumptions on f and h. Plugging in the bound on \u0001 yields the result.\n\nWe placed different bounds on each of f, h, f 2, h2 to emphasize the dependence on each. Since each\nis bounded separately, the DRO based bound in Theorem 4.2 allows \ufb01ner control of the complexity\nof the function class than is typical. Since, by Theorem 4.1, the norms of f 2, h2 and f h are bounded\nby those of f and h, we may also state Theorem 4.2 just with (cid:107)f(cid:107)\u03c3 and (cid:107)h(cid:107)\u03c3.\nCorollary 4.1. Assume the target function h satis\ufb01es (cid:107)h(cid:107)\u03c3 \u2264 \u039b. Then, for any \u03b4 > 0, with\nprobability 1 \u2212 \u03b4, the following holds for all functions f satisfying (cid:107)f(cid:107)\u03c3 \u2264 \u039b:\n\n(cid:18)\n\n(cid:113)\n\n(cid:19)\n\n1 +\n\nlog(1/\u03b4)\n\n2\n\n.\n\n(19)\n\nRP(f ) \u2264 R\u02c6Pn\n\n(f ) + 8\u039b2\n\u221an\n\n6\n\n\fProof. We reduce to Theorem 4.2. By Theorem 4.1, we know that (cid:107)f 2(cid:107)\u03c3/\u221a2 \u2264 (cid:107)f(cid:107)2\nmay be bounded above by \u039b2 (and similarly for h). Therefore we can take \u039bf 2 = \u039b2\n\u039bh2 = \u039b2\n\nh = \u039b in Theorem 4.2. The result follows by bounding\n\n\u03c3, which\nf = \u039b and\n\n\u039bf 2 + \u039bh2 + 2\u039bf \u039bh \u2264 \u039b2 + \u039b2 + 2\u039b \u00b7 \u039b = 4\u039b2.\n\nGeneralization bounds for kernel ridge regression are of course not new; we emphasize that the DRO\nviewpoint provides an intuitive approach that also grants \ufb01ner control over the function complexity.\nMoreover, our results take essentially the same form as the typical generalization bounds for kernel\nridge regression, reproduced below:\nTheorem 4.3 (Specialized from [29], Theorem 10.7). Assume the target function h satis\ufb01es (cid:107)h(cid:107)\u03c3 \u2264\n\u039b. Then, for any \u03b4 > 0, with probability 1 \u2212 \u03b4, it holds for all functions f satisfying (cid:107)f(cid:107)\u03c3 \u2264 \u039b that\n(20)\n\n(cid:113)\n\n(cid:19)\n\n(cid:18)\n\nlog(1/\u03b4)\n\n1 +\n\n.\n\nRP(f ) \u2264 R\u02c6Pn\n\n(f ) + 8\u039b2\n\u221an\n\n1\n2\n\n2\n\nHence, our DRO-based Theorem 4.2 evidently recovers standard results up to a universal constant.\n\n(cid:16)\n\n(cid:17)\n\n4.3 Algorithmic implications\nThe generalization result in Theorem 4.3 is often used to justify penalizing by the norm (cid:107)f(cid:107)\u03c3, since\nit is the only part of the bound (other than the risk R\u02c6Pn\n(f )) that depends on f. In contrast, our\nDRO-based generalization bound in Theorem 4.2 is of the form\n\nRP(f ) \u2212 R\u02c6Pn\n\n(21)\nwhich depends on f through both norms (cid:107)f(cid:107)\u03c3 and (cid:107)f 2(cid:107)\u03c3/\u221a2. This bound motivates the use of both\nnorms as regularizers in kernel regression, i.e. we would instead solve\n(22)\n\n(cid:107)f 2(cid:107)\u03c3/\u221a2 + (cid:107)h2(cid:107)\u03c3/\u221a2 + 2(cid:107)f(cid:107)\u03c3(cid:107)h(cid:107)\u03c3\n\n(f ) \u2264 \u0001\n\n[(f (x) \u2212 y)2] + \u03bb1(cid:107)f(cid:107)\u03c3 + \u03bb2(cid:107)f 2(cid:107)\u03c3/\u221a2.\n\nE\n(x,y)\u223c\u02c6Pn\n\n,\n\nto consider only f of the form f =(cid:80)n\n\ninf\nf\u2208H\u03c3\n\nGiven data (xi, yi)n\n\ni=1, for kernel ridge regression, the Representer Theorem implies that it is suf\ufb01cient\ni=1 aik\u03c3(xi,\u00b7). Here this is not in general possible due to\nthe norm of f 2. However, it is possible to evaluate and compute gradients of (cid:107)f 2(cid:107)2\n: let K be\nthe matrix with Kij = k\u221a2\u03c3(xi, xj), and let D = diag(a). Using Proposition 4.1, we can prove\n(cid:107)f 2(cid:107)2\n5 Approximation and connections to variance regularization\n\n= tr((DK)4) A complete proof is given in the appendix.\n\n\u03c3/\u221a2\n\n\u03c3/\u221a2\n\nIn the previous section we studied bounding the MMD DRO problem (5) via Hilbert norm penalization.\nGoing beyond kernel methods where we search over f \u2208 H, it is even less clear how to evaluate\nthe Hilbert norm (cid:107)(cid:96)f(cid:107)H. To circumvent this issue, next we approach the DRO problem from a\ndifferent angle: we directly search for the adversarial distribution Q. Along the way, we will build\nconnections to variance regularization [27, 18, 22, 31], where the empirical risk is regularized by the\nempirical variance of (cid:96)f : Var\u02c6Pn\n[(cid:96)f (x)]2. In particular, we show in\nTheorem 5.1 that MMD DRO yields stronger regularization than variance.\nSearching over all distributions Q in the MMD ball is intractable, so we restrict our attention to those\ni=1 as the empirical sample \u02c6Pn. All such distributions Q can be written\nwith the same support {xi}n\ni=1 wi\u03b4xi, where w is in the n-dimensional simplex. By restricting the set of candidate\ndistributions Q, we make the adversary weaker:\n\nas Q =(cid:80)n\n\n[(cid:96)f (x)2] \u2212 E\n\n((cid:96)f ) = E\n\nx\u223c\u02c6Pn\n\nx\u223c\u02c6Pn\n\nsupQ Ex\u223cQ[(cid:96)f (x)]\ns.t.\n\ndMMD(Q, \u02c6Pn) \u2264 \u0001\n\n\u2265\n\nsupw\ns.t.\n\n(cid:80)n\ndMMD((cid:80)n\n(cid:80)n\n\ni=1 wi = 1\n\ni=1 wi(cid:96)f (xi)\n\nwi \u2265 0 \u2200i = 1, . . . , n.\n\ni=1 wi\u03b4xi, \u02c6Pn) \u2264 \u0001\n\n(23)\n\nBy restricting the support of Q, it is no longer possible to guarantee out of sample performance, since\nit typically will have different support. Yet, as we will see, problem (23) has nice connections.\n\n7\n\n\fFigure 1: Comparison of the two regularizers (cid:107)f(cid:107)2\n(right) settings, across a parameter sweep of \u03bb. The x-axis is shifted to make comparison easier.\n\n\u03c3 and (cid:107)f 2(cid:107)\u03c3/\u221a2 in both the easy (left) and hard\n\nThe dMMD constraint is a quadratic penalty on v = w \u2212 1\nn(cid:88)\nde\ufb01nition of MMD:\n\n(cid:32) n(cid:88)\n\n(cid:33)2\n\nwi\u03b4xi, \u02c6Pn\n\n=\n\ndMMD\n\nwik(xi,\u00b7) \u2212\n\n1\nn\n\nk(xi,\u00b7)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n(cid:88)\n\ni=1\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nH\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n(cid:88)\n\ni=1\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nH\n\nn 1, as one may see via the mean embedding\n\n=\n\nvik(xi,\u00b7)\n\n.\n\n(24)\n\nn 1)T K(w\u2212 1\n\nThe last term is vT Kv = (w\u2212 1\nn 1), where K is the kernel matrix with Kij = k(xi, xj).\nIf the radius \u0001 of the uncertainty set is small enough, the constraints wi \u2265 0 are inactive, and can be\nignored. By dropping these constraints, we can solve the adversary\u2019s problem in closed form:\nLemma 5.1. Let (cid:126)(cid:96) be the vector with i-th element (cid:96)f (xi). If \u0001 is small enough that the constraints wi\n(cid:113)\nare not active, then the optimal value of problem (23) is given by\n(cid:126)(cid:96)T K\u22121(cid:126)(cid:96) \u2212 ((cid:126)(cid:96)T K\n\n\u221211)2\n1T K\u221211 .\n\nE\nx\u223c\u02c6Pn\n\n[(cid:96)f (x)] + \u0001\n\n(25)\n\nIn other words, \ufb01tting a model to minimize the support-constrained approximation of MMD DRO\nis equivalent to penalizing by the nonconvex regularizer in Lemma 5.1. To better understand this\nregularizer, consider, for instance, the case that the kernel matrix K equals the identity I. This will\nhappen e.g. for a Gaussian kernel as the bandwidth \u03c3 approaches zero. Then, the regularizer equals\n\n(cid:113)\n(cid:126)(cid:96)T K\u22121(cid:126)(cid:96) \u2212 ((cid:126)(cid:96)T K\n\n\u0001\n\n\u221211)2\n\n1T K\u221211 = \u0001\n\n(cid:113)\n(cid:126)(cid:96)T (cid:126)(cid:96) \u2212 ((cid:126)(cid:96)T 1)2\n\n1T 1 = \u0001\u221an\n\n(cid:113)\n\nVar\u02c6Pn\n\n((cid:96)f ).\n\n(26)\n\nIn fact, this equivalence holds a bit more generally:\n\nLemma 5.2. Let K = aI + b11T . Then,\n\n(cid:113)\n(cid:126)(cid:96)T K\u22121(cid:126)(cid:96) \u2212 ((cid:126)(cid:96)T K\u221211)2\n\n1T K\u221211 = a\u22121/2\u221an\n\n(cid:113)\n\nVar\u02c6Pn\n\n((cid:96)f ).\n\nAs a consequence, we conclude that with the right choice of kernel k, MMD DRO is a stronger\nregularizer than variance:\nTheorem 5.1. There exists a kernel k so that MMD DRO bounds the variance regularized problem:\n\n(cid:113)\n\nE\nx\u223c\u02c6Pn\n\n[(cid:96)f (x)] \u2264 E\n\nx\u223c\u02c6Pn\n\n[(cid:96)f (x)] + \u0001\u221an\n\n6 Experiments\n\nVar\u02c6Pn\n\n((cid:96)f ) \u2264\n\nsup\n\n[(cid:96)f (x)].\n\nQ:dMMD(Q,\u02c6Pn)\u2264\u0001\n\n(27)\n\nIn subsection 4.3 we proposed an alternate regularizer for kernel ridge regression, speci\ufb01cally,\npenalizing (cid:107)f 2(cid:107)\u03c3/\u221a2 instead of (cid:107)f(cid:107)2\n\u03c3. Here we probe the new regularizer on a synthetic problem\nwhere we can precisely compute the population risk RP(f ). Consider the Gaussian kernel k\u03c3 with\n\u03c3 = 1. Fix the ground truth h = k\u03c3(1,\u00b7) \u2212 k\u03c3(\u22121,\u00b7) \u2208 H\u03c3. Sample 104 points from a standard one\ndimensional Gaussian, and set this as the population P. Then subsample n points xi = h(xi) + \u0001i,\n\n8\n\n100101102103Regularizerstrength\u03bbforkf2k\u03c3/\u221a20.010.020.030.04PopulationRMSEkfk2\u03c3kf2k\u03c3/\u221a210\u2212210\u22121100101Regularizerstrength\u03bbforkfk2\u03c310\u22121100101102103Regularizerstrength\u03bbforkf2k\u03c3/\u221a20.30.40.5PopulationRMSEkfk2\u03c3kf2k\u03c3/\u221a210\u2212210\u22121100101102Regularizerstrength\u03bbforkfk2\u03c3\fwhere \u0001i are Gaussian. We consider both an easy regime, where n = 103 and Var(\u0001i) = 10\u22122, and a\nhard regime where n = 102 and Var(\u0001i) = 1. On the empirical data, we \ufb01t f \u2208 H\u03c3 by minimizing\nsquare loss plus either \u03bb(cid:107)f(cid:107)2\n\u03c3 (as is typical) or \u03bb(cid:107)f 2(cid:107)\u03c3/\u221a2 (our proposal). We average over 102\nresampling trials for the easy case and 103 for the hard case, and report 95% con\ufb01dence intervals.\nFigure 1 shows the result in each case for a parameter sweep over \u03bb. If \u03bb is tuned properly, the\ntighter regularizer (cid:107)f 2(cid:107)\u03c3/\u221a2 yields better performance in both cases. It also appears the regularizer\n(cid:107)f 2(cid:107)\u03c3/\u221a2 is less sensitive to the choice of \u03bb: performance decays slowly when \u03bb is too low.\n7 Conclusion\n\nWe introduce MMD DRO, distributionally robust optimization with maximum mean discrepancy\nuncertainty sets. We prove fundamental structural results and upper bounds for MMD DRO, and\nunearth deep connections, in particular to Gaussian kernel ridge regression and variance regularization.\nSeveral open questions remain. In terms of theory, our MMD DRO approach to generalization bounds\nleaves much new ground to explore. In particular, we conjecture that our approach might also work\nfor ridge regression with non-Gaussian kernels. Practically, there is also much left to do to make\nMMD DRO a general purpose tool. We have presented two approximations of MMD DRO, each\nwith strengths and drawbacks: the upper bound in Corollary 3.2 enables our kernel ridge regression\ngeneralization bound, but is potentially loose, and is dif\ufb01cult to use more generally because the\nHilbert norm is tricky to compute; the discrete approximation in Section 5 is more practical but is\nnot an upper bound on the MMD DRO problem. Future work could address these drawbacks, or\npotentially develop a tractable exact reformulation of the DRO problem.\n\nAcknowledgements\n\nThis work was supported by The Defense Advanced Research Projects Agency (grant number YFA17\nN66001-17-1-4039). The views, opinions, and/or \ufb01ndings contained in this article are those of the\nauthor and should not be interpreted as representing the of\ufb01cial views or policies, either expressed\nor implied, of the Defense Advanced Research Projects Agency or the Department of Defense. We\nthank Cameron Musco and Joshua Robinson for helpful conversations, and Marwa El Halabi and\nSebastian Claici for comments on the draft.\n\nReferences\n[1] Aharon Ben-Tal, Dick den Hertog, Anja De Waegenaere, Bertrand Melenberg, and Gijs Rennen. Robust\nsolutions of optimization problems affected by uncertain probabilities. Management Science, 59(2):\n341\u2013357, 2013.\n\n[2] Dimitris Bertsimas, Vishal Gupta, and Nathan Kallus. Data-driven robust optimization. Mathematical\n\nProgramming, 167(2):235\u2013292, Feb 2018.\n\n[3] Alberto Bietti and Julien Mairal. Group invariance, stability to deformations, and complexity of deep\n\nconvolutional representations. The Journal of Machine Learning Research, 20(1):876\u2013924, 2019.\n\n[4] Alberto Bietti, Gr\u00e9goire Mialon, Dexiong Chen, and Julien Mairal. A kernel perspective for regularizing\ndeep neural networks. In Proceedings of the 36th International Conference on Machine Learning. PMLR,\n2019.\n\n[5] Miko\u0142aj Bi\u00b4nkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs.\n\nIn International Conference on Learning Representations, 2018.\n\n[6] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein pro\ufb01le inference and applications to\n\nmachine learning. arXiv preprint arXiv:1610.05627, 2016.\n\n[7] Jose Blanchet, Yang Kang, Fan Zhang, and Karthyek Murthy. Data-driven optimal transport cost selection\n\nfor distributionally robust optimization. arXiv preprint arXiv:1705.07152, 2017.\n\n[8] Jose Blanchet, Karthyek Murthy, and Fan Zhang. Optimal transport based distributionally robust optimiza-\n\ntion: Structural properties and iterative schemes. arXiv preprint arXiv:1810.02403, 2018.\n\n[9] Kacper Chwialkowski, Heiko Strathmann, and Arthur Gretton. A kernel test of goodness of \ufb01t.\n\nIn\nMaria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference\non Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2606\u20132615, New\nYork, New York, USA, 20\u201322 Jun 2016. PMLR.\n\n[10] Erick Delage and Yinyu Ye. Distributionally robust optimization under moment uncertainty with application\n\nto data-driven problems. Operations Research, 58(3):595\u2013612, 2010.\n\n[11] John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized\n\nempirical likelihood approach. arXiv preprint arXiv:1610.03425, 2016.\n\n9\n\n\f[12] Gintare Karolina Dziugaite, Daniel M. Roy, and Zoubin Ghahramani. Training generative neural networks\nvia maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty\nin Arti\ufb01cial Intelligence, UAI\u201915, pages 258\u2013267, Arlington, Virginia, United States, 2015. AUAI Press.\nISBN 978-0-9966431-0-8.\n\n[13] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in Wasserstein distance of the empirical\n\nmeasure. Probability Theory and Related Fields, 162(3):707\u2013738, Aug 2015.\n\n[14] Rui Gao and Anton J Kleywegt. Distributionally robust stochastic optimization with Wasserstein distance.\n\narXiv preprint arXiv:1604.02199, 2016.\n\n[15] Rui Gao, Xi Chen, and Anton J Kleywegt. Wasserstein distributional robustness and regularization in\n\nstatistical learning. arXiv preprint arXiv:1712.06050, 2017.\n\n[16] Joel Goh and Melvyn Sim. Distributionally robust optimization and its tractable approximations. Operations\n\nResearch, 58(4-part-1):902\u2013917, 2010.\n\n[17] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.\n\nIn International Conference on Learning Representations, 2015.\n\n[18] Jun-ya Gotoh, Michael Kim, and Andrew Lim. Robust Empirical Optimization is Almost the Same As\n\nMean-Variance Optimization. Available at SSRN 2827400, 2015.\n\n[19] Arthur Gretton, Karsten M. Borgwardt, Malte J. Rasch, Bernhard Sch\u00f6lkopf, and Alexander Smola. A\n\nkernel two-sample test. Journal of Machine Learning Research, 13:723\u2013773, March 2012.\n\n[20] Tatsunori Hashimoto, Megha Srivastava, Hongseok Namkoong, and Percy Liang. Fairness without\ndemographics in repeated loss minimization. In Jennifer Dy and Andreas Krause, editors, Proceedings of\nthe 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning\nResearch, pages 1929\u20131938, Stockholmsm\u00e4ssan, Stockholm Sweden, 10\u201315 Jul 2018. PMLR.\n\n[21] Wittawat Jitkrittum, Wenkai Xu, Zoltan Szabo, Kenji Fukumizu, and Arthur Gretton. A linear-time kernel\ngoodness-of-\ufb01t test. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 262\u2013271. Curran\nAssociates, Inc., 2017.\n\n[22] Henry Lam. Robust Sensitivity Analysis for Stochastic Systems. Mathematics of Operations Research, 41\n\n(4):1248\u20131275, 2016.\n\n[23] Jing Lei. Convergence and concentration of empirical measures under wasserstein distance in unbounded\n\nfunctional spaces. arXiv preprint arXiv:1804.10556, 2018.\n\n[24] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In Francis Bach and\nDavid Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of\nProceedings of Machine Learning Research, pages 1718\u20131727, Lille, France, 07\u201309 Jul 2015. PMLR.\n\n[25] Qiang Liu, Jason Lee, and Michael Jordan. A kernelized Stein discrepancy for goodness-of-\ufb01t tests. In\nMaria Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of The 33rd International Conference\non Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 276\u2013284, New\nYork, New York, USA, 20\u201322 Jun 2016. PMLR.\n\n[26] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To-\nwards deep learning models resistant to adversarial attacks. In International Conference on Learning\nRepresentations, 2018.\n\n[27] Andreas Maurer and Massimiliano Pontil. Empirical Bernstein bounds and sample variance penalization.\n\nIn Conference on Learning Theory, 2009.\n\n[28] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the\nWasserstein metric: performance guarantees and tractable reformulations. Mathematical Programming,\n171(1):115\u2013166, Sep 2018.\n\n[29] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT\n\npress, 2018.\n[30] Krikamol Muandet, Kenji Fukumizu, Bharath Sriperumbudur, and Bernhard Sch\u00f6lkopf. Kernel mean\nembedding of distributions: A review and beyond. Foundations and Trends R(cid:13) in Machine Learning, 10\n(1-2):1\u2013141, 2017.\n\n[31] Hongseok Namkoong and John C. Duchi. Variance-based Regularization with Convex Objectives. In\n\nAdvances in Neural Information Processing Systems 30, pages 2975\u20132984, 2017.\n\n[32] Herbert Scarf. A min-max solution of an inventory problem. Studies in the mathematical theory of\n\ninventory and production, 1958.\n\n[33] Soroosh Sha\ufb01eezadeh Abadeh, Peyman Mohajerin Mohajerin Esfahani, and Daniel Kuhn. Distributionally\nrobust logistic regression. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 28, pages 1576\u20131584. Curran Associates, Inc., 2015.\n[34] Shashank Singh and Barnab\u00e1s P\u00f3czos. Minimax distribution estimation in wasserstein distance. arXiv\n\npreprint arXiv:1802.08855, 2018.\n\n[35] Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with\n\nprincipled adversarial training. In International Conference on Learning Representations, 2018.\n\n[36] Matthew Staib and Stefanie Jegelka. Distributionally robust deep learning as a generalization of adversarial\n\ntraining. In NIPS Machine Learning and Computer Security Workshop, 2017.\n\n10\n\n\f[37] Matthew Staib, Bryan Wilder, and Stefanie Jegelka. Distributionally robust submodular maximization.\nIn Kamalika Chaudhuri and Masashi Sugiyama, editors, Proceedings of the Twenty-Second International\nConference on Arti\ufb01cial Intelligence and Statistics, volume 89 of Proceedings of Machine Learning\nResearch, pages 506\u2013516. PMLR, 16\u201318 Apr 2019.\n\n[38] Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola,\nand Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In\nInternational Conference on Learning Representations, 2017.\n\n[39] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,\nand Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning\nRepresentations, 2014.\n\n[40] C\u00e9dric Villani. Optimal Transport: Old and New (Grundlehren der mathematischen Wissenschaften).\n\nSpringer, 2008. ISBN 9788793102132.\n\n[41] Jonathan Weed, Francis Bach, et al. Sharp asymptotic and \ufb01nite-sample rates of convergence of empirical\n\nmeasures in wasserstein distance. Bernoulli, 25(4A):2620\u20132648, 2019.\n\n[42] Huan Xu, Constantine Caramanis, and Shie Mannor. Robustness and regularization of support vector\n\nmachines. Journal of Machine Learning Research, 10(Jul):1485\u20131510, 2009.\n\n11\n\n\f", "award": [], "sourceid": 4898, "authors": [{"given_name": "Matthew", "family_name": "Staib", "institution": "MIT"}, {"given_name": "Stefanie", "family_name": "Jegelka", "institution": "MIT"}]}