{"title": "A General Projection Property for Distribution Families", "book": "Advances in Neural Information Processing Systems", "page_first": 2232, "page_last": 2240, "abstract": "We prove that linear projections between distribution families with fixed first and second moments are surjective, regardless of dimension. We further extend this result to families that respect additional constraints, such as symmetry, unimodality and log-concavity. By combining our results with classic univariate inequalities, we provide new worst-case analyses for natural risk criteria arising in different fields. One discovery is that portfolio selection under the worst-case value-at-risk and conditional value-at-risk criteria yields identical portfolios.", "full_text": "A General Projection Property for Distribution Families\n\nYao-Liang Yu Yuxi Li Dale Schuurmans Csaba Szepesv\u00b4ari\n\n{yaoliang,yuxi,dale,szepesva}@cs.ualberta.ca\n\nEdmonton, AB, T6G 2E8 Canada\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nAbstract\n\nSurjectivity of linear projections between distribution families with \ufb01xed mean\nand covariance (regardless of dimension) is re-derived by a new proof. We further\nextend this property to distribution families that respect additional constraints,\nsuch as symmetry, unimodality and log-concavity. By combining our results with\nclassic univariate inequalities, we provide new worst-case analyses for natural\nrisk criteria arising in classi\ufb01cation, optimization, portfolio selection and Markov\ndecision processes.\n\n1 Introduction\n\nIn real applications, the model of the problem at hand inevitably embodies some form of uncertainty:\nthe parameters of the model are usually (roughly) estimated from data, which themselves can be\nuncertain due to various kinds of noises. For example, in \ufb01nance, the return of a \ufb01nancial product\ncan seldom be known exactly beforehand. Despite this uncertainty, one still usually has to take action\nin the underlying application. However, due to uncertainty, any attempt to behave \u201coptimally\u201d in the\nworld must take into account plausible alternative models.\nFocusing on problems where uncertain data/parameters are treated as random variables and the\nmodel consists of a joint distribution over these variables, we initially assume prior knowledge\nthat the \ufb01rst and second moments of the underlying distribution are known, but the distribution is\notherwise arbitrary. A parametric approach to handling uncertainty in such a setting would be to \ufb01t a\nspeci\ufb01c parametric model to the known moments and then apply stochastic programming techniques\nto solve for an optimal decision. For example, \ufb01tting a Gaussian model to the constraints would be\na popular choice. However, such a parametric strategy can be too bold, hard to justify, and might\nincur signi\ufb01cant loss if the \ufb01tting distribution does not match the true underlying distribution very\nwell. A conservative, but more robust approach would be to take a decision that was \u201cprotected\u201d in\nthe worst-case sense; that is, behaves optimally assuming that nature has the freedom to choose an\nadverse distribution. Such a minimax formulation has been studied in several \ufb01elds [1; 2; 3; 4; 5; 6]\nand is also the focus of this paper. Although Bayesian optimal decision theory is a rightfully well-\nestablished approach for decision making under uncertainty, minimax has proved to be a useful\nalternative in many domains, such as \ufb01nance, where it is dif\ufb01cult to formulate appropriate priors\nover models. In these \ufb01elds, minimax formulation combined with stochastic programming [7] have\nbeen extensively studied and successfully applied.\nWe make a contribution to minimax probability theory and apply the results to problems arising in\nfour different areas. Speci\ufb01cally, we generalize a classic result on the linear projection property of\ndistribution families: we show that any linear projection between distribution families with \ufb01xed\nmean and covariance, regardless of their dimensions, is surjective. That is, given any matrix X and\nany random vector r with mean X T \u00b5 and covariance X T (cid:6)X, one can always \ufb01nd another random\nvector R with mean \u00b5 and covariance (cid:6) such that X T R = r (almost surely). Our proof imposes no\nconditions on the deterministic matrix X, hence extends the classic projection result in [6], which\nassumes X is a vector. We furthermore extend this surjective property to some restricted distribution\n\n1\n\n\ffamilies, which allows additional prior information to be incorporated and hence less conservative\nsolutions to be obtained. In particular, we prove that surjectivity of linear projections remains to hold\nfor distribution families that are additionally symmetric, log-concave, or symmetric linear unimodal.\nIn each case, our proof strategy allows one to construct the worst-case distribution(s).\nAn immediate application of these results is to reduce the worst-case analysis of multivariate expec-\ntations to the univariate (or reduced multivariate) ones, which have been long studied and produced\nmany fruitful results. In this direction, we conduct worst-case analyses of some common restricted\ndistribution families. We illustrate our results on problems that incorporate a classic worst case\nvalue-at-risk constraint: minimax probability classi\ufb01cation [2]; chance constrained linear program-\nming (CCLP) [3]; portfolio selection [4]; and Markov decision processes (MDPs) with reward un-\ncertainty [8]. Although some of the results we obtain have been established in the respective \ufb01elds\n[2; 3; 4], we unify them through a much simpler proof strategy. Additionally, we provide extensions\nto other constrained distribution families, which makes the minimax formulation less conservative\nin each case. These results are then extended to the more recent conditional value-at-risk constraint,\nand new bounds are proved, including a new bound on the survival function for symmetric unimodal\ndistributions.\n\n2 A General Projection Property\n\nFirst we establish a generalized linear projection property for distribution families. The key appli-\ncation will be to reduce worst-case multivariate stochastic programming problems to lower dimen-\nsional equivalents; see Corollary 1. Popescu [6] has proved the special case of reduction to one\ndimension, however we provide a simpler proof that can be more easily extended to other distribu-\ntion families1.\nLet (\u00b5, (cid:6)) denote the family of distributions sharing common mean \u00b5 and covariance (cid:6), and let\n\u00b5X = X T \u00b5 and (cid:6)X = X T (cid:6)X. Below we denote random variables by boldface letters, and use I\nto denote the identity matrix. We use \u2020 to denote the pseudo-inverse.\nTheorem 1 (General Projection Property (GPP)) For all \u00b5, (cid:6) \u227d 0, and X \u2208 Rm\u00d7d, the projec-\ntion X T R = r from m-variate distributions R \u223c (\u00b5, (cid:6)) to d-variate distributions r \u223c (\u00b5X , (cid:6)X) is\nsurjective and many-to-one. That is, every r \u223c (\u00b5X , (cid:6)X) can be obtained from some R \u223c (\u00b5, (cid:6))\nvia X T R = r (almost surely).\nProof: The proof is constructive. Given a r \u223c (\u00b5X , (cid:6)X), we can construct a pre-image R by letting\nX X T )M, where M \u223c (\u00b5, (cid:6)) is independent of r, for example, one can\nXr+(Im\u2212(cid:6)X(cid:6)\n\u2020\n\u2020\nR = (cid:6)X(cid:6)\nchoose M as a Gaussian random vector. It is easy to verify that R \u223c (\u00b5, (cid:6)) and X T R = (cid:6)X(cid:6)\n\u2020\nXr+\nX)X T M = r. The last equality holds since (Im \u2212 (cid:6)X(cid:6)\n(Im \u2212 (cid:6)X(cid:6)\nX)r = (Im \u2212 (cid:6)X(cid:6)\n\u2020\n\u2020\n\u2020\nX)X T M\n(the two random vectors on both sides have the same mean and zero covariance). Note that since M\n(cid:4)\ncan be chosen arbitrarily in (\u00b5, (cid:6)), the projections are always many-to-one.\nAlthough this establishes the general result, we extend it to distribution families under additional\nconstraints below. That is, one often has additional prior information about the underlying distri-\nbution, such as symmetry, unimodality, and/or support. In such cases, if a general linear projection\nproperty can still be shown to hold, the additional assumptions can be used to make the minimax\napproach less conservative in a simple, direct manner. We thus consider a number of additionally\nrestricted distribution families.\n\nDe\ufb01nition 1 A random vector X is called (centrally) symmetric about \u00b5, if for all vectors x,\nPr(X \u2265 \u00b5 + x) = Pr(X \u2264 \u00b5 \u2212 x). A univariate random variable is called unimodal about\na if its cumulative distribution function (c.d.f.) is convex on (\u2212\u221e, a] and concave on [a,\u221e). A\nrandom vector X is called log-concave if its c.d.f. is log-concave. A random m-vector X is called\nlinear unimodal about 0m if for all a \u2208 Rm, aT X is (univariate) unimodal about 0.\nLet (\u00b5, (cid:6))S denote the family of distributions in (\u00b5, (cid:6)) that are additionally symmetric about \u00b5,\nand similarly, let (\u00b5, (cid:6))L denote the family of distributions that are additionally log-concave, and\n1In preparing the \ufb01nal version of this paper, we noticed that a very recent work [9] proved the one dimen-\n\nsional case by a similar technique as ours.\n\n2\n\n\flet (\u00b5, (cid:6))SU denote the family of distributions that are additionally symmetric and linear unimodal\nabout \u00b5. For each of these restricted families, we require the following properties to establish our\nnext main result.\nLemma 1 (a) If random vector X is symmetric about 0, then AX + \u00b5 is symmetric about \u00b5. (b) If\nX, Y are independent and both symmetric about 0, Z = X + Y is also symmetric about 0.\n\nAlthough once misbelieved, it is now clear that the convolution of two (univariate) unimodal distri-\nbutions need not be unimodal. However, for symmetric, unimodal distributions we have\nLemma 2 ([10] Theorem 1.6) If two independent random variables x and y are both symmetric\nand unimodal about 0, then z = x + y is also unimodal about 0.\n\nThere are several non-equivalent extensions of unimodality to multivariate random variables. We\nconsider two speci\ufb01c (multivariate) unimodalities in this paper: log-concave and linear unimodal.2\nLemma 3 ([10] Lemma 2.1, Theorem 2.4, Theorem 2.18)\n1. Linearity: If random m-vector X is log-concave, aT X is also log-concave for all a \u2208 Rm.\n2. Cartesian Product: If X and Y are log-concave, then Z =\n3. Convolution: If X and Y are independent and log-concave, then Z = X+Y is also log-concave.\n\nis also log-concave .\n\n]\n\n[\n\nX\nY\n\nGiven the above properties, we can now extend Theorem 1 to (\u00b5, (cid:6))S, (\u00b5, (cid:6))L and (\u00b5, (cid:6))SU .\nTheorem 2 (GPP for Symmetric, Log-concave, and Symmetric Linear Unimodal Distributions)\nFor all \u00b5, (cid:6) \u227d 0 and X \u2208 Rm\u00d7d, the projection X T R = r from m-variate R \u223c (\u00b5, (cid:6))S to\nd-variate r \u223c (\u00b5X , (cid:6)X)S is surjective and many-to-one. The same is true for (\u00b5, (cid:6))L and\n(\u00b5, (cid:6))SU .3\nProof: The proofs follow the same basic outline as Theorem 1 except that in the \ufb01rst step we\nnow choose N \u223c (0m, Im)S or (0m, Im)L or (0m, Im)SU . Then, respectively, symmetry of the\nconstructed R follows from Lemma 1; log-concavity of R follows from Lemma 3; and linear uni-\n(cid:4)\nmodality of R follows from the de\ufb01nition and Lemma 2. The maps remain many-to-one.\nAn immediate application of the general projection property is to reduce worst-case analyses of\nmultivariate expectations to the univariate case. Note that in the following corollary, the optimal\ndistribution of R can be easily constructed from the optimal distribution of r.\nCorollary 1 For any matrix X and any function g(\u00b7) (including in particular when X is a vector)\n(1)\n\nE[g(X T R)]\n\nE[g(r)].\n\nsup\n\nsup\n\n=\n\nR\u223c(\u00b5,(cid:6))\n\nr\u223c(X T \u00b5,X T (cid:6)X)\n\nThe equality continues to hold if we restrict (\u00b5, (cid:6)) to (\u00b5, (cid:6))S, (\u00b5, (cid:6))L, or (\u00b5, (cid:6))SU respectively.\nProof: It is obvious that the right hand side is an upper bound on the left hand side, since for every\nR \u223c (\u00b5, (cid:6)) there exists an r \u223c (X T \u00b5, X T (cid:6)X) given by r = X T R. Similarly for (\u00b5, (cid:6))S,\n(\u00b5, (cid:6))L, and (\u00b5, (cid:6))SU . However, given Theorems 1 and 2, one can then establish the converse.4 (cid:4)\n\n3 Application to Worst-case Value-at-risk\n\nWe now apply these projection properties to analyze the worst case value-at-risk (VaR) \u2014a useful\nrisk criterion in many application areas. Consider the following constraint on a distribution R\n\nPr(\u2212xT R \u2264 \u03b1) \u2265 1 \u2212 \u03f5,\n\n(2)\n2A suf\ufb01cient but not necessary condition for log-concavity is having log-concave densities. This can be used\nto verify log-concavity of normal and uniform distributions. In the univariate case, log-concave distributions\nare called strongly unimodal, which is only a proper subset of univariate unimodal distributions [10].\n\n3If X is a vector we can also extend this theorem to other multivariate unimodalities such as symmetric\n\nstar/block/convex unimodal.\n\n4The closure of (\u00b5, (cid:6)), (\u00b5, (cid:6))S, (\u00b5, (cid:6))L, and (\u00b5, (cid:6))SU under linear projection is critical for Corollary 1\nto hold. Corollary 1 fails for other kinds of multivariate unimodalities, such as symmetric star/block/convex\nunimodal. It also fails for (\u00b5, (cid:6))+, a distribution family whose support is contained in the nonnegative orthant.\nThis is not surprising since determining whether the set (\u00b5, (cid:6))+ is empty is already NP-hard [11].\n\n3\n\n\ffor given x, \u03b1 and \u03f5 \u2208 (0, 1). In this case, the in\ufb01mum over \u03b1 such that (2) is satis\ufb01ed is referred\nto as the \u03f5-VaR of R. Within certain restricted distribution families, such as Q-radially symmetric\ndistributions, (2) can be (equivalently) transformed to a deterministic second order cone constraint\n(depending on the range of \u03f5) [3]. Unfortunately, determining whether (2) can be satis\ufb01ed for given\nx, \u03b1 and \u03f5 \u2208 (0, 1) is NP-hard in general [8]. Suppose however that one knew the distribution of\nR belonged to a certain family, such as (\u00b5, (cid:6)).5 Given such knowledge, it is natural to consider\nwhether (2) can be satis\ufb01ed in a worst case sense. That is, consider\n\n[\n\ninf\n\nR\u223c(\u00b5,(cid:6))\n\n]\nPr(\u2212xT R \u2264 \u03b1)\n\n\u2265 1 \u2212 \u03f5.\n\n(3)\n\nHere the in\ufb01mum of \u03b1 values satisfying (3) is referred to as the worst-case \u03f5-VaR. If we have ad-\nditional information about the underlying distribution, such as symmetry or unimodality, the worst-\ncase \u03f5-VaR can be reduced. Importantly, using the results of the previous section, we can easily\ndetermine the worst-case \u03f5-VaR for various distribution families. These can also be used to provide\na tractable bound on the \u03f5-VaR even when the distribution is known.\n\nProposition 1 For alternative distribution families, the worst-case \u03f5-VaR constraint (3) is given by:\n\nif R \u223c (\u00b5, (cid:6))\n\nthen \u03b1 \u2265 \u2212\u00b5x +\n\nif R \u223c (\u00b5, (cid:6))S\n\nthen\n\n{\n{\n\n\u221a\n\n\u221a\n1 \u2212 \u03f5\n\u03f5\n\u221a\n\n\u03c3x,\n\n1\n2\u03f5 \u03c3x,\n\n\u03b1 \u2265 \u2212\u00b5x +\n\u03b1 \u2265 \u2212\u00b5x,\n\u03b1 \u2265 \u2212\u00b5x + 2\n\u03b1 \u2265 \u2212\u00b5x,\n\nif \u03f5 \u2208 (0, 1\n2)\nif \u03f5 \u2208 [ 1\n2 , 1)\nif \u03f5 \u2208 (0, 1\n2)\nif \u03f5 \u2208 [ 1\n2 , 1)\n\n(4)\n\n(5)\n\n(6)\n\n(9)\n\n(10)\n\nif R \u223c (\u00b5, (cid:6))SU\nif R \u223c N (\u00b5, (cid:6))\n\n1\n2\u03f5 \u03c3x,\nthen\nthen \u03b1 \u2265 \u2212\u00b5x + (cid:8)\u22121(1 \u2212 \u03f5)\u03c3x,\n\n3\n\n\u221a\n\nwhere \u00b5x = xT \u00b5, \u03c3x =\n\n(7)\nxT (cid:6)x and (cid:8)(\u00b7) is the c.d.f. of the standard normal distribution N (0, 1).\nIt turns out some results of Proposition 1 are known. In fact, the \ufb01rst bound (4) has been extensively\nstudied. However, given the results of the previous section, we can now provide a much simpler\nproof.6 (This simplicity will also allow us to achieve some useful new bounds in Section 4 below.)\nProof: From Corollary 1 it follows that\n\nPr(\u2212xT R \u2264 \u03b1) =\n\ninf\n\nR\u223c(\u00b5,(cid:6))\n\ninf\n\nr\u223c(\u2212\u00b5x,\u03c32\nx)\n\nPr(r \u2264 \u03b1) = 1 \u2212\n\nsup\n\nr\u223c(\u2212\u00b5x,\u03c32\nx)\n\nPr(r > \u03b1).\n\n(8)\n\nGiven that the problem is reduced to the univariate case, we simply exploit classical inequalities:\n\nif x \u223c (\u00b5, \u03c32)\nif x \u223c (\u00b5, \u03c32)S\nif x \u223c (\u00b5, \u03c32)SU\n\nthen Pr(x > t) \u2264\nthen Pr(x > t) \u2264 1\n2\nthen Pr(x > t) \u2264 1\n2\n\n\u03c32\n\n\u03c32 + (\u00b5 \u2212 t)2 ,\n\u03c32\nmin(1,\n\n(\u00b5 \u2212 t)2 ),\n4\n(\u00b5 \u2212 t)2 ),\n9\n\n\u03c32\n\n(11)\nfor t \u2265 \u00b5.7 Now to prove (4), simply plug (8) into (3) and notice that an application of (9) leads to\n\nmin(1,\n\n\u03b1 \u2265 \u2212\u00b5x\n\nand\n\n1 \u2212\n\n\u03c32\nx\n\nx + (\u2212\u00b5x \u2212 \u03b1)2\n\n\u03c32\n\n\u2265 1 \u2212 \u03f5.\n\n(4) then follows by simple rearrangement. The same procedure can be used to prove (5), (6), (7). (cid:4)\n5We will return to the question of when such moment information is also subject to uncertainty in Section 5.\n6[2] and [3] provide a proof of (4) based on the multivariate Chebyshev inequality in [12]; [4] proves (4)\n\nfrom dual optimality; and the proof in [6] utilizes two point support property of the general constraint (3).\n\n7(9) is known as the (one-sided) Chebyshev inequality. Two-sided version of (11) is known as the Gauss\n\ninequality. These classical bounds are tight. Proofs can be found in [13], for example.\n\n4\n\n\fFigure 1: Comparison of the coef\ufb01cients in front of \u03c3x for different distribution families in Proposi-\ntion 1 (left) and Proposition 2 (right). Only the range \u03f5 \u2208 (0, 1\n\n2) is depicted.\n\n[\n\n]\nPr(xT R2 \u2265 \u03b1)\n\n\u2265 1 \u2212 \u03f5, (12)\n\n[\n\n]\nPr(xT R1 \u2264 \u03b1)\n\nProposition 1 clearly illustrates the bene\ufb01t of prior knowledge. Figure 1 compares the coef\ufb01cients\non \u03c3x among the different worst case VaR for different distribution families. The large gap between\ncoef\ufb01cients for general and symmetric (linear) unimodal distributions demonstrates how additional\nconstraints can generate much less conservative solutions while still ensuring robustness.\nBeyond simplifying existing proofs, Proposition 1 can be used to extend some of the uses of the VaR\ncriterion in different application areas.\nMinimax probability classi\ufb01cation [2]: Lanckriet et al. [2] \ufb01rst studied the value-at-risk constraint\nin binary classi\ufb01cation. In this scenario, one is given labeled data from two different sources and\nseeks a robust separating hyperplane. From the data, the distribution families (\u00b51, (cid:6)1) and (\u00b52, (cid:6)2)\ncan be estimated. Then a robust hyperplane can be recovered by minimizing the worst-case error\n\n\u2265 1 \u2212 \u03f5 and\n\ninf\n\ninf\n\nR1\u223c(\u00b51,(cid:6)1)\n\n\u03f5 s.t.\n\nR2\u223c(\u00b52,(cid:6)2)\n\nmin\nx\u0338=0,\u03b1,\u03f5\nwhere x is the normal vector of the hyperplane, \u03b1 is the offset and \u03f5 controls the error probability.\nNote that the results in [2] follow from using the bound (4). However, interesting additional facts\narise when considering alternative distribution families. For example, consider symmetric distri-\nbutions. In this case, suppose we knew in advance that the optimal \u03f5 lay in [ 1\n2 , 1), meaning that\nno hyperplane predicts better than random guessing. Then the constraints in (12) become linear,\ncovariance information becomes useless in determining the optimal hyperplane, and the optimiza-\ntion concentrates solely on separating the means of two classes. Although such a result might seem\nsurprising, it is a direct consequence of symmetry: the worst-case distributions are forced to put\nprobability mass arbitrarily far away on both sides of the mean, thereby eliminating any information\nbrought by covariance. When the optimal \u03f5 lies in (0, 1\n2), however, covariance information becomes\nmeaningful, since the worst-case distributions can no longer put probability mass arbitrarily far\naway on both sides of the mean (owing to the existence of a hyperplane that predicts labels better\nthan random guessing). In this case, the optimization problems involving (\u00b5, (cid:6))S and (\u00b5, (cid:6))SU are\nequivalent to that for (\u00b5, (cid:6)) except that the maximum error probability \u03f5 becomes smaller, which\nis to be expected since more information about the marginal distributions should make one more\ncon\ufb01dent to predict the labels of future data.\nChance Constrained Linear Programming (CCLP)\nConsider a linear program\nminx aT x s.t. rT x \u2265 0. If the coef\ufb01cient r is uncertain, it is clear that solving the linear program\nmerely using the expected value of r could result in a solution x that was sub-optimal or even in-\nfeasible. Cala\ufb01ore and El Ghaoui studied this problem in [3], and imposed the inequality constraint\nwith high probability, leading to the the so-called chance constrained linear program (CCLP):\n\n[\n\nmin\nx\n\naT x s.t.\n\ninf\n\nR\u223c(\u00b5,(cid:6))\n\n\u2265 1 \u2212 \u03f5.\n\n(13)\n\n[3]:\n\n]\n\nPr(\u2212xT R \u2264 0)\n\n5\n\n\fIn this case, \u03b1 is simply 0 and \u03f5 is given by the user. Depending on the value of \u03f5, the chance\nconstraint can be equivalently transformed into a second order cone constraint or a linear constraint.\nThe work in [3] concentrates on the general and symmetric distribution families. In the latter case,\n[3] uses the \ufb01rst part of inequality (5) as a suf\ufb01cient condition for guaranteeing robust solutions.\nNote however that from Corollary 1 and Proposition 1 one can now see that (5) is also a necessary\ncondition. Although the symmetric linear unimodal case is not discussed in [3], from Proposition 1\nagain one can see that incorporating bound (6) in (13) yields a looser constraint than does (5),\nhence the feasible region will be enlarged and the optimum value of the CCLP potentially reduced,\ncorresponding to the intuition that increased prior knowledge leads to more optimized results.\nPortfolio Selection [4]: In portfolio selection, let R represent the (uncertain) returns of a suite of\n\ufb01nancial assets, and x the weighting one would like to put on the various assets. Here \u03b1 > \u2212xT R\nrepresents an upper bound on the loss one might suffer with weighting x. The goal is to minimize\nan upper bound on the loss that holds with high probability,8 say 1 \u2212 \u03f5, speci\ufb01ed by the user\n\n[\n\n]\n\nmin\nx,\u03b1\n\n\u03b1 s.t.\n\ninf\n\nR\u223c(\u00b5,(cid:6))\n\nPr(\u2212xT R \u2264 \u03b1)\n\n\u2265 1 \u2212 \u03f5.\n\n(14)\n\nThis criterion has been studied by El Ghaoui et al. [4] in the worst case setting. Previous work has not\naddressed the case when additional symmetry or linear unimodal information is available. However,\ncomparing the minimal value of \u03b1 in Proposition 1, we see that such additional information, such\nas symmetry or unimodality, indeed decreases our potential loss, as shown clearly in Figure 1. This\nmakes sense, since the more one knows about uncertain returns the less risk one should have to bear.\nNote also that when incorporating additional information, the optimal portfolio, represented by x, is\nchanged as well but remains mean-variance ef\ufb01cient when \u03f5 \u2208 (0, 1\n2).\nUncertain MDPs with reward uncertainty: The standard planning problem in Markov decision\nprocesses (MDPs) is to \ufb01nd a policy such that maximizes the expected total discounted return. This\nnonlinear optimization problem can be ef\ufb01ciently solved by dynamic programming, provided that\nthe model parameters (transition kernel and reward function) are exactly known. Unfortunately, this\nis rarely the case in practice. Delage and Mannor [8] extend this problem to the uncertain case by\nemploying the value-at-risk type constraint (2) and assuming the unknown reward model and transi-\ntion kernel are drawn from a known distribution (Gaussian and Dirichlet respectively).Unfortunately,\n[8] also proves that the constraint (2) is generally NP-hard to satisfy unless one assumes some very\nrestricted form of distribution, such as Gaussian. Alternatively, note that one can use the worst case\nvalue-at-risk formulation (3) to obtain a tractable approximation to (2)\n\u2265 1 \u2212 \u03f5,\n\nPr(\u2212xT R \u2264 \u03b1)\n\n\u03b1 s.t.\n\n(15)\n\n[\n\n]\n\ninf\n\nmin\nx,\u03b1\n\nR\u223c(\u00b5,(cid:6))\n\nwhere R is the reward function (unknown but assumed to belong to (\u00b5, (cid:6))) and x represents a\ndiscounted-stationary state-action visitation distribution (which can be used to recover an optimal\nbehavior policy). Although this worst case formulation (15) might appear to be conservative com-\npared to working with a known distribution on R and using (2), when additional information about\nthe distribution is available, such as symmetry or unimodality, (15) can be brought very close to us-\ning a Gaussian distribution, as shown in Figure 1. Thus, given reasonable constraints, the minimax\napproach does not have to be overly conservative, while providing robustness and tractability.\n\n4 Application to Worst-case Conditional Value-at-risk\n\n[\n\n]\n\u2217) \u2265 1\u2212\u03f5\n\n(cid:12)(cid:12)(cid:12) Pr(\u2212xT R \u2264 \u03b1\n\nFinally, we investigate the more re\ufb01ned conditional value-at-risk (CVaR) criterion that bounds the\nconditional expectation of losses beyond the value-at-risk (VaR). This criterion has been of growing\nprominence in many areas recently. Consider the following quantity de\ufb01ned as the mean of a tail\ndistribution:\n\u03b1 s.t. Pr(\u2212xT R \u2264 \u03b1) \u2265 1\u2212\u03f5.\n\u2212xT R\n^f = E\n(16)\n\u2217 is the value-at-risk and ^f is the conditional value-at-risk of R. It is well-known that the\n\u2217. Although it might appear that dealing with\n8Note that seeking to minimize the loss surely leads to a meaningless outcome. For example, if \u03f5 = 0, the\n\nHere, \u03b1\nCVaR, ^f, is always an upper bound on the VaR, \u03b1\n\noptimization problem trivially says that the loss of any portfolio will be no larger than 1.\n\n\u2217 = arg min\n\nwhere \u03b1\n\n\u03b1\n\n6\n\n\fthe CVaR criterion entails greater complexity than the VaR, since VaR is directly involved in the\nde\ufb01nition of CVaR, it turns out that CVaR can be more directly expressed as\n\n[\n(\u2212xT R \u2212 \u03b1)+\n\n]\n\n,\n\n^f = min\n\n\u03b1\n\n\u03b1 +\n\nE\n\n1\n\u03f5\n\n(17)\n\nwhere (x)+ = max(0, x) [14]. Unlike the VaR constraint (2), (17) is always (jointly) convex in x\nand \u03b1. Thus if R were discrete, ^f could be easily computed by a linear program [14; 5]. However,\nthe expectation in (17) involves a high dimensional integral in general, whose analytical solution\nis not always available, thus ^f is still hard to compute in practice. Although one potential remedy\nmight be to use Monte Carlo techniques to approximate the expectation, we instead take a robust\napproach: As before, suppose one knew the distribution of R belonged to a certain family, such as\n(\u00b5, (cid:6)). Given such knowledge, it is natural to consider the worst-case CVaR\n\nf = sup\n\nR\u223c(\u00b5,(cid:6))\n\nmin\n\u03b1\n\n\u03b1 +\n\nE\n\n1\n\u03f5\n\n= min\n\u03b1\n\nsup\n\nR\u223c(\u00b5,(cid:6))\n\n\u03b1 +\n\nE\n\n1\n\u03f5\n\n[\n(\u2212xT R \u2212 \u03b1)+\n\n]\n\n,\n\n(18)\n\n[\n(\u2212xT R \u2212 \u03b1)+\n\n]\n\nwhere the interchangeability of the min and sup operators follows from the classic minimax theorem\n[15]. Importantly, as in the previous section, we can determine the worst-case CVaR for various\ndistribution families. If one has additional information about the underlying distribution, such as\nsymmetry or unimodality, the worst-case CVaR can be reduced. These can be used to provide a\ntractable bound on the CVaR even when the distribution is known.\n\nProposition 2 For alternative distribution families, the worst-case CVaR is given by:\n\n\u221a\n\n2\n\n\u03c3x,\n\nf = \u2212\u00b5x +\n\n\u221a\n(2\u03f5 \u2212 1)\n{\n\u03f5(1 \u2212 \u03f5)\n\u03c3x, f = \u2212\u00b5x + 1\u221a\n\u03b1 = \u2212\u00b5x + 1\u221a\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u03b1 = \u2212\u00b5x + 1\n1\u221a\n\u03b1 = \u2212\u00b5x \u2212\n\u03c3x, f = \u2212\u00b5x +\n8\u03f5\n8(1\u2212\u03f5)\n\u03f5 \u03c3x, f = \u2212\u00b5x + 2\n\u221a\n\u221a\n\u221a\n\u221a\n\u03f5 \u03c3x\n3(1 \u2212 2\u03f5)\u03c3x, f = \u2212\u00b5x +\n\u03b1 = \u2212\u00b5x +\n\u221a\n\u03b1 = \u2212\u00b5x \u2212 1\n\u03c3x, f = \u2212\u00b5x + 2\n1\u2212\u03f5\n\u221a\n3\u03f5 \u03c3x,\n1\u2212\u03f5\n2\u221a\n2\u03c0\u03f5\n\n\u2212 ((cid:8)(cid:0)1(1(cid:0)\u03f5))2\n\n\u03c3x\n\u221a\n1\u2212\u03f5\u221a\n\n\u03c3x,\n\n2\u03f5\n\n2\u03f5\n\n3\n\n3\n\n3\n\n1 \u2212 \u03f5\n\u03f5\n\n(19)\n\n\u03c3x\n\n\u03c3x,\nif \u03f5 \u2208 (0, 1\n2]\nif \u03f5 \u2208 [ 1\n2 , 1)\nif \u03f5 \u2208 (0, 1\n3]\n3(1 \u2212 \u03f5)\u03c3x if \u03f5 \u2208 [ 1\n3]\n3 , 2\n(21)\nif \u03f5 \u2208 [ 2\n3 , 1)\n\n(20)\n\nif R \u223c (\u00b5, (cid:6)) then \u03b1 = \u2212\u00b5x +\n\nif R \u223c (\u00b5, (cid:6))S then\n\nif R \u223c (\u00b5, (cid:6))SU then\n\nif R \u223c N (\u00b5, (cid:6)) then f = \u2212\u00b5x + e\n\nE\n\nE\n\n=\n\n\u221a\n\n]\n\n]\n\nsup\n\nsup\n\nR\u223c(\u00b5,(cid:6))\n\nr\u223c(\u2212\u00b5x,\u03c32\nx)\n\nwhere \u00b5x = xT \u00b5, \u03c3x =\n\n(22)\nxT (cid:6)x and (cid:8)(\u00b7) is the c.d.f. of a standard normal distribution N (0, 1).\nThe results of Proposition 2 are a novel contribution of this paper, with the exception of (22), which\n[\nis a standard result in stochastic programming [7].\nProof: We know from Corollary 1 that\n(\u2212xT R \u2212 \u03b1)+\n[\n(\u2212\u00b5x \u2212 \u03b1) +\n\nwhich reduces the problem to the univariate case. To proceed, we will need to make use of the\nunivariate results given in Proposition 3 below. Assuming Proposition 3 for now, we show how to\nprove (19): In this case, substitute (23) into (18) and apply (24) from Proposition 3 below to obtain\n\n[\n(r \u2212 \u03b1)+\n\nf = min\n\u03b1\n\nx + (\u2212\u00b5x \u2212 \u03b1)2\n\n1\n2\u03f5\nThis is a convex univariate optimization problem in \u03b1. Taking the derivative with respect to \u03b1 and\nsetting to zero gives \u03b1 = \u2212\u00b5x + (2\u03f5\u22121)\n1\u2212\u03f5\n\u03f5 \u03c3x.\n\u03f5(1\u2212\u03f5)\n(cid:4)\nA similar strategy can be used to prove (20), (21), and (22).\nAs with Proposition 1, Proposition 2 illustrates the bene\ufb01t of prior knowledge. Figure 1 (right)\ncompares the coef\ufb01cients on \u03c3x among different worst-case CVaR quantities for different families.\nComparing VaR and CVaR in Figure 1 shows that unimodality has less impact on improving CVaR.\nA key component of Proposition 2 is its reliance on the following important univariate results. The\nfollowing proposition gives tight bounds of the expected survival function for the various families.\n\n\u03c3x. Substituting back we obtain f = \u2212\u00b5x +\n\n\u221a\n\n\u221a\n\n(23)\n\n\u03b1 +\n\n]\n\n\u221a\n\n\u03c32\n\n2\n\n,\n\n.\n\n7\n\n\fProposition 3 For alternative distribution families, the expected univariate survival functions are:\n\n]\n,\n\u2264 t \u2264 \u00b5 + \u03c3\n\n2\n\n1\n2\n\n[\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3 \u03c3\u2212t+\u00b5\n(\u00b5 \u2212 t) +\n\uf8f1\uf8f4\uf8f4\uf8f2\uf8f4\uf8f4\uf8f3 (\n\n,\n8(t\u2212\u00b5) ,\n\u2212 \u03c32+8(t\u2212\u00b5)2\n8(t\u2212\u00b5)\n\u221a\n3\u03c3\u2212t+\u00b5)2\n\u221a\n4\n\u03c32\n\n\u221a\n\u03c32 + (\u00b5 \u2212 t)2\nif \u00b5 \u2212 \u03c3\nif t > \u00b5 + \u03c3\n2\nif t < \u00b5 \u2212 \u03c3\nif \u00b5 \u2212 \u03c3\u221a\nif t > \u00b5 + \u03c3\u221a\nif t < \u00b5 \u2212 \u03c3\u221a\n\n3\u03c3\n9(t\u2212\u00b5) ,\n\u2212 \u03c32+9(t\u2212\u00b5)2\n\n9(t\u2212\u00b5)\n\n2\n\u03c32\n\n,\n\n,\n\n3\n\n2\n\n,\n\n2\n\n\u2264 t \u2264 \u00b5 + \u03c3\u221a\n\n3\n\n3\n\n3\n\n[\n(x \u2212 t)+\n[\n(x \u2212 t)+\n[\n(x \u2212 t)+\n\n]\n]\n]\n\n=\n\n=\n\n=\n\nsup\n\nx\u223c(\u00b5,\u03c32)\n\nsup\n\nx\u223c(\u00b5,\u03c32)S\n\nsup\n\nx\u223c(\u00b5,\u03c32)SU\n\nE\n\nE\n\nE\n\n(24)\n\n(25)\n\n(26)\n\nHere (26) is a further novel contribution of this paper. Proofs of (24) and (25) can be found in [1].\nInterestingly, to the best of our knowledge, the worst-case CVaR criterion has not yet been applied\nto any of the four problems mentioned in the previous section9. Given the space constraints, we can\nonly discuss the direct application of worst-case CVaR to the portfolio selection problem. We note\nthat CVaR has been recently applied to \u03bd-SVM learning in [16].\nImplications for Portfolio Selection: By comparing Propositions 1 and 2, the \ufb01rst interesting con-\nclusion one can reach about portfolio selection is that, without considering any additional informa-\ntion, the worst-case CVaR criterion yields the same optimal portfolio weighting x as the worst-case\nVaR criterion (recall that VaR minimizes \u03b1 in Proposition 1 by adjusting x while CVaR minimizes\nf by adjusting x in Proposition 2). However, the worst-case distributions for the two approaches are\nnot the same, which can be seen from the relation (16) between VaR and CVaR and observing that \u03b1\nin (4) is not the same as in (19). Next, when additional symmetry information is taken into account\nand \u03f5 \u2208 (0, 1\n2), CVaR and VaR again select the same portfolio but under different worst-case distri-\nbutions. When unimodality is added, the CVaR criterion \ufb01nally begins to select different portfolios\nthan VaR.\n\n5 Concluding Remarks\n\nWe have provided a simpler yet broader proof of the general linear projection property for distribu-\ntion families with given mean and covariance. The proof strategy can be easily extended to more\nrestricted distribution families. A direct implication of our results is that worst-case analyses of mul-\ntivariate expectations can often be reduced to those of univariate ones. By combining this trick with\nclassic univariate inequalities, we were able to provide worst-case analyses of two widely adopted\nconstraints (based on value-at-risk criteria). Our analysis recovers some existing results in a simpler\nway while also provides new insights on incorporating additional information.\nAbove, we assumed the \ufb01rst and second moments of the underlying distribution were precisely\nknown, which of course is questionable in practice. Fortunately, there are standard techniques for\nhandling such additional uncertainty. One strategy, proposed in [2], is to construct a (bounded and\nconvex) uncertainty set U over (\u00b5, (cid:6)), and then applying a similar minimax formulation but with\nrespect to (\u00b5, (cid:6)) \u2208 U. As shown in [2], appropriately chosen uncertainty sets amount to adding\nstraightforward regularizations to the original problem. A second approach is simply to lower one\u2019s\ncon\ufb01dence of the constraints and rely on the fact that the moment estimates are close to their true\nvalues within some additional con\ufb01dence bound [17]. That is, instead of enforcing the constraint\n(3) or (18) surely, one can instead plug-in the estimated moments and argue that constraints will be\nsatis\ufb01ed within some diminished probability. For an application of this strategy in CCLP, see [3].\n\nAcknowledgement\n\nWe gratefully acknowledge support from the Alberta Ingenuity Centre for Machine Learning, the\nAlberta Ingenuity Fund, iCORE and NSERC. Csaba Szepesv`ari is on leave from MTA SZTAKI, Bp.\nHungary.\n\n9Except the very recent work of [9] on portfolio selection.\n\n8\n\n\fReferences\n[1] R. Jagannathan. \u201cMinimax procedure for a class of linear programs under uncertainty\u201d. Oper-\n\nations Research, vol. 25(1):pp. 173\u2013177, 1977.\n\n[2] Gert R.G. Lanckriet, Laurent El Ghaoui, Chiranjib Bhattacharyya and Michael I. Jordan. \u201cA\nrobust minimax approach to classi\ufb01cation\u201d. Journal of Machine Learning Research, vol. 03:pp.\n555\u2013582, 2002.\n\n[3] G.C.Cala\ufb01ore and Laurent El Ghaoui. \u201cOn distributionally robust chance-constrained linear\n\nprograms\u201d. Journal of Optimization Theory and Applications, vol. 130(1):pp. 1\u201322, 2006.\n\n[4] Laurent El Ghaoui, Maksim Oks and Francois Oustry. \u201cWorst-case value-at-risk and robust\nportfolio optimization: a conic programming approach\u201d. Operations Research, vol. 51(4):pp.\n542\u2013556, 2003.\n\n[5] Shu-Shang Zhu and Masao Fukushima. \u201cWorst-case conditional value-at-risk with application\n\nto robust portfolio management\u201d. Operations Research, vol. 57(5):pp. 1155\u20131168, 2009.\n\n[6] Ioana Popescu. \u201cRobust mean-covariance solutions for stochastic optimization\u201d. Operations\n\nResearch, vol. 55(1):pp. 98\u2013112, 2007.\n\n[7] Andr\u00b4as Pr\u00b4ekopa. Stochastic Programming. Springer, 1995.\n[8] Erick Delage and Shie Mannor. \u201cPercentile optimization for Markov decision processes with\n\nparameter uncertainty\u201d. Operations Research, to appear 2009.\n\n[9] Li Chen, Simai He and Shuzhong Zhang. \u201cTight Bounds for Some Risk Measures, with Ap-\nplications to Robust Portfolio Selection\u201d. Tech. rep., Department of Systerms Engineering and\nEngineering Management, The Chinese University of Hongkong, 2009.\n\n[10] Sudhakar Dharmadhikari and Kumar Joag-Dev. Unimodality, Convexity, and Applications.\n\nAcademic Press, 1988.\n\n[11] Dimitris Bertsimas and Ioana Popescu. \u201cOptimal inequalities in probability theory a convex\n\noptimization approach\u201d. SIAM Journal on Optimization, vol. 15(3):pp. 780\u2013804, 2005.\n\n[12] Albert W. Marshall and Ingram Olkin. \u201cMultivariate Chebyshev inequalities\u201d. Annals of Math-\n\nematical Statistics, vol. 31(4):pp. 1001\u20131014, 1960.\n\n[13] Ioana Popescu. \u201cA semide\ufb01nite programming approach to optimal moment bounds for convex\nclasses of distributions\u201d. Mathematics of Operations Research, vol. 30(3):pp. 632\u2013657, 2005.\n[14] R. Tyrrell Rockafellar and Stanislav Uryasev. \u201cOptimization of conditional value-at-risk\u201d.\n\nJournal of Risk, vol. 2(3):pp. 493\u2013517, 2000.\n\n[15] Ky Fan.\n\n\u201cMinimax Theorems\u201d.\n\nvol. 39(1):pp. 42\u201347, 1953.\n\nProceedings of\n\nthe National Academy of Sciences,\n\n[16] Akiko Takeda and Masashi Sugiyama. \u201c\u03bd-support vector machine as conditional value-at-risk\nminimization\u201d. In Proceedings of the 25th International Conference on Machine Learning, pp.\n1056\u20131063. 2008.\n\n[17] John Shawe-Taylor and Nello Cristianini. \u201cEstimating the moments of a random vector with\n\napplications\u201d. In Proceedings of GRETSI 2003 Conference, pp. 47\u201352. 2003.\n\n9\n\n\f", "award": [], "sourceid": 54, "authors": [{"given_name": "Yao-liang", "family_name": "Yu", "institution": null}, {"given_name": "Yuxi", "family_name": "Li", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Csaba", "family_name": "Szepesv\u00e1ri", "institution": null}]}