{"title": "Risk Aversion in Markov Decision Processes via Near Optimal Chernoff Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 3131, "page_last": 3139, "abstract": null, "full_text": "Risk Aversion in Markov Decision Processes\n\nvia Near-Optimal Chernoff Bounds\n\nTeodor Mihai Moldovan\n\nDepartment of Computer Science\nUniversity of California at Berkeley\n\nBerkeley CA 94720, USA\n\nmoldovan@cs.berkeley.edu\n\nPieter Abbeel\n\nDepartment of Computer Science\nUniversity of California at Berkeley\n\nBerkeley CA 94720, USA\n\npabbeel@cs.berkeley.edu\n\nAbstract\n\nThe expected return is a widely used objective in decision making under uncer-\ntainty. Many algorithms, such as value iteration, have been proposed to optimize\nit. In risk-aware settings, however, the expected return is often not an appropriate\nobjective to optimize. We propose a new optimization objective for risk-aware\nplanning and show that it has desirable theoretical properties. We also draw con-\nnections to previously proposed objectives for risk-aware planing: minmax, ex-\nponential utility, percentile and mean minus variance. Our method applies to an\nextended class of Markov decision processes: we allow costs to be stochastic as\nlong as they are bounded. Additionally, we present an ef\ufb01cient algorithm for op-\ntimizing the proposed objective. Synthetic and real-world experiments illustrate\nthe effectiveness of our method, at scale.\n\n1\n\nIntroduction\n\nThe expected return is often the objective function of choice in planning problems where outcomes\nnot only depend on the actor\u2019s decisions but also on random events. Often expectations are the\nnatural choice, as the law of large numbers guarantees that the average return over many independent\nruns will converge to the expectation. Moreover, the linearity of expectations can often be leveraged\nto obtain ef\ufb01cient algorithms.\nSome games, however, can only be played once, either because they take a very long time (investing\nfor retirement), because we are not given a chance to try again if we lose (skydiving, crossing\nthe road), or because i.i.d. versions of the game are not available (stock market). In this setting,\nwe can no longer take advantage of the law of large numbers to ensure that the return is close\nto its expectation with high probability, so the expected return might not be the best objective to\noptimize. If we were pessimistic, we might assume that everything that can go wrong will go wrong\nand try to minimize the losses under this assumption. This is called minmax optimization and is\nsometimes useful, but, most often, the resulting policies are overly cautious. A more balanced and\ngeneral approach would include minmax optimization and expectation optimization, corresponding\nrespectively to absolute risk aversion and risk ignorance, but would also allow a spectrum of policies\nbetween these extremes.\nAs a motivating example, consider buying tickets to \ufb02y to a very important meeting. Shorter travel\ntime is preferable, but even more importantly, it would be disastrous if you arrived late. Some \ufb02ights\narrive on time more often than others, and the delays might be ampli\ufb01ed if you miss connecting\n\ufb02ights. With these risks in mind, would you rather take a route with an expected travel time of 12:21\nand no further guarantees, or would you prefer a route that takes less than 16:19 with 99% probabil-\nity? Our method produces these options when traveling from Shreveport Regional Airport (SHV) to\nRafael Hern\u00b4andez Airport (BQN). According to historical \ufb02ight data, if you chose the former alter-\n\n1\n\n\fnative you could end up travelling for 22 hours with 8% probability. Another example comes from\nsoftware quality assurance. Amazon.com requires its sub-services to report and optimize perfor-\nmance at the 99.9th percentile, rather than in expectation, to make sure that all of its customers have\na good experience, not just the majority [1]. In the economics literature, this percentile criterion\nis known as value at risk and has become a widely used measure of risk after the market crash of\n1987 [2]. At the same time, the classical method for managing risk in investment is Markovitz port-\nfolio optimization where the objective is to optimize expectation minus weighted variance. These\nexamples suggest that proper risk-aware planning should allow a trade-off between expectation and\nvariance, and, at the same time, should provide some guarantees about the probability of failure.\nRisk-aware planning for Markov decision processes (MDPs) is dif\ufb01cult for two main reasons. First,\noptimizing many of the intuitive risk-aware objectives seems to be intractable computationally. Both\nmean minus variance optimization and percentile optimization for MDPs have been shown to be\nNP-hard in general [3, 4]. Consequently, we can only optimize relaxations of these objectives in\npractice. Second, it seems to be dif\ufb01cult to \ufb01nd an optimization objective which correctly models\nour intuition of risk awareness. Even though expectation, variance and percentile levels relate to\nrisk awareness, optimizing them directly can lead to counterintuitive policies as illustrated recently\nin [3], for the case of mean minus variance optimization, and in the appendix of this paper, for\npercentile optimization.\nPlanning under uncertainty in MDPs is an old topic that has been addressed by many authors. The\nminmax objective has been proposed in [5, 6], which propose a dynamic programming algorithm for\noptimizing it ef\ufb01ciently. Unfortunately, minmax policies tend to be overly cautious. A number of\nmethods have been proposed for relaxations of mean minus variance optimization [3, 7]. Percentile\noptimization has been shown to be tractable when dealing with ambiguity in MDP parameters [8, 9],\nand it has also been discussed in the context of risk [10, 11]. Our approach is closest to the line of\nwork on exponential utility optimization [12, 13]. This problem can be solved ef\ufb01ciently and the\nresulting policies conform to our intuition of risk awareness. However, previous methods give no\nguarantees about probability of failure or variance. For an overview of previously used objectives\nfor risk-aware planning in MDPs, see [14, 15].\nOur method arises from approaching the problem in the context of probability theory. We observe\nconnections between exponential utility maximization, Chernoff bounds, and cumulant generating\nfunctions, which enables formulating a new optimization objective for risk-aware planning. This\nnew objective is essentially a re-parametrization of exponential utility, and inherits both the ef\ufb01-\ncient optimization algorithms and the concordance to intuition about risk awareness. We show that\noptimizing the proposed objective includes, as limiting cases, both minmax and expectation opti-\nmization and allows interpolation between them. Additionally, we provide guarantees at a certain\npercentile level, and show connections to mean minus variance optimization.\nTwo experiments, one synthetic and one based on real-world data, support our theoretical guaran-\ntees and showcase the proposed optimization algorithms. Our largest MDP has 124791 state-action\npairs\u2014signi\ufb01cantly larger than experiments in most past work on risk-aware planning. Our exper-\niments illustrate the ability of our approach to\u2014out of the exponentially many policies available\u2014\nproduce a family of policies that agrees with the human intuition of varying risk.\n\n2 Background and Notation\n\nAn MDP consists of a state space S, an action space A, state transition dynamics, and a cost function\nG. Assume that, at time t, the system is in state st \u2208 S. Once the player chooses an action at \u2208 A,\nthe system transitions stochastically to state st+1 \u2208 S, with probability p(st+1|st, at), and the\nplayer incurs a stochastic cost of Gt(st, at, st+1). The process continues for a number of time steps,\nh, called the horizon. We eventually care about the total cost obtained. We represent the player\u2019s\nstrategy as a time dependent policy, which is a measure on the space of state-actions. Finally, we\nset the starting state to some \ufb01xed s0 \u2208 S. Then, the objective is to \u201coptimize\u201d the random variable\nt=0 Gt(St, At, St+1). Traditionally, \u201coptimizing\u201d J means minimizing its\nexpected value, that is solving min\u03c0 Es,\u03c0 [J]. The classical solution to this problem is to run value\n\nJ h, de\ufb01ned by J h :=(cid:80)h\u22121\n\n2\n\n\fiteration, summarized below:\n\nqt+1(s, a) :=\n\nps(cid:48)|s,a\n\n(cid:88)\n\ns(cid:48)\n\n(cid:0)Gt\ns,a,s(cid:48) + jt(s(cid:48))(cid:1) ,\n\njt(s) := min\n\na\n\nqt (s, a) = min\n\n\u03c0\n\nEs,\u03c0[J t]\n\nWe will refer to policies obtained by standard value iteration as expectimin policies. We use upper\ncase letters for random variables. We assume that the state-action space is \ufb01nite and that sums with\nzero terms, for example J 0, are equal to zero. The notation Es,\u03c0 signi\ufb01es taking the expectation\nstarting from S0 = s, and following policy \u03c0. We assume that costs are upper bounded, that is there\nexists jM such that J \u2264 jM almost surely for any start state and any policy, and that the expected\ncosts are \ufb01nite. Finally, in this paper we will not consider discounting explicitly.\nIf necessary,\ndiscounting can be introduced in one of two ways: either by adding a transition from every state,\nfor all actions, to an absorbing \u201cend game\u201d state, with probability \u03b3, or by setting a time dependent\ncost as Gt\nold. Note that these two ways of introducing discounting are equivalent when\noptimizing the expected cost, but they can differ in the risk-aware setting we are considering. We\nrefer the reader to [16] and [17] for further background on MDPs.\n\nnew = \u03b3tGt\n\n3 The Chernoff Functional as Risk-Aware Objective\n\nWe propose optimizing the following functional of the cost, which we call the Chernoff functional\nsince it often appears in proving Chernoff bounds:\n\n(cid:16)\n\n(cid:104)\n\neJ/\u03b8(cid:105) \u2212 \u03b8 log(\u03b4)\n(cid:17)\n\nC \u03b4\n\ns,\u03c0[J] = inf\n\u03b8>0\n\n\u03b8 log Es,\u03c0\n\n.\n\n(1)\n\nFirst, note the total cost appears in the expression of the Chernoff functional as an exponential utility\n(Es,\u03c0[eJ/\u03b8]). This shows that there is a strong connection between our method and exponential\nutility optimization. Speci\ufb01cally, all policies proposed by our algorithm, including the \ufb01nal solution,\nare optimal policies with respect to the exponential utility for some parameter. These policies are\nknown to show risk-awareness in practice [12, 13], and our method inherits this property. In some\nsense, our proposed objective is a re-parametrization of exponential utility, which was obtained\nthrough its connections to Chernoff bounds and cumulant generating functions. The theorem below,\nwhich is one of the main contributions of this paper, provides more reasons for optimizing the\nChernoff functional in the risk-aware setting. We will state and discuss the theorem here, but leave\nthe proof for the appendix.\nTheorem 1. Let \u03b4 \u2208 [0, 1], and let J be a random variable that has a cumulant generating function,\nthat is E exp(J/\u03b8) < \u221e for all \u03b8 > 0. Then, the Chernoff functional of this random variable,\nC \u03b4[J], is well de\ufb01ned, and has the following properties:\n(i) P (J \u2265 C \u03b4[J]) \u2264 \u03b4\n(ii) C 1[J] = lim\u03b8\u2192\u221e \u03b8 log E[eJ/\u03b8] = E[J]\n(iii) C 0[J] := lim\u03b4\u21920 C \u03b4[J] = lim\u03b8\u21920 \u03b8 log E[eJ/\u03b8] = sup{j : P{J \u2265 j} > 0} < \u221e.\n\n(iv) C \u03b4[J] = E[J] +(cid:112)2 log(1/\u03b4)Var[J]\n(v) As \u03b4 \u2192 1, C \u03b4[J] \u2248 E[J] +(cid:112)2 log(1/\u03b4)Var[J]\n\nif J is Gaussian.\n\n(vi) C \u03b4[J] is a smooth, decreasing function of \u03b4.\n\n(a) log EezJ =(cid:80)\u221e\n\nProof sketch. Property (i) is simply a Chernoff bound and follows by applying Markov\u2019s inequality\nto the random variable eJ/\u03b8. Property (iv) follows from the fact that all but the \ufb01rst two cumulants of\nGaussian random variables are zero [18]. Properties (ii), (iii), (v) and (vi) follow from the following\nproperties of cumulant generating function, log EezJ, [18]:\n\n(b) log EezJ as a function of z \u2208 R is strictly convex, analytic and in\ufb01nitely differentiable in a\n\ni=1 ziki/i! where ki are the cumulants [18], e.g. k1 = E[J], k2 = Var[J].\n\nneighborhood of zero, if it is \ufb01nite in that neighborhood.\n\n3\n\n\fMinimax cost\n\nexact (f)\n\napproximate ( \u02c6f)\n\nExpectimin cost\n\n\u03b8\n\nFigure 1: Plot showing the exact function f de\ufb01ned in Equation 2 and the approximation that our\nalgorithm constructs \u02c6f for the Grid World MDP described in Section 5.1.\n\nProperties (ii) and (iii) show that we can use the \u03b4 parameter to interpolate between the nominal\npolicy, which ignores risk, at \u03b4 = 1, and the minmax policy, which corresponds to extreme risk\naversion, at \u03b4 = 0. Property (i) shows that the value of the Chernoff functional is with probability\nat least 1 \u2212 \u03b4 an upper bound on the cost obtained by following the corresponding Chernoff policy.\nThese two observations suggests that by tuning \u03b4 from 0 to 1 we can \ufb01nd a family of risk-aware\npolicies, in order of risk aversion. Our experiments support this hypothesis (Section 5).\nProperty (i) shows a connection between our approach and percentile optimization. Although we\nare not optimizing the \u03b4-percentile directly, our method provides guarantees about it. Properties\n(iv) and (v) show a connection between optimizing the Chernoff functional and mean minus vari-\nance optimization, which has been proposed before for risk-aware planning, but was found to be\nintractable in general [3]. Via property (v), we can optimize mean minus variance with a low weight\non variance if we set \u03b4 close to 1. In the limit, this allows us to optimize the expectation, while\nbreaking ties in favor of lower variance. Property (iv) show that we can optimize mean minus scaled\nstandard deviation exactly if the total cost is Gaussian. Typically, this will not be the case, but, if\nthe MDP is ergodic and the time horizon is large enough, the total cost will be close to Gaussian, by\nthe central limit theorem. To see why this is true, note that, by the Markov property, costs between\nsuccessive returns to the same state are i.i.d. random variables [19]. Our formulation ties into mean\nminus standard deviation optimization, which is of consistent dimensionality, unlike the classical\nmean minus variance objective.\n\n4 Optimizing the Proposed Objective\n\nFinding the policy that optimizes our proposed objective at a given risk level \u03b4 amounts to a joint\noptimization problem (Bellman optimality does not hold for our objective; see Appendix for discus-\nsion):\n\n(cid:16)\n\n(cid:16)\n\n(cid:104)\n\neJ/\u03b8(cid:105)(cid:17) \u2212 \u03b8 log(\u03b4)\n\n(cid:17)\n\nmin\n\n\u03c0\n\nC \u03b4\n\ns,\u03c0[J] = inf\n\u03b8>0\n\n= inf\n\u03b8>0\n\nmin\n\nEs,\u03c0\n\n\u03b8 log\n(f (\u03b8) \u2212 \u03b8 log(\u03b4)) where\n\n\u03c0\n\nf (\u03b8) := \u03b8 log\n\n(cid:16)\n\nmin\n\n\u03c0\n\nEs,\u03c0\n\n(cid:104)\n\neJ/\u03b8(cid:105)(cid:17)\n\n.\n\n(2)\n\nThe inner optimization problem, the optimization over policies \u03c0, is simply exponential utility opti-\nmization, a classical problem that can be solved ef\ufb01ciently. For brevity, we will not discuss solutions\nto this problem and, instead, refer the readers to [12, 13]. The main dif\ufb01culty is solving the outer\noptimization problem, over the scale variable \u03b8. Unfortunately, this problem is not convex and may\nhave a large number of local minima. Our main algorithmic contribution consists of an approach for\nsolving the outer (non-convex) optimization problem ef\ufb01ciently to some speci\ufb01ed precision \u03b5.\nBased on Theorems 1 and 2 (below), we propose a method for \ufb01nding the policy that minimizes the\nChernoff functional, to precision \u03b5, with worst case time complexity O(h|S|2|A|/\u03b5). It is summa-\nrized in Algorithm 1. Our approach is to solve the optimization problem in (2) with an approximation\nof the function f (Figure 1 shows a example plot of this function). The algorithm maintains such\nan approximation and improves it as needed up to a precision of \u03b5. In practice we might want to\nrun the algorithm for more than one setting of \u03b4 to \ufb01nd policies for the same planning task at dif-\nferent levels of risk aversion, say at n different levels. Naively, the time complexity of doing this\n\n4\n\n\fAlgorithm 1 Near optimal Chernoff bound algorithm\n\n\u02c6f \u2190 empty hash map\n\u02c6f [0] \u2190 f (0)\n\u02c6f [\u221e] \u2190 f (\u221e)\nfor \u03b8 \u2208 {1, 10, 100,\u00b7\u00b7\u00b7}, until \u02c6f [\u221e] \u2212 \u02c6f [\u03b8] < \u03b5, do\n\n\u02c6f [\u03b8] \u2190 f (\u03b8)\n\n(cid:46) will store incremental approximation of f de\ufb01ned in Eq. 2\n(cid:46) minimax cost of the MDP\n(cid:46) expectimin cost of the MDP\n(cid:46) \ufb01nd upper bound\n(cid:46) exponential utility optimization\n(cid:46) \ufb01nd lower bound\n(cid:46) exponential utility optimization\n\nfor \u03b8 \u2208 {1, 0.1, 0.01,\u00b7\u00b7\u00b7}, until \u02c6f [\u03b8] \u2212 \u02c6f [0] < \u03b5, do\n\n\u02c6f [\u03b8] \u2190 f (\u03b8)\n\u03b8\u2217 \u2190 argmin{\u03b8 \u2208 keys( \u02c6f ) : \u02c6f [\u03b8] \u2212 \u03b8 log(\u03b4)},\n\n\u03b8\u2217 \u00b7 min{\u03b8 > \u03b8\u2217, \u03b8 \u2208 keys( \u02c6f )}(cid:17)1/2\n\nrepeat\n\n\u03b8 \u2190(cid:16)\n\n\u02c6f [\u03b8] \u2190 f (\u03b8)\n\nuntil \u02c6f [\u03b8\u2217] \u2212 \u02c6f [\u03b8] < \u03b5\nreturn optimal exponential utility policy(MDP, 1/\u03b8\u2217).\n\n(cid:46) argmin over previously computed costs\n(cid:46) split interval at geometric mean\n(cid:46) exponential utility optimization\n(cid:46) until \u02c6f is an \u03b5-accurate approximation of f\n\nwould be O(nh|S|2|A|/\u03b5) but, fortunately, our function approximation can be reused between sub-\nsequent runs of the algorithm, saving computation time, so the total complexity will, in fact, be only\nO(h|S|2|A|/\u03b5 + n).\nProperties (ii) and (iii) of Theorem 1 imply that f (0) can be computed by minimax optimization\nand f (\u221e) can be computed by value iteration (expectimin optimization), which both have the same\ntime complexity as exponential utility optimization: O(h|S|2|A|). Once we have computed these\nlimits, the next step in the algorithm is \ufb01nding some appropriate bounding interval, [\u03b81, \u03b82], such that\nf (0) \u2212 f (\u03b81) < \u03b5 and f (\u03b82) \u2212 f (\u221e) < \u03b5. We do this by \ufb01rst searching over \u03b8 = 1, .1, 10\u22122,\u00b7\u00b7\u00b7 ,\nand, then, over \u03b8 = 1, 10, 102,\u00b7\u00b7\u00b7 . For a given machine architecture, the number of \u03b8 values is\nbounded by the number format used in the implementation. For example, working with double\nprecision \ufb02oating-point numbers limits the number of \u03b8 evaluations to 2 \u00b7 1023, implied by the\nfact that exponents are only assigned 11 bits. In our experiments, this step takes 10-15 function\nevaluations. Now, for any given risk level, \u03b4, we will \ufb01nd \u03b8\u2217 that minimizes the objective, f (\u03b8) \u2212\n\u03b8 log(\u03b4), among those \u03b8 where we have already evaluated f. We will, then, evaluate f at a new point:\nthe geometric mean of \u03b8\u2217 and its closest neighbor to the right. We stop iterating when the function\nvalue at the new point is less than \u03b5 away from the function value at \u03b8\u2217, and return the corresponding\noptimal exponential utility policy. Consequently, our algorithm evaluates f at a subset of the points\n{\u03b81(\u03b82/\u03b81)i/n : i = 0,\u00b7\u00b7\u00b7 , n} where n is a power of 2. Theorem 2 guarantees that to get an \u03b5\nguarantee for the accuracy of the optimization it suf\ufb01ces to perform n(\u03b5) = O(1/\u03b5) evaluations of\nf, where we are now treating log(\u03b42) \u2212 log(\u03b41) as a constant. Therefore, the number of functions\nevaluations is O(1/\u03b5), and, since the time complexity of every evaluation is O(h|S|2|A|), the total\ntime complexity of the algorithm is O(h|S|2|A|/\u03b5).\nTheorem 2. Consider the interval 0 < \u03b81 < \u03b82 split up into n sub-intervals by \u03b8n\ni = \u03b81(\u03b82/\u03b81)i/n,\ni < \u03b8}) be our piecewise constant approximation to the func-\nand let \u02c6fn(\u03b8) := f (maxi\u22080\u00b7\u00b7\u00b7n{\u03b8n\ntion f (\u03b8) de\ufb01ned in Equation (2). Then, for a given approximation error \u03b5 there exists n(\u03b5) =\nO((log(\u03b42) \u2212 log(\u03b41))/\u03b5) such that | \u02c6fn(\u03b5)(\u03b8) \u2212 f (\u03b8)| \u2264 \u03b5 for all \u03b8 \u2208 [\u03b81, \u03b82].\nProof sketch. The key insight when proving this theorem is bounding rate of change of f. We\ncan immediately see that f\u03c0(\u03b8) := \u03b8 log Es,\u03c0\ntransformation of a convex function, namely, the cumulant generating function of the total cost J.\nAdditionally, Theorem 1 shows that f\u03c0 is lower bounded by Es,\u03c0[J], assumed to be \ufb01nite, which\nimplies that f\u03c0 is non-increasing. On the other hand, by directly differentiating the de\ufb01nition of f\u03c0,\nwe get that \u03b8f(cid:48)\nSince we assumed that the costs, J, are upper bounded, there exist a maximum cost jM such that\nJ \u2264 jM almost surely for any starting state s, and any policy \u03c0. We have also shown that f\u03c0(\u03b8) \u2265\nEs,\u03c0[J] \u2265 jm := min\u03c0(cid:48) Es,\u03c0(cid:48)[J], so we conclude that \u2212(jM \u2212 jm)/\u03b8 \u2264 f(cid:48)\n\u03c0(\u03b8) \u2264 0 for any policy,\n\u03c0. Now that we have bounded the derivative of f\u03c0 we can see that the value of f can not change too\n\n(cid:2)eJ/\u03b8(cid:3) is a convex function since it is the perspective\n\n\u03c0(\u03b8) = f\u03c0(\u03b8) \u2212 Es,\u03c0[JeJ/\u03b8]/Es,\u03c0[eJ/\u03b8].\n\n5\n\n\f#\n#\n$\n#\n# #\n# #\n# #\n\n# #\n#\n\n#\n\n\u2190 \u03b4 \u2208 {0.75, 0.9, 0.99, 1.0 (expectimin)}\n\u2190 \u03b4 = .6\n\u2190 \u03b4 \u2208 {0.1, 0.3}\n\u2190 \u03b4 \u2208 {10\u22123, 10\u22124, 10\u22125, 10\u22126, 10\u22127}\n\u2190 \u03b4 \u2208 {10\u221210, 10\u22128}\n\nFigure 2: Chernoff policies for the Grid World MDP. See text for complete description. The colored\narrows indicate the most likely paths under Chernoff policies for different values of \u03b4. The minimax\npolicy (\u03b4 = 0) acts randomly since it assumes that any action will lead to a trap.\n\nmuch over an interval [\u03b8n\n\ni+1, \u03b8n\n\ni ]. Let \u03c0i := argmin\u03c0 f\u03c0(\u03b8n\n\ni ) and \u03c0i+1 := argmin\u03c0 f\u03c0(\u03b8n\n\ni+1). Then:\n\n0 \u2264 f (\u03b8n\n\ni ) \u2212 f (\u03b8n\n\ni+1) = f\u03c0i(\u03b8n\n\u2264 max\ni \u2264\u03b8\u2264\u03b8n\n\u03b8n\n\ni+1\n\ni ) \u2212 f\u03c0i+1 (\u03b8n\n\n|f(cid:48)\n\n\u03c0i+1\n\ni+1) \u2264 f\u03c0i+1(\u03b8n\n(\u03b8)| \u00b7 (\u03b8n\ni+1 \u2212 \u03b8n\ni+1 \u2212 \u03b8n\n\ni+1) \u2264\ni ) \u2212 f\u03c0i+1 (\u03b8n\n(cid:33)\ni ) \u00b7 (\u03b8n\ni ) = \u2212f(cid:48)\ni+1 \u2212 \u03b8n\n(\u03b8n\n\n(cid:32)(cid:18) \u03b82\n\n(cid:19)1/n \u2212 1\n\n= (jM \u2212 jm)\n\n\u03c0i+1\n\ni\n\ni ) \u2264\n\n,\n\n(3)\n\n\u2264 (jM \u2212 jm) \u00b7 \u03b8n\n\n\u03b8n\ni\n\n\u03b81\n\nwhere we \ufb01rst used the fact that f\u03c0i(\u03b8n\ni ), then the convexity of f\u03c0i+1\nwhich implies that f(cid:48)\n\u03c0i+1 is increasing, and, \ufb01nally, our previous derivative bound. Our \ufb01nal goal is\nto \ufb01nd a value of n(\u03b5) such that the last expression in Equation 3 is less than \u03b5. One can easily verify\nthat the following n(\u03b5) satis\ufb01es this requirement (the detailed derivation appears in the Appendix):\n\ni ) = min\u03c0 f\u03c0(\u03b8n\n\ni ) \u2264 f\u03c0i+1(\u03b8n\n\nn(\u03b5) = (cid:100)(jM \u2212 jm)/\u03b5 log (\u03b82/\u03b81) + log (\u03b82/\u03b81)(cid:101) .\n\n5 Experiments\n\nWe ran a number of experiments to test that our proposed objective indeed captures the intuitive\nmeaning of risk-aware planning. The \ufb01rst experiment models a situation where it is immediately\nobvious what the family of risk-aware policies should be. We show that optimizing the Chernoff\nfunctional with increasing values of \u03b4 produces the intuitively correct family of policies. The second\nexperiment shows that our method can be applied successfully to a large scale, real world problem,\nwhere it is dif\ufb01cult to immediately \u201csee\u201d the risk-aware family of policies.\nOur experiments empirically con\ufb01rm some of the properties of the Chernoff functional proven in\nTheorem 1: the probability that the return is lower than the value of the Chernoff policy at level \u03b4\nis always less than \u03b4, setting \u03b4 = 1 corresponds to optimizing the expected return with the added\nbene\ufb01t of breaking ties in favor of lower variance, and setting \u03b4 = 0 leads to the minmax policy\nwhenever it is de\ufb01ned. Additionally, we observed that policies at lower risk levels, \u03b4, tend to have\nlower expectation but also lower variance, if the structure of the problem allows it. Generally, the\nprobability of extremely bad outcomes decreases as we lower \u03b4.\n\n5.1 Grid world\n\nWe \ufb01rst tested our algorithm on the Grid-World MDP (Figure 2). It models an obstacle avoidance\nproblem with stochastic dynamics. Each state corresponds to a square in the grid, and the actions,\n{N, NE, E, SE, S, SW, W, NW}, typically cause a move in the respective direction. In unmarked\nsquares, the actor\u2019s intention is executed with probability .93. Each of the seven remaining actions\nmight be executed instead, each with probability 0.01. Squares marked with $ and # are absorbing\nstates. The former gives a reward of 35 when entered, and the latter gives a penalty of 35. Any\nother state transitions cost 1. The horizon is 35. To make the problem \ufb01nite, we simply set the\n\n6\n\n\f\u03b4 \u2208 {.99, .999, 1.0 (expectimin)}:\n15:45 SHV - DFW 16:45\n18:25 DFW - MCO 21:50\n23:15 MCO - BQN 02:46\n\u03b4 \u2208 {.3, .4, .5, .6, .7, .8, .9}:\n10:46 SHV - ATL 13:31\n14:10 ATL - EWR 16:30\n18:00 EWR - BQN 23:00\n\u03b4 = 0.2:\n12:35 SHV - DFW 13:30\n18:25 DFW - MCO 21:50\n23:15 MCO - BQN 02:46\n\u03b4 \u2208 {0 (minimax) , .001, .01, .1}:\n12:35 SHV - DFW 13:30\n14:25 DFW - MSY 15:50\n17:50 MSY - JFK 21:46\n23:40 JFK - BQN 04:20\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22128\n\nCumulative distribution function: P (V < v)\n\n\u03b4 \u2208 {.99, .999, 1}\n\u03b4 \u2208 {0.3, .4,\u00b7\u00b7\u00b7 .9}\n\n\u03b4 = 0.2\n\n\u03b4 \u2208 {0, .001, .01, .1}\n\n\u22127\nTotal reward: v (seconds)\n\n\u22126\n\n\u22125\n\n\u22124\n\u00b7104\n\n(a) Paths under Chernoff policies as-\nsuming all \ufb02ight arrive on time, shown\nusing International Air Transport As-\nsociation (IATA) airport codes.\n\n(b) Cumulative distribution functions of rewards (equals minus\ncost) under Chernoff policies at different risk levels. The aster-\nisk (*) indicates the value of the policy. The big O indicates the\nexpected reward and the small o\u2019s correspond to expectation plus-\nminus standard deviation. 10000 samples.\n\nFigure 3: Chernoff policies to travel from Shreveport Regional Airport (SHV) to Rafael Hern\u00b4andez\nAirport (BQN) at different risk levels.\n\nprobability of all transitions outside the grid boundary to zero, and re-normalize. We set the precision\nto \u03b5 = 1. With this setting, our algorithm performed exponential utility optimization for 97 different\nparameters when planning for 14 values of the risk level \u03b4. For low values of \u03b4, the algorithm behaves\ncautiously, preferring longer, but safer routes. For higher values of \u03b4, the algorithm is willing to take\nshorter routes, but also accepts increasing amounts of risk.\n\n5.2 Air travel planning\n\nThe aerial travel planning MDP (Figure 3) illustrates that our method applies to real-world problems\nat a large scale. It models the problem of buying airplane tickets to travel between two cities, when\nyou care only about reaching the destination in a reliable amount of time. We assume that, if you\nmiss a connecting \ufb02ight due to delays, the airline will re-issue a ticket for the route of your choice\nleading to the original destination. In this case, a cautious traveler will consider a number of aspects:\nchoosing \ufb02ights that usually arrive on time, choosing longer connection times and making sure that,\nin case of a missed connection, there are good alternative routes.\nIn our implementation, the state space consists of pairs of all airports and times when \ufb02ights depart\nfrom those airports. At every state there are two actions: either take the \ufb02ight that departs at that time,\nor wait. The total number of state-action pairs is 124791. To keep the horizon low, we introduce\nenough wait transitions so that it takes no more than 10 transitions to wait a whole day in the busiest\nairport (about 1000 \ufb02ights per day) and we set the horizon at 100. Costs are deterministic and cor-\nrespond to the time difference between the scheduled departure time of the \ufb01rst \ufb02ight and the arrival\ntime. We compute transition probabilities based on historical data, available from the Of\ufb01ce of Air-\nline Information, Bureau of Transportation Statistics, at http://www.transtats.bts.gov/.\nParticularly, we have used on-time statistics for February 2011. Airlines often try to conceal statistics\nfor \ufb02ights with low on-time performance by slightly changing departure times and \ufb02ight numbers.\nSometimes, they do this every week. Consequently, we \ufb01rst clustered together all \ufb02ights with the\nsame origin and destination that were scheduled to depart within 15 minutes of each other, under the\nassumption they would have the same on-time statistics. We, then, remove all clusters with fewer\nthan 7 recorded \ufb02ights, since these usually correspond to incidental \ufb02ights.\n\n7\n\n\f20\n\n10\n\n0\n\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\n80\n\n60\n\n40\n\n20\n\n0\n\n0\n\n2\n\n4\n\n6\n\n8\n\n10\n\n12\n\n(a) Number of exponential utility optimization runs\nto compute the Chernoff policies.\n\n(b) Number of distinct Chernoff policies found.\n\nFigure 4: Histograms demonstrating the ef\ufb01ciency and relevance of our algorithm on 500 randomly\nchosen origin - destination airport pairs, at 15 risk levels.\n\nTo test our algorithm on this problem, we randomly chose 500 origin - destination airport pairs and\ncomputed the Chernoff policies for risk levels: \u03b4 \u2208 {1.0, .999, .99, .9, .8,\u00b7\u00b7\u00b7 , .1, 0.01, 0.001, 0.0},\nand precision \u03b5 = 10 minutes. Figure 3 shows the resulting policies and corresponding cost (travel\ntime) histograms for one such randomly chosen route. To address the question of computational\nef\ufb01ciency, Figure 4a shows a histogram of the total number of different parameters for which our\nalgorithm ran exponential utility optimization. To address the question of relevance, Figure 4b shows\nthe number of distinct Chernoff policies found among the risk levels. Two policies, \u03c0 and \u03c0(cid:48), are\nconsidered distinct if the total variation distance of the induced state - action occupation measures\nis more than 10\u22126; that is, if there exists t, s, and a such that |P\u03c0{St = s, At = a} \u2212 P\u03c0(cid:48){St =\ns, At = a}| \u2265 10\u22126. For most origin - destination pairs we found a rich spectrum of distinct\npolicies, but there are also cases where all the Chernoff policies are identical or only the expectimax\nand minimax policies differ.\nMany air travel routes exhibit only two phases mainly because they connect small airports where\nonly one or two \ufb02ights of the type we consider land or take off per day. Consequently there will be\nfew policies to choose from in these cases. In our experiment, we chose 200 origin and destination\npairs at random and, of these, 72 routes show only two phases. In 41 of these cases, either the\norigin or the destination airport serves only one or two \ufb02ights per day total. Only 9 of the two-phase\nroutes connect airports which both serve more than 10 \ufb02ights per day total, and, of course, not all of\nthese \ufb02ight will help reach the destination. Thus, typically the reason we see only two phases is that\nthe choice of policies is very limited. Additionally, airlines have an incentive to provide suf\ufb01cient\nmargin such that passengers can make connections and they don\u2019t have to re-ticket them. That is,\nthey tend to set up routes such that, even in a worse than average scenario, the original route will\ntend to succeed.\n\n6 Conclusion\n\nWe proposed a new optimization objective for risk-aware planning called the Chernoff functional.\nOur objective has a free parameter \u03b4 that can be used to interpolate between the nominal policy,\nwhich ignores risk, at \u03b4 = 1, and the minmax policy, which corresponds to extreme risk aversion,\nat \u03b4 = 0. The value of the Chernoff functional is with probability at least 1 \u2212 \u03b4 an upper bound on\nthe cost incurred by following the corresponding Chernoff policy. We established a close connec-\ntion between optimizing the Chernoff functional and mean minus variance optimization, which has\nbeen proposed before for risk-aware planning, but was found to be intractable in general. We also\nestablish a close connection with optimization of mean minus scaled standard deviation.\nWe proposed an ef\ufb01cient algorithm that optimizes the Chernoff functional to any desired accuracy \u03b5\nrequiring O(1/\u03b5) runs of exponential utility optimization. Our experiments illustrate the capability\nof our approach to recover a spread of policies in the spectrum from risk neutral to minmax requiring\na running time that was on average about ten times the running of value iteration.\n\n8\n\n\fReferences\n[1] G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasub-\nramanian, P. Vosshall, and W. Vogels. Dynamo: amazon\u2019s highly available key-value store.\nACM SIGOPS Operating Systems Review, 41(6):205\u2013220, 2007.\n\n[2] Philippe Jorion. Value at risk: the new benchmark for managing \ufb01nancial risk, volume 1.\n\nMcGraw-Hill Professional, 2007.\n\n[3] Shie Mannor and John N. Tsitsiklis. Mean-Variance Optimization in Markov Decision Pro-\n\ncesses. In Proceedings of the 28 International Conference on Machine Learning, 2011.\n\n[4] Erick Delage and Shie Mannor. Percentile optimization in uncertain Markov decision processes\n\nwith application to ef\ufb01cient exploration. ICML; Vol. 227, page 225, 2007.\n\n[5] Jay K. Satia and Roy E. Lave Jr. Markovian Decision Processes with Uncertain Transition\n\nProbabilities. Operations Research, 21(3):728\u2013740, 1973.\n\n[6] Matthias Heger. Consideration of risk in reinforcement learning. In Proceedings of the 11th\nInternational Machine Learning Conference (1994), pages 105\u2013111. Morgan Kaufmann, 1994.\n[7] Steve Levitt and Adi Ben-Israel. On Modeling Risk in Markov Decision Processes. Optimiza-\n\ntion and Related Topics, pages 27\u201341, 2001.\n\n[8] Erick Delage and Shie Mannor. Percentile Optimization for Markov Decision Processes with\n\nParameter Uncertainty. Operations Research, 58(1):203\u2013213, 2010.\n\n[9] Arnab Nilim and Laurent El Ghaoui. Robust Control of Markov Decision Processes with\n\nUncertain Transition Matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n[10] M. Bouakiz and Y. Kebir. Target-level criterion in Markov decision processes. Journal of\n\nOptimization Theory and Applications, 86(1):1\u201315, July 1995.\n\n[11] Congbin Wu and Yuanlie Lin. Minimizing Risk Models in Markov Decision Processes with\nPolicies Depending on Target Values. Journal of Mathematical Analysis and Applications,\n231(1):47\u201367, 1999.\n\n[12] S.I. Marcus, E. Fern\u00b4andez-Gaucherand, D. Hern\u00b4andez-Hernandez, S. Coraluppi, and P. Fard.\nRisk sensitive Markov decision processes. Systems and Control in the Twenty-First Century,\n29:263\u2013281, 1997.\n\n[13] VS Borkar and SP Meyn. Risk-sensitive optimal control for Markov decision processes with\n\nmonotone cost. Mathematics of Operations Research, 27(1):192\u2013209, 2002.\n\n[14] B. Defourny, D. Ernst, and L. Wehenkel. Risk-aware decision making and dynamic program-\n\nming. In NIPS 2008 Workshop on Model Uncertainty and Risk in RL, 2008.\n\n[15] Yann Le Tallec. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Pro-\n\ncesses. PhD thesis, Massachusetts Institute of Technology, 2007.\n\n[16] Richard S. Sutton and Andrew G. Barto. Reinforcement learning: an introduction. MIT Press,\n\n1998.\n\n[17] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scienti\ufb01c,\n\nOctober 1996.\n\n[18] J. F. Kenney and E. S. Keeping. Cumulants and the cumulant-generating function, additive\nproperty of cumulants, and Sheppard\u2019s correction. In Mathematics of Statistics, chapter 4.10-\n4.12, pages 77\u201382. Van Nostrand, Princeton, NJ, 2 edition, 1951.\n\n[19] Richard Durrett. Probability: Theory and Examples. Cambridge University Press, 2010.\n\n9\n\n\f", "award": [], "sourceid": 4514, "authors": [{"given_name": "Teodor", "family_name": "Moldovan", "institution": null}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": null}]}