{"title": "Robustness and risk-sensitivity in Markov decision processes", "book": "Advances in Neural Information Processing Systems", "page_first": 233, "page_last": 241, "abstract": "We uncover relations between robust MDPs and risk-sensitive MDPs. The objective of a robust MDP is to minimize a function, such as the expectation of cumulative cost, for the worst case when the parameters have uncertainties. The objective of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost when the parameters are known. We show that a risk-sensitive MDP of minimizing the expected exponential utility is equivalent to a robust MDP of minimizing the worst-case expectation with a penalty for the deviation of the uncertain parameters from their nominal values, which is measured with the Kullback-Leibler divergence. We also show that a risk-sensitive MDP of minimizing an iterated risk measure that is composed of certain coherent risk measures is equivalent to a robust MDP of minimizing the worst-case expectation when the possible deviations of uncertain parameters from their nominal values are characterized with a concave function.", "full_text": "Robustness and risk-sensitivity in Markov decision\n\nprocesses\n\nTakayuki Osogami\nIBM Research - Tokyo\n\n5-6-52 Toyosu, Koto-ku, Tokyo, Japan\n\nosogami@jp.ibm.com\n\nAbstract\n\nWe uncover relations between robust MDPs and risk-sensitive MDPs. The objec-\ntive of a robust MDP is to minimize a function, such as the expectation of cumu-\nlative cost, for the worst case when the parameters have uncertainties. The objec-\ntive of a risk-sensitive MDP is to minimize a risk measure of the cumulative cost\nwhen the parameters are known. We show that a risk-sensitive MDP of minimiz-\ning the expected exponential utility is equivalent to a robust MDP of minimizing\nthe worst-case expectation with a penalty for the deviation of the uncertain pa-\nrameters from their nominal values, which is measured with the Kullback-Leibler\ndivergence. We also show that a risk-sensitive MDP of minimizing an iterated\nrisk measure that is composed of certain coherent risk measures is equivalent to\na robust MDP of minimizing the worst-case expectation when the possible devi-\nations of uncertain parameters from their nominal values are characterized with a\nconcave function.\n\n1\n\nIntroduction\n\nRobustness against uncertainties and sensitivity to risk are major issues that have been addressed\nin recent development of the Markov decision process (MDP). The robust MDP [3, 4, 10, 11, 12,\n20, 21] deals with uncertainties in parameters; that is, some of the parameters of the MDP are not\nknown exactly. The objective of a robust MDP is to minimize a function for the worst case when\nthe values of its parameters vary within a prede\ufb01ned set called an uncertainty set. The standard\nobjective function is the expected cumulative cost [11]. When the uncertainty set is trivial, the\nrobust MDP is reduced to the standard MDP [17]. The risk-sensitive MDP [5, 7, 13, 14, 19], on the\nother hand, assumes that the parameters are exactly known. The objective of a risk-sensitive MDP is\nto minimize the value of a risk measure, such as the expected exponential utility [5, 7, 8, 15, 18], of\nthe cumulative cost. When the risk measure is expectation, the risk-sensitive MDP is reduced to the\nstandard MDP. The robust MDP and the risk-sensitive MDP have been developed independently.\nThe goal of this paper is to reveal relations between these two seemingly unrelated models of MDPs.\nSuch unveiled relations will provide insights into the two models of MDPs. For example, it is not\nalways clear what it means to minimize the value of a risk measure or to minimize the worst case\nexpected cumulative cost under an uncertainty set. In particular, the iterated risk measure studied\nin [13, 14, 19] is de\ufb01ned recursively, which prevents an intuitive understanding of its meaning. The\nunveiled relation to a robust MDP can allow us to understand what it means to minimize the value of\nan iterated risk measure in terms of uncertainties. In addition, the optimal policy for a robust MDP\nis often found too conservative [3, 4, 10, 21], or it becomes intractable to \ufb01nd the optimal policy\nparticularly when the transition probabilities have uncertainties [3, 4, 10]. The unveiled relations to\na risk-sensitive MDP, for which the optimal policy can be found ef\ufb01ciently, can allow us to \ufb01nd the\noptimal robust policy ef\ufb01ciently, avoiding that the policy is too conservative. We will explore these\npossibilities.\n\n1\n\n\fThe contributions of this paper can be summarized in two points. First, we prove that a risk-sensitive\nMDP with the objective of minimizing the value of an iterated risk measure is equivalent to a ro-\nbust MDP with the objective of minimizing the expected cumulative cost for the worst case when\nthe probability mass functions for the transition and cost from a state have uncertainties. More\nspeci\ufb01cally, the iterated risk measure of the risk-sensitive MDP is de\ufb01ned recursively with a class of\ncoherent risk measures [9], and it evaluates the riskiness of the sum of the value of a coherent risk\nmeasure of immediate cost. The uncertainty set of the robust MDP is characterized by the use of a\nrepresentation of the coherent risk measure. See Section 2.\nSecond, we prove that a risk-sensitive MDP with the objective of minimizing an expected exponen-\ntial utility is equivalent to a robust MDP whose objective is to minimize the expected cumulative\ncost minus a penalty function for the worst case when the probability mass functions for the tran-\nsition and cost from a state have uncertainties. More speci\ufb01cally, the expected exponential utility\nevaluates the riskiness of the sum of the value of an entropic risk measure [6] of immediate cost.\nThe penalty function measures the deviation of the values of the probability mass functions from\ntheir nominal values using the Kullback-Leibler divergence. See Section 3.\n\n2 Robust representations of iterated coherent risk measures\n\nThroughout this paper, we consider Markov decision processes over a \ufb01nite horizon, so that there\nare N decision epochs. Let Sn be the set of possible states at the n-th decision epoch for n =\n0, . . . , N \u2212 1. Let A(s) be the set of candidate actions from the state, s. We assume that a nominal\ntransition probability, p0(s(cid:48)|s, a), is associated with the transition from each state s \u2208 Sn to each\nstate s(cid:48) \u2208 Sn+1 given that the action a \u2208 A(s) is taken at s for n = 0, . . . , N \u2212 1. For a robust MDP,\nthe corresponding true transition probability, p(s(cid:48)|s, a), has the uncertainty that will be speci\ufb01ed in\nthe sequel. The random cost, C(s, a), is associated with each pair of a state, s, and an action,\na \u2208 S(a). We assume that C(s, a) has a nominal probability distribution, but the true probability\ndistribution for a robust MDP has the uncertainty that will be speci\ufb01ed in the sequel. We assume\nthat Si and Sj are disjoint for any j (cid:54)= i (e.g., the state space is augmented with time).\n\n2.1 Special case of the iterated conditional tail expectation\n\nWe start by studying a robust MDP where the uncertainty is speci\ufb01ed by the factor, \u03b1, such that\n0 < \u03b1 < 1, which determines the possible deviation from the nominal value. Speci\ufb01cally, for each\npair of s \u2208 Sn and a \u2208 A(s), the true transition probabilities are in the following uncertainty set:\n\n0 \u2264 p(s(cid:48)|s, a) \u2264 1\n\u03b1\n\np0(s(cid:48)|s, a),\u2200s(cid:48) \u2208 Sn+1 and (cid:88)\n\ns(cid:48)\u2208Sn+1\n\np(s(cid:48)|s, a) = 1.\n\n(1)\n\nThroughout Section 2.1, we assume that the cost C(s, a) is deterministic and has no uncertainty.\nBecause the uncertainty set (1) is convex, the existing technique [11] can be used to ef\ufb01ciently \ufb01nd\nthe optimal policy that minimizes the expected cumulative cost for the worst case where the true\nprobability is chosen to maximize the expected cumulative cost within the uncertainty set:\n\nmin\n\n\u03c0\n\nmax\np\u2208Up\n\nEp[ \u02dcC(\u03c0)],\n\n(2)\n\nwhere \u02dcC(\u03c0) is the cumulative cost with a policy \u03c0, and Ep is the expectation with respect to p, which\nis chosen from the uncertainty set, Up, de\ufb01ned with (1).\nOur key \ufb01nding is that there is a risk-sensitive MDP that is equivalent to the robust MDP having\nthe objective (2). Speci\ufb01cally, consider the risk-sensitive MDP, where the transition probability is\ngiven by p0, and the cost C(s, a) is deterministic given s and a. This risk-sensitive MDP becomes\nequivalent to the robust MDP having the objective (2) when the objective of the risk-sensitive MDP\nis to \ufb01nd the optimal \u03c0 with respect to an iterated conditional tail expectation (ICTE) [13]:\n\nmin\n\n\u03c0\n\nICTE(N )\n\n\u03b1 [ \u02dcC(\u03c0)],\n\n(3)\n\nwhere ICTE(N )\nICTE(N )\n\n\u03b1\n\n\u03b1\n\ndenotes the ICTE de\ufb01ned for N decision epochs with parameter \u03b1. Speci\ufb01cally,\n\nis de\ufb01ned recursively with conditional tail expectation (CTE) as follows [13]:\nICTE(N\u2212i+1)\n\nICTE(N\u2212i)\n\n, for i = 1, . . . , N,\n\n\u03b1\n\n(cid:104) \u02dcC(\u03c0)\n\n(cid:105) \u2261 CTE\u03b1\n\n(cid:104)\n\n(cid:104) \u02dcC(\u03c0)|Si\n\n(cid:105)(cid:105)\n\n(4)\n\n\u03b1\n\n2\n\n\f\u03b1\n\n\u03b1 [ \u02dcC(\u03c0)] \u2261 \u02dcC(\u03c0). In (4), ICTE(N\u2212i)\n\n(5)\n[ \u02dcC(\u03c0)|Si] denotes the ICTE of \u02dcC(\u03c0)\nwhere we de\ufb01ne ICTE(0)\nconditioned on the state at the i-th decision epoch. When Si is random, so is ICTE(N\u2212i)\n[ \u02dcC(\u03c0)|Si].\nThe right-hand side of (4) evaluates the CTE of this random ICTE(N\u2212i)\n[ \u02dcC(\u03c0)|Si]. CTE is also\nknown as conditional value at risk or average value at risk and is formally de\ufb01ned as follows for a\nrandom variable Y :\n\nCTE\u03b1[Y ] \u2261 (1 \u2212 \u03b2)E[Y |Y > V\u03b1] + (\u03b2 \u2212 \u03b1)V\u03b1\n\n(6)\nwhere V\u03b1 \u2261 min{y | FY (y) \u2265 \u03b1}, and FY is the cumulative distribution function of Y . For a\ncontinuous Y , or unless there is a mass probability at V\u03b1, we have CTE\u03b1[Y ] = E[Y |Y > V\u03b1].\nThe equivalence between the robust MDP with the objective (2) and the risk-sensitive MDP with the\nobjective (3) can be shown by the use of the following alternative de\ufb01nition of CTE:\n\n1 \u2212 \u03b1\n\n\u03b1\n\n\u03b1\n\n,\n\nCTE\u03b1[Y ] \u2261 max\n\n(7)\nwhere Q is the set of probability mass functions, q, whose support is a subset of the probability mass\nfunction, q0, of Y such that q(y) \u2264 q0(y)/\u03b1 for every y in the support of q0. Speci\ufb01cally, let C \u03c0\ni be\nthe cost incurred at the i-th epoch with policy \u03c0 so that \u02dcC(\u03c0) = C \u03c0\nN\u22121. Then, by the\nrecursive de\ufb01nition of ICTE and the translation invariance1 of CTE, it can be shown that\n\n0 + \u00b7\u00b7\u00b7 + C \u03c0\n\nq\u2208Q Eq[Y ],\n\nICTE(N\u2212i+1)\n\n\u03b1\n\n(cid:104) \u02dcC(\u03c0) | Si\u22121\n(cid:105)\n\uf8ee\uf8f0C \u03c0\n\uf8ee\uf8f0C \u03c0\n\nEp\n\nC \u03c0\n\ni + max\np\u2208Up\n\ni\u22121(cid:88)\ni\u22121(cid:88)\n\nj=0\n\nj=0\n\n=\n\n=\n\n\uf8ee\uf8f0 N\u22121(cid:88)\n\uf8ee\uf8f0 N\u22121(cid:88)\n\nj=i+1\n\nj=i+1\n\n\uf8f9\uf8fb\n\uf8f9\uf8fb | Si\u22121\n\uf8f9\uf8fb ,\n\uf8f9\uf8fb | Si\u22121\n\ni + ICTE(N\u2212i)\n\n\u03b1\n\nj | Si\nC \u03c0\n\nC \u03c0\n\ni + CTE\u03b1\n\ni + ICTE(N\u2212i)\n\n\u03b1\n\nj | Si\nC \u03c0\n\n(8)\n\n(9)\n\nwhere the second equality follows from (7). What (9) suggests is that the ICTE of the cumulative\ncost given Si can be represented by the cost already accumulated C \u03c0\ni\u22121 plus the maximum\npossible expected value of the sum of the cost incurred at Si and the ICTE of the cumulative cost to\nbe incurred from the (i + 1)st epoch. Induction can now be used to establish the following theorem,\nwhich will be proved formally for the general case in Section 2.3:\nTheorem 1. When the immediate cost from a state is deterministic given that state and the action\nfrom that state, the risk-sensitive MDP with the objective (3) is equivalent to the robust MDP with\nthe objective (2).\n\n0 +\u00b7\u00b7\u00b7+C \u03c0\n\nThroughout, we say that a risk-sensitive MDP is equivalent to a robust MDP if the two MDPs have a\ncommon state space, and, regardless of the values of the parameters of the MDPs, the optimal action\nfor one MDP coincides with that for the other for every state.\n\n2.2 Relation between cost uncertainty and risk-sensitivity\n\nIn addition to the transition probabilities, we now assume that the probability distribution of cost has\nuncertainty. Speci\ufb01cally, for each pair of s \u2208 Sn and a \u2208 A(s), the true probability mass function2,\nf (\u00b7|s, a), for the random cost, C(s, a), is in the following uncertainty set that is characterized with\nthe nominal probability mass function, f0(\u00b7|s, a):\n\n0 \u2264 f (x|s, a) \u2264 1\n\u03b1\n\nf0(x|s, a),\u2200x \u2208 X and (cid:88)\n\ns\u2208X (s,a)\n\nf (x|s, a) = 1,\n\n(10)\n\nwhere X (s, a) is the support of C(s, a). Because the uncertainty sets, (1) and (10), are both convex,\nthe existing technique [11] can still be used to ef\ufb01ciently \ufb01nd the optimal policy with respect to\n\nmin\n\n\u03c0\n\nmax\n\np\u2208Up,f\u2208Uf\n\nEp,f [ \u02dcC(\u03c0)],\n\n(11)\n\n1CTE\u03b1[Y + b] = CTE\u03b1[Y ] + b for a random Y and a deterministic b.\n2Continuous cost is discussed in the supplementary material.\n\n3\n\n\f(a) g(t) = min{t/\u03b1, 1}\n\n(b) concave g\n\n(c) piecewise linear g\n\nFigure 1: An illustration of the probabilities that give the worst case expectation.\n\n(cid:104) \u02dcD(\u03c0)\n(cid:105)\n\nwhere Uf is de\ufb01ned analogously to Up.\nAgain, our key \ufb01nding is that there is a risk-sensitive MDP that is equivalent to the robust MDP\nhaving the objective (11). To de\ufb01ne the objective of the equivalent risk-sensitive MDP, let D(s, a) \u2261\nCTE\u03b1[C(s, a)] and let \u02dcD(\u03c0) be the cumulative value of D(s, a) along the sequence of (s, a) with a\npolicy \u03c0. Then the objective of the equivalent risk-sensitive MDP is given by\n\nmin\n\n\u03c0\n\nICTE(N )\n\n\u03b1\n\n.\n\n(12)\n\nBy \ufb01rst applying (7) to D(s, a) and following the arguments that have led to Theorem 1, we can\nestablish the following theorem, which will be proved formally for the general case in Section 2.3:\nTheorem 2. The risk-sensitive MDP with the objective (12) is equivalent to the robust MDP with\nthe objective (11).\n\n2.3 General case of coherent risk measures\n\nThe robust MDPs considered in Section 2.1 and Section 2.2 are not quite \ufb02exible, can lead to too\nconservative policies depending on the value of \u03b1, or might be too sensitive to the particular value\nof \u03b1. We now introduce a broader class of robust MDPs and equivalent risk-sensitive MDPs.\nTo de\ufb01ne the broader class of robust MDPs, we study the uncertainty set of (1) and (10) in more\ndetail. Given a random variable that takes value vi with nominal probability pi for i = 1, . . . , m, a\nstep of \ufb01nding the optimal robust policy calculates the maximum possible expected value:\n\nmax\n\nq\n\ns.t.\n\nq1 v1 + \u00b7\u00b7\u00b7 + qm vm\n0 \u2264 qi \u2264 1\n\u03b1\nq1 + \u00b7\u00b7\u00b7 + qm = 1.\n\npi,\u2200i = 1, . . . m\n\n(13)\n\nWithout loss of generality, let v1 > v2 > . . . > vm. Then the optimal solution to (13) can be\nillustrated with Figure 1(a): for i = 1, . . . , m, the optimal solution q \u2261 (q1, . . . , qm) satis\ufb01es\n\ni(cid:88)\n\n(cid:96)=1\n\ni(cid:88)\n\nq\n\ns.t.\n\n(cid:32) i(cid:88)\n\n(cid:96)=1\n\n(cid:32) i(cid:88)\n\n(cid:33)\n\n(cid:33)\n\nq(cid:96) = g\n\np(cid:96)\n\n,\n\n(14)\n\nwhere g(t) = min{t/\u03b1, 1}. Relaxing the constraints in (13), we obtain the following optimization\nproblem, whose optimal solution is still given by (14):\nq1 v1 + \u00b7\u00b7\u00b7 + qm vm\n\nmax\n\nq(cid:96) \u2264 g\n\n,\u2200i = 1, . . . , m\n\n(15)\n\np(cid:96)\n\n(cid:96)=1\n\n0 \u2264 qi,\u2200i = 1, . . . , m.\n\n(cid:96)=1\n\nThe in\ufb02exibility of (15) stems from the in\ufb02exibility of g(t) = min{t/\u03b1, 1}, which has only one\nadjustable parameter, \u03b1. When \u03b1 is small (speci\ufb01cally, 0 < \u03b1 \u2264 1 \u2212 pm), some of the qis become\n\n4\n\np1p2p31p4q1q2q31y= t/ aytp1p2p31p4q1q2q31ytq4p1p2p31p4q1q2q31g= g1+ g2ytq4g1g2\fzero. This means that the corresponding optimistic cases (those resulting in small vis) are ignored.\nOtherwise (speci\ufb01cally, 1\u2212pm < \u03b1 \u2264 1), the uncertainty set can become too small as qi \u2264 pi/\u03b1,\u2200i.\nThis in\ufb02exibility motivates us to generalize g to a concave function such that g(0) = 0 and g(1) = 1\n(see Figure 1(b)). The optimal solution to (15) with the concave g is still given by (14). With an\nappropriate g, we can consider a suf\ufb01ciently large uncertainty set for the pessimistic cases (e.g.,\nq1 (cid:29) p1) and at the same time consider the possibility of the optimistic cases (e.g. qm > 0).\nTo formally de\ufb01ne the uncertainty set for p(s(cid:48)|s, a), s \u2208 Sn and a \u2208 A(s), with the concave g,\nlet Qp/p0(\u00b7) denote the quantile function of a random variable that takes value p(s(cid:48)|s, a)/p0(s(cid:48)|s, a)\nwith probability p0(s(cid:48)|s, a) for s(cid:48) \u2208 Sn+1. Analogously, let Qf /f0(\u00b7) denote the quantile function of\na random variable that takes value f (x|s, a)/f0(x|s, a) with probability f0(x|s, a) for x \u2208 X (s, a).\nThen p(s(cid:48)|s, a) and f (x|s, a) are in the uncertainty set iff we have, for 0 < t < 1, that\n\nQp/p0(u) du \u2264 g(t)\n\nand\n\nQf /f0 (u) du \u2264 g(t).\n\n(16)\n\n(cid:90) 1\n\n1\u2212t\n\nNow (7) suggests that expectation with respect to the q illustrated in Figure 1(a) is the CTE with\nparameter \u03b1 with respect to the corresponding p. It can be shown that the expectation with respect\nto the q illustrated in Figure 1(b) is a coherent risk measure, CRM, of the following form [9]:\n\nCRMH [Y ] =\n\nCTE\u03b1[Y ] dH(\u03b1),\n\n(17)\n\nfor a nondecreasing function H such that H(0) = 0 and H(1) = 1, where Y denotes a generic\nrandom variable. Notice that (17) is a weighted average of CTE\u03b1[Y ] for varying \u03b1s. One can\nbalance the weights on worse cases (higher \u03b1) and the weights on better cases (lower \u03b1).\nLet K(s, a) \u2261 CRMH [C(s, a)] and let \u02dcK(\u03c0) be the cumulative value of K(s, a) along the sequence\nof (s, a) with a policy, \u03c0. We de\ufb01ne an iterated coherent risk measure (ICRM) of \u02dcK(\u03c0) as follows:\n\n(cid:104) \u02dcK(\u03c0)\n\n(cid:105) \u2261 CRMH\n\n(cid:104)\n\nICRM(N\u2212i+1)\n\nH\n\n(cid:104) \u02dcK(\u03c0)|Si\n\n(cid:105)(cid:105)\n\n, for i = 1, . . . , N,\n\n(18)\n\nwhere ICRM(0)\nTheorem 3. Consider the risk-sensitive MDP with the following objective:\n\nH [ \u02dcK(\u03c0)] \u2261 \u02dcK(\u03c0). Now we are ready to prove the general results in this section.\n\n(cid:90) 1\n\n1\u2212t\n\n(cid:90) 1\n\n0\n\nICRM(N\u2212i)\n\nH\n\n(cid:105)\n(cid:104) \u02dcK(\u03c0)\n\n.\n\nThis risk-sensitive MDP is equivalent to the robust MDP with the objective (11) if\n\nmin\n\n\u03c0\n\nICRM(N )\n\nH\n\n(cid:90) 1\n\nt\n\ndg(t)\n\ndt\n\n=\n\n1\ns\n\ndH(s)\n\nfor\n\n0 < t < 1.\n\n(19)\n\n(20)\n\nTo gain an intuition behind (20), consider the g illustrated in Figure 1(c), where g1(t) =\nmin{x/\u03b11, r1}, g2(t) = min{x/\u03b12, r2}, and g(t) = g1(t) + g2(t) for 0 \u2264 t \u2264 1. The expectation\nwith respect to the q illustrated in Figure 1(c) can be represented by r1 CRM\u03b11[\u00b7] + r2 CRM\u03b12 [\u00b7]\nwith respect to the corresponding p. The H is thus a piecewise constant function with a step of size\nri at \u03b1i for i = 1, 2. The slope dg(t)\nis either 1/\u03b11 + 1/\u03b12, 1/\u03b12, or 0, depending on the particular\ndt\nvalue of t in such a way that (20) holds.\n\nProof of Theorem 3. Notice that Bellman\u2019s optimality equations are satis\ufb01ed both for the robust\nMDP and for the risk-sensitive MDP under consideration. For the robust MDP, Bellman\u2019s optimal-\nity equations are established in [11]. For our risk-sensitive MDP, note that the coherent risk measure\nsatis\ufb01es strong monotonicity, translation-invariance, and positive homogeneity that are used to es-\ntablish Bellman\u2019s optimality equations in [13]. A difference between the risk-sensitive MDP in [13]\nand our risk-sensitive MDP is that the former minimizes the value of an iterated risk measure for \u02dcC,\nwhile the latter minimizes the value of an iterated risk measure (speci\ufb01cally, ICRM(0)\nH ) for \u02dcK. This\ndifference does not affect whether Bellman\u2019s optimality equations are satis\ufb01ed.\n\n5\n\n\fThe equivalence between our risk-sensitive MDP and our robust MDP can thus be established by\nshowing that the two sets of Bellman\u2019s optimality equations are equivalent. For s \u2208 Sn, Bellman\u2019s\noptimality equation for our robust MDP is\n\nv(s) = min\na\u2208A(s)\n\nmax\n\np\u2208Up,f\u2208Uf\n\nx f (x|s, a) +\n\nv(s(cid:48)) p(s(cid:48)|s, a)\n\n(21)\n\n\uf8eb\uf8ed (cid:88)\n\nx\u2208X (s,a)\n\n(cid:88)\n\ns(cid:48)\u2208Sn+1\n\n\uf8f6\uf8f8 ,\n\n(cid:88)\n\nmax\nf\u2208Uf\n\nx\u2208X (s,a)\n\ni(cid:88)\n\nwhere v(s) denotes the value function representing the worst-case expected cumulative cost from s.\nFor s \u2208 Sn, Bellman\u2019s optimality equation for our risk-sensitive MDP is\n\nw(s) = min\na\u2208A(s)\n\nCRMH [CRMH [C0(s, a)] + W (s, a)] ,\n\n(22)\n\nwhere w(s) denotes the value function representing the value of the iterated coherent risk measure\nfrom s, C0(s, a) is a random variable that takes value x with probability f0(x|s, a) for x \u2208 X (s, a),\nand W (s, a) is a random variable that takes value w(s(cid:48)) with probability p0(s(cid:48)|s, a) for s(cid:48) \u2208 Sn+1.\nNote that the inner CRMH is calculated with respect to f0(\u00b7|s, a); the outer CRMH is with respect\nto p0(\u00b7|s, a). In the following, we will show the equivalence between (21) and (22).\nWe \ufb01rst show the following equality:\n\nx f (x|s, a) = CRMH [C0(s, a)].\n\n(23)\n\n(cid:33)\n\n(cid:32) i(cid:88)\n(cid:88)\n\n(cid:96)=1\n\nLet {x1, x2, . . .} = X (s, a) such that x1 > x2 > . . .. As we have seen with Figure 1, the maximizer,\nf (cid:63), of the left-hand side of (23) satis\ufb01es\n\nf (cid:63)(x(cid:96)|s, a) = g\n\nf0(x(cid:96)|s, a)\n\nfor i = 1, 2, . . .. For brevity, let Ri \u2261(cid:80)i\n(cid:88)\n\n(cid:88)\n\n(cid:96)=1\n\nx f (x|s, a) =\n\nxi\n\nmax\nf\u2208Uf\n\nx\u2208X (s,a)\n\n(cid:90) Ri\n(cid:90) Ri\n(cid:90) 1\n(cid:96)=1 f0(x(cid:96)|s, a). Then we have\n\ndg(t) =\n\nxi\n\ni\n\nRi\u22121\n\ni\n\nRi\u22121\n\nt\n\n(24)\n\n1\nu\n\ndH(u) dt,\n\n(25)\n\nwhere the \ufb01rst equality follows from (24), and the second equality follows from (20). Exchanging\nthe integrals in the last expression, we obtain\n\n(cid:88)\n\nmax\nf\u2208Uf\n\nx\u2208X (s,a)\n\n(cid:88)\n\nmax\nf\u2208Uf\n\nx\u2208X (s,a)\n\nx f (x|s, a) =\n\n=\n\nx f (x|s, a) =\n\n=\n\ni\n\ni\n\ndt\n\nxi\n\n1\nu\n\nRi\u22121\n\nRi\u22121\n\ndH(u)\n\n(cid:90) min{u,Ri}\n\n(cid:90) 1\n(cid:88)\n(cid:90) 1\n(cid:88)\n(cid:80)\n(cid:90) 1\ni:Ri\u22121\u2264u xi (min{u, Ri} \u2212 Ri\u22121)\n(cid:90) 1\n\nxi (min{u, Ri} \u2212 Ri\u22121)\n\nRi\u22121\n\n1\nu\n\nu\n\nCTEu[C0(s, a)] dH(u),\n\n0\n\n0\n\nExchanging the integral and the summation in the last expression, we obtain\n\nwhich establishes (23). To understand the last equality, plug in the following expressions in (6):\n\u03b1 = 1 \u2212 u, V\u03b1 = xi(cid:63), and \u03b2 = 1 \u2212 Ri(cid:63)\u22121, where i(cid:63) \u2261 min{i|Ri > u}.\nFinally, we show the equivalence between (21) and (22). By (23), we have\n\n\uf8f6\uf8f8= CRMH [C0(s, a)] + max\n\n(cid:88)\n\np\u2208Up\n\ns(cid:48)\u2208Sn+1\n\n(cid:48)\nv(s\n\n(cid:48)|s, a). (30)\n\n) p(s\n\nx f (x|s, a) +\n\n(cid:48)\nv(s\n\n(cid:48)|s, a)\n\n) p(s\n\nmax\n\np\u2208Up,f\u2208Uf\n\nx\u2208X (s,a)\n\ns(cid:48)\u2208Sn+1\nAnalogously to (23), we can show\n\n\uf8eb\uf8ed (cid:88)\n\n(26)\n\n(27)\n\ndH(u).\n\ndH(u)\n\n(28)\n\n(29)\n\n(cid:88)\n(cid:88)\n\nmax\np\u2208Up\n\ns(cid:48)\u2208Sn+1\n\nv(s(cid:48)) p(s(cid:48)|s, a) = CRMH [V0(s, a)],\n\n(31)\n\n6\n\n\fwhere V0(s, a) denotes the random variable that takes value v(s(cid:48)) with probability p0(s(cid:48)|s, a) for\ns(cid:48) \u2208 Sn+1. By (30) and (31), we have\nv(s) = min\na\u2208A(s)\n\n(32)\nwhere the \ufb01rst CTEH is with respect to f0(\u00b7|s, a); the second is with respect to p0(\u00b7|s, a). Because\nf0(\u00b7|s, a) is independent of the state at n + 1, the translation invariance3 of CRMH implies\n\n(CRMH [C0(s, a)] + CRMH [V (s, a)]) ,\n\n(33)\nwhere the inner CRMH is with respect to f0(\u00b7|s, a); the outer is with respect to p0(\u00b7|s, a). Because\nv(s(cid:48)) = w(s(cid:48)) = 0,\u2200s(cid:48) \u2208 SN , we can show, by induction, that (33) is equivalent to (22).\n\nCRMH [CRMH [C0(s, a)] + V (s, a)],\n\nv(s) = min\na\u2208A(s)\n\n3 Robust representations of expected exponential utilities\n\nIn this section, we study risk-sensitive MDPs whose objectives are de\ufb01ned with expected exponential\nutilities. We will see that there are robust MDPs that are equivalent to these risk-sensitive MDPs.\nWe start by the standard risk-sensitive MDP [5, 7, 8, 15, 18] whose objective is to minimize\nE[exp(\u03b3 \u02dcC(\u03c0))] for \u03b3 > 0. Because \u03b3 > 0, minimizing E[exp(\u03b3 \u02dcC(\u03c0))] is equivalent to mini-\nmizing an entropic risk measure (ERM) [6, 13]: ERM\u03b3[ \u02dcC(\u03c0)] \u2261 1\nThe key property of ERM that we exploit in this section is\n\n\u03b3 ln E[exp(\u03b3 \u02dcC(\u03c0))].\n\n{Eq0[Y ] \u2212 \u03b3 KL(q||q0)} ,\n\nERM\u03b3[Y ] = max\nq\u2208P(q0)\n\n(34)\nwhere Y is a generic discrete random variable, q0 is the probability mass function for Y , P(q0) is\nthe set of probability mass functions whose support is contained in the support of q0 (i.e., q(y) = 0\nif q0(y) = 0 for q \u2208 P(q0)), Eq is the expectation with respect to q \u2208 P(q0), and KL(q||q0) is the\nKullback-Leibler divergence [2] from q to q0. The property (34) has been discussed in the context\nof optimal control [1, 16]. See [6] for a proof of (34). Observe that the maximizer of the right-hand\nside of (34) trades a large value of Eq[Y ] for the closeness of q to q0.\nIt is now evident that the risk-sensitive MDP with the objective of minimizing E[exp(\u03b3 \u02dcC(\u03c0))] is\nequivalent to a \u201crobust\u201d MDP with the objective of minimizing Eq[ \u02dcC(\u03c0)] \u2212 \u03b3 KL(q||q0) for the\nworst choice of q \u2208 P(q0), where q0 denotes the probability mass function for \u02dcC(\u03c0). Here, the\nuncertainty is in the distribution of the cumulative cost, and it is nontrivial how this uncertainty is\nrelated to the uncertainty in the parameters, p and f, of the MDP.\nOur goal is to explicitly relate the risk-sensitive MDP of minimizing E[exp(\u03b3 \u02dcC(\u03c0))] to uncertainties\nin the parameters of the MDP. For a moment, we assume that C(s, a) has no uncertainty and is\ndeterministic given s and a, which will be relaxed later.\n0 (si+1|si) be the nominal transition\nTo see the relation, we study ERM\u03b3[ \u02dcC(\u03c0)] for a given \u03c0. Let p\u03c0\nprobability from si \u2208 Si to si+1 \u2208 Si+1 for i = 0, . . . , N \u2212 1. By the translation invariance and the\nrecursiveness4 of ERM [13], we have\n\n(cid:34)N\u22121(cid:88)\n\n(cid:35)(cid:35)\n\n(cid:34)\n\n(cid:34)N\u22121(cid:88)\n\ni=2\n\nERM\u03b3[ \u02dcC(\u03c0)] = C \u03c0\n\nC \u03c0\n\n0 + ERM\u03b3\n\n1 + ERM\u03b3\n\ni |S1\nC \u03c0\n0 (\u00b7|S1); the outer is with respect to p\u03c0\n\ni=2\n\n,\n\n(35)\n0 (\u00b7|s0). By (34), the\n\nwhere the inner ERM is with respect to p\u03c0\nsecond term of the right-hand side of (35) can be represented as follows:\n\n(cid:40)\n\n(cid:34)\n\n(cid:35)(cid:35)\n\nmax\np\u03c0(\u00b7|s0)\u2208P(p\u03c0\n\n0 (\u00b7|s0))\n\nEp\u03c0(\u00b7|s0)\n\nC \u03c0\n\n1 + ERM\u03b3\n\n| S1\n\nC \u03c0\ni\n\n\u2212 \u03b3 KL (p\u03c0(\u00b7|s0)||p\u03c0\n\n0 (\u00b7|s0))\n\n(36)\n\n(cid:41)\n\nThus, by induction, we can establish the following theorem5:\n3CRMH [Y + c] = CRMH [Y ] + c for a deterministic constant c.\n4ERM\u03b3[Y + c] = ERM\u03b3[Y ] + c and ERM\u03b3[Y ] = ERM\u03b3[ERM\u03b3[Y |Z]], where Y and Z are generic\n\nrandom variables, and c is a deterministic constant.\n\n5The proof is omitted, because this is a spacial case of Theorem 5.\n\n7\n\n\f(cid:40)\n\n(cid:34)N\u22121(cid:88)\n\n(cid:35)\n\n(cid:34)N\u22122(cid:88)\n\ni=0\n\ni=0\n\nTheorem 4. When the immediate cost from a state is deterministic given that state and the action\nfrom that state, the risk-sensitive MDP with the objective of minimizing E[exp(\u03b3 \u02dcC(\u03c0))] is equivalent\nto the robust MDP with the following objective:\n\nmin\n\n\u03c0\n\nmax\np\u03c0\u2208P(p\u03c0\n0 )\n\nEp\u03c0\n\nC \u03c0\ni\n\n\u2212 \u03b3 Ep\u03c0\n\nKL (p\u03c0(\u00b7|Si)||p\u03c0\n\n0 (\u00b7|Si))\n\n,\n\n(37)\n\nwhere p\u03c0 \u2208 P(p\u03c0\n\n0 ) denotes that p\u03c0(\u00b7|si) \u2208 Pp\u03c0\n\n0 (\u00b7|si),\u2200si \u2208 Si, i = 0, . . . , N \u2212 1.\n\nOur results in Section 2.2 motivate us to extend Theorem 4 to the case where C(s, a) has uncer-\n0 (\u00b7|s) be the nominal probability mass function for the immediate cost from a state s\ntainties. Let f \u03c0\nunder a policy \u03c0. Consider the risk-sensitive MDP with the following objective:\n\n(cid:35)(cid:41)\n\n(cid:104) \u02dcL(\u03c0)\n(cid:105)\n\nmin\n\nERM\u03b3\n\n(38)\nwhere \u02dcL(\u03c0) is the cumulative value of L(s, a) \u2261 ERM\u03b3[C(s, a)] along the sequence of (s, a) with\na policy \u03c0. Then we have the following theorem, which is proved in the supplementary material.\nTheorem 5. The risk-sensitive MDP with the objective (38) is equivalent to the robust MDP with\nthe following objective, where f \u03c0 \u2208 P(f \u03c0\n\n0 ) is de\ufb01ned analogously to p\u03c0 \u2208 P(p\u03c0\n\n0 ) in Theorem 4:\n\n\u03c0\n\n,\n\n\uf8f1\uf8f2\uf8f3 Ep\u03c0,f \u03c0\n(cid:104)(cid:80)N\u22121\n(cid:104)(cid:80)N\u22122\n\n\u2212\u03b3 Ep\u03c0\n\n(cid:105)\n\ni=0 C \u03c0\ni\ni=0 KL (p\u03c0(\u00b7|Si)||p\u03c0\n\n0 (\u00b7|Si)) +(cid:80)N\u22121\n\nmin\n\n\u03c0\n\nmax\np\u03c0\u2208P(p\u03c0\n0 )\nf \u03c0\u2208P(f \u03c0\n0 )\n\n(cid:105)\uf8fc\uf8fd\uf8fe . (39)\n\ni=0 KL (f \u03c0(\u00b7|Si)||f \u03c0\n\n0 (\u00b7|Si))\n\n4 Conclusion\n\nWe have shown relations between risk-sensitive MDPs and robust MDPs. Because ERM is also an\niterated risk measure [13], the objectives of the risk-sensitive MDPs studied in this paper are all\nwith respect to some iterated risk measures. The signi\ufb01cance of iterated risk measures is intensively\ndiscussed in [13], but it can represent one\u2019s preferences that cannot be represented by standard\nexpected exponential utility and yet allows ef\ufb01cient optimization and consistent decision making.\nWhile the prior work [13, 14, 19] minimizes the iterated risk measure of the cumulative cost ( \u02dcC(\u03c0)\nin Section 2), our study on the relation to a robust MDP suggests that one might want to minimize\nthe iterated risk measure of the sum of the values of risk measures for immediate costs (e.g., \u02dcK(\u03c0) in\nSection 2.3 or \u02dcL(\u03c0) in Section 3), because the latter is related to the robustness against uncertainty\nin cost. The optimal policy with respect to an iterated risk measure can be found ef\ufb01ciently with\ndynamic programming (speci\ufb01cally, the computational effort that is required in addition to that of the\ndynamic programming for minimizing the expected cumulative cost is in the time to calculate a risk\nmeasure such as CTE instead of expectation at each step of the dynamic programming) [13]. This\nmeans that the optimal policy for the robust MDP studied in this paper can be found quite ef\ufb01ciently.\nIn particular, the robust MDP in Theorem 5 might not seem to allow an ef\ufb01cient optimization without\nthe knowledge of the relation to the corresponding risk-sensitive MDP, whose optimal policy is\nreadily available with dynamic programming. Overall, the relation to a robust MDP can provide\nstrong motivation for the corresponding risk-sensitive MDP and vice versa.\nFor simplicity, the uncertainty sets in Section 2 are characterized by a single parameter, \u03b1, or a single\nfunction, g, but it is trivial to extend our results to the cases where the uncertainty sets are de\ufb01ned\ndifferently depending on the particular states, actions, and other elements of the MDP. In such cases,\nthe objective of the corresponding risk-sensitive MDP is composed of various risk measures. The\nuncertainty set in Section 3 depends only on the support of the nominal probability mass function.\nThe penalty for the deviation from the nominal value can be adjusted with a single parameter, \u03b3, but\nit is also trivial to extend our results to the cases, where this parameter varies depending on the partic-\nular elements of the MDP. In such cases, the objective of the corresponding risk-sensitive MDP is an\niterated risk measure composed of ERM having varying parameters. It would also be an interesting\ndirection to extend our results to convex risk measures, which allows robust representations.\n\n8\n\n\fReferences\n[1] C. D. Charalambous, F. Rezaei, and A. Kyprianou. Relations between information theory,\nrobustness, and statistical mechanics of stochastic systems. In Proceedings of the 43rd IEEE\nConference on Decision and Control, volume 4, pages 3479\u20133484, 2004.\n\n[2] T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, Inc.,\n\nHoboken, New Jersey, 2nd edition, 2006.\n\n[3] E. Delage and S. Mannor. Percentile optimization in uncertain MDP with application to ef-\n\ufb01cient exploration. In Proceedings of the 24th Annual International Conference on Machine\nLearning (ICML 2007), pages 225\u2013232, June 2007.\n\n[4] E. Delage and S. Mannor. Percentile optimization for MDP with parameter uncertainty. Oper-\n\nations Research, 58(1):203\u2013213, 2010.\n\n[5] E. V. Denardo and U. G. Rothblum. Optimal stopping, exponential utility, and linear program-\n\nming. Mathematical Programming, 16:228\u2013244, 1979.\n\n[6] H. F\u00a8ollmer and A. Schied. Stochastic Finance: An Introduction in Discrete Time. Walter de\n\nGruyter, Berlin, Germany, 3rd edition, 2010.\n\n[7] R. Howard and J. Matheson. Risk-sensitive Markov decision processes. Management Science,\n\n18(7):356\u2013369, 1972.\n\n[8] S. C. Jaquette. A utility criterion for Markov decision processes. Management Science,\n\n23(1):43\u201349, 1976.\n\n[9] S. Kusuoka. On law invariant coherent risk measures. In S. Kusuoka and T. Maruyama, editors,\n\nAdvances in Mathematical Economics, volume 3, pages 83\u201395. Springer, Tokyo, 2001.\n\n[10] S. Mannor, O. Mebel, and H. Xu. Lightning does not strike twice: Robust MDPs with coupled\nIn Proceedings of the International Conference on Machine Learning (ICML\n\nuncertainty.\n2012), pages 385\u2013392, 2012.\n\n[11] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n[12] A. Nilim and L. E. Ghaoui. Robustness in Markov decision problems with uncertain transition\nmatrices. In S. Thrun, L. Saul, and B. Sch\u00a8olkopf, editors, Advances in Neural Information\nProcessing Systems 16. MIT Press, Cambridge, MA, 2004.\n\n[13] T. Osogami.\n\nIterated risk measures for risk-sensitive Markov decision processes with dis-\ncounted cost. In Proceedings of the 27th Conference on Uncertainty in Arti\ufb01cial Intelligence\n(UAI 2011), pages 567\u2013574, July 2011.\n\n[14] T. Osogami and T. Morimura. Time-consistency of optimization problems. In Proceedings of\n\nthe 26th Conference on Arti\ufb01cial Intelligence (AAAI-12), July 2012.\n\n[15] S. D. Patek. On terminating Markov decision processes with a risk-averse objective function.\n\nAutomatica, 37(9):1379\u20131386, 2001.\n\n[16] I. R. Petersen, M. R. James, and P. Dupuis. Minimax optimal control of stochastic uncer-\nIEEE Transactions on Automatic Control,\n\ntain systems with relative entropy constraints.\n45(3):398\u2013412, 2000.\n\n[17] M. L. Puterman. Markov Decision Processes: Discrete Dynamic Programming. Wiley-\n\nInterscience, Hoboken, NJ, second edition, 2005.\n\n[18] U. G. Rothblum. Multiplicative Markov decision chains. Mathematics of Operations Research,\n\n9(1):6\u201324, 1984.\n\n[19] A. Ruszczy\u00b4nski. Risk-averse dynamic programming for Markov decision processes. Mathe-\n\nmatical Programming, 125:235\u2013261, 2010.\n\n[20] H. Xu and S. Mannor. The robustness-performance tradeoff in Markov decision processes. In\nB. Sch\u00a8olkopf, J. Platt, and T. Hoffman, editors, Advances in Neural Information Processing\nSystems 19, pages 1537\u20131544. MIT Press, Cambridge, MA, 2007.\n\n[21] H. Xu and S. Mannor. Distributionally robust Markov decision processes. In J. Lafferty, C. K. I.\nWilliams, J. Shawe-Taylor, R. Zemel, and A. Culotta, editors, Advances in Neural Information\nProcessing Systems 23, pages 2505\u20132513. MIT Press, Cambridge, MA, 2010.\n\n9\n\n\f", "award": [], "sourceid": 129, "authors": [{"given_name": "Takayuki", "family_name": "Osogami", "institution": null}]}