{"title": "Beyond Confidence Regions: Tight Bayesian Ambiguity Sets for Robust MDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 7049, "page_last": 7058, "abstract": "Robust MDPs (RMDPs) can be used to compute policies with provable worst-case guarantees in reinforcement learning. The quality and robustness of an RMDP solution are determined by the ambiguity set---the set of plausible transition probabilities---which is usually constructed as a multi-dimensional confidence region. Existing methods construct ambiguity sets as confidence regions using concentration inequalities which leads to overly conservative solutions. This paper proposes a new paradigm that can achieve better solutions with the same robustness guarantees without using confidence regions as ambiguity sets. To incorporate prior knowledge, our algorithms optimize the size and position of ambiguity sets using Bayesian inference. Our theoretical analysis shows the safety of the proposed method, and the empirical results demonstrate its practical promise.", "full_text": "Beyond Con\ufb01dence Regions: Tight Bayesian\n\nAmbiguity Sets for Robust MDPs\n\nReazul Hasan Russel\n\nDepartment of Computer Science\n\nUniversity of New Hampshire\n\nrrussel@cs.unh.edu\n\nMarek Petrik\n\nDepartment of Computer Science\n\nUniversity of New Hampshire\n\nmpetrik@cs.unh.edu\n\nAbstract\n\nRobust MDPs (RMDPs) can be used to compute policies with provable worst-\ncase guarantees in reinforcement learning. The quality and robustness of an\nRMDP solution are determined by the ambiguity set\u2014the set of plausible transition\nprobabilities\u2014which is usually constructed as a multi-dimensional con\ufb01dence\nregion. Existing methods construct ambiguity sets as con\ufb01dence regions using\nconcentration inequalities which leads to overly conservative solutions. This paper\nproposes a new paradigm that can achieve better solutions with the same robustness\nguarantees without using con\ufb01dence regions as ambiguity sets. To incorporate\nprior knowledge, our algorithms optimize the size and position of ambiguity sets\nusing Bayesian inference. Our theoretical analysis shows the safety of the proposed\nmethod, and the empirical results demonstrate its practical promise.\n\n1\n\nIntroduction\n\nMarkov decision processes (MDPs) provide a versatile framework for modeling reinforcement\nlearning problems [4, 33, 38]. However, they assume that transition probabilities and rewards\nare known exactly which is rarely the case. Limited data sets, modeling errors, value function\napproximation, and noisy data are common reasons for errors in transition probabilities [16, 30, 45].\nThis results in policies that are brittle and fail when implemented. This is particularly true in the case\nof batch reinforcement learning [18, 20, 23, 32, 42].\nA promising framework for computing robust policies is based on Robust MDPs (RMDPs). RMDPs\nrelax the need for precisely known transition probabilities. Instead, transition probabilities can\ntake on any value from a so-called ambiguity set which represents a set of plausible transition\nprobabilities [9, 14, 24, 29, 32, 40, 46, 47]. RMDPs are also reminiscent of dynamic zero-sum games:\nthe decision maker chooses the best actions, while the adversarial nature chooses the worst transition\nprobabilities from the ambiguity set.\nThe practical utility of using RMDPs has been hindered by the lack of good ways of constructing ambi-\nguity sets that lead to solutions that are robust without being too conservative. The standard approach\nto constructing ambiguity sets from concentration inequalities [1, 32, 42, 44] leads to theoretical\nguarantees but provides solutions that hopelessly conservative. Many problem-speci\ufb01c methods have\nbeen proposed too, but they are hard to use and typically lack \ufb01nite-sample guarantees [3, 5, 16, 28].\nThe main contribution of this work is to introduce a new method for constructing ambiguity sets that\nare both signi\ufb01cantly less conservative than existing ones [21, 32, 42] and also provide strong \ufb01nite-\nsample guarantees. Similarly to some prior work on robust reinforcement learning and optimization,\nwe use Bayesian assumptions to take advantage of domain knowledge which is often available [7,\n8, 13, 47]. Our main innovation is to realize that the natural approach to building ambiguity sets\nas con\ufb01dence intervals is unnecessarily conservative. Surprisingly, in the Bayesian setting, using a\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f95% con\ufb01dence region for the transition probabilities is unnecessarily conservative to achieve 95%\ncon\ufb01dence in the robustness of the solution. We also derive new L1 concentration inequalities of\npossible independent interest.\nThe remainder of the paper is organized as follows. Section 2 formally describes the framework and\ngoals of the paper. Section 3 describes our main contribution, RSVF, a new method for constructing\ntight ambiguity sets from Bayesian models that are adapted to the optimal policy. We provide\ntheoretical justi\ufb01cation for the robustness of RSVF, but detailed theoretical analysis of its performance\nguarantees is beyond the scope of this work. Then, Section 4 overviews related work and outlines\nmethods that build ambiguity sets as frequentist con\ufb01dence regions or Bayesian credible sets. Finally,\nSection 5 presents empirical results on several problem domains.\n\n2 Problem Statement: Data-driven RMDPs\n\nThis section formalizes our goals and reviews relevant results for robust Markov decision pro-\ncesses (RMDPs). Throughout the paper, we use the symbol \u2206S to denote the probability simplex in\nRS\n+. The symbols 1 and 0 denote vectors of all ones and zeros, respectively, of an appropriate size.\nThe symbol I represents the identity matrix.\n\n2.1 Safe Return Estimate: VaR\n\nThe underlying reinforcement learning problem is a Markov decision process with states S =\n{1, . . . , S} and actions A = {1, . . . , A}. The rewards r : S \u00d7 A \u2192 R are known but the true\ntransition probabilities P (cid:63) : S \u00d7 A \u2192 \u2206S are unknown. The transition probability vector for a state\ns,a. As this is a batch reinforcement learning setting, a \ufb01xed dataset\ns and an action a is denoted by p(cid:63)\nD of transition samples is provided: D = (si \u2208 S, ai \u2208 A, s(cid:48)\ni \u2208 S)i=1,...,m. The only assumption\nabout D is that the state s(cid:48) in (s, a, s(cid:48)) \u2208 S is distributed according to the true transition probabilities\ns(cid:48) \u223c P (cid:63)(s, a,\u00b7), no assumptions are made on the sampling policy. Note that in the Bayesian approach,\nP (cid:63) is a random variable and we assume to have a prior distribution available.\nThe objective is to maximize the standard \u03b3-discounted in\ufb01nite horizon return [33]. Because this\npaper analyzes the impact of using different transition probabilities, we use a subscript to indicate\nwhich ones are used. The optimal value function for some transition probabilities P is, therefore,\nP : S \u2192 R, and the value function for a deterministic policy \u03c0 : S \u2192 A is denoted as v\u03c0\ndenoted as v(cid:63)\nP .\nThe set of all deterministic stationary policies is denoted by \u03a0. The total return \u03c1(\u03c0, P ) of a policy \u03c0\nunder transition probabilities P is:\n\n\u03c1(\u03c0, P ) = pT\n\n0 v\u03c0\nP ,\n\nwhere p0 is the initial distribution.\nIdeally, we could compute a policy \u03c0 : S \u2192 A that maximizes the return \u03c1(\u03c0, P (cid:63)), but P (cid:63) is\nunknown. Ignoring the uncertainty in P (cid:63) completely leads to brittle policies. Instead, a common\nobjective in robust reinforcement learning is to maximize a plausible lower-bound on the return.\nHaving a safe return estimate is very important since it can inform the stakeholder that the policy\nmay not be good enough when deployed. The objective of computing a policy \u03c0 that maximizes a\nhigh-con\ufb01dence lower bound on the return can be expressed as [8, 21, 31, 42]:\n\nmax\n\u03c0\u2208\u03a0\n\nV@R\u03b4\n\nP (cid:63) [\u03c1(\u03c0, P (cid:63))],\n\n(1)\n\nwhere V@R\u03b4 is the popular value-at-risk measure at a risk level \u03b4 [35]. This objective is also\nsometimes known as percentile optimization [8]. It is important to note that the risk metric is applied\nover possible values of the uncertain parameter and not over the distribution of returns. For example,\nP (cid:63) [\u03c1(\u03c0, P (cid:63))] = \u22121 then for 5% of uncertain transition probabilities P (cid:63), the return is \u22121 or\nif V@R0.05\nsmaller.\nBecause solving the optimization problem in (1) is NP-hard [8], we instead maximize a lower bound\n\u02dc\u03c1(\u03c0). We call this lower bound a safe return estimate and it is de\ufb01ned as follows.\nDe\ufb01nition 2.1 (Safe Return Estimate). The estimate \u02dc\u03c1 : \u03a0 \u2192 R of return is called safe for a policy\n\u03c0 with probability 1 \u2212 \u03b4 if \u02dc\u03c1(\u03c0) \u2264 V@R\u03b4\n\nP (cid:63) [\u03c1(\u03c0, P (cid:63))], or in other words if it satis\ufb01es:\n\n(cid:105) \u2265 1 \u2212 \u03b4 .\n\n(cid:104)\n\nPP (cid:63)\n\n\u02dc\u03c1(\u03c0) \u2264 \u03c1(\u03c0, P (cid:63)) D\n\n2\n\n\fRecall that under Bayesian assumptions, P (cid:63) is a random variable and the guarantees are conditional\non the dataset D. This is different from the frequentist approach, in which the random variable\nis D and the guarantees are conditional on P (cid:63). The relative merits of Bayesian versus frequentist\napproaches to robust optimization have been discussed in earlier work [8, 47], but we emphasize\nthat each approach presents a different set of advantages. An insightful discussion of the differences\nbetween the two approaches can be found, for example, in Sections 5.2.2 and 6.1.1 of Murphy (2012).\nThe following example will be used throughout the paper to demonstrate the proposed methods and\nvisualize simple ambiguity sets.\nExample 2.1. Consider an MDP with 3 states: s1, s2, s3 and a single action a1. Assume that the true,\nbut unknown, transition probability is P (cid:63)(s1, a1,\u00b7) = [0.3, 0.2, 0.5]. The known prior distribution\nover p(cid:63)\ns1,a1 is Dirichlet with concentration parameters \u03b1 = (1, 1, 1). The dataset D is comprised of 3\noccurrences of transitions (s1, a1, s1), 2 of transitions (s1, a1, s2), and 5 of transitions (s1, a1, s3).\nThe posterior distribution over p(cid:63)\nsa,a1 is also Dirichlet with \u03b1 = (4, 3, 6). Note that this is a probability\ndistribution over transition probability distributions. Fig. 1 depicts the posterior distribution projected\nonto the probability simplex along with a 90% con\ufb01dence region centered on the posterior mean.\n\n2.2 Robust MDPs\n\nRobust Markov Decision Processes (RMDPs) are a convenient model and tractable model that\ngeneralizes MDPs. We will use RMDPs to maximize a tractable lower bound on V@R objective in\n(1) and compute a safe return estimate. Our RMDP model has the same states S, actions A, rewards\nrs,a as the MDP. The transition probabilities for each state s and action a, denoted as ps,a \u2208 \u2206S, are\nassumed chosen adversarialy from an ambiguity set Ps,a. We use P to refer cumulatively to Ps,a for\nall states s and actions a.\nWe restrict our attention to sa-rectangular ambiguity sets, which allow the adversarial nature to\nchoose the worst transition probability independently for each state and action [22, 45]. Limitations\nof rectangular ambiguity sets are known well [12, 25, 43] but they represent a simple, tractable, and\npractical model. A convenient way of de\ufb01ning ambiguity sets is to use a norm-distance from a given\nnominal transition probability \u00afps,a:\n\nPs,a =(cid:8)p \u2208 \u2206S : (cid:107)p \u2212 \u00afps,a(cid:107)1 \u2264 \u03c8s,a\n\n(cid:9)\n\n(2)\nfor a given \u03c8s,a \u2265 0 and a nominal point \u00afps,a. We focus on ambiguity sets de\ufb01ned by the L1 norm\nbecause they give rise to RMDPs that can be solved very ef\ufb01ciently [15].\nRMDPs have properties that are similar to regular MDPs (see, for example, [2, 19, 22, 28, 45]). The\n\nrobust Bellman operator (cid:98)TP for an ambiguity set P for a state s computes the best action with respect\n\nto the worst-case realization of the transition probabilities:\n\n(rs,a + \u03b3 \u00b7 pTv)\n\nmin\np\u2208Ps,a\n\n(3)\n\n((cid:98)TPv)(s) := max\nThe symbol (cid:98)T \u03c0\nto MDPs, satisfy \u02c6v(cid:63) = (cid:98)TP\u02c6v(cid:63) and \u02c6v\u03c0 = (cid:98)T \u03c0\n\na\u2208A\n\nP denotes a robust Bellman update for a given stationary policy \u03c0. The optimal robust\nvalue function \u02c6v(cid:63), and the robust value function \u02c6v\u03c0 for a policy \u03c0 are unique and must, similarly\nP \u02c6v\u03c0. In general, we use a hat to denote quantities in the\nRMDP and omit it for the MDP. When the ambiguity set P is not obvious from the context, we use it\nas a subscript \u02c6v(cid:63)\n\nP. The robust return \u02c6\u03c1 is de\ufb01ned as [16]:\n\n\u02c6\u03c1(\u03c0, P) = min\nP\u2208P\n\n\u03c1(\u03c0, P ) = pT\n\n0 \u02c6v\u03c0\nP ,\n\nwhere p0 \u2208 \u2206S is the initial distribution. In the remainder of the paper, we describe methods that\nconstruct P from D in order to guarantee that \u02c6\u03c1 is a tight lower bound on V@R of the returns.\n\n3 Optimized Bayesian Ambiguity Sets\n\nIn this section, we describe the new algorithm for constructing Bayesian ambiguity sets that can\ncompute less-conservative lower bounds on the return. RSVF (robusti\ufb01cation with sensible value\nfunctions) is a Bayesian method that uses samples from the posterior distribution over P (cid:63) to construct\ntight ambiguity sets.\n\n3\n\n\fFigure 1: Contours of the\nposterior distribution and the\n90%-con\ufb01dence region.\n\nFigure 2: Optimal Bayesian\nambiguity set (red) for a value\nfunction v = (0, 0, 1).\n\nSets Ks1,a1 (vi)\nFigure 3:\n(dashed red) for i = 1, 2 and\nLs1,a1 ({v1, v2}) (black).\n\nBefore describing the algorithm, we use the setting of Example 2.1 to motivate our approach. To\nminimize distractions by technicalities, assume that the goal is to compute the return for a single\ntime step starting from state s1. Assume also that the value function v = (1, 0, 0) is known, all\nrewards from s1 are 0, and \u03b3 = 1. Recall that our goal is to construct a safe return estimate \u02dc\u03c1(\u03c0) of\nV@R0.1\nP (cid:63) [\u03c1(\u03c0, P (cid:63))] at the 90% level. When the value function is known, it is possible to construct the\noptimal ambiguity set P(cid:63) such that \u02c6\u03c1(\u03c0) = minp\u2208P(cid:63) pTv = V@R0.1\n\nP (cid:63) [\u03c1(\u03c0, P (cid:63))] as:\n\n(cid:110)\n\nP(cid:63) =\n\np \u2208 \u22063 : pTv \u2265 V@R0.1\n\nP (cid:63) [\u03c1(\u03c0, P (cid:63))]\n\n.\n\n(cid:111)\n\nIt can be shown readily that this ambiguity set is optimal in the sense that any set for which \u02dc\u03c1(\u03c0) is\nexact must be a subset of P(cid:63) [13]. Fig. 2 depicts the optimal ambiguity set along with the arrow that\nindicates the direction along which v increases.\nThe optimal ambiguity set described above cannot be used directly, unfortunately, because the value\nfunction is unknown. It would be tempting to construct the ambiguity set as the intersection of\noptimal sets for all possible value functions; a polyhedral approximation of this set is shown in Fig. 2\nusing a blue color. Unfortunately, this approach is not (usually) correct and will not lead to a safe\nreturn estimate. This can be shown from the fact that support functions to convex sets are convex and\nV@R is not a convex (concave) function [6, 34]; see Gupta (2015) for a more detailed discussion.\nSince it is not possible, in general, to simply consider the intersection of optimal ambiguity sets for\nall possible value functions, we approximate the optimal ambiguity set for a few reasonable value\nfunctions. For this purpose, we use a set Ks,a(v) de\ufb01ned as follows:\n\ns,a)Tv(cid:3)(cid:111)\n(cid:2)(p(cid:63)\n(cid:2)(p(cid:63)\ns,a)Tv(cid:3). See Lemma B.2 for the formal statement.\n\nwhere \u03b6 = 1 \u2212 \u03b4/(SA). The bottom dashed set in Fig. 3 depicts this set K for v = (0, 0, 1)\nin Example 2.1. The intuition behind this construction is as follows. If any ambiguity set Ps,a\nP is safe: maxp\u2208Ks,a(v) pTv \u2264\nintersects Ks,a(\u02c6v\u03c0\nV@R\u03b6\nThe set Ks,a(v) is suf\ufb01cient, when the value function is known, but we need to generalize the\napproach to unknown value functions. The set Ls,a(V) provides such a guarantee for a set of possible\nvalue functions (POV) V. Its center is chosen to minimize its size while intersecting Ks,a(v) for each\nv in V and is constructed as follows.\n\nP) for each state s, a then the value function \u02c6v\u03c0\n\nLs,a(V) =(cid:8)p \u2208 \u2206S : (cid:107)p \u2212 \u03b8s,a(V)(cid:107)1 \u2264 \u03c8s,a(V)(cid:9)\n\nKs,a(v) =\n\np \u2208 \u2206S : pTv \u2264 V@R\u03b6\n\nP (cid:63)\n\n(cid:110)\n\nP (cid:63)\n\n,\n\n\u03c8s,a(V) = min\np\u2208\u2206S\n\nf (p),\n\n\u03b8s,a(V) \u2208 arg min\np\u2208\u2206S\n\nf (p),\n\nf (p) = max\nv\u2208V\n\n(cid:107)q \u2212 p(cid:107)1\n\nmin\n\nq\u2208Ks,a(v)\n\n(4)\n\nThe optimization in (4) can be represented and solved as a linear program. Fig. 3 shows the set L\nin black solid color. It is the smallest L1-constrained set that intersects the two K sets for value\nfunctions v1 = (0, 0, 1) and v2 = (2, 1, 0) in Example 2.1.\nWe are now ready to describe RSVF, which is outlined in Algorithm 1. RSVF takes an optimistic\napproach to approximating the optimal ambiguity set. It starts with a small set of potential optimal\nvalue functions (POV) and constructs an ambiguity set that is safe for these value functions. It\nkeeps increasing the POV set until \u02c6v(cid:63) is in the set and the policy is safe. To simplify presentation,\n\n4\n\ns1s2s3l0.000.250.500.750.000.250.500.751.00s1s2s30.000.250.500.750.000.250.500.751.00s1s2s3+l0.000.250.500.750.000.250.500.751.00\fAlgorithm 1: RSVF: Adapted Ambiguity Sets\nInput: Con\ufb01dence 1 \u2212 \u03b4 and posterior PP (cid:63) [\u00b7 | D]\nOutput: Policy \u03c0 and lower bound \u02dc\u03c1(\u03c0)\n1 k \u2190 0;\n2 Pick some initial value function \u02c6v0;\n3 Initialize POV: V0 \u2190 \u2205 ;\n4 repeat\n5\n6\n7\n8\n\nAugment POV: Vk+1 \u2190 Vk \u222a {vk} ;\ns,a \u2190 Ls,a(Vk+1) ;\nFor all s, a update Pk+1\nSolve \u02c6vk+1 \u2190 \u02c6v(cid:63)\nand \u02c6\u03c0k+1 \u2190 \u02c6\u03c0(cid:63)\nk \u2190 k + 1 ;\n\nPk+1\n\nPk+1\n\n;\n\n9 until safe for all s, a: Ks,a(\u02c6vk) \u2229 Pk\n10 return (\u02c6\u03c0k, pT\n\n0 \u02c6vk) ;\n\ns,a (cid:54)= \u2205;\n\nAlgorithm 1 is not guaranteed to terminate in \ufb01nite time; the actual implementation switches to BCI\ndescribed in Section 4.2 after 100 iterations, which guarantees its termination.\nThe following theorem states that Algorithm 1 produces a safe estimate of the true return.\nTheorem 3.1. Suppose that Algorithm 1 terminates with a policy \u02c6\u03c0k and a value function \u02c6vk in the\niteration k. Then, the return estimate \u02dc\u03c1(\u02c6\u03c0) = pT\n\n(cid:105) \u2265 1 \u2212 \u03b4.\n\n0 \u02c6vk is safe: PP (cid:63)\n\n0 \u02c6vk \u2264 pT\npT\n\n(cid:104)\n\n0 v \u02c6\u03c0k\nP (cid:63)\n\n(cid:12)(cid:12)(cid:12) D\n\nBefore discussing the proof of Theorem 3.1, it is important to mention its limitations. This result\nshows only that the return estimate \u02c6\u03c1 is safe; it does not show that it is good. There are, of course,\nnaive safe estimates such as \u02dc\u03c1(\u03c0) = (1 \u2212 \u03b3)\u22121 mins,a rs,a. Since RSVF tightly approximates the\noptimal ambiguity sets, we expect it to perform signi\ufb01cantly better. The theoretical analysis of this of\nthe approximation error of \u02c6\u03c1 is beyond the scope of this work and we present empirical evidence in\nSection 5 instead.\nAll proofs can be found in Appendix B. The proof is technical but conceptually simple. It is based\non two main properties. The \ufb01rst one is the construction of optimal ambiguity sets for the known\nvalue function as outlined above. The second is the fact that the ambiguity set needs to be robust\nwith only with respect to the robust value function \u02c6v and not the optimal value function v(cid:63). This is\nsubtle, but crucial since \u02c6v is a constant while v(cid:63) is a random variable in the Bayesian setting. The\nRSVF approach, therefore, does not work when frequentist guarantees are required. Con\ufb01dence\nregions, described in Section 4, are designed for situations when robustness is required with respect\nto a random variable, and are therefore overly conservative in our setting. See Appendix E for more\nin-depth discussion.\n\n4 Ambiguity Sets as Con\ufb01dence Regions\n\nIn this section, we describe the standard approach to constructing ambiguity sets as multidimensional\ncon\ufb01dence regions and propose its extension to the Bayesian setting. Con\ufb01dence regions derived\nfrom concentration inequalities have been used previously to compute bounds on the true return in\noff-policy policy evaluation [41, 42]. These methods, unfortunately, do not readily generalize to the\npolicy optimization setting, which we target. Other work has focused on reducing variance rather than\non high-probability bounds [18, 23, 26]. Methods for exploration in reinforcement learning, such as\nMBIE or UCRL2, also construct ambiguity sets using concentration inequalities [10, 17, 37, 37, 39]\nand compute optimistic (upper) bounds to guide exploration.\n\n4.1 Distribution-free (Frequentist) Con\ufb01dence Interval\n\nDistribution-free con\ufb01dence regions are used widely in reinforcement learning to achieve robust-\nness [32, 42] and to guide exploration [36, 39]. The con\ufb01dence region is constructed around the\nmean transition probability by combining the Hoeffding inequality with the union bound [32, 44].\n\n5\n\n\fWe refer to this set as a Hoeffding con\ufb01dence region and de\ufb01ne it as follows for each s and a:\n\nPH\n\ns,a =\n\np \u2208 \u2206S : (cid:107)p \u2212 \u00afps,a(cid:107)1 \u2264\n\n2\n\nns,a\n\nlog\n\nSA2S\n\n\u03b4\n\n(cid:115)\n\n(cid:41)\n\n,\n\n(cid:40)\n\n(cid:40)\n\nwhere \u00afps,a is the mean transition probability computed from D and ns,a is the number of transitions\nin D originating from state s and an action a.\nTheorem 4.1. The robust value function \u02c6vPH for the ambiguity set PH satis\ufb01es:\n\nPD [\u02c6v\u03c0\n\nPH \u2264 v\u03c0\n\nP (cid:63) , \u2200\u03c0 \u2208 \u03a0 | P (cid:63)] \u2265 1 \u2212 \u03b4 .\n\n(5)\n\nIn addition, if \u02c6\u03c0(cid:63)\n\nPH is the optimal solution to the RMDP, then pT\n\n0 \u02c6v(cid:63)\n\nPH is a safe return estimate of \u02c6\u03c0(cid:63)\n\nPH .\n\nTo better understand the limitations of using concentration inequalities, we compare with new, and\nsigni\ufb01cantly tighter, frequentist ambiguity sets. The size of PH grows as a square root of the number\nof states because of the 2S term. This means that the size of D must scale about quadratically with the\nnumber of states to achieve the same con\ufb01dence. Under some restrictive assumptions, the ambiguity\nset can be shown to be:\n\n(cid:115)\n\n(cid:41)\n\nPM\n\ns,a =\n\np \u2208 \u2206S : (cid:107)p \u2212 \u00afps,a(cid:107)1 \u2264\n\n2\n\nns,a\n\nlog\n\nS2A\n\n\u03b4\n\n.\n\nThis auxiliary result is proved in Appendix C.1. We emphasize that the aim of this bound is\nto understand the limitations of distribution free bounds, and we use it even when the necessary\nassumptions are violated.\n\n4.2 Bayesian Credible Region (BCI)\n\nWe now describe how to construct ambiguity sets from Bayesian credible (or con\ufb01dence) regions. To\nthe best of our knowledge, this approach has not been studied explicitly. The construction starts with\na (hierarchical) Bayesian model that can be used to sample from the posterior probability of P (cid:63) given\ndata D. The implementation of the Bayesian model is irrelevant as long as it generates posterior\nsamples ef\ufb01ciently. For example, one may use a Dirichlet posterior, or use MCMC sampling libraries\nlike JAGS, Stan, or others [11].\nThe posterior distribution is used to optimize for the smallest ambiguity set around the mean transition\nprobability. Smaller sets, for a \ufb01xed nominal point, are likely to result in less conservative robust\nestimates. The BCI ambiguity set is de\ufb01ned as follows:\n\ns,a =(cid:8)p \u2208 \u2206S : (cid:107)p \u2212 \u00afps,a(cid:107)1 \u2264 \u03c8B\n\ns,a\n\n(cid:9) ,\n\nPB\n\n\u00afps,a = EP (cid:63) [p(cid:63)\n\nThere is no closed-form expression for the Bayesian ambiguity set size. It must be computed by\nsolving the following optimization problem for each state s and action a:\n\n(cid:26)\n\n\u03c8 : P(cid:2)(cid:107)p(cid:63)\n\ns,a \u2212 \u00afps,a(cid:107)1 > \u03c8 | D(cid:3) <\n\n\u03c8B\n\ns,a = min\n\u03c8\u2208R+\n\nThe nominal point \u00afps,a is \ufb01xed (not optimized) to preserve tractability. This optimization problem can\nbe solved by the Sample Average Approximation (SAA) algorithm [35]. Algorithm 2, in the appendix,\nsummarizes the sort-based method. The main idea is to sample from the posterior distribution and\nthen choose the minimal size \u03c8s,a that satis\ufb01es the constraint. We assume that it is possible to draw\nenough samples from P (cid:63) that the sampling error becomes negligible. Because the \ufb01nite-sample\nanalysis of SAA is simple but tedious, we omit it.\nTheorem 4.2. The robust value function \u02c6vPB for the ambiguity set PB satis\ufb01es:\n\nPP (cid:63) [\u02c6v\u03c0\n\nPB \u2264 v\u03c0\n\nP (cid:63) , \u2200\u03c0 \u2208 \u03a0 | D] \u2265 1 \u2212 \u03b4 .\n\nIn addition, if \u02c6\u03c0(cid:63)\n\nPB is the optimal solution to the RMDP, then pT\n\n0 \u02c6v(cid:63)\n\nPB is a safe return estimate of \u02c6\u03c0(cid:63)\n\nPB .\n\nThe proof is provided in Appendix B. Similar to other results, this theorem only proves that the\nconstructed lower bound on the return is safe. It does not address the tightness of the bound.\n\n6\n\ns,a | D] .\n(cid:27)\n\n.\n\n\u03b4\nSA\n\n\fFigure 4: Expected regret of safe estimates\nwith 95% con\ufb01dence regions for the Bellman\nupdate with an uninformative prior.\n\nFigure 5: Expected regret of safe estimates\nwith 95% con\ufb01dence regions for the Bellman\nupdate with an informative prior.\n\n5 Empirical Evaluation\n\nIn this section, we empirically evaluate the safe estimates computed using Hoeffding, BCI, and\nRSVF ambiguity sets. We start by assuming a true model and generate simulated datasets from it.\nEach dataset is then used to construct an ambiguity set and a safe estimate of policy return. The\nperformance of the methods is measured using the average of the absolute errors of the estimates\ncompared with the true returns of the optimal policies. All of our experiments use a 95% con\ufb01dence\nfor the safety of the estimates.\nWe compare ambiguity sets constructed using BCI, RSVF, with the Hoeffding sets. To reduce the\nconservativeness of Hoeffding sets when transition probabilities are sparse, we use a modi\ufb01cation\ninspired by the Good-Turing bounds [39]. That is that any transitions from s, a to s(cid:48) are impossible\nif they are not in D. We also compare with the \u201cHoeffding Monotone\u201d formulation PM even when\nthere is no guarantee that the value function is really monotone. Finally, we compare the results with\nthe \u201cMean Transition\u201d which solves the expected model \u00afps,a with no safety guarantees.\nNext in Section 5.1, we compare the methods in a simpli\ufb01ed setting in which we consider the problem\nof estimating the value of a single state from a Bellman update. Then, Section 5.2, evaluates the\napproach on an MDP with an informative prior.\nWe do not evaluate the computational complexity of the methods since they target problems con-\nstrained by data and not computation. The Bayesian methods are generally more computationally\ndemanding but the scale depends signi\ufb01cantly on the type of the prior model used. All Bayesian\nmethods draw 1, 000 samples from the posterior for each state and action.\n\n5.1 Bellman Update\n\nIn this section, we consider a transition from a single state s0 and action a0 to 5 states s1, . . . , s5. The\nvalue function for the states s1, . . . , s5 is \ufb01xed to be [1, 2, 3, 4, 5]. RSVF is run for a single iteration\nwith the given value function. The single iteration of RSVF in this simplistic setting helps to quantify\nthe possible bene\ufb01t of using RSVF-style methods over BCI. The ground truth is generated from the\ncorresponding prior for each one of the problems.\n\nUninformative Dirichlet Priors This setting considers a uniform Dirichlet distribution with \u03b1 =\n[1, 1, 1, 1, 1] as the prior. This prior provides little information. Figure 4 compares the computed\nrobust return errors. The value \u03be represents the regret of predicted returns, which is the absolute\ndifference between the true optimal value and the robust estimate: \u03be = |\u03c1(\u03c0(cid:63)\nP (cid:63) , P (cid:63)) \u2212 \u02dc\u03c1(\u02c6\u03c0(cid:63))|. Here,\n\u02dc\u03c1 is the robust estimate and \u02c6\u03c0(cid:63) is the optimal robust solution. The smaller the value, the tighter and\nless conservative the safe estimate is. The number of samples is the size of dataset D. All results are\ncomputed by averaging over 200 simulated datasets of the given size generated from the true P (cid:63). The\nresults show that BCI improves on both types of Hoeffding bounds and RSVF further improves on\nBCI. The mean estimate provides the tightest bounds, but it does not provide any meaningful safety\nguarantees.\n\n7\n\n20406080100Number of samples10\u22121100Calculated return error: \ue156[\u03be]Mean TransitionHoeffdingHoeffding MonotoneBCIRSVF20406080100Number of samples10\u22121100Calculated return error: \ue156[\u03be]Mean TransitionHoeffdingHoeffding MonotoneBCIRSVF\fFigure 6: Expected regret of safe estimates\nwith 95% con\ufb01dence regions for the River-\nSwim: an MDP with an uninformative prior.\n\nFigure 7: Expected regret of safe estimates\nwith 90% con\ufb01dence regions for the ExpPop-\nulation: an MDP with an informative prior.\n\nInformative Gaussian Priors To evaluate the effect of using an informative prior, we use a problem\ninspired by inventory optimization. The states s1, . . . , s5 represent inventory levels. The inventory\nlevel corresponds to the state index (1 in the state s1) except that the inventory in the current state\ns0 is 5. The demand is assumed to be Normally distributed with an unknown mean \u00b5 and a known\nstandard deviation \u03c3 = 1. The prior over \u00b5 is Normal with the mean \u00b50 = 3 and, therefore, the\nposterior over \u00b5 is also Normal. The current action assumes that no product is ordered and, therefore,\nonly the demand is subtracted from s0.\n\n5.2 Full MDP\n\nIn this section, we evaluate the methods using MDPs with relatively small state-spaces. They can be\nused with certain types of value function approximation, like aggregation [30], but we evaluate them\nonly on tabular problems to prevent approximation errors from skewing the results. To prevent the\nsampling policy from in\ufb02uencing the results, each dataset D has the same number of samples from\neach state.\n\nUninformative Prior We \ufb01rst use the standard RiverSwim domain for the evaluation [36]. The\nmethods are evaluated identically to the Bellman update above. That is, we generate synthetic datasets\nfrom the ground truth and then compare the expected regret of the robust estimate with respect to the\ntrue return of the optimal policy for the ground truth. As the prior distribution, we use the uniform\nDirichlet distribution over all states. Figure 6 shows the expected robust regret over 100 repetitions.\nThe x-axis represents the number of samples in D for each state. It is apparent that BCI improves\nonly slightly on the Hoeffding sets since the prior is not informative. RSVF, on the other hand, shows\na signi\ufb01cant improvement over BCI. All robust methods have safety violations of 0% indicating that\neven RSVF is unnecessarily conservative here.\n\nInformative Prior Next, we evaluate RSVF on the MDP model of a simple exponential population\nmodel [43]. Robustness plays an important role in ecological models because they are often complex,\nstochastic, and data collection is expensive. Yet, it is important that the decisions are robust due to\ntheir long term impacts. Figure 7 shows the average regret of safe predictions. BCI can leverage the\nprior information to compute tighter bounds, but RSVF further improves on BCI. The rate of safety\nviolations is again 0% for all robust methods.\n\n6 Summary and Conclusion\n\nThis paper proposes new Bayesian algorithms for constructing ambiguity sets in RMDPs, improving\nover standard distribution-free methods. BCI makes it possible to \ufb02exibly incorporate prior domains\nknowledge and is easy to generalize to other shapes of ambiguity sets (like L2) without having\nto prove new concentration inequalities. Finally, RSVF improves on BCI by constructing tighter\nambiguity sets that are not con\ufb01dence regions. Our experimental results and theoretical analysis\nindicate that the new ambiguity sets provide much tighter safe return estimates. The only drawbacks\nof the Bayesian methods are that they need priors and may increase the computational complexity.\n\n8\n\n20406080100Number of samples05101520253035Calculated return error: \ue156[\u03be]Mean TransitionHoeffdingHoeffding MonotoneBCIRSVF20406080100Number of samples020406080100Calculated return error: \ue156[\u03be]Mean TransitionHoeffdingHoeffding MonotoneBCIRSVF\fAcknowledgments\n\nWe would like to thank Vishal Gupta and the anonymous referees for their insightful comments and\nsuggestions. This work was supported by NSF under grants number 1815275 and 1717368.\n\nReferences\n[1] Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning.\n\nJournal of Machine Learning Research, 11(1):1563\u20131600, 2010.\n\n[2] Bagnell, J. A., Ng, A. Y., and Schneider, J. G. Solving Uncertain Markov Decision Processes.\n\nCarnegie Mellon Research Showcase, pp. 948\u2013957, 2001.\n\n[3] Ben-Tal, A., El Ghaoui, L., and Nemirovski, A. Robust Optimization. Princeton University\n\nPress, 2009.\n\n[4] Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic programming. 1996.\n[5] Bertsimas, D., Kallus, N., and Gupta, V. Data-driven robust optimization. Springer Berlin\n\nHeidelberg, 2017.\n\n[6] Boyd, S. and Vandenberghe, L. Convex Optimization. Cambridge University Press, Cambridge,\n\n2004.\n\n[7] Castro, P. S. and Precup, D. Smarter Sampling in Model-Based Bayesian Reinforcement\nLearning. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010.\nLNAI, 6321:200\u20132014, 2010.\n\n[8] Delage, E. and Mannor, S. Percentile Optimization for Markov Decision Processes with\n\nParameter Uncertainty. Operations Research, 58(1):203\u2013213, 2010.\n\n[9] Delgado, K. V., De Barros, L. N., Dias, D. B., and Sanner, S. Real-time dynamic programming\nfor Markov decision processes with imprecise probabilities. Arti\ufb01cial Intelligence, 230:192\u2013223,\n2016.\n\n[10] Dietterich, T., Taleghan, M., and Crowley, M. PAC optimal planning for invasive species\nmanagement: Improved exploration for reinforcement learning from simulator-de\ufb01ned MDPs.\nAAAI, 2013.\n\n[11] Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B. Bayesian Data Analysis. Chapman and\n\nHall/CRC, 3rd edition, 2014.\n\n[12] Goyal, V. and Grand-Clement, J. Robust Markov Decision Process: Beyond Rectangularity.\n\nTechnical report, 2018.\n\n[13] Gupta, V. Near-Optimal Bayesian Ambiguity Sets for Distributionally Robust Optimization.\n\n2015.\n\n[14] Hanasusanto, G. and Kuhn, D. Robust Data-Driven Dynamic Programming. In Advances in\n\nNeural Information Processing Systems (NIPS), 2013.\n\n[15] Ho, C. P., Petrik, M., and Wiesemann, W. Fast Bellman Updates for Robust MDPs.\nInternational Conference on Machine Learning (ICML), volume 80, pp. 1979\u20131988, 2018.\n\nIn\n\n[16] Iyengar, G. N. Robust dynamic programming. Mathematics of Operations Research, 30(2):\n\n257\u2013280, 2005.\n\n[17] Jaksch, T., Ortner, R., and Auer, P. Near-optimal Regret Bounds for Reinforcement Learning.\n\nJournal of Machine Learning Research, 11(1):1563\u20131600, 2010.\n\n[18] Jiang, N. and Li, L. Doubly Robust Off-policy Value Evaluation for Reinforcement Learning.\n\nIn International Conference on Machine Learning (ICML), 2015.\n\n[19] Kalyanasundaram, S., Chong, E. K. P., and Shroff, N. B. Markov decision processes with\nuncertain transition rates: Sensitivity and robust control. In IEEE Conference on Decision and\nControl, pp. 3799\u20133804, 2002.\n\n[20] Lange, S., Gabel, T., and Riedmiller, M. Batch Reinforcement Learning. In Reinforcement\n\nLearning, pp. 45\u201373. 2012.\n\n[21] Laroche, R. and Trichelair, P. Safe Policy Improvement with Baseline Bootstrapping, 2018.\n\n9\n\n\f[22] Le Tallec, Y. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Processes.\n\nPhD thesis, MIT, 2007.\n\n[23] Li, L., Munos, R., and Szepesv\u00e1ri, C. Toward Minimax Off-policy Value Estimation.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics (AISTATS), 2015.\n\nIn\n\n[24] Mannor, S., Mebel, O., and Xu, H. Lightning does not strike twice: Robust MDPs with coupled\n\nuncertainty. In International Conference on Machine Learning (ICML), 2012.\n\n[25] Mannor, S., Mebel, O., and Xu, H. Robust MDPs with k-rectangular uncertainty. Mathematics\n\nof Operations Research, 41(4):1484\u20131509, 2016.\n\n[26] Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. G. Safe and Ef\ufb01cient Off-Policy\nReinforcement Learning. In Conference on Neural Information Processing Systems (NIPS),\n2016.\n\n[27] Murphy, K. Machine Learning: A Probabilistic Perspective. 2012.\n[28] Nilim, A. and El Ghaoui, L. Robust control of Markov decision processes with uncertain\n\ntransition matrices. Operations Research, 53(5):780\u2013798, 2005.\n\n[29] Petrik, M. Approximate dynamic programming by minimizing distributionally robust bounds.\n\nIn International Conference of Machine Learning (ICML), 2012.\n\n[30] Petrik, M. and Subramanian, D. RAAM : The bene\ufb01ts of robustness in approximating aggregated\n\nMDPs in reinforcement learning. In Neural Information Processing Systems (NIPS), 2014.\n\n[31] Petrik, M., Chow, Y., and Ghavamzadeh, M. Safe Policy Improvement by Minimizing Robust\nBaseline Regret. In ICML Workshop on Reliable Machine Learning in the Wild, pp. 1\u201325, 2016.\n[32] Petrik, M., Mohammad Ghavamzadeh, and Chow, Y. Safe Policy Improvement by Minimizing\nRobust Baseline Regret. In Advances in Neural Information Processing Systems (NIPS), 2016.\n[33] Puterman, M. L. Markov decision processes: Discrete stochastic dynamic programming. 2005.\n[34] Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on Stochastic Programming. SIAM,\n\n2009.\n\n[35] Shapiro, A., Dentcheva, D., and Ruszczynski, A. Lectures on stochastic programming: Modeling\n\nand theory. 2014.\n\n[36] Strehl, A. and Littman, M. An analysis of model-based Interval Estimation for Markov Decision\n\nProcesses. Journal of Computer and System Sciences, 74:1309\u20131331, 2008.\n\n[37] Strehl, A. L. Probably Approximately Correct (PAC) Exploration in Reinforcement Learning.\n\nPhD thesis, Rutgers University, 2007.\n\n[38] Sutton, R. S. and Barto, A. Reinforcement learning. 1998.\n[39] Taleghan, M. A., Dietterich, T. G., Crowley, M., Hall, K., and Albers, H. J. PAC Optimal MDP\nPlanning with Application to Invasive Species Management. Journal of Machine Learning\nResearch, 16:3877\u20133903, 2015.\n\n[40] Tamar, A., Mannor, S., and Xu, H. Scaling up Robust MDPs Using Function Approximation.\n\nIn International Conference of Machine Learning (ICML), 2014.\n\n[41] Thomas, P. S. and Brunskill, E. Data-ef\ufb01cient off-policy policy evaluation for reinforcement\n\nlearning. In International Conference of Machine Learning (ICML), 2016.\n\n[42] Thomas, P. S., Teocharous, G., and Ghavamzadeh, M. High Con\ufb01dence Off-Policy Evaluation.\n\nIn Annual Conference of the AAAI, 2015.\n\n[43] Tirinzoni, A., Milano, P., Chen, X., and Ziebart, B. D. Policy-Conditioned Uncertainty Sets for\nRobust Markov Decision Processes. In Neural Information Processing Systems (NIPS), 2018.\n[44] Weissman, T., Ordentlich, E., Seroussi, G., Verdu, S., and Weinberger, M. J. Inequalities for the\n\nL1 deviation of the empirical distribution. 2003.\n\n[45] Wiesemann, W., Kuhn, D., and Rustem, B. Robust Markov decision processes. Mathematics of\n\nOperations Research, 38(1):153\u2013183, 2013.\n\n[46] Xu, H. and Mannor, S. The robustness-performance tradeoff in Markov decision processes.\n\nAdvances in Neural Information Processing Systems (NIPS), 2006.\n\n[47] Xu, H. and Mannor, S. Parametric regret in uncertain Markov decision processes. In IEEE\n\nConference on Decision and Control (CDC), pp. 3606\u20133613, 2009.\n\n10\n\n\f", "award": [], "sourceid": 3818, "authors": [{"given_name": "Marek", "family_name": "Petrik", "institution": "University of New Hampshire"}, {"given_name": "Reazul Hasan", "family_name": "Russel", "institution": "University of New Hampshire"}]}