{"title": "Bounding Performance Loss in Approximate MDP Homomorphisms", "book": "Advances in Neural Information Processing Systems", "page_first": 1649, "page_last": 1656, "abstract": "We define a metric for measuring behavior similarity between states in a Markov decision process (MDP), in which action similarity is taken into account. We show that the kernel of our metric corresponds exactly to the classes of states defined by MDP homomorphisms (Ravindran \\& Barto, 2003). We prove that the difference in the optimal value function of different states can be upper-bounded by the value of this metric, and that the bound is tighter than that provided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both for discrete and for continuous actions. We provide an algorithm for constructing approximate homomorphisms, by using this metric to identify states that can be grouped together, as well as actions that can be matched. Previous research on this topic is based mainly on heuristics.", "full_text": "Bounding Performance Loss in Approximate MDP\n\nHomomorphisms\n\nJonathan J. Taylor\n\nDept. of Computer Science\n\nUniversity of Toronto\n\nToronto, Canada, M5S 3G4\njonathan.taylor@utoronto.ca\n\nSchool of Computer Science\n\nSchool of Computer Science\n\nMcGill University\n\nMcGill University\n\nDoina Precup\n\nPrakash Panangaden\n\nMontreal, Canada, H3A 2A7\n\nMontreal, Canada, H3A 2A7\n\ndprecup@cs.mcgill.ca\n\nprakash@cs.mcgill.ca\n\nAbstract\n\nWe de\ufb01ne a metric for measuring behavior similarity between states in a Markov\ndecision process (MDP), which takes action similarity into account. We show\nthat the kernel of our metric corresponds exactly to the classes of states de\ufb01ned\nby MDP homomorphisms (Ravindran & Barto, 2003). We prove that the differ-\nence in the optimal value function of different states can be upper-bounded by\nthe value of this metric, and that the bound is tighter than previous bounds pro-\nvided by bisimulation metrics (Ferns et al. 2004, 2005). Our results hold both\nfor discrete and for continuous actions. We provide an algorithm for constructing\napproximate homomorphisms, by using this metric to identify states that can be\ngrouped together, as well as actions that can be matched. Previous research on\nthis topic is based mainly on heuristics.\n\n1 Introduction\nMarkov Decision Processes (MDPs) are a very popular formalism for decision making under un-\ncertainty (Puterman, 1994). A signi\ufb01cant problem is computing the optimal strategy when the state\nand action space are very large and/or continuous. A popular approach is state abstraction, in which\nstates are grouped together in partitions, or aggregates, and the optimal policy is computed over\nthese. Li et al. (2006) provide a nice comparative survey of approaches to state abstraction. The\nwork we present in this paper bridges two such methods: bisimulation-based approaches and meth-\nods based on MDP homomorphisms.\n\nBisimulation is a well-known, well-studied notion of behavioral equivalence between systems\n(Larsen & Skou, 1991; Milner, 1995) which has been specialized for MDPs by Givan et al (2003). In\nrecent work, Ferns et al. (2004, 2005, 2006) introduced (pseudo)metrics for measuring the similarity\nof states, which provide approximations to bisimulation. One of the disadvantages of bisimulation\nand the corresponding metrics is that they require that the behavior matches for exactly the same\nactions. However, in many cases of practical interest, actions with the exact same label may not\nmatch, but the environment may contain symmetries and other types of special structure, which may\nallow correspondences between states by matching their behavior with different actions. This idea\nwas formalized by (Ravindran & Barto, 2003) with the concept of MDP homomorphisms. MDP ho-\nmomorphisms specify a map matching equivalent states as well as equivalent actions in such states.\nThis matching can then be used to transfer policies between different MDPs. However, like any\nequivalence relations in probabilistic systems, MDP homomorphisms are brittle: a small change\nin the transition probabilities or the rewards can cause two previously equivalent state-action pairs\nto become distinct. This implies that such approaches do not work well in situations in which the\nmodel of the system is estimated from data. As a solution to this problem, Ravindran & Barto\n(2004) proposed using approximate homomorphisms, which allow aggregating states that are not\nexactly equivalent. They de\ufb01ne an MDP over these partitions and quantify the approximate loss\nresulting from using this MDP, compared to the original system. As expected, the bound depends on\n\n\fthe quality of the partition. Subsequent work (e.g. Wolfe & Barto, 2006) constructs such partitions\nheuristically.\n\nIn this paper, we attempt to construct provably good, approximate MDP homomorphisms from \ufb01rst\nprinciples. First, we relate the notion of MDP homomorphisms to the concept of lax bisimulation,\nexplored recently in the process algebra literature (Arun-Kumar, 2006). This allows us to de\ufb01ne a\nmetric on states, similarly to existing bisimulation metrics. Interestingly, this approach works both\nfor discrete and for continuous actions. We show that the difference in the optimal value function of\ntwo states is bounded above by this metric. This allows us to provide a state aggregation algorithm\nwith provable approximation guarantees. We illustrate empirically the fact that this approach can\nprovide much better state space compression than the use of existing bisimulation metrics.\n\n2 Background\nA \ufb01nite Markov decision process (MDP) is a tuple hS,A,P,Ri, where S is a \ufb01nite set of states, A is a\nset of actions, P : S \u00d7 A \u00d7 S \u2192 [0,1] is the transition model, with P(s,a,s\u2032) denoting the probability\nof transition from state s to s\u2032 under action a, and R : S \u00d7 A \u2192 R is the reward function with R(s,a)\nbeing the reward for performing action a in state s. For the purpose of this paper, the state space S\nis assumed to be \ufb01nite, but the action set A could be \ufb01nite or in\ufb01nite (as will be detailed later). We\nassume without loss of generality that rewards are bounded in [0,1].\nA deterministic policy p\n: S \u2192 A speci\ufb01es which action should be taken in every state. By following\npolicy p\ng t\u22121rt |s0 = s,p ) where g \u2208 (0,1)\nis a discount factor and rt is the sample reward received at time t. In a \ufb01nite MDP, the optimal\nvalue function V \u2217 is unique and satis\ufb01es the following formulas, known as the Bellman optimality\nequations:\n\nfrom state s, an agent can expect a value of V p\n\n(s) = E((cid:229)\n\nt=1\n\nV \u2217(s) = max\n\na\u2208A R(s,a) + g\n\nP(s,a,s\u2032)V \u2217(s\u2032)! , \u2200s \u2208 S\n\ns\u2032\n\nIf the action space is continuous, we will assume that it is compact, so the max can be taken and\nthe above results still hold (Puterman, 1994). Given the optimal value function, an optimal policy\nis easily inferred by simply taking at every state the greedy action with respect to the one-step-\nlookahead value. It is well known that the optimal value function can be computed by turning the\nabove equation into an update rule, which can be applied iteratively.\n\nIdeally, if the state space is very large, \u201csimilar\u201d states should be grouped together in order to speed\nup this type of computation. Bisimulation for MDPs (Givan et al., 2003) is a notion of behavioral\nequivalence between states. A relation E \u2286 S \u00d7 S is a bisimulation relation if:\n\nsEu \u21d4 \u2200a.(R(s,a) = R(u,a) and \u2200X \u2208 S/E.Pr(X|s,a) = Pr(X|u,a))\n\nwhere S/E denotes the partition of S into E-equivalent subsets of states. The relation \u223c is the union\nof all bisimulation relations and two states in an MDP are said to be bisimilar if s \u223c u. From this\nde\ufb01nition, it follows that bisimilar states can match each others\u2019 actions to achieve the same returns.\nHence, bisimilar states have the same optimal value (Givan et al., 2003). However, bisimulation is\nnot robust to small changes in the rewards or the transition probabilities.\n\nOne way to avoid this problem is to quantify the similarity between states using a (pseudo)-metric.\nFerns et al. (2004) proposed a bisimulation metric, de\ufb01ned as the least \ufb01xed point of the following\noperator on the lattice of 1-bounded metrics d : S \u00d7 S \u2192 [0,1]:\n\nG(d)(s,t) = max\n\n(cr|R(s,a) \u2212 R(u,a)| + cpK(d)(P(s,a, \u00b7),P(u,a, \u00b7))\n\na\n\n(1)\n\nThe \ufb01rst term above measures reward similarity. The second term is the Kantorovich metric between\nthe probability distributions of the two states. Given probability distributions P and Q over the state\nspace S, and a semimetric d on S, the Kantorovich metric K(d)(P,Q) is de\ufb01ned by the following\nlinear program:\n\nmax\n\nvi\n\n|S|\n\ni=1\n\n(P(si) \u2212 Q(si))vi subject to: \u2200i, j.vi \u2212 v j \u2264 d(si,s j) and \u2200i.0 \u2264 vi \u2264 1\n\nwhich has the following equivalent dual program:\n\n|S|\n\nmin\nl k j\n\nk, j=1\n\nl k jd(sk,s j) subject to: \u2200k.(cid:229)\n\nl k j = P(sk), \u2200 j.(cid:229)\n\nl k j = Q(s j) and \u2200k, j.l k j \u2265 0\n\nj\n\nk\n\n\u00a5\n(cid:229)\n(cid:229)\n(cid:229)\n\fFerns et al. (2004) showed that by applying (1) iteratively, the least \ufb01xed point e f ix can be obtained,\nand that s and u are bisimilar if and only if e f ix(s,u) = 0. In other words, bisimulation is the kernel\nof this metric.\n\n3 Lax bisimulation\nIn many cases of practical interest, actions with exactly the same label may not match, but the\nenvironment may contain symmetries and other types of special structure, which may allow corre-\nspondences between different actions at certain states. For example, consider the environment in\nFigure 1. Because of symmetry, going south in state N6 is \u201cequivalent\u201d to going north in state S6.\nHowever, no two states are bisimilar. Recent work in process algebra has rethought the de\ufb01nition of\nbisimulation to allow certain distinct actions to be essentially equivalent (Arun-Kumar, 2006). Here,\nwe de\ufb01ne lax bisimulation in the context of MDPs.\nDe\ufb01nition 1. A relation B is a lax (probabilistic) bisimulation relation if whenever sBu we have that:\n\u2200a \u2203b such that R(s,a) = R(u,b) and for all B-closed sets X we have that Pr(X|s,a) = P(X|u,b),\nand vice versa. The lax bisimulation \u223c is the union of all the lax bisimulation relations.\n\nIt is easy to see that B is an equivalence relation and we denote the equivalence classes of S by\nS/B. Note that the de\ufb01nition above assumes that any action can be matched by any other action.\nHowever, the set of actions that can be used to match another action can be restricted based on prior\nknowledge.\n\nLax bisimulation is very closely related to the idea of MDP homomorphisms (Ravindran & Barto,\n2003). We now formally establish this connection.\nDe\ufb01nition 2. (Ravindran & Barto, 2003) A MDP homomorphism h from M = hS,A,P,Ri to M\u2032 =\nhS\u2032,A\u2032,P\u2032,R\u2032i is a tuple of surjections h f , {gs : s \u2208 S}i with h(s,a) = ( f (s),gs(a)), where f : S \u2192 S\u2032\nand gs : A \u2192 A\u2032 such that R(s,a) = R\u2032( f (s),gs(a)) and P(s,a, f \u22121( f (s\u2032))) = P\u2032( f (s),gs(a), f (s\u2032))\n\nHence, a homomorphism puts in correspondence states, and has a state-dependent mapping between\nactions as well. We now show that homomorphisms are identical to lax probabilistic bisimulation.\nTheorem 3. Two states s and u are bisimilar if and only if they are related by some MDP homomor-\nphism h f , {gs : s \u2208 S}i in the sense that f (s) = f (u).\n\nProof: For the \ufb01rst direction, let h be a MDP homomorphism and de\ufb01ne the relation B such that sBu\niff f (s) = f (u). Since gu is a surjection to A, there must be some b \u2208 A with gu(b) = gs(a). Hence,\n\nR(s,a) = R\u2032( f (s),gs(a)) = R\u2032( f (u),gu(b)) = R(u,b)\n\nLet X be a non-empty B-closed set such that f \u22121( f (s\u2032)) = X for some s\u2032. Then:\n\nP(s,a,X) = P\u2032( f (s),gs(a), f (s\u2032)) = P\u2032( f (u),gu(b), f (s\u2032)) = P(u,b,X)\n\nso B is a lax bisimulation relation.\n\nFor the other direction, let B be a lax bisimulation relation. We will construct an MDP homo-\nmorphism in which sBu =\u21d2 f (s) = f (u). Consider the partition S/B induced by the equivalence\nrelation B on set S. For each equivalence class X \u2208 S/B, we choose a representative state sX \u2208 X\nand de\ufb01ne f (sX ) = sX and gsX (a) = a, \u2200a \u2208 A. Then, for any s \u223c sX , we de\ufb01ne f (s) = sX . From\nde\ufb01nition 1, we have that \u2200a\u2203b s.t. Pr(X \u2032|s,a) = Pr(X \u2032|sX ,b), \u2200X \u2032 \u2208 S/B. Hence, we set gs(a) = b.\nThen, we have:\n\nP\u2032( f (s),gs(a), f (s\u2032)) = P\u2032( f (sX ),b\u2032, f \u22121( f (s\u2032)) = P(sX ,b, f \u22121( f (s\u2032)) = P(s,a, f \u22121( f (s\u2032))\n\nAlso, R\u2032( f (s),gs(a)) = R\u2032( f (sX ),b) = R(sX ,a). Hence, we constructed a homomorphism. \u22c4\n\n4 A metric for lax bisimulation\nWe will now de\ufb01ne a lax bisimulation metric for measuring similarity between state-action pairs,\nfollowing the approach used by Ferns et al. (2004) for de\ufb01ning the bisimulation metric between\nstates. We want to say that states s and u are close exactly when every action of one state is close to\nsome action available in the other state. In order to capture this meaning, we \ufb01rst de\ufb01ne similarity\nbetween state-action pairs, then we lift this to states using the Hausdorff metric (Munkres, 1999).\n\n\fDe\ufb01nition 4. Let cr,cp \u2265 0 be constants with cr + cp \u2264 1. Given a 1-bounded semi-metric d on S,\nthe metric d( d) : S \u00d7 A \u2192 [0,1] is de\ufb01ned as follows:\n\nd( d)((s,a), (u,b)) = cr|R(s,a) \u2212 R(u,b)| + cpK(d)(P(s,a, \u00b7),P(u,b, \u00b7))\n\nWe now have to measure the distance between the set of of actions at state s and the set of actions\nat state u. Given a metric between pairs of points, the Hausdorff metric can be used to measure the\ndistance between sets of points. It is de\ufb01ned as follows.\nDe\ufb01nition 5. Given a \ufb01nite 1-bounded metric space (M ,d), let P (M ) be the set of compact spaces\n(e.g. closed and bounded in R). The Hausdorff metric H(d) : P (M ) \u00d7 P (M ) \u2192 [0,1] is de\ufb01ned as:\n\nH(d)(X,Y ) = max(sup\nx\u2208X\n\ninf\ny\u2208Y\n\nd(x,y),sup\ny\u2208Y\n\ninf\nx\u2208X\n\nd(x,y))\n\nDe\ufb01nition 6. Denote Xs = {(s,a)|a \u2208 A}. Let M be the set of all semimetrics on S. We de\ufb01ne the\noperator F : M \u2192 M as F(d)(s,u) = H(d( d))(Xs,Xu)\nWe note that the same de\ufb01nition can be applied both for discrete and for compact continuous action\nspaces. If the action set is compact then Xs = {s} \u00d7 A is also compact, so the Hausdorff metric is\nstill well de\ufb01ned. For simplicity, we consider the discrete case, so that max and min are de\ufb01ned.\nTheorem 7. F is monotonic and has a least \ufb01xed point d f ix in which d f ix(s,u) = 0 iff s \u223c u.\n\nThe proof is similar in \ufb02avor to (Ferns et al., 2004) and we omit it for lack of space.\nAs both e f ix and d f ix quantify the difference in behaviour between states, it is not surprising to see\nthat they constrain the difference in optimal value. Indeed, the bound below has previously been\nshown in (Ferns et al., 2004) for e f ix, but we also show that our metric d f ix is tighter.\nTheorem 8. Let e f ix be the metric de\ufb01ned in (Ferns et al., 2004). Then we have:\n\ncr|V \u2217(s) \u2212V \u2217(u)| \u2264 d f ix(s,u) \u2264 e f ix(s,u)\n\nProof: We show via induction on n that for the sequence of iterates Vn encountered during value\niteration, cr|Vn(s) \u2212 Vn(u)| \u2264 d f ix(s,u) \u2264 e f ix(s,u), and then the result follows by merely taking\nlimits.\nFor the base case note that cr|V0(s) \u2212V0(u)| = d0(s,u) = e0(s,u) = 0.\nAssume this holds for n. By the monotonicity of F, we have that F(dn)(s,u) \u2264 F(en)(s,u). Now,\nfor any a, d( en)((s,a), (u,a)) \u2264 G(en)(s,u), which implies:\n\nF(en)(s,u) \u2264 max(max\n\na\n\n\u2264 max(max\n\na\n\nd( en)((s,a), (u,a)),max\nb\nG(en)(s,u),G(en)(s,u)) = G(en)(s,u)\n\nd( en)((s,b), (u,b))\n\nso dn+1 \u2264 en+1 Without loss of generality, assume that Vn+1(s) > Vn+1(u). Then:\n(R(u,b) + g\ncr|Vn+1(s) \u2212Vn+1(u)| =cr|max\n\nP(s,a,s\u2032)Vn(s\u2032)) \u2212 max\n\n(R(s,a) + g\n\nP(u,b,s\u2032)Vn(s\u2032))|\n\na\n\n=cr|(R(s,a\u2032) + g\n\ns\u2032\nP(s,a\u2032,s\u2032)Vn(s\u2032)) \u2212 (R(t,b\u2032) + g\n\ns\u2032\n\nP(u,b\u2032,s\u2032)Vn(s\u2032))|\n\nb\n\ns\u2032\n|(R(s,a\u2032) + g\n\ns\u2032\nP(s,a\u2032,s\u2032)Vn(s\u2032)) \u2212 (R(u,b) + g\n\nP(u,b,s\u2032)Vn(s\u2032))|\n\n=cr min\nb\n\n\u2264cr max\n\na\n\n\u2264max\n\na\n\nmin\n\nb\n\ns\u2032\n|(R(s,a) + g\n\nb\n\nmin\n(cr|R(s,a) \u2212 R(u,b)| + cp|(cid:229)\n\ns\u2032\n\ns\u2032\nP(s,a,s\u2032)Vn(s\u2032)) \u2212 (R(t,b) + g\n\nP(u,b,s\u2032)Vn(s\u2032))|\n\ns\u2032\n\n(P(s,a,s\u2032) \u2212 P(u,b,s\u2032))\n\ns\u2032\n\ncrg\ncp\n\nVn(s\u2032)|)\n\nNow since g \u2264 cp, we have 0 \u2264 crg\ncp\n\ncp(1\u2212g ) \u2264 1 and by the induction hypothesis\n\nVi(s\u2032) \u2264 (1\u2212cp)g\ncrg\ncp\n\ncrg\ncp\n\nVn(s) \u2212\n\nVn(u) \u2264 cr|Vn(s) \u2212Vn(u)| \u2264 dn(s,u)\n\nSo { crg\nVn(s\u2032) : s\u2032 \u2208 S} is a feasible solution to the LP for K(dn)(P(s,a),P(t,b)). We then continue the\ncp\ninequality: cr|Vn+1(s) \u2212 Vn+1(u)| \u2264 maxa minb(cr|R(s,a) \u2212 R(u,b)| + cpK(dn)(P(s,a),P(u,b))) =\nF(dn)(s,u) = dn+1(s,u)\u22c4\n\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n\f5 State aggregation\nWe now show how we can use this notion of lax bisimulation metrics to construct approximate MDP\nhomomorphisms. First, if we have an MDP homomorphism, we can use it to provide a state space\naggregation, as follows.\nDe\ufb01nition 9. Given a MDP M and a homomorphism, an aggregated MDP M\u2032\nis given by\n(S\u2032,A, {P(C,a,D) : a \u2208 A;C,D \u2208 S\u2032}, {R(C,a) : a \u2208 A,C \u2208 S\u2032},r ,gs : s \u2208 S) where S\u2032 is a partition of\nS, r\n: S \u2192 S\u2032 maps states to their aggregates, each gs : A \u2192 A relabels the action set and we have that\n\u2200C,D \u2208 S\u2032 and a \u2208 A,\n\nP(C,a,D) =\n\n1\n|C|\n\ns\u2208C\n\nP(s,gs(a),D) and R(C,a) =\n\n1\n|C|\n\ns\u2208C\n\nR(s,gs(a))\n\nNote that all the states in a partition have actions that are relabelled speci\ufb01cally so they can exactly\nmatch each other\u2019s behaviour. Thus, a policy in the aggregate MDP can be lifted to the original\nMDP by using this relabeling.\nDe\ufb01nition 10. If M\u2032 is an aggregation of MDP M and p\nde\ufb01ned by p (s) = gs(p\n\n\u2032 is a policy in M\u2032, then the lifted policy is\n\n\u2032(s\u2032)).\n\nUsing a lax bisimulation metric, it is possible to choose appropriate re-labelings so that states within\na partition can approximately match each other\u2019s actions.\nDe\ufb01nition 11. Given a lax bisimulation metric d and a MDP M, we say that an aggregated MDP M\u2032\nis d-consistent if each aggregated class C has a state s \u2208 C, called the representative of C, such that:\n\n\u2200u \u2208 C,d( d)((s,gs(a)), (u,gu(a))) \u2264 F(d)(s,u)\n\nWhen the re-labelings are chosen in this way, we can solve for the optimal value function of the\naggregated MDP and be assured that for each state, its true optimal value is close to the optimal\nvalue of the partition in which it is contained.\nTheorem 12. If M\u2032 is a dz -consistent aggregation of a MDP M and n \u2264 z, then \u2200s \u2208 S we have:\n\ncr|Vn(r (s)) \u2212Vn(s)| \u2264 m(r (s)) + M\n\ng n\u2212k.\n\nn\u22121\n\nk=1\n\nwhere m(C) = 2 maxu\u2208C dz (s\u2032,u), s\u2032 denotes the representative state of C and M = maxC m(C). Fur-\nthermore, if p\n\nis the corresponding lifted policy in M, then:\n\n\u2032 is a policy in M\u2032 and p\n\ncr|V\n\n\u2032\n\nn (r (s)) \u2212V\n\nn (s)| \u2264 m(r (s)) + M\n\ng n\u2212k\n\nn\u22121\n\nk=1\n\nP(s,a,s\u2032)Vn(s\u2032))|\n\ns\u2032\n\n(R(s,a) + g\n\nProof: |Vn+1(r (s)) \u2212Vn+1(s)| =\n= |max\n\n(R(r (s),a) + g\n\na\n\nD\u2208S\u2032\n\nP(r (s),a,D)Vn(D)) \u2212 max\na\na |R(u,gu(a)) \u2212 R(s,gs(a))| + g | (cid:229)\na |R(u,gu(a)) \u2212 R(s,gs(a))| + g |(cid:229)\n(|R(u,gu(a)) \u2212 R(s,gs(a))| + g |(cid:229)\n\ns\u2032\n\nmax\n\nmax\n\nmax\n\na\n\nu\u2208r (s)\nP(u,gu(a),s\u2032)(Vn(r (s\u2032)) \u2212Vn(s\u2032))|) \u2264\n\ns\u2032\n\n1\n\ncr|r (s)|\n\n(P(u,gu(a),s\u2032) \u2212 P(s,gs(a),s\u2032))\n\ncrg\ncp\n\nVn(s\u2032)|) +\n\n\u2264\n\n\u2264\n\n\u2264\n\n1\n\n|r (s)|\n\n1\n\n|r (s)|\n\nu\u2208r (s)\n\nu\u2208r (s)\n\n1\n\n|r (s)|\n\n+ g |(cid:229)\n\ns\u2032\n\n+cp|(cid:229)\n\ns\u2032\n\nD\u2208S\u2032\n\nP(s,gs(a),s\u2032)Vn(s\u2032)|!\nP(u,gu(a),D)Vn(D) \u2212(cid:229)\n(P(u,gu(a),s\u2032)Vn(r (s\u2032)) \u2212 P(s,gs(a),s\u2032)Vn(s\u2032))|!\n\ns\u2032\n\n(P(u,gu(a),s\u2032) \u2212 P(s,gs(a),s\u2032))Vn(s\u2032)\n\nu\u2208r (s)\n\n|r (s)|\n\nmax\n\n(cr|R(s,gs(a)) \u2212 R(u,gu(a))|\n\na\n\nmax\n\na\n\nu\u2208r (s)\n\nP(u,gu(a),s\u2032)|Vn(r (s\u2032)) \u2212Vn(s\u2032)|\n\ns\u2032\n\n(cid:229)\n(cid:229)\n(cid:229)\np\np\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\ng\n(cid:229)\n(cid:229)\n\fFrom Theorem 8, we know that { crg\nVn(s\u2032) : s\u2032 \u2208 S} is a feasible solution to the primal LP for\ncp\nK(dn)(P(s,gs(a)),P(u,gu(a))). Let z be the representative used for r (s). Then we can continue\nas follows:\n\n\u2264 cr|R(s,gs(a) \u2212 R(u,gu(a))| + cpK(dn)(P(s,gs(a)),P(u,gu(a)))\n\u2264 cr|R(s,gs(a)) \u2212 R(u,gu(a))| + cpK(dz )(P(s,gs(a)),P(u,gu(a)))\n\u2264 cr|R(s,gs(a)) \u2212 R(z,gz(a))| + cpK(dz )(P(s,gs(a)),P(z,gz(a)))\n+ cr|R(z,gz(a)) \u2212 R(u,gu(a))| + cpK(dz )(P(z,gz(a)),P(u,gu(a))) = dz (s,z) + dz (z,u) \u2264 m(r (s))\n\nWe continue with the original inequality using these two results:\n\n\u2264\n\n+\n\n1\ncr\n\n(cr|R(s,gs(a)) \u2212 R(u,gu(a))| + cpK(dn)(P(s,gs(a)),P(u,gu(a))))\n\nu\u2208r (s)\n\n|r (s)|\n\nmax\n\na\n\ns\u2032\n\nu\u2208r (s)\n\nP(u,gu(a),s\u2032)max\ns\u2032\u2032\n\n|Vn(r (s\u2032\u2032)) \u2212Vn(s\u2032\u2032)|\n\n1\n\n1\n\n\u2264\n\n\u2264\n\nu\u2208r (s)\n\n+ g max\n\nm(r (s))\n\nm(r (s)) + g max\ns\u2032\n\n|Vn(r (s\u2032)) \u2212Vn(s\u2032)| \u2264\ng n+1\u2212k! \u2264\n\ncr|r (s)|\ncr m(r (s)) + g max\nThe second proof is nearly identical except that instead of maximizing over actions, the action\nselected by the policy, a = p\nBy taking limits we get the following theorem:\nTheorem 13. If M\u2032 is a d f ix-consistent aggregation of a MDP M, then \u2200s \u2208 S we have:\n\n\u2032(r (s)), and the lifted policy, gs(a) = p (s) are used.\n\ncr m(r (s)) + M\n\ns\u2032\u2032 m(r (s))\ng (n+1)\u2212k!\nn(cid:229)\n\nm(r (s\u2032)) + M\n\n+ M\n\nn\u22121\n\nn\u22121\n\nk=1\n\nk=1\n\nk=1\n\ncr\n\ncr\n\n1\n\n\u22c4\n\ns\u2032\n\ng n\u2212k!\n\ncr|V \u2217(r (s)) \u2212V \u2217(s)| \u2264 m(r (s)) +\n\n1 \u2212 g M\n\nFurthermore, if p\n\n\u2032 is any policy in M\u2032 and p\n\n\u2032\n\ncr|V\n\n(r (s)) \u2212V\n\nis the lifted policy to M then\n(s)| \u2264 m(r (s)) +\n\n1 \u2212 g M\n\nwhere m(C) = 2 maxu\u2208C d f ix(s\u2032,u), s\u2032 is the representative state of C and M = maxC m(C).\nOne appropriate way to aggregrate states is to choose some desired error bound e > 0 and ensure\nthat the states in each partition are within an e -ball. A simple way to do this is to pick states and\nrandom and add to a partition each state within the e -ball. Of course, better clustering heuristics can\nbe used here as well.\n\n2e\n\nIt has been noted that when the above condition holds, then under the unlaxed bisimulation metric\ne f ix, we can be assured that for each state s, |V \u2217(r (s)) \u2212V (s)| is bounded by\ncr(1\u2212g ). The theorem\nabove shows that under the lax bisimulation metric d f ix this difference is actually bounded by\ncr(1\u2212g ).\nHowever, as we illustrate in the next section. a massive reduction in the size of the state space can\nbe achieved by moving from e f ix to d f ix, even when using e\nFor large systems, it might not be feasible to compute the metric e f ix in the original MDP. In this\ncase, we might want to use some sort of heuristic or prior knowledge to create an aggregation.\nRavindran & Barto (2003) provided, based on a result from Whitt (1978), a bound on the difference\nin values between the optimal policy in the aggregated MDP and the lifted policy in the original\nMDP. We now show that our metric can be used to tighten this bound.\nTheorem 14. If M\u2032 is an aggregation of a MDP M, p\nlifted from p\n\n\u2032 is an optimal policy in M\u2032, p\n\n\u2032 to M and d\u2032\n\nis the policy\n\n\u2032 =\n\n2 .\n\n4e\n\nf ix corresponds to our metric computed on M\u2032, then\n2\n1 \u2212 g max\n\n|R(s,gs(a)) \u2212 R(r (s),a)| +\n\nmax\ns,a\n\nK(d\u2032\n\ns,a\n\ncr\n\nf ix)(P(s,gs(a)),P(r (s),a))\n\n|V\n\n(s) \u2212V\n\n\u2032\n\n(r (s))| \u2264\n\n(cid:229)\ng\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\n(cid:229)\ng\np\np\ng\ne\np\np\ng\n\fE6\n\nE5\n\nE4\n\nE3 E2 E1\n\nN6\n\nN5\nN4\n\nN3\nN2\n\nN1\nC\n\nS1\nS2\n\nS3\nS4\n\nS5\n\nS6\n\nComparison of Laxed and Unlaxed Lumping Performance\n\nUnlaxed Metric\nLaxed Metric\n\nW1 W2 W3 W4\n\nW5\n\nW6\n\n30\n\n25\n\n20\n\n15\n\n10\n\n5\n\ns\ne\nt\na\nt\nS\n \nd\ne\np\nm\nu\nL\n\n0\n0.0\n\n0.2\n\n0.4\n\n0.6\n\n0.8\n\n1.0\n\nEpsilon\n\nFigure 1: Example environment exhibiting symmetries (left). Aggregation performance (right)\n\nProof: We have:\n\n|V\n\n(s) \u2212V\n\n\u2032\n\n(r (s))|) \u2264\n\n2\n1 \u2212 g max\n\ns,a\n\n\u2264\n\n\u2264\n\ns,a\n\n2\n1 \u2212 g max\n2\n1 \u2212 g max\n\ns,a\n\n|R(s,gs(a)) \u2212 R(r (s),a) + g\n|(cid:229)\n\n|R(s,gs(a)) \u2212 R(r (s),a)| + g max\ns,a\n|R(s,gs(a)) \u2212 R(r (s),a)| + max\ns,a\n\ncr\n\n(P(s,gs(a),C) \u2212 P(r (s),a,C))V\n\n\u2032\n\n(C)|\n\nC\n(P(s,gs(a),C) \u2212 P(r (s),a,C))V\n\n\u2032\n\n(C)|\n\nC\n\nK(d\u2032\n\nf ix)(P(s,gs(a)),P(r (s),a))\n\np \u2032\n\n(C)\ncr\n\nThe \ufb01rst inequality originally comes from (Whitt, 1978) and is applied to MDPs in (Ravindran &\nBarto, 2003). The last inequality holds since p\n\u2032 is an optimal policy and thus by Theorem 8 we know\nthat { V\n: C \u2208 S\u2032} is a feasible solution. \u22c4\nAs a corrolary, we can get the same bound as in (Ravindran & Barto, 2003) by bounding the Kan-\ntorovich by the total variation metric.\nDe\ufb01nition 15. Given two \ufb01nite distributions P and Q, the total variation metric TV (P,Q) is de\ufb01ned\nas: TV (P,Q) = (cid:229)\nCorollary 16. Let D = maxC,a R(C,a) \u2212 minC,a R(C,a) be the maximum difference in rewards in the\naggregated MDP. Then:\n\n1\n2 |P(s) \u2212 Q(s)|\n\ns\n\n|V\n\n(s) \u2212V\n\n(r (s))| \u2264\n\n2\n\n1 \u2212 g (cid:18)max\n\ns,a\n\n|R(s,gs(a)) \u2212 R(r (s),a)| +\n\n1 \u2212 g\n\n\u00b7 TV (P(s,gs(a)),P(r (s),a))(cid:19)\n\nProof: This follows from the fact that:\n\ncrD\n1 \u2212 g\nand using the total variation as an approximation (Gibbs & Su, 2002), we have:\n\nf ix(C,D) \u2264 crD + cp max\nd\u2032\n\nd\u2032\nf ix(C,D) \u00b7 \u00b7 \u00b7 \u2264\n\ncrD\n1 \u2212 cp\n\nmax\nC,D\n\nC,D\n\n\u2264\n\nK(d\u2032\n\nf ix)(P(s,gs(a)),P(r (s),a)) \u2264 max\n\nf ix(C,D) \u00b7 TV (P(s,gs(a)),P(r (s),a)) \u22c4\nd\u2032\n\nC,D\n\n6 Illustration\n\nConsider the cross-shaped MDP displayed in Figure 1. There is a reward of 1 in the center and the\nprobability of the agent moving in the intended direction is 0.8. For a given e , we used the random\npartitioning algorithm outlined earlier to create a state aggregation. The graph plots the size of the\naggregated MDPs obtained against e , using the lax and the non-lax bisimulation metrics. In the case\nof the lax metric, we used e\n\u2032 = e /2 to compensate for the factor of 2 difference in the error bound.\nIt is very revealing that the number of partitions drops very quickly and levels at around 6 or 7 for\nour algorithm. This is because the MDP is collapsing to a state space close to the natural choice of\n{{C}} \u222a {{Ni,Si,Wi,Ei} : i \u2208 {1,2,3,4,5,6}}. Under the unlaxed metric, this is not likely to occur,\nand thus the \ufb01rst states to be partitioned together are the ones neighbouring each other (which can\nactually have quite different behaviours).\n\np\np\n(cid:229)\np\np\ng\np\np\ng\nD\n\f7 Discussion and future work\nWe de\ufb01ned a metric for measuring the similarity of state-action pairs in a Markov Decision Process\nand used it in an algorithm for constructing approximate MDP homomorphisms. Our approach\nworks signi\ufb01cantly better than the bisimulation metrics of Ferns et al., as it allows capturing different\nregularities in the environment. The theoretical bound on the error in the value function presented\nin (Ravindran & Barto, 2004) can be derived using our metric.\n\nAlthough the metric is potentially expensive to compute, there are domains in which having an\naccurate aggregation is worth it. For example, in mobile device applications, one may have big\ncomputational resources initially to build an aggregation, but may then insist on a very coarse,\ngood aggregation, to \ufb01t on a small device. The metric can also be used to \ufb01nd subtasks in a larger\nproblem that can be solved using controllers from a pre-supplied library. For example, if a controller\nis available to navigate single rooms, the metric might be used to lump states in a building schematic\ninto \u201crooms\u201d. The aggregate MDP can then be used to solve the high level navigational task using\nthe controller to navigate speci\ufb01c rooms.\n\nAn important avenue for future work is reducing the computational complexity of this approach.\nTwo sources of complexity include the quadratic dependence on the number of actions, and the\nevaluation of the Kantorovich metric. The \ufb01rst issue can be addressed by sampling pairs of actions,\nrather than considering all possibilities. We are also investigating the possibility of replacing the\nKantorovich metric (which is very convenient from the theoretical point of view) with a more prac-\ntical approximation. Finally, the extension to continuous states is very important. We currently have\npreliminary results on this issue, using an approach similar to (Ferns et al, 2005), which assumes\nlower-semi-continuity of the reward function. However, the details are not yet fully worked out.\nAcknowledgements: This work was funded by NSERC and CFI.\n\nReferences\n\nArun-Kumar, S. (2006). On bisimilarities induced by relations on actions. SEFM \u201906: Proceedings of the Fourth\nIEEE International Conference on Software Engineering and Formal Methods (pp. 41\u201349). Washington, DC,\nUSA: IEEE Computer Society.\n\nFerns, N., Castro, P. S., Precup, D., & Panangaden, P. (2006). Methods for computing state similarity in Markov\n\nDecision Processes. Proceedings of the 22nd UAI.\n\nFerns, N., Panangaden, P., & Precup, D. (2004). Metrics for \ufb01nite markov decision processes. Proceedings of\n\nthe 20th UAI (pp. 162\u2013169).\n\nFerns, N., Panangaden, P., & Precup, D. (2005). Metrics for markov decision processes with in\ufb01nite state\n\nspaces. Proceedings of the 21th UAI (pp. 201\u2013209).\n\nGibbs, A., & Su, F. (2002). On choosing and bounding probability metrics.\nGivan, R., Dean, T., & Greig, M. (2003). Equivalence notions and model minimization in Markov Decision\n\nProcesses. Arti\ufb01cial Intelligence, 147, 163\u2013223.\n\nLarsen, K. G., & Skou, A. (1991). Bisimulation through probabilistic testing. Inf. Comput., 94, 1\u201328.\nLi, L., Walsh, T. J., & Littman, M. L. (2006). Towards a uni\ufb01ed theory of state abstraction for MDPs. Proceed-\n\nings of the International Symposium on Arti\ufb01cial Intelligence and Mathematics.\n\nMilner, R. (1995). Communication and concurrency. Prentice Hall International (UK) Ltd.\nMunkres, J. (1999). Topology. Prentice Hall.\nPuterman, M. L. (1994). Markov decision processes: discrete stochastic dynamic programming. Wiley.\nRavindran, B., & Barto, A. G. (2003). Relativized options: Choosing the right transformation. Proceedings of\n\n20th ICML (pp. 608\u2013615).\n\nRavindran, B., & Barto, A. G. (2004). Approximate homomorphisms: A framework for non-exact minimization\ninn Markov Decision Processes. Proceedings of the Fifth International Conference on Knowledge Based\nComputer Systems.\n\nWhitt, W. (1978). Approximations of dynamic programs i. Mathematics of Operations Research, 3, 231\u2013243.\nWolfe, A. P., & Barto, A. G. (2006). Decision tree methods for \ufb01nding reusable MDP homomorphisms.\n\nProceedings of AAAI.\n\n\f", "award": [], "sourceid": 971, "authors": [{"given_name": "Jonathan", "family_name": "Taylor", "institution": null}, {"given_name": "Doina", "family_name": "Precup", "institution": null}, {"given_name": "Prakash", "family_name": "Panagaden", "institution": null}]}