{"title": "Planning for Markov Decision Processes with Sparse Stochasticity", "book": "Advances in Neural Information Processing Systems", "page_first": 785, "page_last": 792, "abstract": null, "full_text": " Planning for Markov Decision Processes with\n Sparse Stochasticity\n\n\n\n Maxim Likhachev Geoff Gordon Sebastian Thrun\nSchool of Computer Science School of Computer Science Dept. of Computer Science\nCarnegie Mellon University Carnegie Mellon University Stanford University\n Pittsburgh, PA 15213 Pittsburgh, PA 15213 Stanford CA 94305\n maxim+@cs.cmu.edu ggordon@cs.cmu.edu thrun@stanford.edu\n\n\n\n Abstract\n\n Planning algorithms designed for deterministic worlds, such as A*\n search, usually run much faster than algorithms designed for worlds with\n uncertain action outcomes, such as value iteration. Real-world planning\n problems often exhibit uncertainty, which forces us to use the slower\n algorithms to solve them. Many real-world planning problems exhibit\n sparse uncertainty: there are long sequences of deterministic actions\n which accomplish tasks like moving sensor platforms into place, inter-\n spersed with a small number of sensing actions which have uncertain out-\n comes. In this paper we describe a new planning algorithm, called MCP\n (short for MDP Compression Planning), which combines A* search with\n value iteration for solving Stochastic Shortest Path problem in MDPs\n with sparse stochasticity. We present experiments which show that MCP\n can run substantially faster than competing planners in domains with\n sparse uncertainty; these experiments are based on a simulation of a\n ground robot cooperating with a helicopter to fill in a partial map and\n move to a goal location.\n\n In deterministic planning problems, optimal paths are acyclic: no state is visited more\nthan once. Because of this property, algorithms like A* search can guarantee that they visit\neach state in the state space no more than once. By visiting the states in an appropriate\norder, it is possible to ensure that we know the exact value of all of a state's possible\nsuccessors before we visit that state; so, the first time we visit a state we can compute its\ncorrect value.\n By contrast, if actions have uncertain outcomes, optimal paths may contain cycles:\nsome states will be visited two or more times with positive probability. Because of these\ncycles, there is no way to order states so that we determine the values of a state's successors\nbefore we visit the state itself. Instead, the only way to compute state values is to solve a\nset of simultaneous equations.\n In problems with sparse stochasticity, only a small fraction of all states have uncertain\noutcomes. It is these few states that cause all of the cycles: while a deterministic state s\nmay participate in a cycle, the only way it can do so is if one of its successors has an action\nwith a stochastic outcome (and only if this stochastic action can lead to a predecessor of s).\n In such problems, we would like to build a smaller MDP which contains only states\nwhich are related to stochastic actions. We will call such an MDP a compressed MDP,\nand we will call its states distinguished states. We could then run fast algorithms like A*\nsearch to plan paths between distinguished states, and reserve slower algorithms like value\niteration for deciding how to deal with stochastic outcomes.\n\n\f\n (a) Segbot (b) Robotic helicopter\n\n\n\n\n\n (d) Planning map (e) Execution simulation (c) 3D Map\n Figure 1: Robot-Helicopter Coordination\n\n There are two problems with such a strategy. First, there can be a large number of states\nwhich are related to stochastic actions, and so it may be impractical to enumerate all of them\nand make them all distinguished states; we would prefer instead to distinguish only states\nwhich are likely to be encountered while executing some policy which we are considering.\nSecond, there can be a large number of ways to get from one distinguished state to another:\nedges in the compressed MDP correspond to sequences of actions in the original MDP. If\nwe knew the values of all of the distinguished states exactly, then we could use A* search\nto generate optimal paths between them, but since we do not we cannot.\n In this paper, we will describe an algorithm which incrementally builds a compressed\nMDP using a sequence of deterministic searches. It adds states and edges to the compressed\nMDP only by encountering them along trajectories; so, it never adds irrelevant states or\nedges to the compressed MDP. Trajectories are generated by deterministic search, and so\nundistinguished states are treated only with fast algorithms. Bellman errors in the values\nfor distinguished states show us where to try additional trajectories, and help us build the\nrelevant parts of the compressed MDP as quickly as possible.\n\n1 Robot-Helicopter Coordination Problem\n\nThe motivation for our research was the problem of coordinating a ground robot and a\nhelicopter. The ground robot needs to plan a path from its current location to a goal, but\nhas only partial knowledge of the surrounding terrain. The helicopter can aid the ground\nrobot by flying to and sensing places in the map.\n Figure 1(a) shows our ground robot, a converted Segway with a SICK laser rangefinder.\nFigure 1(b) shows the helicopter, also with a SICK. Figure 1(c) shows a 3D map of the\nenvironment in which the robot operates. The 3D map is post-processed to produce a\ndiscretized 2D environment (Figure 1(d)). Several places in the map are unknown, either\nbecause the robot has not visited them or because their status may have changed (e.g, a\ncar may occupy a driveway). Such places are shown in Figure 1(d) as white squares. The\nelevation of each white square is proportional to the probability that there is an obstacle\nthere; we assume independence between unknown squares.\n The robot must take the unknown locations into account when planning for its route.\nIt may plan a path through these locations, but it risks having to turn back if its way is\nblocked. Alternately, the robot can ask the helicopter to fly to any of these places and sense\nthem. We assign a cost to running the robot, and a somewhat higher cost to running the\nhelicopter. The planning task is to minimize the expected overall cost of running the robot\nand the helicopter while getting the robot to its destination and the helicopter back to its\nhome base. Figure 1(e) shows a snapshot of the robot and helicopter executing a policy.\n Designing a good policy for the robot and helicopter is a POMDP planning problem;\nunfortunately POMDPs are in general difficult to solve (PSPACE-complete [7]). In the\nPOMDP representation, a state is the position of the robot, the current location of the\nhelicopter (a point on a line segment from one of the unknown places to another unknown\nplace or the home base), and the true status of each unknown location. The positions of the\nrobot and the helicopter are observable, so that the only hidden variables are whether each\n\n\f\nunknown place is occupied. The number of states is (# of robot locations)(# of helicopter\nlocations)2# of unknown places. So, the number of states is exponential in the number of\nunknown places and therefore quickly becomes very large.\n We approach the problem by planning in the belief state space, that is, the space of\nprobability distributions over world states. This problem is a continuous-state MDP; in this\nbelief MDP, our state consists of the ground robot's location, the helicopter's location, and\na probability of occupancy for each unknown location. We will discretize the continuous\nprobability variables by breaking the interval [0, 1] into several chunks; so, the number of\nbelief states is exponential in the number of unknown places, and classical algorithms such\nas value iteration are infeasible even on small problems.\n If sensors are perfect, this domain is acyclic: after we sense a square we know its true\nstate forever after. On the other hand, imperfect sensors can lead to cycles: new sensor data\ncan contradict older sensor data and lead to increased uncertainty. With or without sensor\nnoise, our belief state MDP differs from general MDPs because its stochastic transitions\nare sparse: large portions of the policy (while the robot and helicopter are traveling be-\ntween unknown locations) are deterministic. The algorithm we propose in this paper takes\nadvantage of this property of the problem, as we explain in the next section.\n\n2 The Algorithm\n\nOur algorithm can be broken into two levels. At a high level, it constructs a compressed\nMDP, denoted M c, which contains only the start, the goal, and some states which are out-\ncomes of stochastic actions. At a lower level, it repeatedly runs deterministic searches to\nfind new information to put into M c. This information includes newly-discovered stochas-\ntic actions and their outcomes; better deterministic paths from one place to another; and\nmore accurate value estimates similar to Bellman backups. The deterministic searches can\nuse an admissible heuristic h to focus their effort, so we can often avoid putting many\nirrelevant actions into M c.\n Because M c will often be much smaller than M , we can afford to run stochastic plan-\nning algorithms like value iteration on it. On the other hand, the information we get by\nplanning in M c will improve the heuristic values that we use in our deterministic searches;\nso, the deterministic searches will tend to visit only relevant portions of the state space.\n\n2.1 Constructing and Solving a Compressed MDP\n\nEach action in the compressed MDP represents several consecutive actions in M : if we\nsee a sequence of states and actions s1, a1, s2, a2, . . . , sk, ak where a1 through ak-1 are\ndeterministic but ak is stochastic, then we can represent it in M c with a single action a,\navailable at s1, whose outcome distribution is P (s | sk, ak) and whose cost is\n\n k-1\n\n c(s1, a, s ) = c(si, ai, si+1) + c(sk, ak, s )\n i=1\n\n(See Figure 2(a) for an example of such a compressed action.) In addition, if we see a se-\nquence of deterministic actions ending in sgoal, say s1, a1, s2, a2, . . . , sk, ak, sk+1 = sgoal,\nwe can define a compressed action which goes from s1 to sgoal at cost k c(s\n i=1 i, ai, si+1).\nWe can label each compressed action that starts at s with (s, s , a) (where a = null if\ns = sgoal).\n Among all compressed actions starting at s and ending at (s , a) there is (at least) one\nwith lowest expected cost; we will call such an action an optimal compression of (s, s , a).\nWrite Astoch for the set of all pairs (s, a) such that action a when taken from state s has\nmore than one possible outcome, and include as well (sgoal, null). Write Sstoch for the\nstates which are possible outcomes of the actions in Astoch, and include sstart as well. If we\ninclude in our compressed MDP an optimal compression of (s, s , a) for every s Sstoch\nand every (s , a) Astoch, the result is what we call the full compressed MDP; an example\nis in Figure 2(b).\n If we solve the full compressed MDP, the value of each state will be the same as the\nvalue of the corresponding state in M . However, we do not need to do that much work:\n\n\f\n (a) action compression\n\n\n\n\n\n (b) full MDP compression\n\n\n\n\n\n (c) incremental MDP compression\n\n\n Figure 2: MDP compression\n\nMain()\n01 initialize M c with sstart and sgoal and set their v-values to 0;\n02 while (s M c s.t. RHS(s) - v(s) > and s belongs to the current greedy policy)\n03 select spivot to be any such state s;\n04 [v; vlim] = Search(spivot);\n05 v(spivot) = v;\n06 set the cost c(spivot, \n a, sgoal) of the limit action \n a from spivot to vlim;\n07 optionally run some algorithm satisfying req. A for a bounded amount of time to improve the value function in M c;\n\n Figure 3: MCP main loop\n\nmany states and actions in the full compressed MDP are irrelevant since they do not appear\nin the optimal policy from sstart to sgoal. So, the goal of the MCP algorithm will be to\nconstruct only the relevant part of the compressed MDP by building M c incrementally.\nFigure 2(c) shows the incremental construction of a compressed MDP which contains all\nof the stochastic states and actions along an optimal policy in M .\n The pseudocode for MCP is given in Figure 3. It begins by initializing M c to contain\nonly sstart and sgoal, and it sets v(sstart) = v(sgoal) = 0. It maintains the invariant that\n0 v(s) v(s) for all s. On each iteration, MCP looks at the Bellman error of each of\nthe states in M c. The Bellman error is v(s) - RHS(s), where\n\n RHS(s) = min RHS(s, a) RHS(s, a) = Es succ(s,a)(c(s, a, s ) + v(s ))\n aA(s)\n\nBy convention the min of an empty set is , so an s which does not have any compressed\nactions yet is considered to have infinite RHS.\n MCP selects a state with negative Bellman error, spivot, and starts a search at that state.\n(We note that there exist many possible ways to select spivot; for example, we can choose\nthe state with the largest negative Bellman error, or the largest error when weighted by state\nvisitation probabilities in the best policy in M c.) The goal of this search is to find a new\ncompressed action a such that its RHS-value can provide a new lower bound on v(spivot).\nThis action can either decrease the current RHS(spivot) (if a seems to be a better action in\nterms of the current v-values of action outcomes) or prove that the current RHS(spivot) is\nvalid. Since v(s ) v(s ), one way to guarantee that RHS(spivot, a) v(spivot) is\n\n\f\nto compute an optimal compression of (spivot, s, a) for all s, a, then choose the one with\nthe smallest RHS. A more sophisticated strategy is to use an A search with appropriate\nsafeguards to make sure we never overestimate the value of a stochastic action. MCP,\nhowever, uses a modified A search which we will describe in the next section.\n As the search finds new compressed actions, it adds them and their outcomes to M c.\nIt is allowed to initialize newly-added states to any admissible values. When the search\nreturns, MCP sets v(spivot) to the returned value. This value is at least as large as\nRHS(spivot). Consequently, Bellman error for spivot becomes non-negative.\n In addition to the compressed action and the updated value, the search algorithm returns\na \"limit value\" vlim(spivot). These limit values allow MCP to run a standard MDP planning\nalgorithm on M c to improve its v(s) estimates. MCP can use any planning algorithm\nwhich guarantees that, for any s, it will not lower v(s) and will not increase v(s) beyond\nthe smaller of vlim(s) and RHS(s) (Requirement A). For example, we could insert a fake\n\"limit action\" into M c which goes directly from spivot to sgoal at cost vlim(spivot) (as we\ndo on line 06 in Figure 3), then run value iteration for a fixed amount of time, selecting for\neach backup a state with negative Bellman error.\n After updating M c from the result of the search and any optional planning, MCP begins\nagain by looking for another state with a negative Bellman error. It repeats this process\nuntil there are no negative Bellman errors larger than . For small enough , this property\nguarantees that we will be able to find a good policy (see section 2.3).\n\n2.2 Searching the MDP Efficiently\n\nThe top level algorithm (Figure 3) repeatedly invokes a search method for finding trajec-\ntories from spivot to sgoal. In order for the overall algorithm to work correctly, there are\nseveral properties that the search must satisfy. First, the estimate v that search returns for\nthe expected cost of spivot should always be admissible. That is, 0 v v(spivot)\n(Property 1). Second, the estimate v should be no less than the one-step lookahead value\nof spivot in M c. That is, v RHS(spivot) (Property 2). This property ensures that search\neither increases the value of spivot or finds additional (or improved) compressed actions.\nThe third and final property is for the vlim value, and it is only important if MCP uses its\noptional planning step (line 07). The property is that v vlim v(spivot) (Property 3).\nHere v(spivot) denotes the minimum expected cost of starting at spivot, picking a com-\npressed action not in M c, and acting optimally from then on. (Note that v can be larger\nthan v if the optimal compressed action is already part of M c.) Property 3 uses v rather\nthan v since the latter is not known while it is possible to compute a lower bound on the\nformer efficiently (see below).\n One could adapt A* search to satisfy at least Properties 1 and 2 by assuming that we can\ncontrol the outcome of stochastic actions. However, this sort of search is highly optimistic\nand can bias the search towards improbable trajectories. Also, it can only use heuristics\nwhich are even more optimistic than it is: that is, h must be admissible with respect to the\noptimistic assumption of controlled outcomes.\n We therefore present a version of A*, called MCP-search (Figure 4), that is more effi-\ncient for our purposes. MCP-search finds the correct expected value for the first stochas-\ntic action it encounters on any given trajectory, and is therefore far less optimistic. And,\nMCP-search only requires heuristic values to be admissible with respect to v values,\nh(s) v(s). Finally, MCP-search speeds up repetitive searches by improving heuris-\ntic values based on previous searches.\n A* maintains a priority queue, OPEN, of states which it plans to expand. The OPEN\nqueue is sorted by f (s) = g(s)+h(s), so that A* always expands next a state which appears\nto be on the shortest path from start to goal. During each expansion a state s is removed\nfrom OPEN and all the g-values of s's successors are updated; if g(s ) is decreased for some\nstate s , A* inserts s into OPEN. A* terminates as soon as the goal state is expanded. We\nuse the variant of A* with pathmax [5] to use efficiently heuristics that do not satisfy the\ntriangle inequality.\n MCP is similar to A, but the OPEN list can also contain state-action pairs {s, a} where\na is a stochastic action (line 31). Plain states are represented in OPEN as {s, null}. Just\n\n\f\nImproveHeuristic(s)\n01 if s M c then h(s) = max(h(s), v(s));\n02 improve heuristic h(s) further if possible using f best and g(s) from previous iterations;\n\nprocedure fvalue({s, a})\n03 if s = null return ;\n04 else if a = null return g(s) + h(s);\n05 else return g(s) + max(h(s), E {c(s, a, s ) + h(s )})\n s Succ(s,a) ;\n\nCheckInitialize(s)\n06 if s was accessed last in some previous search iteration\n07 ImproveHeuristic(s);\n08 if s was not yet initialized in the current search iteration\n09 g(s) = ;\n\nInsertUpdateCompAction(spivot, s, a)\n10 reconstruct the path from spivot to s;\n11 insert compressed action (spivot, s, a) into A(spivot) (or update the cost if a cheaper path was found)\n12 for each outcome u of a that was not in M c previously\n13 set v(u) to h(u) or any other value less than or equal to v(u);\n14 set the cost c(u, \n a, sgoal) of the limit action \n a from u to v(u);\n\nprocedure Search(spivot)\n15 CheckInitialize(sgoal), CheckInitialize(spivot);\n16 g(spivot) = 0;\n17 OPEN = {{spivot, null}};\n18 {sbest, abest} = {null, null}, f best = ;\n19 while(g(sgoal) > min{s,a}OPEN(fvalue({s, a})) AND f best + > min{s,a}OPEN(fvalue({s, a})))\n20 remove {s, a} with the smallest fvalue({s, a}) from OPEN breaking ties towards the pairs with a = null;\n21 if a = null //expand state s\n22 for each s Succ(s)\n23 CheckInitialize(s );\n24 for each deterministic a A(s)\n25 s = Succ(s, a );\n26 h(s ) = max(h(s ), h(s) - c(s, a , s ));\n27 if g(s ) > g(s) + c(s, a , s )\n28 g(s ) = g(s) + c(s, a , s );\n29 insert/update {s , null} into OPEN with fvalue({s , null});\n30 for each stochastic a A(s)\n31 insert/update {s, a } into OPEN with fvalue({s, a });\n32 else //encode stochastic action a from state s as a compressed action from spivot\n33 InsertUpdateCompAction(spivot, s, a);\n34 if f best > fvalue({s, a}) then {sbest, abest} = {s, a}, f best = fvalue({s, a});\n35 if (g(sgoal) min{s,a}OPEN(fvalue({s, a})) AND OPEN = )\n36 reconstruct the path from spivot to sgoal;\n37 update/insert into A(spivot) a deterministic action a leading to sgoal;\n38 if f best g(sgoal) then {sbest, abest} = {sgoal, null}, f best = g(sgoal);\n39 return [f best; min{s,a}OPEN(fvalue({s, a}))];\n\n Figure 4: MCP-search Algorithm\n\n\nlike A*, MCP-search expands elements in the order of increasing f -values, but it breaks\nties towards elements encoding plain states (line 20). The f -value of {s, a} is defined\nas g(s) + max[h(s), Es Succ(s,a)(c(s, a, s ) + h(s ))] (line 05). This f -value is a lower\nbound on the cost of a policy that goes from sstart to sgoal by first executing a series of\ndeterministic actions until action a is executed from state s. This bound is as tight as\npossible given our heuristic values.\n State expansion (lines 21-31) is very similar to A. When the search removes from\nOPEN a state-action pair {s, a} with a = null, it adds a compressed action to M c (line\n33). It also adds a compressed action if there is an optimal deterministic path to sgoal\n(line 37). f best tracks the minimum f -value of all the compressed actions found. As a\nresult, f best v(spivot) and is used as a new estimate for v(spivot). The limit value\nvlim(spivot) is obtained by continuing the search until the minimum f -value of elements\nin OPEN approaches f best + for some 0 (line 19). This minimum f -value then\nprovides a lower bound on v(spivot).\n To speed up repetitive searches, MCP-search improves the heuristic of every state that\nit encounters for the first time in the current search iteration (lines 01 and 02). Line 01\nuses the fact that v(s) from M c is a lower bound on v(s). Line 02 uses the fact that\nf best - g(s) is a lower bound on v(s) at the end of each previous call to Search; for more\ndetails see [4].\n\n\f\n2.3 Theoretical Properties of the Algorithm\n\nWe now present several theorems about our algorithm. The proofs of these and other theo-\nrems can be found in [4]. The first theorem states the main properties of MCP-search.\nTheorem 1 The search function terminates and the following holds for the values it re-\nturns:\n (a) if sbest = null then v(spivot) f best E{c(spivot, abest, s ) + v(s )}\n (b) if sbest = null then v(spivot) = f best = \n (c) f best min{s,a}OPEN(fvalue({s, a})) v(spivot).\n If neither sgoal nor any state-action pairs were expanded, then sbest = null and (b) says\nthat there is no policy from spivot that has a finite expected cost. Using the above theorem\nit is easy to show that MCP-search satisfies Properties 1, 2 and 3, considering that f best is\nreturned as variable v and min{s,a}OPEN(fvalue({s, a})) is returned as variable vlim in\nthe main loop of the MCP algorithm (Figure 3). Property 1 follows directly from (a) and\n(b) and the fact that costs are strictly positive and v-values are non-negative. Property 2\nalso follows trivially from (a) and (b). Property 3 follows from (c). Given these properties\nthe next theorem states the correctness of the outer MCP algorithm (in the theorem cgreedy\ndenotes a greedy policy that always chooses an action that looks best based on its cost and\nthe v-values of its immediate successors).\nTheorem 2 Given a deterministic search algorithm which satisfies Properties 13, the\nMCP algorithm will terminate. Upon termination, for every state s M c c we\n greedy\nhave RHS(s) - v(s) v(s).\n Given the above theorem one can show that for 0 < cmin (where cmin is the\nsmallest expected action cost in our MDP) the expected cost of executing c from\n greedy\nsstart is at most cmin v(s\n c start). Picking cmin is not guaranteed to result in a proper\n min -\npolicy, even though Theorem 2 continues to hold.\n\n3 Experimental Study\n\nWe have evaluated the MCP algorithm on the robot-helicopter coordination problem de-\nscribed in section 1. To obtain an admissible heuristic, we first compute a value function\nfor every possible configuration of obstacles. Then we weight the value functions by the\nprobabilities of their obstacle configurations, sum them, and add the cost of moving the\nhelicopter back to its base if it is not already there. This procedure results in optimistic cost\nestimates because it pretends that the robot will find out the obstacle locations immediately\ninstead of having to wait to observe them.\n The results of our experiments are shown in Figure 5. We have compared MCP against\nthree algorithms: RTDP [1], LAO* [2] and value iteration on reachable states (VI). RTDP\ncan cope with large size MDPs by focussing its planning efforts along simulated execu-\ntion trajectories. LAO* uses heuristics to prune away irrelevant states, then repeatedly\nperforms dynamic programming on the states in its current partial policy. We have im-\nplemented LAO* so that it reduces to AO* [6] when environments are acyclic (e.g., the\nrobot-helicopter problem with perfect sensing). VI was only able to run on the problems\nwith perfect sensing since the number of reachable states was too large for the others.\n The results support the claim that MCP can solve large problems with sparse stochas-\nticity. For the problem with perfect sensing, on average MCP was able to plan 9.5 times\nfaster than LAO*, 7.5 times faster than RTDP, and 8.5 times faster than VI. On average for\nthese problems, MCP computed values for 58633 states while M c grew to 396 states, and\nMCP encountered 3740 stochastic transitions (to give a sense of the degree of stochastic-\nity). The main cost of MCP was in its deterministic search subroutine; this fact suggests\nthat we might benefit from anytime search techniques such as ARA* [3].\n The results for the problems with imperfect sensing show that, as the number and den-\nsity of uncertain outcomes increases, the advantage of MCP decreases. For these problems\nMCP was able to solve environments 10.2 times faster than LAO* but only 2.2 times faster\nthan RTDP. On average MCP computed values for 127,442 states, while the size of M c\nwas 3,713 states, and 24,052 stochastic transitions were encountered.\n\n\f\nFigure 5: Experimental results. The top row: the robot-helicopter coordination problem with perfect\nsensors. The bottom row: the robot-helicopter coordination problem with sensor noise. Left column:\nrunning times (in secs) for each algorithm grouped by environments. Middle column: the number\nof backups for each algorithm grouped by environments. Right column: an estimate of the expected\ncost of an optimal policy (v(sstart)) vs. running time (in secs) for experiment (k) in the top row and\nexperiment (e) in the bottom row. Algorithms in the bar plots (left to right): MCP, LAO*, RTDP and\nVI (VI is only shown for problems with perfect sensing). The characteristics of the environments are\ngiven in the second and third rows under each of the bar plot. The second row indicates how many\ncells the 2D plane is discretized into, and the third row indicates the number of initially unknown\ncells in the environment.\n4 Discussion\n\nThe MCP algorithm incrementally builds a compressed MDP using a sequence of deter-\nministic searches. Our experimental results suggest that MCP is advantageous for problems\nwith sparse stochasticity. In particular, MCP has allowed us to scale to larger environments\nthan were otherwise possible for the robot-helicopter coordination problem.\n\nAcknowledgements\nThis research was supported by DARPA's MARS program. All conclusions are our own.\n\nReferences\n\n[1] S. Bradtke A. Barto and S. Singh. Learning to act using real-time dynamic program-\n ming. Artificial Intelligence, 72:81138, 1995.\n\n[2] E. Hansen and S. Zilberstein. LAO*: A heuristic search algorithm that finds solutions\n with loops. Artificial Intelligence, 129:3562, 2001.\n\n[3] M. Likhachev, G. Gordon, and S. Thrun. ARA*: Anytime A* with provable bounds\n on sub-optimality. In Advances in Neural Information Processing Systems (NIPS) 16.\n Cambridge, MA: MIT Press, 2003.\n\n[4] M. Likhachev, G. Gordon, and S. Thrun. MCP: Formal analysis. Technical report,\n Carnegie Mellon University, Pittsburgh, PA, 2004.\n\n[5] L. Mero. A heuristic search algorithm with modifiable estimate. Artificial Intelligence,\n 23:1327, 1984.\n\n[6] N. Nilsson. Principles of Artificial Intelligence. Palo Alto, CA: Tioga Publishing,\n 1980.\n\n[7] C. H. Papadimitriou and J. N. Tsitsiklis. The complexity of Markov decision pro-\n cessses. Mathematics of Operations Research, 12(3):441450, 1987.\n\n\f\n", "award": [], "sourceid": 2727, "authors": [{"given_name": "Maxim", "family_name": "Likhachev", "institution": null}, {"given_name": "Sebastian", "family_name": "Thrun", "institution": null}, {"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}