{"title": "Approximate Planning in POMDPs with Macro-Actions", "book": "Advances in Neural Information Processing Systems", "page_first": 775, "page_last": 782, "abstract": "", "full_text": "Approximate Planning in POMDPs with\n\nMacro-Actions\n\nGeorgios Theocharous\n\nMIT AI Lab\n\n200 Technology Square\nCambridge, MA 02139\n\ntheochar@ai.mit.edu\n\nAbstract\n\nLeslie Pack Kaelbling\n\nMIT AI Lab\n\n200 Technology Square\nCambridge, MA 02139\nlpk@ai.mit.edu\n\nRecent research has demonstrated that useful POMDP solutions do not\nrequire consideration of the entire belief space. We extend this idea with\nthe notion of temporal abstraction. We present and explore a new rein-\nforcement learning algorithm over grid-points in belief space, which uses\nmacro-actions and Monte Carlo updates of the Q-values. We apply the\nalgorithm to a large scale robot navigation task and demonstrate that with\ntemporal abstraction we can consider an even smaller part of the belief\nspace, we can learn POMDP policies faster, and we can do information\ngathering more ef\ufb01ciently.\n\n1 Introduction\n\nA popular approach to arti\ufb01cial intelligence is to model an agent and its interaction with\nits environment through actions, perceptions, and rewards [10]. Intelligent agents should\nchoose actions after every perception, such that their long-term reward is maximized. A\nwell de\ufb01ned framework for this interaction is the partially observable Markov decision pro-\ncess (POMDP) model. Unfortunately solving POMDPs is an intractable problem mainly\ndue to the fact that exact solutions rely on computing a policy over the entire belief-space\n[6, 3], which is a simplex of dimension equal to the number of states in the underlying\nMarkov decision process (MDP). Recently researchers have proposed algorithms that take\nadvantage of the fact that for most POMDP problems, a large proportion of the belief space\nis not experienced [7, 9].\n\nIn this paper we explore the same idea, but in combination with the notion of temporally\nextended actions (macro-actions). We propose and investigate a new model-based rein-\nforcement learning algorithm over grid-points in belief space, which uses macro-actions\nand Monte Carlo updates of the Q-values. We apply our algorithm to large scale robot nav-\nigation and demonstrate the various advantages of macro-actions in POMDPs. Our exper-\nimental results show that with macro-actions an agent experiences a signi\ufb01cantly smaller\npart of the belief space than with simple primitive actions. In addition, learning is faster be-\ncause an agent can look further into the future and propagate values of belief points faster.\nAnd \ufb01nally, well designed macros, such as macros that can easily take an agent from a high\nentropy belief state to a low entropy belief state (e.g., go down the corridor), enable agents\nto perform information gathering.\n\n\f2 POMDP Planning with Macros\n\nWe now describe our algorithm for \ufb01nding an approximately optimal plan for a known\nPOMDP with macro actions. It works by using a dynamically-created \ufb01nite-grid approxi-\nmation to the belief space, and then using model-based reinforcement learning to compute\na value function at the grid points. Our algorithm takes as input a POMDP model, a res-\nolution r, and a set of macro-actions (described as policies or \ufb01nite state automata). The\noutput is a set of grid-points (in belief space) and their associated action-values, which via\ninterpolation specify an action-value function over the entire belief space, and therefore a\ncomplete policy for the POMDP.\n\nDynamic Grid Approximation A standard method of \ufb01nding approximate solutions to\nPOMDPs is to discretize the belief space by covering it with a uniformly-spaced grid (oth-\nerwise called regular grid as shown in Figure 1, then solve an MDP that takes those grid\npoints as states [1]. Unfortunately, the number of grid points required rises exponentially\nin the number of dimensions in the belief space, which corresponds to the number of states\nin the original space.\n\nRecent studies have shown that in many cases, an agent actually travels through a very small\nsubpart of its entire belief space. Roy and Gordon [9] \ufb01nd a low-dimensional subspace of\nthe original belief space, then discretize that uniformly to get an MDP approximation to\nthe original POMDP. This is an effective strategy, but it might be that the \ufb01nal uniform\ndiscretization is unnecessarily \ufb01ne.\n\nS2\n\n(0,1,0)\n\nS2\n\nS2\n\n(0.5,0.5,0)\n\n(0.25,0.75,0)\n\nS1\n\n(1,0,0)\n\n(0,0,1)\n\nS3\n\nS1\n\nS3\n\nS1\n\nS3\n\nRESOLUTION 1\n\nRESOLUTION 2\n\nRESOLUTION 4\n\nFigure 1: The \ufb01gure depicts various regular dicretizations of a 3 dimensional belief sim-\nplex. The belief-space is the surface of the triangle, while grid points are the intersection\nof the lines drawn within the triangles. Using resolution of powers of 2 allows \ufb01ner dis-\ncretizations to include the points of coarser dicretizations.\n\nIn our work, we allocate grid points from a uniformly-spaced grid dynamically by simulat-\ning trajectories of the agent through the belief space. At each belief state experienced, we\n\ufb01nd the grid point that is closest to that belief state and add it to the set of grid points that\nwe explicitly consider. In this way, we develop a set of grid points that is typically a very\nsmall subset of the entire possible grid, which is adapted to the parts of the belief space\ntypically inhabited by the agent.\n\nIn particular, given a grid resolution r and a belief state b we can compute the coordi-\nnates (grid points gi) of the belief simplex that contains b using an ef\ufb01cient method called\nFreudenthal triangulation [2]. In addition to the vertices of a sub-simplex, Freundenthal\ntriangulation also produces barycentric coordinates (cid:21)i, with respect to gi, which enable\neffective interpolation for the value of the belief state b from the values of the grid points\ngi [1]. Using the barycentric coordinates we can also decide which is the closest grid-point\nto be added in the state space.\n\nMacro Actions The semi-Markov decision process (SMDP) model has become the pre-\nferred method for modeling temporally extended actions. An SMDP is de\ufb01ned as a \ufb01ve-\ntuple (S,A,P ,R,F ), where S is a \ufb01nite set of states, A is the set of actions, P is the state\n\n\fand action transition probability function, R is the reward function, and F is a function giv-\ning probability of transition times for each state-action pair. The transitions are at decision\nepochs only. The SMDP represents snapshots of the system at decision points, whereas the\nso-called natural process [8] describes the evolution of the system over all times. Discrete-\ntime SMDPs represent transition distributions as F (s0; N js; a), which speci\ufb01es the ex-\npected number of steps N that action a will take before terminating in state s0 starting in\nstate s. Q-learning generalizes nicely to discrete SMDPs. The Q-learning rule for discrete-\ntime discounted SMDPs is\n\nQt+1(s; a)   (1 (cid:0) (cid:12))Qt(s; a) + (cid:12)(cid:18)R + (cid:13)k max\n\na02A(s0)\n\nQt(s0; a0)(cid:19) ;\n\nwhere (cid:12) 2 (0; 1), and action a was initiated in state s, lasted for k steps, and terminated\nin state s0, while generating a total discounted sum of rewards of R. Several frameworks\nfor hierarchical reinforcement learning have been proposed, all of which are variants of\nSMDPs, such as the \u201coptions\u201d framework [11].\n\nMacro actions have been shown to be useful in a variety of MDP situations, but they have a\nspecial utility in POMDPs. For example, in a robot navigation task modeled as a POMDP,\nmacro actions can consist of small state machines, such as a simple policy for driving down\na corridor without hitting the walls until the end is reached. Such actions may have the\nuseful property of reducing the entropy of the belief space, by helping a robot to localize\nits position. In addition, they relieve us of the burden of having to choose another primitive\naction based on the new belief state. Using macro actions tends to reduce the number of\nbelief states that are visited by the agent. If a robot navigates largely by using macro-actions\nto move to important landmarks, it will never be necessary to model the belief states that\nare concerned with where the robot is within a corridor, for example.\n\nAlgorithm Our algorithm works by building a grid-based approximation of the belief\nspace while executing a policy made up of macro actions. The policy is determined by\n\u201csolving\u201d the \ufb01nite MDP over the grid points. Computing a policy over grid points equally\nspaced in the belief simplex, otherwise called regular discretization, is computationally\nintractable since the number of grid-points grows exponentially with the resolution [2].\nNonetheless, the value of a belief point in a regular dicretization can be interpolated ef\ufb01-\nciently from the values of the neighboring grid-points [2]. On the other hand, in variable\nresolution non-regular grids, interpolation can be computationally expensive [1]. A better\napproach is variable resolution with regular dicretization which takes advantage of fast in-\nterpolation and increases resolution only in the necessary areas [12]. Our approach falls\nin this last category with the addition of macro-actions, which exhibit various advantages\nover approaches using primitive actions only. Speci\ufb01cally, we use a reinforcement-learning\nalgorithm (rather than dynamic programming) to compute a value function over the MDP\nstates. It works by generating trajectories through the belief space according to the current\npolicy, with some added exploration. Reinforcement learning using a model, otherwise\ncalled real time dynamic programming (RTDP) is not only better suited for huge spaces\nbut in our case is also convenient in estimating the necessary models of our macro-actions\nover the experienced grid points.\n\nWhile Figure 2 gives a graphical explanation of the algorithm, below we sketch the entire\nalgorithm in detail:\n\n1. Assume a current true state s. This is the physical true location of the agent, and\n\nit should have support in the current belief state b (that is b(s) 6= 0).\n\n2. Discretize the current belief state b ! gi, where gi is the closest grid-point\n(with the maximum barycentric coordinate) in a regular discretization of the belief\nspace. If gi is missing add it to the table. If the resolution is 1 initialize its value\nto zero otherwise interpolate its initial value from coarser resolutions.\n\n\fb\n\ng\n\nb\u2019\n\ng1\n\ng2\n\nb\u2019\u2019\n\ng3\n\nFigure 2: The agent \ufb01nds itself at a belief state b. It maps b to the grid point g, which has\nthe largest barycentric coordinate among the sub-simplex coordinates that contain b. Now,\nit needs to do a value backup for that grid point. It chooses a macro action and executes it\nstarting from the chosen grid-point, using the primitive actions and observations that it does\nalong the way to update its belief state. It needs to get a value estimate for the resulting\nbelief state b00. It does so by using the barycentric coordinates from the grid to interpolate a\nvalue from nearby grid points g1, g2, and g3. In case the nearest grid-point gi is missing, it\nis interpolated from coarser resolutions and added to the representation. If the resolution is\n1, the value of gi is initialized to zero. The agent executes the macro-action from the same\ngrid point g multiple times so that it can approximate the probability distribution over the\nresulting belief-states b00. Finally, it can update the estimated value of the grid point g and\nexecute the macro-action chosen from the true belief state b. The process repeats from the\nnext true belief state b0.\n\n3. Choose a random action (cid:15)% of the time. The rest of the time choose the best\nmacro-action (cid:22) by interpolating over the Q values of the vertices of the sub-\n\nsimplex that contains b: (cid:22) = argmax(cid:22)2MPjSj+1\n\n4. Estimate E [R(gi; (cid:22)) + (cid:13)tV (b0)] by sampling:\n\ni=1 (cid:21)iQ(gi; (cid:22)).\n\n(a) Sample a state s from the current grid-belief state gi (which like all belief\n\nstates represents a probability distribution over world states).\ni. Set t = 0\nii. Choose the appropriate primitive action a according to macro-action (cid:22).\niii. Sample the next state s0 from the transition model T (s; a; (cid:1)).\niv. Sample an observation z from observation model O(a; s0; (cid:1)).\nv. Store the reward R(gi; (cid:22)) := R(gi; (cid:22)) + (cid:13)t (cid:3) R(s; a). For faster learning\nwe use reward-shaping: R(gi; (cid:22)) := R(gi; (cid:22)) + (cid:13)t+1V (s0) (cid:0) (cid:13)tV (s),\nwhere V (s) are the values of the underlying MDP [5].\n\nvi. Update the belief state: b0(j) := 1\n\nstates j, where (cid:11) is a normalizing factor.\n\n(cid:11) O(a; j; z)Pi2S T (i; a; j), for all\n\nvii. Set t = t+1, b = b0, s = s0 and repeat from step 4(a)ii until (cid:22) terminates.\n(b) Compute the value of the resulting belief state b0 by interpolating over the\n(cid:21)iV (gi). If\nthe closest grid-point (with the maximum barycentric coordinate) is missing,\ninterpolate it from coarser resolutions, and add it to the hash-table.\n\nvertices in the resulting belief sub-simplex: V (b0) = PjSj+1\n\ni\n\n(c) Repeat steps 4a and 4b multiple times, and average the estimate\n\n[R(gi; (cid:22)) + (cid:13)tV (b0)].\n\n5. Update the state action value: Q(gi; (cid:22)) = (1 (cid:0) (cid:12))Q(gi; (cid:22)) + (cid:12) [R + (cid:13)tV (b0)].\n6. Update the state value: V (gi) = argmax(cid:22)2MQ(gi; (cid:22)).\n7. Execute the macro-action (cid:22) starting from belief state b until termination. During\nexecution, generate observations by sampling the POMDP model, starting from\nthe true state s. Set b = b0 and s = s0 and go to step 2.\n\n8. Repeat this learning epoch multiple times starting from the same b.\n\n\f3 Experimental Results\n\nWe tested this algorithm by applying it to the problem of robot navigation, which is a\nclassic sequential decision-making problem under uncertainty. We performed experiments\nin a corridor environment, shown in Figure 3. Such a topological map can be compiled\ninto POMDPs, in which the discrete states stand for regions in the robot\u2019s pose space (for\nexample 2 square meters in position and 90 degrees in orientation). In such a representa-\ntion, the robot can move through the different environment states by taking actions such as\n\u201cgo-forward\u201d, \u201cturn-left\u201d, and \u201cturn-right\u201d. A macro-actions is implemented as a behavior\n(could be a POMDP policy) that takes as inputs observations and outputs actions. In our\nexperiments we have a macro-action for going down the corridor until the end. In this nav-\nigation domain, our robot can only perceive sixteen possible observations, which indicate\nthe presence of a wall and opening on the four sides of the robot. The observations are\nextracted from trained neural nets where the inputs are local occupancy grids constructed\nfrom sonar sensors and outputs are probabilities of walls and openings [4]. The POMDP\nmodel of the corridor environment has a reward function with value -1 in every state, except\nfor -100 for going forward into a wall and +100 for taking any action from the four-way\njunction.\n\n66\n\n24\n\n14\n\n26\n\n2\n\n32\n\n20\n\n18\n\n40\n\n40\n\n20\n\n36\n\n4\n\n8\n\n40\n\n40\n\n96\n\n32\n\n20\n\n26\n\nFigure 3: The \ufb01gure on the left shows the \ufb02oor plan of our experimental environment. The\n\ufb01gure on the right is a topological map representation of the \ufb02oor, which compiles into a\nPOMDP with 1068 world states. The numbers next to the edges are the distances between\nthe nodes in meters.\n\nWe ran the algorithm starting with resolution 1. When the average number of training steps\nstabilized we increased the resolution by multiplying it by 2. The maximum resolution we\nconsidered was 4. Each training episode started from the uniform initial belief state and\nwas terminated when the four-way junction was reached or when more than 200 steps were\ntaken. We ran the algorithm with and without the macro-action go-down-the-corridor. We\ncompared the results with the QMDP heuristic which \ufb01rst solves the underlying MDP and\nthen given any belief state, chooses the action that maximizes the dot product of the belief\n\nand Q values of state action pairs: QM DPa = argmaxaPjSj\n\ns=1 b(s)Q(s; a).\n\nWith Reward Shaping The learning results in Figure 4 demonstrate that learning with\nmacro-actions requires fewer number of training steps, which means the agent is getting\nto the goal faster. An exception is when the resolution is 1, where training with only\nprimitive actions requires a small number of steps too. Nonetheless as we increase the\nresolution, training with primitive actions only does not scale well, because the number of\nstates increases dramatically.\n\nIn general, the number of grid points used with or without macro-actions is signi\ufb01cantly\n\n\fsmaller than the total number of points allowed for regular dicretization. For example, for\na regular discretization the number of grid points can be computed by the formula given\nin [2], (r+jSj(cid:0)1)!\nr!(jSj(cid:0)1)! , which is 5:410 for r = 4 and jSj = 1068. Our algorithm with macro\nactions uses only about about 3000 and with primitive actions only about 6500 grid points.\n\ni\n\ni\n\ns\np\ne\nt\ns\n \ng\nn\nn\na\nr\nt\n \nf\no\n \n#\n \ne\ng\na\nr\ne\nv\nA\n\n 180\n 160\n 140\n 120\n 100\n 80\n 60\n 40\n 20\n 0\n\nTraining Steps\n\nprimitive\nmacro\n\n 0  100 200 300 400 500 600 700 800 900\n\nNumber of training episodes\n\ns\ne\nt\na\nt\ns\n \nf\no\n \nr\ne\nb\nm\nu\nN\n\n 7000\n 6000\n 5000\n 4000\n 3000\n 2000\n 1000\n 0\n\nNumber of States\n\nprimitive\nmacro\n\n 0  100 200 300 400 500 600 700 800 900\n\nNumber of training episodes\n\nFigure 4: The graph on the left shows the average number of training-steps per episode as a\nfunction of the number of episodes. The graph on the right shows the number of grid-points\nadded during learning. The sharp changes in the graph are due to the resolution increase.\n\nWe tested the policies that resulted from each algorithm by starting from a uniform initial\nbelief state and a uniformly randomly chosen world state and simulating the greedy policy\nderived by interpolating the grid value function. We tested our plans over 200 different\nsampling sequences and report the results in Figure 5. A run was considered a success if\nthe robot was able to reach the goal in fewer than 200 steps.\n\n%\n \ns\ns\ne\nc\nc\nu\nS\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\nSuccess\n\nqmdp\nprimitive\nmacro\n\n 0  1  2  3  4\n\nResolution\n\nl\n\na\no\ng\n\n \n\no\n\nt\n \ns\np\ne\nS\n\nt\n\n 140\n 120\n 100\n 80\n 60\n 40\n 20\n 0\n\nTesting Steps\n\nprimitive\nmacro\n\n 0  1  2  3  4\n\nResolution\n\nFigure 5: The \ufb01gure on the left shows the success percentage for the different methods\nduring testing. The results are reported after training for each resolution. The graph on the\nright shows the number of steps during testing. For the primitive-actions only algorithm\nwe report the result for resolution 1 only, since it was as successful as the macro-action\nalgorithm.\n\nFrom Figure 5 we can conclude that the QMDP approach can never be 100% success-\nful, while the primitive-actions algorithm can perform quite well with resolution 1 in this\nenvironment. It is also evident from Figure 5 that as we increase the resolution, the macro-\naction algorithm maintains its robustness while the primitive-action algorithm performs\nconsiderably worse, mainly due to the fact that it requires more grid-points. In addition,\nwhen we compared the average number of testing steps for resolution 1 the macro-action\nalgorithm seems to have learned a better policy. The macro-action policy policy seems to\nget worse for resolution 4 due to the increasing number of grid-points added in the repre-\n\n\fsentation. This means that more training is required.\n\nWithout Reward Shaping We also performed experiments to investigate the effect of\nreward-shaping. Figure 6 shows that with primitive actions only, the algorithm fails com-\npletely. However, with macro-actions the algorithm still converges and is more successful\nthan the QMDP-heuristic.\n\ni\n\ni\n\ns\np\ne\nt\ns\n \ng\nn\nn\na\nr\nt\n \nf\no\n \n#\n \ne\ng\na\nr\ne\nv\nA\n\n 200\n 190\n 180\n 170\n 160\n 150\n 140\n 130\n 120\n 110\n\nTraining Steps\n\nprimitive\nmacro\n\n 0  100 200 300 400 500 600 700 800 900\n\nNumber of training episodes\n\n%\n \ns\ns\ne\nc\nc\nu\nS\n\n 100\n\n 80\n\n 60\n\n 40\n\n 20\n\n 0\n\nSuccess\n\nprimitive\nmacro\n\n 0\n\n 1\nResolution\n\nFigure 6: The The graph on the left shows the average number of training-steps (without\nreward shaping). The \ufb01gure on the right shows the success percentage\n\nInformation Gathering Apart from simulated experiments we also wanted to compare\nthe performance of QMDP with the macro-action algorithm on a platform more closely\nrelated to a real robot. We used the Nomad 200 simulator and describe a test in Figure 7 to\ndemonstrate how our algorithm is able to perform information gathering, as compared to\nQMDP.\n\n4 Conclusions\n\nIn this paper we have presented an approximate planning algorithm for POMDPs that uses\nmacro-actions. Our algorithm is able to solve a dif\ufb01cult planning problem, namely the\ntask of navigating to a goal in a huge space POMDP starting from a uniform initial belief,\nwhich is more dif\ufb01cult than many of the tasks that similar algorithms are tested on. In\naddition, we have presented an effective reward-shaping approach to POMDPs that results\nin faster training (even without macro-actions).\n\nIn general macro-actions in POMDPs allow us to experience a smaller part of the state\nspace, backup values faster, and do information gathering. As a result we can afford to\nallow for higher grid resolution which results in better performance. We cannot do this\nwith only primitive actions (unless we use reward shaping) and it is completely out of the\nquestion for exact solution over the entire regular grid.\nIn our current research we are\ninvestigating methods for dynamic discovery of \u201cgood\u201d macro-actions given a POMDP.\n\nReferences\n[1] M. Hauskrecht. Value-function approximations for partially observable Markov decision pro-\n\ncesses. Journal of Arti\ufb01cial Intelligence Research, 13:33\u201394, 2000.\n\n[2] W. S. Lovejoy. Computationally feasible bounds for partially observed Markov decision pro-\n\ncesses. Operations Research, 39(1):162\u2013175, January-February 1991.\n\n[3] O. Madani, S. Hanks, and A. Gordon. On the undecidability of probabilistic planning and\nin\ufb01nite-horizon partially observable Markov decision processes. In Proceedings of the Sixteenth\nNational Conference in Arti\ufb01cial Intelligence, pages 409\u2013416, 1999.\n\n\fJ3\n\nJ2\n\nSTART\n\nJ1\n\nJ4\n\nJ5\n\nGOAL\n\nFigure 7: The \ufb01gure shows the actual \ufb02oor as it was designed in the Nomad 200 simulator.\nFor the QMDP approach the robot starts from START with uniform initial belief. After\nreaching J2 the belief becomes bi-modal concentrating on J1 and J2. The robot then keeps\nturning left and right. On the other hand, with our planning algorithm, the robot again starts\nfrom START and a uniform initial belief. Upon reaching J2 the belief becomes bimodal\nover J1 and J2. The agent resolves its uncertainty by deciding that the best action to take is\nthe go-down-the-corridor macro, at which point it encounters J3 and localizes. The robot\nthen is able to reach its goal by traveling from J3, to J2 , J1, J4, and J5.\n\n[4] S. Mahadevan, G. Theocharous, and N. Khaleeli. Fast concept learning for mobile robots.\n\nMachine Learning and Autonomous Robots Journal (joint issue), 31/5:239\u2013251, 1998.\n\n[5] A. Y. Ng, D. Harada, and S. Russell. Theory and application to reward shaping. In Proceedings\n\nof the Sixteenth International Conference on Machine Learning, 1999.\n\n[6] C. Papadimitriou and J. Tsitsiklis. The complexity of Markov decision processes. Mathematics\n\nof Operation Research, 12(3), 1987.\n\n[7] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for\n\nPOMDPs. In International Joint Conference on Arti\ufb01cial Intelligence, 2003.\n\n[8] M. Puterman. Markov Decision Processes: Discrete Dynamic Stochastic Programming. John\n\nWiley, 1994.\n\n[9] N. Roy and G. Gordon. Exponential family PCA for belief compression in POMDPs.\n\nAdvances in Neural Information Processing Systems, 2003.\n\nIn\n\n[10] S. J. Russell and P. Norvig. Arti\ufb01cial Intelligence: A Modern Approach. Prentice Hall, 2nd\n\nedition, 2003.\n\n[11] R. S. Sutton, D. Precup, and S. Singh. Between MDPs and semi-MDPs: A framework for tem-\n\nporal abstraction in reinforcement learning. Arti\ufb01cial Intelligence, pages 112:181\u2013211, 1999.\n\n[12] R. Zhou and E. A. Hansen. An improved grid-based approximation algorithm for POMDPs. In\nProceedings of the Seventeenth International Conference in Arti\ufb01cial intelligence (IJCAI-01),\nSeattle, WA, August 2001.\n\n\f", "award": [], "sourceid": 2485, "authors": [{"given_name": "Georgios", "family_name": "Theocharous", "institution": null}, {"given_name": "Leslie", "family_name": "Kaelbling", "institution": null}]}