{"title": "Bayesian Hierarchical Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 73, "page_last": 81, "abstract": "We describe an approach to incorporating Bayesian priors in the maxq framework for hierarchical reinforcement learning (HRL). We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, given sensible priors, (ii) task hierarchies and Bayesian priors can be complementary sources of information, and using both sources is better than either alone, (iii) taking advantage of the structural decomposition induced by the task hierarchy significantly reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to automatic learning of hierarchically optimal rather than recursively optimal policies.", "full_text": "Bayesian Hierarchical Reinforcement Learning\n\nFeng Cao\n\nDepartment of EECS\n\nCase Western Reserve University\n\nCleveland, OH 44106\nfxc100@case.edu\n\nSoumya Ray\n\nDepartment of EECS\n\nCase Western Reserve University\n\nCleveland, OH 44106\nsray@case.edu\n\nAbstract\n\nWe describe an approach to incorporating Bayesian priors in the MAXQ framework\nfor hierarchical reinforcement learning (HRL). We de\ufb01ne priors on the primitive\nenvironment model and on task pseudo-rewards. Since models for composite tasks\ncan be complex, we use a mixed model-based/model-free learning approach to\n\ufb01nd an optimal hierarchical policy. We show empirically that (i) our approach\nresults in improved convergence over non-Bayesian baselines, (ii) using both task\nhierarchies and Bayesian priors is better than either alone, (iii) taking advantage\nof the task hierarchy reduces the computational cost of Bayesian reinforcement\nlearning and (iv) in this framework, task pseudo-rewards can be learned instead of\nbeing manually speci\ufb01ed, leading to hierarchically optimal rather than recursively\noptimal policies.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) is a well known framework that formalizes decision making in un-\nknown, uncertain environments. RL agents learn policies that map environment states to available\nactions while optimizing some measure of long-term utility. While various algorithms have been de-\nveloped for RL [1], and applied successfully to a variety of tasks [2], the standard RL setting suffers\nfrom at least two drawbacks. First, it is dif\ufb01cult to scale standard RL approaches to large state spaces\nwith many factors (the well-known \u201ccurse of dimensionality\u201d). Second, vanilla RL approaches do\nnot incorporate prior knowledge about the environment and good policies.\nHierarchical reinforcement learning (HRL) [3] attempts to address the scaling problem by simpli-\nfying the overall decision making problem in different ways. For example, one approach introduces\nmacro-operators for sequences of primitive actions. Planning at the level of these operators may\nresult in simpler policies [4]. Another idea is to decompose the task\u2019s overall value function, for\nexample by de\ufb01ning task hierarchies [5] or partial programs with choice points [6]. The structure\nof the decomposition provides several bene\ufb01ts: \ufb01rst, for the \u201chigher level\u201d subtasks, policies are\nde\ufb01ned by calling \u201clower level\u201d subtasks (which may themselves be quite complex); as a result\npolicies for higher level subtasks may be expressed compactly. Second, a task hierarchy or partial\nprogram can impose constraints on the space of policies by encoding knowledge about the structure\nof good policies and thereby reduce the search space. Third, learning within subtasks allows state\nabstraction, that is, some state variables can be ignored because they do not affect the policy within\nthat subtask. This also simpli\ufb01es the learning problem.\nWhile HRL attempts to address the scalability issue, it does not take into account probabilistic prior\nknowledge the agent may have about the task. For example, the agent may have some idea about\nwhere high/low utility states may be located and what their utilities may be, or some idea about the\napproximate shape of the value function or policy. Bayesian reinforcement learning addresses this\nissue by incorporating priors on models [7], value functions [8, 9] or policies [10]. Specifying good\n\n1\n\n\fpriors leads to many bene\ufb01ts, including initial good policies, directed exploration towards regions\nof uncertainty, and faster convergence to the optimal policy.\nIn this paper, we propose an approach that incorporates Bayesian priors in hierarchical reinforcement\nlearning. We use the MAXQ framework [5], that decomposes the overall task into subtasks so that\nvalue functions of the individual subtasks can be combined to recover the value function of the\noverall task. We extend this framework by incorporating priors on the primitive environment model\nand on task pseudo-rewards.\nIn order to avoid building models for composite tasks (which can\nbe very complex), we adopt a mixed model-based/model-free learning approach. We empirically\nevaluate our algorithm to understand the effect of the priors in addition to the task hierarchy. Our\nexperiments indicate that: (i) taking advantage of probabilistic prior knowledge can lead to faster\nconvergence, even for HRL, (ii) task hierarchies and Bayesian priors can be complementary sources\nof information, and using both sources is better than either alone, (iii) taking advantage of the task\nhierarchy can reduce the computational cost of Bayesian RL, which generally tends to be very\nhigh, and (iv) task pseudo-rewards can be learned instead of being manually speci\ufb01ed, leading to\nautomatic learning of hierarchically optimal rather than recursively optimal policies. In this way\nBayesian RL and HRL are synergistic: Bayesian RL improves convergence of HRL and can learn\nhierarchy parameters, while HRL can reduce the signi\ufb01cant computational cost of Bayesian RL.\nOur work assumes the probabilistic priors to be given in advance and focuses on learning with\nthem. Other work has addressed the issue of obtaining these priors. For example, one source of\nprior information is multi-task reinforcement learning [11, 12], where an agent uses the solutions of\nprevious RL tasks to build priors over models or policies for future tasks. We also assume the task\nhierarchy is given. Other work has explored learning MAXQ hierarchies in different settings [13].\n\n2 Background and Related Work\n\nIn the MAXQ framework, each composite subtask Ti de\ufb01nes a semi-Markov decision process with\nparameters (cid:104)Si, Xi, Ci, Gi(cid:105). Si de\ufb01nes the set of \u201cnon-terminal\u201d states for Ti, where Ti may be\ncalled by its parent. Gi de\ufb01nes a set of \u201cgoal\u201d states for Ti. The actions available within Ti are\ndescribed by the set of \u201cchild tasks\u201d Ci. Finally, Xi denotes the set of \u201crelevant state variables\u201d for\nTi. Often, we unify the non-Si states and Gi into a single \u201ctermination\u201d predicate, Pi. An (s, a, s(cid:48))\ntriple where Pi(s) is false, Pi(s(cid:48)) is true, a \u2208 Ci, and the transition probability P (s(cid:48)|s, a) > 0\nis called an exit of the subtask Ti. A pseudo-reward function \u02dcR(s, a) can be de\ufb01ned over exits to\nexpress preferences over the possible exits of a subtask.\nA hierarchical policy \u03c0 for the overall task is an assignment of a local policy to each SMDP Ti.\nA hierarchically optimal policy is a hierarchical policy that has the maximum expected reward. A\nhierarchical policy is said to be recursively optimal if the local policy for each subtask is optimal\ngiven that all its subtask policies are optimal. Given a task graph, model-free [5] or model-based [14]\nmethods can be used to learn value functions for each task-subtask pair. In the model-free method,\na policy is produced by maintaining a value and a completion function for each subtask. For a task\ni, the value V (a, s) denotes the expected value of calling child task a in state s. This is (recursively)\nestimated as the expected reward obtained while executing a. The completion function C(i, s, a)\ndenotes the expected reward obtained while completing i after having called a in s. The central idea\nbehind MAXQ is that the value of i, V (i, s), can be (recursively) decomposed in terms of V (a, s)\nand C(i, s, a). The model-based RMAXQ [14] algorithm extends RMAX [15] to MAXQ by learning\nmodels for all primitive and composite tasks. Value iteration is used with these models to learn a\npolicy for each subtask. An optimistic exploration strategy is used together with a parameter m that\ndetermines how often a transition or reward needs to be seen to be usable in the planning step.\nIn the MAXQ framework, pseudo-rewards must be manually speci\ufb01ed to learn hierarchically optimal\npolicies. Recent work has attempted to directly learn hierarchically optimal policies for ALisp\npartial programs, that generalize MAXQ task hierarchies [6, 16], using a model-free approach. Here,\nalong with task value and completion functions, an \u201cexternal\u201d Q function QE is maintained for each\nsubtask. This function stores the reward obtained after the parent of a subtask exits. A problem here\nis that this hurts state abstraction, since QE is no longer \u201clocal\u201d to a subtask. In later work [16],\nthis is addressed by recursively representing QE in terms of task value and completion functions,\nlinked by conditional probabilities of parent exits given child exits. The conditional probabilities\nand recursive decomposition are used to compute QE as needed to select actions.\n\n2\n\n\fBayesian reinforcement learning methods incorporate probabilistic prior knowledge on models [7],\nvalue functions [8, 9], policies [10] or combinations [17]. One Bayesian model-based RL algorithm\nproceeds as follows. At each step, a distribution over model parameters is maintained. At each\nstep, a model is sampled from this distribution (Thompson sampling [18, 19]). This model is then\nsolved and actions are taken according to the policy obtained. This yields observations that are used\nto update the parameters of the current distribution to create a posterior distribution over models.\nThis procedure is then iterated to convergence. Variations of this idea have been investigated; for\nexample, some work converts the distribution over models to an empirical distribution over Q-\nfunctions, and produces policies by sampling from this distribution instead [7].\nRelatively little work exists that attempts to incorporate probabilistic priors into HRL. We have\nfound one preliminary attempt [20] that builds on the RMAX+MAXQ [14] method. This approach\nadds priors to each subtask model and performs (separate) Bayesian model-based learning for each\nsubtask. 1 In our approach, we do not construct models for subtasks, which can be very complex\nin general. Instead, we only maintain distributions over primitive actions, and use a mixed model-\nbased/model-free learning algorithm that is naturally integrated with the standard MAXQ learning\nalgorithm. Further, we show how to learn pseudo-rewards for MAXQ in the Bayesian framework.\n\n3 Bayesian MAXQ Algorithm\n\nIn this section, we describe our approach to incorporating probabilistic priors into MAXQ. We use\npriors over primitive models and pseudo-rewards. As we explain below, pseudo-rewards are value\nfunctions; thus our approach uses priors both on models and value functions. While such an inte-\ngration may not be needed for standard Bayesian RL, it appears naturally in our setting.\nWe \ufb01rst describe our approach to incorporating priors on environment models alone (assuming\npseudo-rewards are \ufb01xed). We do this following the Bayesian model-based RL framework. At\neach step we have a distribution over environment models (initially the prior). The algorithm has\ntwo main subroutines: the main BAYESIAN MAXQ routine (Algorithm 1) and an auxiliary RECOM-\nPUTE VALUE routine (Algorithm 2). In this description, the value V and completion C functions\nare assumed to be global. At the start of each episode, the BAYESIAN MAXQ routine is called with\nthe Root task and the initial state for the current episode. The MAXQ execution protocol is then\nfollowed, where each task chooses an action based on its current value function (initially random).\nWhen a primitive action is reached and executed, it updates the posterior over model parameters\n(Line 3) and its own value estimate (which is just the reward function for primitive actions). When\na task exits and returns to its parent, the parent subsequently updates its completion function based\non the current estimates of the value of the exit state (Lines 14 and 15). Note that in MAXQ, the\nvalue function of a composite task can be (recursively) computed using the completion functions of\nsubtasks and the rewards obtained by executing primitive actions, so we do not need to separately\nstore or update the value functions (except for the primitive actions where the value function is the\nreward). Finally, each primitive action maintains a count of how many times it has been executed\nand each composite task maintains a count of how many child actions have been taken.\nWhen k (an algorithm parameter) steps have been executed in a composite task, BAYESIAN MAXQ\ncalls RECOMPUTE VALUE to re-estimate the value and completion functions (the check on k is\nshown in RECOMPUTE VALUE, Line 2). When activated, this function recursively re-estimates the\nvalue/completion functions for all subtasks of the current task. At the level of a primitive action,\nthis simply involves resampling the reward and transition parameters from the current posterior\nover models. For a composite task, we use the MAXQ-Q algorithm (Table 4 in [5]). We run this\nalgorithm for Sim episodes, starting with the current subtask as the root, with the current pseudo-\nreward estimates (we explain below how these are obtained). This algorithm recursively updates the\ncompletion function of the task graph below the current task. Note that in this step, the subtasks\nwith primitive actions use model-based updates. That is, when a primitive action is \u201cexecuted\u201d in\nsuch tasks, the currently sampled transition function (part of \u0398 in Line 5) is used to \ufb01nd the next\nstate, and then the associated reward is used to update the completion function. This is similar to\nLines 12, 14 and 15 in BAYESIAN MAXQ, except that it uses the sampled model \u0398 instead of the\n\n1While we believe this description is accurate, unfortunately, due to language issues and some missing\n\ntechnical and experimental details in the cited article, we have been unable to replicate this work.\n\n3\n\n\fwhile i is not terminated do\n\nExecute i, observe r, s(cid:48)\nUpdate current posterior parameters \u03a8 using (s, i, r, s(cid:48))\nUpdate current value estimate: V (i, s) \u2190 (1 \u2212 \u03b1) \u00b7 V (i, s) + \u03b1 \u00b7 r\nCount(i) \u2190 Count(i) + 1\nreturn (s(cid:48), 1, r)\n\nAlgorithm 1 BAYESIAN MAXQ\nInput: Task i, State s, Update Interval k, Simulation Episodes Sim\nOutput: Next state s(cid:48), steps taken N, cumulative reward CR\n1: if i is primitive then\n2:\n3:\n4:\n5:\n6:\n7: else\n8: N \u2190 0, CR \u2190 0, taskStack \u2190 Stack(){i is composite}\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21: end if\n\nend while\nUPDATE PSEUDO REWARD(taskStack, \u02dcR(i, s(cid:48)))\nreturn (s(cid:48), N, CR)\n\nRECOMPUTE VALUE(i, k, Sim)\na \u2190 \u0001-greedy action from V (i, s)\n(cid:104)s(cid:48), Na, cr(cid:105) \u2190 BAYESIAN MAXQ(a, s)\ntaskStack.push((cid:104)a, s(cid:48), Na, cr(cid:105))\na\u2217\n\ns(cid:48) \u2190 arg maxa(cid:48)(cid:2) \u02dcC(i, s(cid:48), a(cid:48)) + V (a(cid:48), s(cid:48))(cid:3)\nC(i, s, a) \u2190 (1 \u2212 \u03b1) \u00b7 C(i, s, a) + \u03b1 \u00b7 \u03b3Na(cid:2)C(i, s(cid:48), a\u2217\ns(cid:48), s(cid:48))(cid:3)\n\u02dcC(i, s, a) \u2190 (1 \u2212 \u03b1) \u00b7 \u02dcC(i, s, a) + \u03b1 \u00b7 \u03b3Na(cid:2) \u02dcR(i, s(cid:48)) + \u02dcC(i, s(cid:48), a\u2217\n\ns(cid:48)) + V (a\u2217\ns \u2190 s(cid:48), CR \u2190 CR + \u03b3N \u00b7 cr, N \u2190 N + Na, Count(i) \u2190 Count(i) + 1\n\ns(cid:48)) + V (a\u2217\n\ns(cid:48), s(cid:48))(cid:3)\n\nSample new transition and reward parameters \u0398 from current posterior \u03a8\n\nAlgorithm 2 RECOMPUTE VALUE\nInput: Task i, Update Interval k, Simulation Episodes Sim\nOutput: Recomputed value and completion functions for the task graph below and including i\n1: if Count(i) < k then\n2:\n3: end if\n4: if i is primitive then\n5:\n6: else\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14: end if\n15: Count(i) \u2190 0\n\ns \u2190 random nonterminal state of i\nRun MAXQ-Q(i, s, \u0398, \u02dcR)\n\nend for\nfor Sim episodes do\n\nRECOMPUTE VALUE(a, k, Sim)\n\nfor all child tasks a of i do\n\nreturn\n\nend for\n\nreal environment. After RECOMPUTE VALUE terminates, a new set of value/completion functions\nare available for BAYESIAN MAXQ to use to select actions.\nNext we discuss task pseudo-rewards (PRs). A PR is a value associated with a subtask exit that\nde\ufb01nes how \u201cgood\u201d that exit is for that subtask. The ideal PR for an exit is the expected reward under\nthe hierarchically optimal policy after exiting the subtask, until the global task (Root) ends; thus the\nPR is a value function. This PR would enable the subtask to choose the \u201cright\u201d exit in the context\nof what the rest of the task hierarchy is doing. In standard MAXQ, these have to be set manually.\nThis is problematic because it presupposes (quite detailed) knowledge of the hierarchically optimal\npolicy. Further, setting the wrong PRs can result in non-convergence or highly suboptimal policies.\nSometimes this problem is sidestepped simply by setting all PRs to zero, resulting in recursively\noptimal policies. However, it is easy to construct examples where a recursively optimal policy\n\n4\n\n\fAlgorithm 3 UPDATE PSEUDO REWARD\nInput: taskStack, Parent\u2019s pseudo reward \u02dcRp\n1: tempCR \u2190 \u02dcRp, Na(cid:48) \u2190 0, cr(cid:48) \u2190 0\n2: while taskStack is not empty do\n(cid:104)a, s, Na, cr(cid:105) \u2190 taskStack.pop()\n3:\ntempCR \u2190 \u03b3N(cid:48)\na \u00b7 tempCR + cr(cid:48)\n4:\nUpdate pseudo-reward posterior \u03a6 for \u02dcR(a, s) using (a, s, tempCR)\n5:\nResample \u02dcR(a, s) from \u03a6\n6:\n7: Na(cid:48) \u2190 Na, cr(cid:48) \u2190 cr\n8: end while\n\nis arbitrarily worse than the hierarchically optimal policy. For all these reasons, PRs are major\n\u201cnuisance parameters\u201d in the MAXQ framework.\nWhat makes learning PRs tricky is that they are not only value functions, but also function as pa-\nrameters of MAXQ. That is, setting different PRs essentially results in a new learning problem. For\nthis reason, simply trying to learn PRs in a standard temporal difference (TD) way fails (as we show\nin our experiments). Fortunately, Bayesian RL allows us to address both these issues. First, we\ncan treat value functions as probabilistic unknown parameters. Second, and more importantly, a key\nidea in Bayesian RL is the \u201clifting\u201d of exploration to the space of task parameters. That is, instead\nof exploration through action selection, Bayesian RL can perform exploration by sampling task pa-\nrameters. Thus treating a PR as an unknown Bayesian parameter also leads to exploration over the\nvalue of this parameter, until an optimal value is found. In this way, hierarchically optimal policies\ncan be learned from scratch\u2014a major advantage over the standard MAXQ setting.\nTo learn PRs, we again maintain a distribution over all such parameters, \u03a6, initially a prior. For\nsimplicity, we only focus on tasks with multiple exits, since otherwise, a PR has no effect on the\npolicy (though the value function changes). When a composite task executes, we keep track of each\nchild task\u2019s execution in a stack. When the parent itself exits, we obtain a new observation of the\nPRs of each child by computing the discounted cumulative reward received after it exited, added to\nthe current estimate of the parent\u2019s PR (Algorithm 3). This observation is used to update the current\nposterior over the child\u2019s PR. Since this is a value function estimate, early in the learning process,\nthe estimates are noisy. Following prior work [8], we use a window containing the most recent\nobservations. When a new observation arrives, the oldest observation is removed, the new one is\nadded and a new posterior estimate is computed. After updating the posterior, it is sampled to obtain\na new PR estimate for the associated exit. This estimate is used where needed (in Algorithms 1 and\n2) until the next posterior update. Combined with the model-based priors above, we hypothesize\nthat this procedure, iterated till convergence, will produce a hierarchically optimal policy.\n\n4 Empirical Evaluation\n\nIn this section, we evaluate our approach and test four hypotheses: First, does incorporating model-\nbased priors help speed up the convergence of MAXQ to the optimal policy? Second, does the\ntask hierarchy still matter if very good priors are available for primitive actions? Third, how does\nBayesian MAXQ compare to standard (\ufb02at) Bayesian RL? Does Bayesian RL perform better (in\nterms of computational time) if a task hierarchy is available? Finally, can our approach effectively\nlearn PRs and policies that are hierarchically optimal?\nWe \ufb01rst focus on evaluating the \ufb01rst three hypotheses using domains where a zero PR results in\nhierarchical optimality. To evaluate these hypotheses, we use two domains: the \ufb01ckle version of\nTaxi-World [5] (625 states) and Resource-collection [13] (8265 states). 2\nIn Taxi-World, the\nagent controls a taxi in a grid-world that has to pick up a passenger from a source location and drop\nthem off at their destination. The state variables consist of the location of the taxi and the source\nand destination of the passenger. The actions available to the agent consist of navigation actions and\nactions to pickup and putdown the passenger. The agent gets a reward of +20 upon completing the\ntask, a constant \u22121 reward for every action and a \u221210 penalty for an erroneous action. Further, each\n\n2Task hierarchies for all domains are available in the supplementary material.\n\n5\n\n\fFigure 1: Performance on Taxi-World (top row) and Resource-collection (bottom). The x-axis\nshows episodes. The pre\ufb01x \u201cB-\u201d denotes Bayesian, \u201cUninformed/Good\u201d denotes the prior and \u201cMB\u201d\ndenotes model-based. Left column: Bayesian methods, right: non-Bayesian methods, with Bayesian\nMAXQ for reference.\n\nnavigation action has a 15% chance of moving in each direction orthogonal to the intended move. In\nthe Resource-collection domain, the agent collects resources (gold and wood) from a grid world\nmap. Here the state variables consist of the location of the agent, what the agent is carrying, whether\na goldmine or forest is adjacent to its current location and whether a desired gold or wood quota has\nbeen met. The actions available to the agent are to move to a speci\ufb01c location, chop gold or harvest\nwood, and to deposit the item it is carrying (if any). For each navigation action, the agent has a 30%\nchance of moving to a random location. In our experiments, the map contains two goldmines and\ntwo forests, each containing two units of gold and two units of wood, and the gold and wood quota\nis set to three each. The agent gets a +50 reward when it meets the gold/wood quota, a constant \u22121\nreward for every action and an additional \u22121 for erroneous actions (such as trying to deposit when\nit is not carrying anything).\nFor the Bayesian methods, we use Dirichlet priors for the transition function parameters and Normal-\nGamma priors for the reward function parameters. We use two priors: an uninformed prior, set\nto approximate a uniform distribution, and a \u201cgood\u201d prior where a previously computed model\nposterior is used as the \u201cprior.\u201d The prior distributions we use are conjugate to the likelihood, so\nwe can compute the posterior distributions in closed form. In general, this is not necessary; more\ncomplex priors could be used as long as we can sample from the posterior distribution.\nThe methods we evaluate are: (i) Flat Q, the standard Q-learning algorithm, (ii) MAXQ-0, the stan-\ndard, Q-learning algorithm for MAXQ with no PR, (iii) Bayesian model-based Q-learning with an\nuninformed prior and (iv) a \u201cgood\u201d prior, (v) Bayesian MAXQ (our proposed approach) with an un-\ninformed prior and (vi) a \u201cgood\u201d prior, and (vii) RMAXQ [14]. In our implementation, the Bayesian\nmodel-based Q-learning uses the same code as the Bayesian MAXQ algorithm, with a \u201ctrivial\u201d hi-\nerarchy consisting of the Root task with only the primitive actions as children. For the Bayesian\nmethods, the update frequency k was set to 50 for Taxi-World and 25 for Resource-collection.\nSim was set to 200 for Bayesian MAXQ for Taxi-World and 1000 for Bayesian model-based Q, and\nto 1000 for both for Resource collection. For RMAXQ, the threshold sample size m was set to\n5 following prior work [14]. The value iteration was terminated either after 300 loops or when the\nsuccessive difference between iterations was less than 0.001. The theoretical version of RMAXQ\nrequires updating and re-solving the model every step. In practice for the larger problems, this is too\n\n6\n\n-1000-800-600-400-200 0 0 100 200 300 400 500Average Cumulative Reward Per EpisodeB-MaxQ UninformedB-MaxQ GoodB-MB-Q UninformedB-MB-Q GoodB-MB-Q Good Comparable Simulations-1000-800-600-400-200 0 0 100 200 300 400 500B-MaxQ UninformedR-MaxQMaxQFlatQ-1000-800-600-400-200 0 0 200 400 600 800 1000Average Cumulative Reward Per EpisodeEpisodeB-MaxQ UninformedB-MaxQ GoodB-MB-Q UninformedB-MB-Q Good-1000-800-600-400-200 0 0 200 400 600 800 1000EpisodeB-MaxQ UninformedMaxQR-MaxQFlatQ\ftime-consuming, so we re-solve the models every 10 steps. This is similar to the update frequency\nk for Bayesian MAXQ. The results are shown in Figure 1 (episodes on x-axis).\nFrom these results, comparing the Bayesian versions of MAXQ to standard MAXQ, we observe that\nfor Taxi-World, the Bayesian version converges faster to the optimal policy even with the unin-\nformed prior, while for Resource-collection, the convergence rates are similar. When a good prior\nis available, convergence is very fast (almost immediate) in both domains. Thus, the availability\nof model priors can help speed up convergence in many cases for HRL. We further observe that\nRMAXQ converges more slowly than MAXQ or Bayesian MAXQ, though it is much better than Flat\nQ. This is different from prior work [14]. This may be because our domains are more stochastic than\nthe Taxi-world on which prior results [14] were obtained. We conjecture that, as the environment\nbecomes more stochastic, errors in primitive model estimates may propagate into subtask models\nand hurt the performance of this algorithm. In their analysis [14], the authors noted that the error in\nthe transition function for a composite task is a function of the total number of terminal states in the\nsubtask. The error is also compounded as we move up the task hierarchy. This could be countered by\nincreasing m, the sample size used to estimate model parameters. This would improve the accuracy\nof the primitive model, but would further hurt the convergence rate of the algorithm.\nNext, we compare the Bayesian MAXQ approach to \u201c\ufb02at\u201d Bayesian model-based Q learning. We\nnote that in Taxi-World, with uninformed priors, though the \u201c\ufb02at\u201d method initially does worse,\nit soon catches up to standard MAXQ and then to Bayesian MAXQ. This is probably because in\nthis domain, the primitive models are relatively easy to acquire, and the task hierarchy provides no\nadditional leverage. For Resource-collection, however, even with a good prior, \u201c\ufb02at\u201d Bayesian\nmodel-based Q does not converge. The difference is that in this case, the task hierarchy encodes\nextra information that cannot be deduced just from the models. In particular, the task hierarchy\ntells the agent that good policies consist of gold/wood collection moves followed by deposit moves.\nSince the reward structure in this domain is very sparse, it is dif\ufb01cult to deduce this even if very\ngood models are available. Taken together, these results show that task hierarchies and model priors\ncan be complementary: in general, Bayesian MAXQ outperforms both \ufb02at Bayesian RL and MAXQ\n(in speed of convergence, since here MAXQ can learn the hierarchically optimal policy).\n\n229\n2.06\n1.77\n\n96\n3089\n4006\n\nTable 1: Time for 500 episodes, Taxi-World.\nNext, we compare the time taken by the dif-\nTime (s)\nMethod\nferent approaches in our experiments in Taxi-\nBayesian MaxQ, Uninformed Prior\n205\nWorld (Table 1). As expected, the Bayesian\n4684\nBayesian Model-based Q, Uninformed\nRL approaches are signi\ufb01cantly slower than\nPrior\nthe non-Bayesian approaches. Further, among\nBayesian MaxQ, Good Prior\nnon-Bayesian approaches, the hierarchical ap-\nBayesian Model-based Q, Good Prior\nproaches (MAXQ and RMAXQ) are slower than\nBayesian Model-based Q, Good Prior\nthe non-hierarchical \ufb02at Q. Out of the Bayesian\n& Comparable Simulations\nmethods, however,\nthe Bayesian MAXQ ap-\nRMAXQ\nproaches are signi\ufb01cantly faster than the \ufb02at\nMAXQ\nFlat Q\nBayesian model-based approaches. This is be-\ncause for the \ufb02at case, during the simulation in RECOMPUTE VALUE, a much larger task needs to be\nsolved, while the Bayesian MAXQ approach is able to take into account the structure of the hierarchy\nto only simulate subtasks as needed, which ends up being much more ef\ufb01cient. However, we note\nthat we allowed the \ufb02at Bayesian model-based approach 1000 episodes of simulation as opposed to\n200 for Bayesian MAXQ. Clearly this increases the time taken for the \ufb02at cases. But at the same\ntime, this is necessary: the \u201cComparable Simulations\u201d row (and curve in Figure 1 top left) shows\nthat, if the simulations are reduced to 250 episodes for this approach, the resulting values are no\nlonger reliable and the performance of the Bayesian \ufb02at approach drops sharply. Notice that while\nFlat Q runs faster than MAXQ (because of the additional \u201cbookkeeping\u201d overhead due to the task\nhierarchy), Bayesian MAXQ runs much faster than Bayesian model-based Q. Thus, taking advantage\nof the hierarchical task decomposition helps reduce the computational cost of Bayesian RL.\nFinally we evaluate how well our approach estimates PRs. Here we use two domains: a Modi\ufb01ed-\nTaxi-World and a Hallway domain [5, 21] (4320 states).\nIn Modi\ufb01ed-Taxi-World, we allow\ndropoffs at any one of the four locations and do not provide a reward for task termination. Thus\nthe Navigate subtask needs a PR (corresponding to the correct dropoff location) to learn a good\npolicy. The Hallway domain consists of a maze with a large scale structure of hallways and inter-\nsections. The agent has stochastic movement actions. For these experiments, we use uninformed\npriors on the environment model. The PR Gaussian-Gamma priors are set to prefer each exit from\n\n7\n\n\fFigure 2: Performance on Modi\ufb01ed-Taxi-World (top row) and Hallway (bottom). \u201cB-\u201d: Bayesian,\n\u201cPR\u201d: Pseudo Reward. Left: Bayesian methods, right: non-Bayesian methods, with Bayesian MAXQ\nas reference. The x-axis is episodes. The bottom right \ufb01gure has the same legend as the top right.\n\na subtask equally. The baselines we use are: (i) Bayesian MAXQ and MAXQ with \ufb01xed zero PR, (ii)\nBayesian MAXQ and MAXQ with \ufb01xed manually set PR, (iii) \ufb02at Q, (iv) ALISPQ [6] and (v) MAXQ\nwith a non-Bayesian PR update. This last method tracks PR just as our approach; however, instead\nof a Bayesian update, it updates the PR using a temporal difference update, treating it as a simple\nvalue function. The results are shown in Figure 2 (episodes on x-axis).\nFrom these results, we \ufb01rst observe that the methods with zero PR always do worse than those with\n\u201cproper\u201d PR, indicating that in these cases the recursively optimal policy is not the hierarchically\noptimal policy. When a PR is manually set, in both domain, MAXQ converges to better policies. We\nobserve that in each case, the Bayesian MAXQ approach is able to learn a policy that is as good, start-\ning with no pseudo rewards; further, its convergence rates are often better. Further, we observe that\nthe simple TD update strategy (MAXQ Non-Bayes PR in Figure 2) fails in both cases\u2014in Modi\ufb01ed-\nTaxi-World, it is able to learn a policy that is approximately as good as a recursively optimal policy,\nbut in the Hallway domain, it fails to converge completely, indicating that this strategy cannot gen-\nerally learn PRs. Finally, we observe that the tripartite Q-decomposition of ALISPQ is also able to\ncorrectly learn hierarchically optimal policies, however, it converges slowly compared to Bayesian\nMAXQ or MAXQ with manual PRs. This is especially visible in the Hallway domain, where there are\nnot many opportunities for state abstraction. We believe this is likely because it is estimating entire\nQ-functions rather than just the PRs. In a sense, it is doing more work than is needed to capture\nthe hierarchically optimal policy, because an exact Q-function may not be needed to capture the\npreference for the best exit, rather, a value that assigns it a suf\ufb01ciently high reward compared to the\nother exits would suf\ufb01ce. Taken together, these results indicate that incorporating Bayesian priors\ninto MAXQ can successfully learn PRs from scratch and produce hierarchically optimal policies.\n\n5 Conclusion\n\nIn this paper, we have proposed an approach to incorporating probabilistic priors on environment\nmodels and task pseudo-rewards into HRL by extending the MAXQ framework. Our experiments\nindicate that several synergies exist between HRL and Bayesian RL, and combining them is fruitful.\nIn future work, we plan to investigate approximate model and value representations, as well as\nmulti-task RL to learn the priors.\n\n8\n\n-1000-800-600-400-200 0 0 100 200 300 400 500Average Cumulative Reward Per EpisodeB-MaxQ Bayes PRB-MaxQ Manual PRB-MaxQ No PR-1000-800-600-400-200 0 0 100 200 300 400 500B-MaxQ Bayes PRMaxQ Non-Bayes PRMaxQ Manual PRMaxQ No PRALispQFlatQ-2000-1800-1600-1400-1200-1000-800-600-400-200 0 1000 2000 3000 4000 5000Average Cumulative Reward Per EpisodeEpisodeB-MaxQ Bayes PRB-MaxQ Manual PRB-MaxQ No PR-2000-1800-1600-1400-1200-1000-800-600-400-200 0 1000 2000 3000 4000 5000Episode\fReferences\n[1] R.S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, 1998.\n[2] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey.\n\nJournal of Arti\ufb01cial Intelligence Research, 4:237\u2013285, 1996.\n\n[3] Andrew G. Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning. Dis-\n\ncrete Event Dynamic Systems, 13(4):341\u2013379, 2003.\n\n[4] Martin Stolle and Doina Precup. Learning Options in reinforcement Learning, volume 2371/2002 of\n\nLecture Notes in Computer Science, pages 212\u2013223. Springer, 2002.\n\n[5] Thomas G. Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition.\n\nJournal of Arti\ufb01cial Intelligence Research, 13:227\u2013303, 2000.\n\n[6] D. Andre and S. Russell. State Abstraction for Programmable Reinforcement Learning Agents. In Pro-\n\nceedings of the Eighteenth National Conference on Arti\ufb01cial Intelligence (AAAI), 2002.\n\n[7] R. Dearden, N. Friedman, and D. Andre. Model based bayesian exploration. In Proceedings of Fifteenth\n\nConference on Uncertainty in Arti\ufb01cial Intelligence. Morgan Kaufmann, 1999.\n\n[8] R. Dearden, N. Friedman, and S. Russell. Bayesian Q-learning. In Proceedings of the Fifteenth National\n\nConference on Arti\ufb01cial Intelligence, 1998.\n\n[9] Y. Engel, S. Mannor, and R. Meir. Bayes meets Bellman:the Gaussian process approach to temporal\ndifference learning. In Proceedings of the Twentieth Internationl Conference on Machine Learning, 2003.\nIn Advances in\n\n[10] Mohammad Ghavamzadeh and Yaakov Engel. Bayesian policy gradient algorithms.\n\nNeural Information Processing Systems 19. MIT Press, 2007.\n\n[11] Alessandro Lazaric and Mohammad Ghavamzadeh. Bayesian multi-task reinforcement learning. In Pro-\n\nceedings of the 27th International Conference on Machine Learning, 2010.\n\n[12] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning: a\nhierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning,\npages 1015\u20131022, New York, NY, USA, 2007. ACM.\n\n[13] N. Mehta, S. Ray, P. Tadepalli, and T. Dietterich. Automatic discovery and transfer of MAXQ hierarchies.\nIn Andrew McCallum and Sam Roweis, editors, Proceedings of the 25th International Conference on\nMachine Learning, pages 648\u2013655. Omnipress, 2008.\n\n[14] Nicholas K. Jong and Peter Stone. Hierarchical model-based reinforcement learning: R-MAX + MAXQ.\n\nIn Proceedings of the 25th International Conference on Machine Learning, 2008.\n\n[15] Ronen I. Brafman, Moshe Tennenholtz, and Pack Kaelbling. R-MAX - a general polynomial time algo-\n\nrithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 2001.\n\n[16] B. Marthi, S. Russell, and D. Andre. A compact, hierarchically optimal q-function decomposition. In\n\n22nd Conference on Uncertainty in Arti\ufb01cial Intelligence, 2006.\n\n[17] M. Ghavamzadeh and Y. Engel. Bayesian actor-critic algorithms. In Zoubin Ghahramani, editor, Pro-\nceedings of the 24th Annual International Conference on Machine Learning, pages 297\u2013304. Omnipress,\n2007.\n\n[18] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence\n\nof two samples. Biometrika, 25:285\u2013294, 1933.\n\n[19] M. J. A. Strens. A Bayesian framework for reinforcement learning. In Proceeding of the 17th International\n\nConference on Machine Learning, 2000.\n\n[20] Zhaohui Dai, Xin Chen, Weihua Cao, and Min Wu. Model-based learning with bayesian and maxq value\nfunction decomposition for hierarchical task. In Proceedings of the 8th World Congress on Intelligent\nControl and Automation, 2010.\n\n[21] Ronald Edward Parr. Hierarchical Control and Learning for Markov Decision Processes. PhD thesis,\n\n1998.\n\n9\n\n\f", "award": [], "sourceid": 48, "authors": [{"given_name": "Feng", "family_name": "Cao", "institution": null}, {"given_name": "Soumya", "family_name": "Ray", "institution": null}]}