{"title": "Nonparametric Bayesian Policy Priors for Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 532, "page_last": 540, "abstract": "We consider reinforcement learning in partially observable domains where the agent can query an expert for demonstrations. Our nonparametric Bayesian approach combines model knowledge, inferred from expert information and independent exploration, with policy knowledge inferred from expert trajectories. We introduce priors that bias the agent towards models with both simple representations and simple policies, resulting in improved policy and model learning.", "full_text": "Nonparametric Bayesian Policy Priors for\n\nReinforcement Learning\n\nFinale Doshi-Velez, David Wingate, Nicholas Roy and Joshua Tenenbaum\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\n{finale,wingated,nickroy,jbt}@csail.mit.edu\n\nAbstract\n\nWe consider reinforcement learning in partially observable domains where the\nagent can query an expert for demonstrations. Our nonparametric Bayesian ap-\nproach combines model knowledge, inferred from expert information and inde-\npendent exploration, with policy knowledge inferred from expert trajectories. We\nintroduce priors that bias the agent towards models with both simple representa-\ntions and simple policies, resulting in improved policy and model learning.\n\n1\n\nIntroduction\n\nWe address the reinforcement learning (RL) problem of \ufb01nding a good policy in an unknown,\nstochastic, and partially observable domain, given both data from independent exploration and ex-\npert demonstrations. The \ufb01rst type of data, from independent exploration, is typically used by\nmodel-based RL algorithms [1, 2, 3, 4] to learn the world\u2019s dynamics. These approaches build mod-\nels to predict observation and reward data given an agent\u2019s actions; the action choices themselves,\nsince they are made by the agent, convey no statistical information about the world. In contrast,\nimitation and inverse reinforcement learning [5, 6] use expert trajectories to learn reward models.\nThese approaches typically assume that the world\u2019s dynamics is known.\n\nWe consider cases where we have data from both independent exploration and expert trajecto-\nries. Data from independent observation gives direct information about the dynamics, while expert\ndemonstrations show outputs of good policies and thus provide indirect information about the un-\nderlying model. Similarly, rewards observed during independent exploration provide indirect infor-\nmation about good policies. Because dynamics and policies are linked through a complex, nonlinear\nfunction, leveraging information about both these aspects at once is challenging. However, we show\nthat using both data improves model-building and control performance.\n\nWe use a Bayesian model-based RL approach to take advantage of both forms of data, applying\nBayes rule to write a posterior over models M given data D as p(M |D) \u221d p(D|M )p(M ). In previ-\nous work [7, 8, 9, 10], the model prior p(M ) was de\ufb01ned as a distribution directly on the dynamics\nand rewards models, making it dif\ufb01cult to incorporate expert trajectories. Our main contribution is\na new approach to de\ufb01ning this prior: our prior uses the assumption that the expert knew something\nabout the world model when computing his optimal policy. Different forms of these priors lead us to\nthree different learning algorithms: (1) if we know the expert\u2019s planning algorithm, we can sample\nmodels from p(M |D), invoke the planner, and weigh models given how likely it is the planner\u2019s\npolicy generated the expert\u2019s data; (2) if, instead of a planning algorithm, we have a policy prior, we\ncan similarly weight world models according to how likely it is that probable policies produced the\nexpert\u2019s data; and (3) we can search directly in the policy space guided by probable models.\n\nWe focus on reinforcement learning in discrete action and observation spaces. In this domain, one of\nour key technical contributions is the insight that the Bayesian approach used for building models of\ntransition dynamics can also be used as policy priors, if we exchange the typical role of actions and\n\n1\n\n\fobservations. For example, algorithms for learning partially observable Markov decision processes\n(POMDPs) build models that output observations and take in actions as exogenous variables. If\nwe reverse their roles, the observations become the exogenous variables, and the model-learning\nalgorithm is exactly equivalent to learning a \ufb01nite-state controller [11]. By using nonparametric\npriors [12], our agent can scale the sophistication of its policies and world models based on the data.\n\nOur framework has several appealing properties. First, our choices for the policy prior and a world\nmodel prior can be viewed as a joint prior which introduces a bias for world models which are\nboth simple and easy to control. This bias is especially bene\ufb01cial in the case of direct policy search,\nwhere it is easier to search directly for good controllers than it is to \ufb01rst construct a complete POMDP\nmodel and then plan with it. Our method can also be used with approximately optimal expert data; in\nthese cases the expert data can be used to bias which models are likely but not set hard constraints on\nthe model. For example, in Sec. 4 an application where we extract the essence of a good controller\nfrom good\u2014but not optimal\u2014trajectories generated by a randomized planning algorithm.\n\n2 Background\n\nan n-tuple\nA partially observable Markov decision process\n{S,A,O,T ,\u2126,R,\u03b3}. S, A, and O are sets of states, actions, and observations. The state transi-\ntion function T (s\u2032|s, a) de\ufb01nes the distribution over next-states s\u2032 to which the agent may transition\nafter taking action a from state s. The observation function \u2126(o|s\u2032, a) is a distribution over obser-\nvations o that may occur in state s\u2032 after taking action a. The reward function R(s, a) speci\ufb01es the\nimmediate reward for each state-action pair, while \u03b3 \u2208 [0, 1) is the discount factor. We focus on\nlearning discrete state, observation, and action spaces.\n\n(POMDP) model M is\n\nBayesian RL In Bayesian RL, the agent starts with a prior distribution P (M ) over possible\nPOMDP models. Given data D from an unknown , the agent can compute a posterior over pos-\nsible worlds P (M |D) \u221d P (D|M )P (M ). The model prior can encode both vague notions, such as\n\u201cfavor simpler models,\u201d and strong structural assumptions, such as topological constraints among\nstates. Bayesian nonparametric approaches are well-suited for partially observable environments\nbecause they can also infer the dimensionality of the underlying state space. For example, the re-\ncent in\ufb01nite POMDP (iPOMDP) [12] model, built from HDP-HMMs [13, 14], places prior over\nPOMDPs with in\ufb01nite states but introduces a strong locality bias towards exploring only a few.\n\nThe decision-theoretic approach to acting in the Bayesian RL setting is to treat the model M as\nadditional hidden state in a larger \u201cmodel-uncertainty\u201d POMDP and plan in the joint space of models\nand states. Here, P (M ) represents a belief over models. Computing a Bayes-optimal policy is\ncomputationally intractable; methods approximate the optimal policy by sampling a single model\nand following that model\u2019s optimal policy for a \ufb01xed period of time [8]; by sampling multiple\nmodels and choosing actions based on a vote or stochastic forward search [1, 4, 12, 2]; and by trying\nto approximate the value function for the full model-uncertainty POMDP analytically [7]. Other\napproaches [15, 16, 9] try to balance the off-line computation of a good policy (the computational\ncomplexity) and the cost of getting data online (the sample complexity).\n\nFinite State Controllers Another possibility for choosing actions\u2014including in our partially-\nobservable reinforcement learning setting\u2014is to consider a parametric family of policies, and at-\ntempt to estimate the optimal policy parameters from data. This is the approach underlying, for\nexample, much work on policy gradients. In this work, we focus on the popular case of a \ufb01nite-state\ncontroller, or FSC [11]. An FSC consists of the n-tuple {N,A,O,\u03c0,\u03b2}. N, A, and O are sets of\nnodes, actions, and observations. The node transition function \u03b2(n\u2032|n, o) de\ufb01nes the distribution\nover next-nodes n\u2032 to which the agent may transition after taking action a from node n. The policy\nfunction \u03c0(a|n) is a distribution over actions that the \ufb01nite state controller may output in node n.\nNodes are discrete; we again focus on discrete observation and action spaces.\n\n3 Nonparametric Bayesian Policy Priors\n\nWe now describe our framework for combining world models and expert data. Recall that our key\nassumption is that the expert used knowledge about the underlying world to derive his policy. Fig. 1\n\n2\n\n\fFigure 1: Two graphical models of expert data generation. Left: the prior only addresses world\ndynamics and rewards. Right: the prior addresses both world dynamics and controllable policies.\n\nshows the two graphical models that summarize our approaches. Let M denote the (unknown) world\nmodel. Combined with the world model M, the expert\u2019s policy \u03c0e and agent\u2019s policy \u03c0a produce the\nexpert\u2019s and agent\u2019s data De and Da. The data consist of a sequence of histories, where a history ht\nis a sequence of actions a1, \u00b7 \u00b7 \u00b7 , at, observations o1, \u00b7 \u00b7 \u00b7 , ot, and rewards r1, \u00b7 \u00b7 \u00b7 , rt. The agent has\naccess to all histories, but the true world model and optimal policy are hidden.\n\nBoth graphical models assume that a particular world M is sampled from a prior over POMDPs,\ngM (M ). In what would be the standard application of Bayesian RL with expert data (Fig. 1(a)), the\nprior gM (M ) fully encapsulates our initial belief over world models. An expert, who knows the true\nworld model M, executes a planning algorithm plan(M ) to construct an optimal policy \u03c0e. The\nexpert then executes the policy to generate expert data De, distributed according to p(De|M, \u03c0e),\nwhere \u03c0e = plan(M ).\nHowever, the graphical model in Fig. 1(a) does not easily allow us to encode a prior bias toward\nmore controllable world models. In Fig. 1(b), we introduce a new graphical model in which we\nallow additional parameters in the distribution p(\u03c0e). In particular, if we choose a distribution of the\nform\n\n(1)\nwhere we interpret g\u03c0(\u03c0e) as a prior over policies and fM (\u03c0e) as a likelihood of a policy given a\nmodel. We can write the distribution over world models as\n\np(\u03c0e|M ) \u221d fM (\u03c0e)g\u03c0(\u03c0e)\n\np(M ) \u221d Z\u03c0e\n\nfM (\u03c0e)g\u03c0(\u03c0e)gM (M )\n\n(2)\n\nIf fM (\u03c0e) is a delta function on plan(M ), then the integral in Eq. 2 reduces to\n\ne )gM (M )\n\np(M ) \u221d g\u03c0(\u03c0M\n\n(3)\ne = plan(M ), and we see that we have a prior that provides input on both the world\u2019s\nwhere \u03c0M\ndynamics and the world\u2019s controllability. For example, if the policy class is the set of \ufb01nite state\ncontrollers as discussed in Sec. 2, the policy prior g\u03c0(\u03c0e) might encode preferences for a smaller\nnumber of nodes used the policy, while gM (M ) might encode preferences for a smaller number of\nvisited states in the world. The function fM (\u03c0e) can also be made more general to encode how\nlikely it is that the expert uses the policy \u03c0e given world model M.\nFinally, we note that p(De|M, \u03c0) factors as p(Da\nhistories De and Do,r\nworld models given data De and Da is:\n\ne are the actions in the\nare the observations and rewards. Therefore, the conditional distribution over\n\n|M ), where Da\n\ne\n\ne |\u03c0)p(Do,r\n\ne\n\np(M |De, Da) \u221d p(Do,r\n\ne\n\n, Da|M )gM (M )Z\u03c0e\n\np(Da\n\ne |\u03c0e)g\u03c0(\u03c0e)fM (\u03c0e)\n\n(4)\n\nThe model in Fig. 1(a) corresponds to setting a uniform prior on g\u03c0(\u03c0e). Similarly, the conditional\ndistribution over policies given data De and Da is\n\np(\u03c0e|De, Da) \u221d g\u03c0(\u03c0e)p(Da\n\ne |\u03c0e)ZM\n\nfM (\u03c0e)p(Do,r\n\ne\n\n, Da|M )gM (M )\n\n(5)\n\nWe next describe three inference approaches for using Eqs. 4 and 5 to learn.\n\n3\n\n\f#1: Uniform Policy Priors (Bayesian RL with Expert Data).\nIf fM (\u03c0e) = \u03b4(plan(M )) and we\nbelieve that all policies are equally likely (graphical model 1(a)), then we can leverage the expert\u2019s\ndata by simply considering how well that world model\u2019s policy plan(M ) matches the expert\u2019s\nactions for a particular world model M. Eq. 4 allows us to compute a posterior over world models\nthat accounts for the quality of this match. We can then use that posterior as part of a planner by\nusing it to evaluate candidate actions. The expected value of an action1 q(a) with respect to this\nposterior is given by:\n\nE [q(a)] = ZM\n= ZM\n\nq(a|M )p(M |Do,r\n\ne\n\n, Da)\n\nq(a|M )p(Do,r\n\ne\n\n, Da|M )gM (M )p(Da\n\ne |plan(M ))\n\n(6)\n\n, Da|M )gM (M ), a common\nWe assume that we can draw samples from p(M |Do,r\nassumption in Bayesian RL [12, 9]; for our iPOMDP-based case, we can draw these samples using\ne |\u03c0e), where \u03c0e = plan(M ), to\nthe beam sampler of [17]. We then weight those samples by p(Da\nyield the importance-weighted estimator\n\n, Da) \u221d p(Do,r\n\ne\n\ne\n\nE [q(a)] \u2248 Xi\n\nq(a|Mi)p(Da\n\ne |Mi, \u03c0e), Mi \u223c p(M |Do,r\n\ne\n\n, Da).\n\nFinally, we can also sample values for q(a) by \ufb01rst sampling a world model given the importance-\nweighted distribution above and recording the q(a) value associated with that model.\n\n#2: Policy Priors with Model-based Inference. The uniform policy prior implied by standard\nBayesian RL does not allow us to encode prior biases about the policy. With a more general prior\n(graphical model 1(b) in Fig. 1), the expectation in Eq. 6 becomes\n\nE [q(a)] = ZM\n\nq(a|M )p(Do,r\n\ne\n\n, Da|M )gM (M )g\u03c0(plan(M ))p(Da\n\ne |plan(M ))\n\n(7)\n\nwhere we still assume that the expert uses an optimal policy, that is, fM (\u03c0e) = \u03b4(plan(M )). Using\nEq. 7 can result in somewhat brittle and computationally intensive inference, however, as we must\ncompute \u03c0e for each sampled world model M. It also assumes that the expert used the optimal\npolicy, whereas a more realistic assumption might be that the expert uses a near-optimal policy.\nWe now discuss an alternative that relaxes fM (\u03c0e) = \u03b4(plan(M )): let fM (\u03c0e) be a function that\nprefers policies that achieve higher rewards in world model M: fM (\u03c0e) \u221d exp {V (\u03c0e|M )}, where\nV (\u03c0e|M ) is the value of the policy \u03c0e on world M; indicating a belief that the expert tends to sample\npolicies that yield high value. Substituting this fM (\u03c0e) into Eq. 4, the expected value of an action is\n\nE [q(a)] = ZM,\u03c0e\n\nq(a|M )p(Da\n\ne |\u03c0e) exp {V (\u03c0e|M )} g\u03c0(\u03c0e)p(Do,r\n\ne\n\n, Da|M )gM (M )\n\nWe again assume that we can draw samples from p(M |Do,r\ne\nadditionally assume that we can draw samples from p(\u03c0e|Da\nE [q(a)] \u2248 Xi\n\ne ) \u221d p(Da\nexp(cid:8)V (\u03c0ej|Mi)(cid:9) , Mi \u223c p(M |Do,r\n\nq(a|Mi)Xj\n\ne\n\n, Da) \u221d p(Do,r\n\n, Da|M )gM (M ), and\n\ne\n\ne |\u03c0e)g\u03c0(\u03c0e), yielding:\n\n, Da), \u03c0ej \u223c p(\u03c0e|Da\ne )\n\n(8)\n\nAs in the case with standard Bayesian RL, we can also use our weighted world models to draw\nsamples from q(a).\n\n#3: Policy Priors with Joint Model-Policy Inference. While the model-based inference for pol-\nicy priors is correct, using importance weights often suffers when the proposal distribution is not\nnear the true posterior. In particular, sampling world models and policies\u2014both very high dimen-\nsional objects\u2014from distributions that ignore large parts of the evidence means that large numbers\nof samples may be needed to get accurate estimates. We now describe an inference approach that\nalternates sampling models and policies that both avoids importance sampling and can be used even\n\n1We omit the belief over world states b(s) from the equations that follow for clarity; all references to q(a|M )\n\nare q(a|bM (s), M ).\n\n4\n\n\fin cases where fM (\u03c0e) = \u03b4(plan(M )). Once we have a set of sampled models we can compute the\nexpectation E[q(a)] simply as the average over the action values q(a|Mi) for each sampled model.\nThe inference proceeds in two alternating stages: \ufb01rst, we sample a new policy given a sampled\nmodel. Given a world model, Eq. 5 becomes\n\ne |\u03c0e)fM (\u03c0e)\n\np(\u03c0e|De, Da, M ) \u221d g\u03c0(\u03c0e)p(Da\n\n(9)\ne |\u03c0e) conjugate is generally an easy design choice\u2014for example,\nwhere making g\u03c0(\u03c0e) and p(Da\nin Sec. 3.1, we use the iPOMDP [12] as a conjugate prior over policies encoded as \ufb01nite state\ncontrollers. We then approximate fM (\u03c0e) with a function in the same conjugate family: in the case\ne , we also approximate fM with a set of Dirichlet counts\nof the iPOMDP prior and count data Da\nscaled by some temperature parameter a. As a is increased, we recover the desired fM (\u03c0e) =\n\u03b4(plan(M )); the initial approximation speeds up the inference and does not affect its correctness.\nNext we sample a new world model given the policy. Given a policy, Eq. 4 reduces to\n\np(M |De, Da) \u221d p(Do,r\n\ne\n\n, Da|M )gM (M )fM (\u03c0e).\n\n(10)\n\ne\n\n, Da|M )gM (M ) and accepting it with ratio fM \u2032 (\u03c0e)\n\nWe apply a Metropolis-Hastings (MH) step to sample new world models, drawing a new model M \u2032\nfrom p(Do,r\nfM (\u03c0e) . If fM (\u03c0e) is highly peaked, then\nthis ratio is likely to be ill-de\ufb01ned; as when sampling policies, we apply a tempering scheme in the\ninference to smooth fM (\u03c0e). For example, if we desired fM (\u03c0e) = \u03b4(plan(M )), then we could use\ne |M ))2), where b is a temperature parameter\nsmoothed version\nfor the inference. While applying MH can suffer from the same issues as the importance sampling in\nthe model-based approach, Gibbs sampling new policies removes one set of proposal distributions\nfrom the inference, resulting in better estimates with fewer samples.\n\n\u02c6fM (\u03c0e) \u221d exp(a\u00b7(V (\u03c0e|M )\u2212V (\u03c0M\n\n3.1 Priors over State Controller Policies\n\nWe now turn to the de\ufb01nition of the policy prior p(\u03c0e). In theory, any policy prior can be used, but\nthere are some practical considerations. Mathematically, the policy prior serves as a regularizer to\navoid over\ufb01tting the expert data, so it should encode a preference toward simple policies. It should\nalso allow computationally tractable sampling from the posterior p(\u03c0e|De) \u221d p(De|\u03c0e)p(\u03c0e).\nIn discrete domains, one choice for the policy prior (as well as the model prior) is the iPOMDP [12].\nTo use the iPOMDP as a model prior (its intended use), we treat actions as inputs and observations\nas outputs. The iPOMDP posits that there are an in\ufb01nite number of states s but a few popular states\nare visited most of the time; the beam sampler [17] can ef\ufb01ciently draw samples of state transition,\nobservation, and reward models for visited states. Joint inference over the model parameters T, \u2126, R\nand the state sequence s allows us to infer the number of visited states from the data.\n\nTo use the iPOMDP as a policy prior, we simply reverse the roles of actions and observations,\ntreating the observations as inputs and the actions as outputs. Now, the iPOMDP posits that there is\na state controller with an in\ufb01nite number of nodes n, but probable polices use only a small subset\nof the nodes a majority of the time. We perform joint inference over the node transition and policy\nparameters \u03b2 and \u03c0 as well as the visited nodes n. The \u2018policy state\u2019 representation learned is not\nthe world state, rather it is a summary of previous observations which is suf\ufb01cient to predict actions.\nAssuming that the training action sequences are drawn from the optimal policy, the learner will\nlearn just enough \u201cpolicy state\u201d to control the system optimally. As in the model prior application,\nusing the iPOMDP as a policy prior biases the agent towards simpler policies\u2014those that visit fewer\nnodes\u2014but allows the number of nodes to grow as with new expert experience.\n\n3.2 Consistency and Correctness\n\nIn all three inference approaches, the sampled models and policies are an unbiased representation\nof the true posterior and are consistent in that in the limit of in\ufb01nite samples, we will recover the\ne . There are\ntrue model and policy posteriors conditioned on their respective data Da, Do,r\nsome mild conditions on the world and policy priors to ensure consistency: since the policy prior\nand model prior are speci\ufb01ed independently, we require that there exist models for which both the\npolicy prior and model prior are non-zero in the limit of data. Formally, we also require that the\nexpert provide optimal trajectories; in practice, we see that this assumption can be relaxed.\n\nand Da\n\ne\n\n5\n\n\fRewards for Multicolored Gridworld\n4000\n\n \n\niPOMDP\nInference #1\nInference #2\nInference #3\n\nd\nr\na\nw\ne\nR\ne\nv\ni\nt\n\n \n\nl\n\na\nu\nm\nu\nC\n\n3000\n\n2000\n\n1000\n\n0\n\n\u22121000\n\n \n0\n\nRewards for Snakes\n\n \n\niPOMDP\nApproach 1\nApproach 2\nApproach 3\n\n120\n\n100\n\n80\n\n60\n\n40\n\n20\n\nd\nr\na\nw\ne\nR\n \ne\nv\ni\nt\na\nu\nm\nu\nC\n\nl\n\n1000\n\n3000\nIterations of Experience\n\n2000\n\n0\n\n \n0\n\n1000\n\n2000\n\n3000\n\nIterations of Experience\n\n4000\n\n5000\n\n6000\n\n7000\n\n8000\n\nFigure 2: Learning curves for the multicolored gridworld (left) and snake (right). Error bars are 95%\ncon\ufb01dence intervals of the mean. On the far right is the snake robot.\n\n3.3 Planning with Distributions over Policies and Models\n\nAll the approaches in Sec. 3 output samples of models or policies to be used for planning. As\nnoted in Section 2, computing the Bayes optimal action is typically intractable. Following similar\nwork [4, 1, 2, 12], we interpret these samples as beliefs. In the model-based approaches, we \ufb01rst\nsolve each model (all of which are generally small) using standard POMDP planners. During the\ntesting phase, the internal belief state of the models (in the model-based approaches) or the internal\nnode state of the policies (in the policy-based approaches), is updated after each action-observation\npair. Models are also reweighted using standard importance weights so that they continue to be an\nunbiased approximation of the true belief. Actions are chosen by \ufb01rst selecting, depending on the\napproach, a model or policy based on their weights, and then performing its most preferred action.\nWhile this approach is clearly approximate (it considers state uncertainty but not model uncertainty),\nwe found empirically that this simple, fast approach to action selection produced nearly identical\nresults to the much slower (but asymptotically Bayes optimal) stochastic forward search in [12].2\n\n4 Experiments\n\nWe \ufb01rst describe a pair of demonstrations that show two important properties of using policy priors:\n(1) that policy priors can be useful even in the absence of expert data and (2) that our approach\nworks even when the expert trajectories are not optimal. We then compare policy priors with the\nbasic iPOMDP [12] and \ufb01nite-state model learner trained with EM on several standard problems. In\nall cases, the tasks were episodic. Since episodes could be of variable length\u2014speci\ufb01cally, experts\ngenerally completed the task in fewer iterations\u2014we allowed each approach N = 2500 iterations, or\ninteractions with the world, during each learning trial. The agent was provided with an expert tra-\njectory with probability .5 n\nN , where n was the current amount of experience. No expert trajectories\nwere provided in the last quarter of the iterations. We ran each approach for 10 learning trials.\n\nModels and policies were updated every 100 iterations, and each episode was capped at 50 iterations\n(though it could be shorter, if the task was achieved in fewer iterations). Following each update, we\nran 50 test episodes (not included in the agent\u2019s experience) with the new models and policies to\nempirically evaluate the current value of the agents\u2019 policy. For all of the nonparametric approaches,\n50 samples were collected, 10 iterations apart, after a burn-in of 500 iterations. Sampled models\nwere solved using 25 backups of PBVI [18] with 500 sampled beliefs. One iteration of bounded\npolicy iteration [19] was performed per sampled model. The \ufb01nite-state learner was trained using\nmin(25, |S|), where |S| was the true number of underlying states. Both the nonparametric and\n\ufb01nite learners were trained from scratch during each update; we found empirically that starting from\nrandom points made the learner more robust than starting it at potentially poor local optima.\n\nPolicy Priors with No Expert Data The combined policy and model prior can be used to encode\na prior bias towards models with simpler control policies. This interpretation of policy priors can\n\n2We suspect that the reason the two planning approaches yield similar results is that the stochastic forward\nsearch never goes deep enough to discover the value of learning the model and thus acts equivalently to our\nsampling-based approach, which only considers the value of learning more about the underlying state.\n\n6\n\n\fbe useful even without expert data: the left pane of Fig. 2 shows the performance of the policy\nprior-biased approaches and the standard iPOMDP on a gridworld problem in which observations\ncorrespond to both the adjacent walls (relevant for planning) and the color of the square (not relevant\nfor planning). This domain has 26 states, 4 colors, standard NSEW actions, and an 80% chance of\na successful action. The optimal policy for this gridworld was simple: go east until the agent hits\na wall, then go south. However, the varied observations made the iPOMDP infer many underlying\nstates, none of which it could train well, and these models also confused the policy-inference in\nApproach 3. Without expert data, Approach 1 cannot do better than iPOMDP. By biasing the agent\ntowards worlds that admit simpler policies, the model-based inference with policy priors (Approach\n2) creates a faster learner.\n\nPolicy Priors with Imperfect Experts While we focused on optimal expert data, in practice pol-\nicy priors can be applied even if the expert is imperfect. Fig. 2(b) shows learning curves for a sim-\nulated snake manipulation problem with a 40-dimensional continuous state space, corresponding\nto (x,y) positions and velocities of 10 body segments. Actions are 9-dimensional continuous vec-\ntors, corresponding to desired joint angles between segments. The snake is rewarded based on the\ndistance it travels along a twisty linear \u201cmaze,\u201d encouraging it to wiggle forward and turn corners.\n\nWe generated expert data by \ufb01rst deriving 16 motor primitives for the action space using a cluster-\ning technique on a near-optimal trajectory produced by a rapidly-exploring random tree (RRT). A\nreasonable\u2014but not optimal\u2014controller was then designed using alternative policy-learning tech-\nniques on the action space of motor primitives. Trajectories from this controller were treated as\nexpert data for our policy prior model. Although the trajectories and primitives are suboptimal,\nFig. 2(b) shows that knowledge of feasible solutions boosts performance when using the policy-\nbased technique.\n\nTests on Standard Problems We also tested the approaches on ten problems: tiger [20] (2 states),\nnetwork [20] (7 states), shuttle [21] (8 states), an adapted version of gridworld [20] (26 states),\nan adapted version of follow [2] (26 states) hallway [20] (57 states), beach (100 states), rocksam-\nple(4,4) [22] (257 states), tag [18] (870 states), and image-search (16321 states).\nIn the beach\nproblem, the agent needed to track a beach ball on a 2D grid. The image-search problem involved\nidentifying a unique pixel in an 8x8 grid with three type of \ufb01lters with varying cost and scales.\nWe compared our inference approaches with two approaches that did not leverage the expert data:\nexpectation-maximization (EM) used to learn a \ufb01nite world model of the correct size and the in\ufb01nite\nPOMDP [12], which placed the same nonparametric prior over world models as we did.\n\n0\n\n\u2212500\n\n\u22121000\n\n\u22121500\n\n\u22122000\n\nd\nr\na\nw\ne\nR\ne\nv\ni\nt\n\n \n\nl\n\na\nu\nm\nu\nC\n\nRewards for shuttle\n\nRewards for follow\n\nRewards for gridworld\n\n100\n\n50\n\n0\n\n\u221250\n\n0\n\n\u22121000\n\n\u22122000\n\n\u22123000\n\n\u22124000\n\n\u22125000\n\n500\n\n0\n\n\u2212500\n\n\u22121000\n\n\u22121500\n\n\u22122000\n\nx 104Rewards for network\n\nRewards for tiger\n\n \n\n1.5\n\n1\n\n0.5\n\n0\n\niPOMDP\nInference #1\nInference #2\nInference #3\nEM\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n0\n\n\u22122500\n\n \n0\n\n1000\n\n2000\n\n3000\n\n1000\n\n2000\n\n3000\n\n4000\n\n\u2212100\n\n0\n\n1000\n\n2000\n\n\u22126000\n\n0\n\n1000\n\n2000\n\n3000\n\n\u22122500\n\n0\n\n1000\n\n2000\n\n3000\n\nRewards for hallway\n\nRewards for beach\n\n8\n\n6\n\n4\n\n2\n\n0\n\nd\nr\na\nw\ne\nR\ne\nv\ni\nt\n\n \n\nl\n\na\nu\nm\nu\nC\n\n250\n\n200\n\n150\n\n100\n\n50\n\nRewards for rocksample\n0\n\n0\n\n\u2212500\n\n\u22121000\n\n\u22121500\n\n\u22122000\n\n\u22122500\n\n\u22125000\n\n\u221210000\n\nRewards for tag\n\nRewards for image\n\n0\n\n\u22121000\n\n\u22122000\n\n\u22123000\n\n\u22124000\n\n\u22125000\n\n\u22122\n\n1000\n\n0\nIterations of Experience\n\n2000\n\n3000\n\n0\n0\nIterations of Experience\n\n1000\n\n2000\n\n3000\n\n\u22123000\n\n1000\n\n0\nIterations of Experience\n\n3000\n\n2000\n\n4000\n\n0\n1000 2000 3000 4000 5000\nIterations of Experience\n\n1000\n\n0\n4000\nIterations of Experience\n\n3000\n\n2000\n\n\u221215000\n\n\u22126000\n\nFigure 3: Performance on several standard problems, with 95% con\ufb01dence intervals of the mean.\n\n7\n\n\fFig. 3 shows the learning curves for our policy priors approaches (problems ordered by state space\nsize); the cumulative rewards and \ufb01nal values are shown in Table 1. As expected, approaches that\nleverage expert trajectories generally perform better than those that ignore the near-optimality of the\nexpert data. The policy-based approach is successful even among the larger problems. Here, even\nthough the inferred state spaces could grow large, policies remained relatively simple. The opti-\nmization used in the policy-based approach\u2014recall we use the stochastic search to \ufb01nd a probable\npolicy\u2014was also key to producing reasonable policies with limited computation.\n\nCumulative Reward\n\nFinal Reward\n\niPOMDP App.\n\ntiger\nnetwork\nshuttle\nfollow\ngridworld\nhallway\nbeach\nrocksample\ntag\nimage\n\n-2.2e3\n-1.5e4\n-5.3e1\n-6.3e3\n-2.0e3\n2.0e-1\n1.9e2\n-3.2e3\n-1.6e4\n-7.8e3\n\n1\n-1.4e3\n-6.3e3\n7.9e1\n-2.3e3\n-6.2e2\n1.4\n1.4e2\n-1.7e3\n-6.9e3\n-5.3e3\n\nApp.\n2\n-5.3e2\n-2.1e3\n1.5e2\n-1.9e3\n-7.0e2\n1.6\n1.8e2\n-1.8e3\n-7.4e3\n-6.1e3\n\nApp.\n3\n-2.2e2\n1.9e4\n5.1e1\n-1.6e3\n4.6e2\n6.6\n1.9e2\n-1.0e3\n-3.5e3\n-3.9e3\n\nEM\n\niPOMDP App. 1 App. 2 App. 3\n\nEM\n\n-3.0e3\n-2.6e3\n0.0\n-5.0e3\n-3.7e3\n0.0\n3.5e2\n-3.5e3\n-\n-\n\n-2.0e1\n-1.1e1\n1.7e-1\n-5.9\n-1.3\n8.6e-4\n2.0e-1\n-1.6\n-9.4\n-5.0\n\n-1.0e1\n-1.2e1\n3.3e-1\n-3.1\n5.3e-1\n7.4e-3\n1.1e-1\n-5.3e-1\n-2.8\n-3.6\n\n-2.3\n-4.0e-1\n6.5e-1\n-1.4\n1.8\n1.4e-2\n1.4e-1\n-1.3\n-4.1\n-4.2\n\n1.6\n1.1e1\n8.6e-1\n-1.1\n2.3\n1.9e-2\n2.7e-1\n1.2\n-1.7\n1.3e1\n\n-2.0e1\n-4.7\n0.0\n-5.0\n-2.1\n0.0\n3.4e-1\n-2.0\n-9.1\n-5.0\n\nTable 1: Cumulative and \ufb01nal rewards on several problems. Bold values highlight best performers.\n\n5 Discussion and Related Work\n\nSeveral Bayesian approaches have been developed for RL in partially observable domains. These\ninclude [7], which uses a set of Gaussian approximations to allow for analytic value function updates\nin the POMDP space; [2], which jointly reasons over the space of Dirichlet parameters and states\nwhen planning in discrete POMDPs, and [12], which samples models from a nonparametric prior.\n\nBoth [1, 4] describe how expert data augment learning. The \ufb01rst [1] lets the agent to query a state\noracle during the learning process. The computational bene\ufb01t of a state oracle is that the informa-\ntion can be used to directly update a prior over models. However, in large or complex domains, the\nagent\u2019s state might be dif\ufb01cult to de\ufb01ne. In contrast, [4] lets the agent query an expert for optimal ac-\ntions. While policy information may be much easier to specify\u2014incorporating the result of a single\nquery into the prior over models is challenging; the particle-\ufb01ltering approach of [4] can be brittle\nas model-spaces grow large. Our policy priors approach uses entire trajectories; by learning policies\nrather than single actions, we can generalize better and evaluate models more holistically. By work-\ning with models and policies, rather than just models as in [4], we can also consider larger problems\nwhich still have simple policies. Targeted criteria for asking for expert trajectories, especially one\nwith performance guarantees such as [4], would be an interesting extension to our approach.\n\n6 Conclusion\n\nWe addressed a key gap in the learning-by-demonstration literature: learning from both expert and\nagent data in a partially observable setting. Prior work used expert data in MDP and imitation-\nlearning cases, but less work exists for the general POMDP case. Our Bayesian approach combined\npriors over the world models and policies, connecting information about world dynamics and expert\ntrajectories. Taken together, these priors are a new way to think about specifying priors over models:\ninstead of simply putting a prior over the dynamics, our prior provides a bias towards models with\nsimple dynamics and simple optimal policies. We show with our approach expert data never reduces\nperformance, and our extra bias towards controllability improves performance even without expert\ndata. Our policy priors over nonparametric \ufb01nite state controllers were relatively simple; classes of\npriors to address more problems is an interesting direction for future work.\n\n8\n\n\fReferences\n\n[1] R. Jaulmes, J. Pineau, and D. Precup. Learning in non-stationary partially observable Markov\n\ndecision processes. ECML Workshop, 2005.\n\n[2] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive POMDPs. In Neural\n\nInformation Processing Systems (NIPS), 2008.\n\n[3] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayesian reinforcement learning in\n\ncontinuous POMDPs with application to robot navigation. In ICRA, 2008.\n\n[4] Finale Doshi, Joelle Pineau, and Nicholas Roy. Reinforcement learning with limited rein-\nforcement: Using Bayes risk for active learning in POMDPs. In International Conference on\nMachine Learning, volume 25, 2008.\n\n[5] Pieter Abbeel, Morgan Quigley, and Andrew Y. Ng. Using inaccurate models in reinforcement\nlearning. In In International Conference on Machine Learning (ICML) Pittsburgh, pages 1\u20138.\nACM Press, 2006.\n\n[6] Nathan Ratliff, Brian Ziebart, Kevin Peterson, J. Andrew Bagnell, Martial Hebert, Anind K.\nIn\n\nInverse optimal heuristic control for imitation learning.\n\nDey, and Siddhartha Srinivasa.\nProc. AISTATS, pages 424\u2013431, 2009.\n\n[7] P. Poupart and N. Vlassis. Model-based Bayesian reinforcement learning in partially observ-\n\nable domains. In ISAIM, 2008.\n\n[8] M. Strens. A Bayesian framework for reinforcement learning. In ICML, 2000.\n[9] John Asmuth, Lihong Li, Michael Littman, Ali Nouri, and David Wingate. A Bayesian sam-\npling approach to exploration in reinforcement learning. In Uncertainty in Arti\ufb01cial Intelli-\ngence (UAI), 2009.\n\n[10] R. Dearden, N. Friedman, and D. Andre. Model based Bayesian exploration. pages 150\u2013159,\n\n1999.\n\n[11] E. J. Sondik. The Optimial Control of Partially Observable Markov Processes. PhD thesis,\n\nStanford University, 1971.\n\n[12] Finale Doshi-Velez. The in\ufb01nite partially observable Markov decision process. In Y. Bengio,\nD. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural\nInformation Processing Systems 22, pages 477\u2013485. 2009.\n\n[13] Matthew J. Beal, Zoubin Ghahramani, and Carl E. Rasmussen. The in\ufb01nite hidden Markov\n\nmodel. In Machine Learning, pages 29\u2013245. MIT Press, 2002.\n\n[14] Yee Whye Teh, Michael I. Jordan, Matthew J. Beal, and David M. Blei. Hierarchical Dirichlet\n\nprocesses. Journal of the American Statistical Association, 101:1566\u20131581, 2006.\n\n[15] Tao Wang, Daniel Lizotte, Michael Bowling, and Dale Schuurmans. Bayesian sparse sampling\nfor on-line reward optimization. In International Conference on Machine Learning (ICML),\n2005.\n\n[16] J. Zico Kolter and Andrew Ng. Near-Bayesian exploration in polynomial time. In International\n\nConference on Machine Learning (ICML), 2009.\n\n[17] J. van Gael, Y. Saatci, Y. W. Teh, and Z. Ghahramani. Beam sampling for the in\ufb01nite hidden\n\nMarkov model. In ICML, volume 25, 2008.\n\n[18] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for\n\nPOMDPs. IJCAI, 2003.\n\n[19] Pascal Poupart and Craig Boutilier. Bounded \ufb01nite state controllers. In Neural Information\n\nProcessing Systems, 2003.\n\n[20] M. L. Littman, A. R. Cassandra, and L. P. Kaelbling. Learning policies for partially observable\n\nenvironments: scaling up. ICML, 1995.\n\n[21] Lonnie Chrisman. Reinforcement learning with perceptual aliasing: The perceptual distinc-\ntions approach. In In Proceedings of the Tenth National Conference on Arti\ufb01cial Intelligence,\npages 183\u2013188. AAAI Press, 1992.\n\n[22] T. Smith and R. Simmons. Heuristic search value iteration for POMDPs. In Proc. of UAI 2004,\n\nBanff, Alberta, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1188, "authors": [{"given_name": "Finale", "family_name": "Doshi-velez", "institution": null}, {"given_name": "David", "family_name": "Wingate", "institution": null}, {"given_name": "Nicholas", "family_name": "Roy", "institution": null}, {"given_name": "Joshua", "family_name": "Tenenbaum", "institution": null}]}