{"title": "Learning to Explore and Exploit in POMDPs", "book": "Advances in Neural Information Processing Systems", "page_first": 198, "page_last": 206, "abstract": "A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.", "full_text": "Learning to Explore and Exploit in POMDPs\n\nChenghui Cai, Xuejun Liao, and Lawrence Carin\nDepartment of Electrical and Computer Engineering\n\nDuke University\n\nDurham, NC 27708-0291, USA\n\nAbstract\n\nA fundamental objective in reinforcement learning is the maintenance of a proper\nbalance between exploration and exploitation. This problem becomes more chal-\nlenging when the agent can only partially observe the states of its environment.\nIn this paper we propose a dual-policy method for jointly learning the agent be-\nhavior and the balance between exploration exploitation, in partially observable\nenvironments. The method subsumes traditional exploration, in which the agent\ntakes actions to gather information about the environment, and active learning, in\nwhich the agent queries an oracle for optimal actions (with an associated cost for\nemploying the oracle). The form of the employed exploration is dictated by the\nspeci\ufb01c problem. Theoretical guarantees are provided concerning the optimality\nof the balancing of exploration and exploitation. The effectiveness of the method\nis demonstrated by experimental results on benchmark problems.\n\n1 Introduction\n\nA fundamental challenge facing reinforcement learning (RL) algorithms is to maintain a proper\nbalance between exploration and exploitation. The policy designed based on previous experiences\nis by construction constrained, and may not be optimal as a result of inexperience. Therefore, it\nis desirable to take actions with the goal of enhancing experience. Although these actions may not\nnecessarily yield optimal near-term reward toward the ultimate goal, they could, over a long horizon,\nyield improved long-term reward. The fundamental challenge is to achieve an optimal balance\nbetween exploration and exploitation; the former is performed with the goal of enhancing experience\nand preventing premature convergence to suboptimal behavior, and the latter is performed with the\ngoal of employing available experience to de\ufb01ne perceived optimal actions.\n\nFor a Markov decision process (MDP), the problem of balancing exploration and exploitation has\nbeen addressed successfully by the E3 [4, 5] and R-max [2] algorithms. Many important applica-\ntions, however, have environments whose states are not completely observed, leading to partially\nobservable MDPs (POMDPs). Reinforcement learning in POMDPs is challenging, particularly in\nthe context of balancing exploration and exploitation. Recent work targeted on solving the explo-\nration vs. exploitation problem is based on an augmented POMDP, with a product state space over\nthe environment states and the unknown POMDP parameters [9]. This, however, entails solving a\ncomplicated planning problem, which has a state space that grows exponentially with the number\nof unknown parameters, making the problem quickly intractable in practice. To mitigate this com-\nplexity, active learning methods have been proposed for POMDPs, which borrow similar ideas from\nsupervised learning, and apply them to selectively query an oracle (domain expert) for the optimal\naction [3]. Active learning has found success in many collaborative human-machine tasks where\nexpert advice is available.\n\nIn this paper we propose a dual-policy approach to balance exploration and exploitation in POMDPs,\nby simultaneously learning two policies with partially shared internal structure. The \ufb01rst policy,\ntermed the primary policy, de\ufb01nes actions based on previous experience; the second policy, termed\n\n1\n\n\fthe auxiliary policy, is a meta-level policy maintaining a proper balance between exploration and\nexploitation. We employ the regionalized policy representation (RPR) [6] to parameterize both\npolicies, and perform Bayesian learning to update the policy posteriors. The approach applies in\neither of two cases: (i) the agent explores by randomly taking the actions that have been insuf\ufb01ciently\ntried before (traditional exploration), or (ii) the agent explores by querying an oracle for the optimal\naction (active learning). In the latter case, the agent is assessed a query cost from the oracle, in\naddition to the reward received from the environment. Either (i) or (ii) is employed as an exploration\nvehicle, depending upon the application.\n\nThe dual-policy approach possesses interesting convergence properties, similar to those of E3 [5]\nand Rmax [2]. However, our approach assumes the environment is a POMDP while E3 and Rmax\nboth assume an MDP environment. Another distinction is that our approach learns the agent policy\ndirectly from episodes, without estimating the POMDP model. This is in contrast to E3 and Rmax\n(both learn MDP models) and the active-learning method in [3] (which learns POMDP models).\n\n2 Regionalized Policy Representation\n\nWe \ufb01rst provide a brief review of the regionalized policy representation, which is used to parame-\nterize the primary policy and the auxiliary policy as discussed above. The material in this section is\ntaken from [6], with the proofs omitted here.\n\nDe\ufb01nition 2.1 A regionalized policy representation is a tuple (A, O, Z, W, \u00b5, \u03c0). The A and O are\nrespectively a \ufb01nite set of actions and observations. The Z is a \ufb01nite set of belief regions. The W is\nthe belief-region transition function with W (z, a, o\u2032, z\u2032) denoting the probability of transiting from\nz to z\u2032 when taking action a in z results in observing o\u2032. The \u00b5 is the initial distribution of belief\nregions with \u00b5(z) denoting the probability of initially being in z. The \u03c0 are the region-dependent\nstochastic policies with \u03c0(z, a) denoting the probability of taking action a in z.\n\nWe denote A = {1, 2, . . . , |A|}, where |A| is the cardinality of A. Similarly, O = {1, 2, . . . , |O|}\nand Z = {1, 2, . . . , |Z|}. We abbreviate (a0, a1, . . . , aT ) as a0:T and similarly, (o1, o2, . . . , aT )\nas o1:T and (z0, z1, . . . , zT ) as z0:T , where the subscripts indexes discrete time steps. The history\nht = {a0:t\u22121, o1:t} is de\ufb01ned as a sequence of actions performed and observations received up to\nt. Let \u0398 = {\u03c0, \u00b5, W } denote the RPR parameters. Given ht, the RPR yields a joint probability\ndistribution of z0:t and a0:t as follows\n\np(a0:t, z0:t|o1:t, \u0398) = \u00b5(z0)\u03c0(z0, a0)Qt\n\n(1)\nBy marginalizing z0:t out in (1), we obtain p(a0:t|o1:t, \u0398). Furthermore, the history-dependent\ndistribution of action choices is obtained as follows:\n\n\u03c4 =1W (z\u03c4 \u22121, a\u03c4 \u22121, o\u03c4 , z\u03c4 )\u03c0(z\u03c4 , a\u03c4 )\n\np(a\u03c4 |h\u03c4 , \u0398) = p(a0:\u03c4 |o1:\u03c4 , \u0398)[p(a0:\u03c4 \u22121|o1:\u03c4 \u22121, \u0398)]\u22121\n\nwhich gives a stochastic policy for choosing the action a\u03c4 . The action choice depends solely on the\nhistorical actions and observations, with the unobservable belief regions marginalized out.\n\n2.1 Learning Criterion\n\nBayesian learning of the RPR is based on the experiences collected from the agent-environment\ninteraction. Assuming the interaction is episodic, i.e., it breaks into subsequences called episodes\n[10], we represent the experiences by a set of episodes.\n\nDe\ufb01nition 2.2 An episode is a sequence of agent-environment\nan absorbing state that\n(ak\nrk\nTk\na, and r are respectively observations, actions, and immediate rewards.\n\ninteractions terminated in\nAn episode is denoted by\ntransits to itself with zero reward.\n), where the subscripts are discrete times, k indexes the episodes, and o,\n\n1 \u00b7 \u00b7 \u00b7 ok\nTk\n\n1 ak\n\nak\nTk\n\n0 ok\n\n0 rk\n\n1rk\n\nDe\ufb01nition 2.3 (The RPR Optimality Criterion) Let D(K) = {(ak\nk=1\nbe a set of episodes obtained by an agent interacting with the environment by following policy\n\u03a0 to select actions, where \u03a0 is an arbitrary stochastic policy with action-selecting distributions\np\u03a0(at|ht) > 0, \u2200 action at, \u2200 history ht. The RPR optimality criterion is de\ufb01ned as\n\n1 \u00b7 \u00b7 \u00b7 ok\nTk\n\n1 ak\n\nak\nTk\n\n0 ok\n\n1 rk\n\n0rk\n\n)}K\n\nrk\nTk\n\n\t\n\t\n\nbV (D(K); \u0398)\n\ndef.\n= 1\n\nK PK\n\nk=1PTk\n\n2\n\nt=0 \u03b3trk\n\nt\n\nt\n\nt\n\n\u03c4 =0 p(ak\n\u03c4 |hk\n\u03c4 =0 p\u03a0(ak\n\n\u03c4 ,\u0398)\n\u03c4 |hk\n\u03c4 )\n\n(2)\n\n\fwhere hk\n0 < \u03b3 < 1 is the discount, and \u0398 denotes the RPR parameters.\n\nt is the history of actions and observations up to time t in the k-th episode,\n\n1 \u00b7 \u00b7 \u00b7 ok\n\nt = ak\n\n1 ak\n\n0 ok\n\nparameterized by \u0398 for an in\ufb01nite number of steps. Therefore, the RPR resulting from maximization\n\nThroughout the paper, we call bV (D(K); \u0398) the empirical value function of \u0398. It is proven in [6]\nthat limK\u2192\u221e bV (D(K); \u0398) is the expected sum of discounted rewards by following the RPR policy\nof bV (D(K); \u0398) approaches the optimal as K is large (assuming |Z| is appropriate). In the Bayesian\n\nsetting discussed below, we use a noninformative prior for \u0398, leading to a posterior of \u0398 peaked at\nthe optimal RPR, therefore the agent is guaranteed to sample the optimal or a near-optimal policy\nwith overwhelming probability.\n\n2.2 Bayesian Learning\n\nLet G0(\u0398) represent the prior distribution of the RPR parameters. We de\ufb01ne the posterior of \u0398 as\n\n(3)\n\n(4)\n\np(\u0398|D(K), G0)\n\ndef.\n\n= bV (D(K); \u0398)G0(\u0398)[bV (D(K))]\u22121\n\nLB({qk\n\nprior as a product of Dirichlet distributions,\n\nis an empirical value function, thus (3) is a non-standard use of Bayes rule. However, (3) indeed\ngives a distribution whose shape incorporates both the prior and the empirical information.\n\nwhere bV (D(K)) =R bV (D(K); \u0398)G0(\u0398)d\u0398 is the marginal empirical value. Note that bV (D(K); \u0398)\nSince each term in bV (D(K); \u0398) is a product of multinomial distributions, it is natural to choose the\ni=1Dir(cid:0)\u03c0(i, 1), \u00b7 \u00b7 \u00b7 , \u03c0(i, |A|)(cid:12)(cid:12)(cid:12)\u03c1i(cid:1),\nwhere p(\u00b5|\u03c5) = Dir(cid:0)\u00b5(1), \u00b7 \u00b7 \u00b7 , \u00b5(|Z|)(cid:12)(cid:12)\u03c5(cid:1), p(\u03c0|\u03c1) = Q|Z|\ni=1Dir(cid:0)W (i, a, o, 1), \u00b7 \u00b7 \u00b7 , W (i, a, o, |Z|)(cid:12)(cid:12)\u03c9i,a,o(cid:1); \u03c1i = {\u03c1i,m}|A|\np(W |\u03c9) = Q|A|\n\nm=1,\n\u03c5 = {\u03c5i}|Z|\nj=1 are hyper-parameters. With the prior thus chosen, the\nposterior in (3) is a large mixture of Dirichlet products, and therefore posterior analysis by Gibbs\nsampling is inef\ufb01cient. To overcome this, we employ the variational Bayesian technique [1] to obtain\n\ni=1, and \u03c9i,a,o = {\u03c9i,a,o,j}|Z|\n\nG0(\u0398) = p(\u00b5|\u03c5)p(\u03c0|\u03c1)p(W |\u03c9)\n\na=1Q|O|\n\no=1Q|Z|\n\na variational posterior by maximizing a lower bound to lnR bV (D(K); \u0398)G0(\u0398)d\u0398,\n\nt }, g(\u0398)) = lnZ bV (D(K); \u0398)G0(\u0398)d\u0398 \u2212 KL({qk\nk=1 PTk\nK PK\n\n0:t) \u2265 1, g(\u0398) \u2265 1,R g(\u0398)d\u0398 = 1,\n\nt }, g(\u0398) are variational distributions satisfying qk\n\nwhere {qk\nand 1\nthe Kullback-Leibler (KL) distance between probability measure q and p.\nThe factorized form {qt(z0:t)g(\u0398)} represents an approximation of the weighted joint posterior of\n\u0398 and z\u2019s when the lower bound reaches the maximum, and the corresponding g(\u0398) is called the\nvariational approximate posterior of \u0398. The lower bound maximization is accomplished by solving\n{qt(z0:t)} and g(\u0398) alternately, keeping one \ufb01xed while solving for the other. The solutions are\nsummarized in Theorem 2.4; the proof is in [6].\n\nt=1P|Z|\n\nand KL(qkp) denotes\n\n0:t) = 1; \u03bdk\n\n0:t)g(\u0398)}||{\u03bdk\n\n\t\nb\nb\u03c5i), andfW (i, a, o, j) = e\u03c8(b\u03c9i,a,o,j )\u2212\u03c8(\n\n1:t,e\u0398)\n\nwhere e\u0398 = {e\u03c0,e\u00b5,fW } is a set of under-normalized probability mass functions, with e\u03c0(i, m) =\n\nand \u03c8 is the digamma function. The g(\u0398) has the same form as the prior G0 in (4), except that the\nhyper-parameter are updated as\n\nb\u03c1i,m), e\u00b5(i) = e\u03c8(b\u03c5i)\u2212\u03c8(\n\nTheorem 2.4 Given the initialization b\u03c1 = \u03c1, b\u03c5 = \u03c5, b\u03c9 = \u03c9, iterative application of the following\n\nupdates produces a sequence of monotonically increasing lower bounds LB({qk\nconverges to a maxima. The update of {qk\n\nt }, g(\u0398)), which\n\ne\u03c8(b\u03c1i,m)\u2212\u03c8(\n\nb\u03c9i,a,o,j ),\n\n\b\n\nz (zk\nqk\n\n0:t) = \u03c3k\n\nt p(zk\n\n0:t|ak\n\n\b\n\n\b\n\nt p(zk\n\n0:t, \u0398|ak\n\n0:t, ok\n\n1:t)\nV (D(K))\n\nt (zk\n\u03b3 trk\n\n\u03c4 =0 p\u03a0(ak\n\n\u03c4 |hk\n\u03c4 )\n\nzk\n0 ,\u00b7\u00b7\u00b7 ,zk\n\nt =1\n\nt =\n\nt\n\nt p(ak\n\n0:t|ok\n\n0:t, ok\n\nqk\nt (zk\n\nt } is\n\n|Z|\ni=1\n\nt(zk\n\n1:t)})\n\n|A|\nm=1\n\n|A|\nj=1\n\nb\u03c5i = \u03c5i +PK\n\nk=1PTk\n\nt=0\u03c3k\n\nt \u03c6k\n\nt,0(i)\n\n3\n\n\fwhere \u03bek\n\nt,\u03c4 (i, j) = p(zk\n\nb\u03c1i,a = \u03c1i,a +PK\nb\u03c9i,a,o,j = \u03c9i,a,o,j +PK\nt = (cid:2)\u03b3trk\n\nt=0Pt\nk=1PTk\nk=1PTk\nt=0Pt\n1:t,e\u0398), \u03c6k\n1:t,e\u0398)(cid:3)(cid:2)Qt\n\n\u03c4 +1 = j|ak\n0:t|ok\nt p(ak\n\n\u03c4 = i, zk\n\n0:t, ok\n\n\u03c3k\n\n\u03c4 , a)\n\nt,\u03c4 (i)\u03b4(ak\nt \u03c6k\n\u03c4 =0\u03c3k\nt \u03bek\n\u03c4 =1\u03c3k\nt,\u03c4 \u22121(i, j)\u03b4(ak\nt,\u03c4 (i) = p(zk\n\n\u03c4 =0 p\u03a0(ak\n\n\u03c4 |hk\n\n\u03c4 \u22121, a)\u03b4(ok\n\n\u03c4 , o)\n\n1:t,e\u0398), and\n\n\u03c4 = i|ak\n\n0:t, ok\n\n\u03c4 )bV (D(K)|e\u0398)(cid:3)\u22121\n\n(5)\n\n3 Dual-RPR: Joint Policy for the Agent Behavior and the Trade-Off\n\nBetween Exploration and Exploitation\n\nAssume that the agent uses the RPR described in Section 2 to govern its behavior in the unknown\nPOMDP environment (the primary policy). Bayesian learning employs the empirical value function\n\nbV (D(K); \u0398) in (2) in place of a likelihood function, to obtain the posterior of the RPR parameters\n\n\u0398. The episodes D(K) may be obtained from the environment by following an arbitrary stochastic\npolicy \u03a0 with p\u03a0(a|h) > 0, \u2200 a, \u2200 h. Although any such \u03a0 guarantees optimality of the resulting\nRPR, the choice of \u03a0 affects the convergence speed. A good choice of \u03a0 avoids episodes that do\nnot bring new information to improve the RPR, and thus the agent does not have to see all possible\nepisodes before the RPR becomes optimal.\n\nIn batch learning, all episodes are collected before the learning begins, and thus \u03a0 is pre-chosen\nand does not change during the learning [6]. In online learning, however, the episodes are collected\nduring the learning, and the RPR is updated upon completion of each episode. Therefore there is\na chance to exploit the RPR to avoid repeated learning in the same part of the environment. The\nagent should recognize belief regions it is familiar with, and exploit the existing RPR policy there;\nin belief regions inferred as new, the agent should explore. This balance between exploration and\nexploitation is performed with the goal of accumulating a large long-run reward.\n\nWe consider online learning of the RPR (as the primary policy) and choose \u03a0 as a mixture of two\npolicies: one is the current RPR \u0398 (exploitation) and the other is an exploration policy \u03a0e. This gives\nthe action-choosing probability p\u03a0(a|h) = p(y = 0|h)p(a|h, \u0398, y = 0)+p(y = 1|h)p(a|h, \u03a0e, y =\n1), where y = 0 (y = 1) indicates exploitation (exploration). The problem of choosing good \u03a0 then\nreduces to a proper balance between exploitation and exploration: the agent should exploit \u0398 when\ndoing so is highly rewarding, while following \u03a0e to enhance experience and improve \u0398.\nAn auxiliary RPR is employed to represent the policy for balancing exploration and exploitation,\ni.e., the history-dependent distribution p(y|h). The auxiliary RPR shares the parameters {\u00b5, W }\nwith the primary RPR, but with \u03c0 = {\u03c0(z, a) : a \u2208 A, z \u2208 Z} replaced by \u03bb = {\u03bb(z, y) : y =\n0 or 1, z \u2208 Z}, where \u03bb(z, y) is the probability of choosing exploitation (y = 0) or exploration\n(y = 1) in belief region z. Let \u03bb have the prior\n\np(\u03bb|u) =Q|Z|\n\ni=1Beta(cid:16)\u03bb(i, 0), \u03bb(i, 1)(cid:12)(cid:12)(cid:12)u0, u1(cid:17).\n\n(6)\nIn order to encourage exploration when the agent has little experience, we choose u0 = 1 and u1 > 1\nso that, at the beginning of learning, the auxiliary RPR always suggests exploration. As the agent\naccumulates episodes of experience, it comes to know a certain part of the environment in which the\nepisodes have been collected. This knowledge is re\ufb02ected in the auxiliary RPR, which, along with\nthe primary RPR, is updated upon completion of each new episode.\n\nSince the environment is a POMDP, the agent\u2019s knowledge should be represented in the space of\nbelief states. However, the agent cannot directly access the belief states, because computation of\nbelief states requires knowing the true POMDP model, which is not available. Fortunately, the RPR\nformulation provides a compact representation of H = {h}, the space of histories, where each\nhistory h corresponds to a belief state in the POMDP. Within the RPR formulation, H is represented\ninternally as the set of distributions over belief regions z \u2208 Z, which allows the agent to access\nH based on a subset of samples from H. Let Hknown be the part of H that has become known to\nthe agent, i.e., the primary RPR is optimal in Hknown and thus the agent should begin to exploit\nupon entering Hknown. As will be clear below, Hknown can be identi\ufb01ed by Hknown = {h : p(y =\n0|h, \u0398, \u03bb) \u2248 1}, if the posterior of \u03bb is updated by\n\nt=0Pt\nk=1PTk\nbui,0 = u0 +PK\nbui,1 = max(cid:0)\u03b7, u1 \u2212PK\nk=1PTk\n\n\u03c4 =0\u03c3k\n\nt \u03c6k\n\nt=0Pt\n\nt,\u03c4 (i),\n\u03c4 =0yk\n\nt \u03b3tc \u03c6k\n\nt,\u03c4 (i)(cid:1),\n\n4\n\n(7)\n\n(8)\n\n\ft , the\nwhere \u03b7 is a small positive number, and \u03c3k\nt = rmeta if the goal is reached at time t in\nmeta-reward received at t in episode k. We have mk\nepisode k, and mk\nt = 0 otherwise, where rmeta > 0 is a constant. When \u03a0e is provided by an oracle\n(active learning), a query cost c > 0 is taken into account in (8), by subtracting c from u1. Thus, the\nprobability of exploration is reduced each time the agent makes a query to the oracle (i.e., yk\nt = 1).\n\nt is the same in (5) except that rk\n\nt is replaced by mk\n\nAfter a certain number of queries,bui,1 becomes the small positive number \u03b7 (it never becomes zero\n\ndue to the max operator), at which point the agent stops querying in belief region z = i.\nIn (7) and (8), exploitation always receives a \u201ccredit\u201d, while exploration never receives credit (ex-\nploration is actually discredited when \u03a0e is an oracle). This update makes sure that the chance\nof exploitation monotonically increases as the episodes accumulate. Exploration receives no credit\nbecause it has been pre-assigned a credit (u1) in the prior, and the chance of exploration should\nmonotonically decrease with the accumulation of episodes. The parameter u1 represents the agent\u2019s\nprior for the amount of needed exploration. When c > 0, u1 is discredited by the cost and the agent\nneeds a larger u1 (than when c = 0) to obtain the same amount of exploration. The fact that the\namount of exploration monotonically increases with u1 implies that, one can always \ufb01nd a large\nenough u1 to ensure that the primary RPR is optimal in Hknown = {h : p(y = 0|h, \u0398, \u03bb) \u2248 1}.\nHowever, an unnecessarily large u1 makes the agent over-explore and leads to slow convergence.\nLet umin\nexists in the\nanalysis below. The possible range of umin\n\ndenote the minimum u1 that ensures optimality in Hknown. We assume umin\n\nis examined in the experiments.\n\n1\n\n1\n\n1\n\n4 Optimality and Convergence Analysis\n\nLet M be the true POMDP model. We \ufb01rst introduce an equivalent expression for the empirical\nvalue function in (2),\n\nT\n\nbV (E (K)\n\n; \u0398) = PE (K)\n\nT PT\n\nt=0\u03b3trtp(a0:t, o1:t, rt|y0:t = 0, \u0398, M ),\n\n(9)\n\nwhere the \ufb01rst summation is over all elements in E (K)\nT \u2286 ET , and ET = {(a0:T , o1:T , r0:T ) : at \u2208\nA, ot \u2208 O, t = 0, 1, \u00b7 \u00b7 \u00b7 , T } is the complete set of episodes of length T in the POMDP, with no\nrepeated elements. The condition y0:t = 0, which is an an abbreviation for y\u03c4 = 0 \u2200 \u03c4 = 0, 1, \u00b7 \u00b7 \u00b7 , t,\n; \u0398) is the empirical value\n\nindicates that the agent always follows the RPR (\u0398) here. Note bV (E (K)\n\n, as is bV (D(K); \u0398) on D(K). When T = \u221e 1, the two are identical\n\nfunction of \u0398 de\ufb01ned on E (K)\nup to a difference in acquiring the episodes: E (K)\nis a simple enumeration of distinct episodes\nwhile D(K) may contain identical episodes. The multiplicity of an episode in D(K) results from the\nsampling process (by following a policy to interact with the environment). Note that the empirical\nvalue function de\ufb01ned using E (K)\nis interesting only for theoretical analysis, because the evaluation\nrequires knowing the true POMDP model, not available in practice. We de\ufb01ne the optimistic value\nfunction\n\nT\n\nT\n\nT\n\nT\n\nT ; \u0398,\u03bb, \u03a0e) =X\n\nTXt=0\n\n\u03b3t\n\n1Xy0,\u00b7\u00b7\u00b7 ,yt=0(cid:0)rt +(Rmax\u2212rt)\u2228t\n\n\u03c4 =0 y\u03c4(cid:1)p(a0:t, o1:t, rt, y0:t|\u0398,\u03bb,M,\u03a0e) (10)\n\nE (K)\n\nT\n\nbVf (E (K)\n\nThe following lemma is proven in the Appendix.\n\nwhere \u2228t\n\u03c4 =0y\u03c4 indicates that the agent receives rt if and only if y\u03c4 = 0 at all time steps \u03c4 =\n1, 2, \u00b7 \u00b7 \u00b7 , t; otherwise, it receives Rmax at t, which is an upper bound of the rewards in the environ-\n; \u0398, \u03bb, \u03a0e).\n\nment. Similarly we can de\ufb01ne bV (D(K); \u0398, \u03bb, \u03a0e), the equivalent expression for bVf (E (K)\nLemma 4.1 Let bV (E (K)\n\nLet\n, \u0398, \u03bb, \u03a0e) be the probability of executing the exploration policy \u03a0e at least once in\n\n; \u0398, \u03bb, \u03a0e), and Rmax be de\ufb01ned as above.\n\n; \u0398), bVf (E (K)\n\nT\n\nT\n\nT\n\n, under the auxiliary RPR (\u0398, \u03bb) and the exploration policy \u03a0e. Then\n\nPexlpore(E (K)\nsome episode in E (K)\n\nT\n\nT\n\nPexlpore(E (K)\n\nT\n\n, \u0398, \u03bb, \u03a0e) \u2265\n\n1 \u2212 \u03b3\nRmax\n\nT\n\n|bV (E (K)\n\n; \u0398) \u2212 bVf (E (K)\n\nT\n\n; \u0398, \u03bb, \u03a0e)|.\n\n1An episode almost always terminates in \ufb01nite time steps in practice and the agent stays in the absorbing\nstate with zero reward for the remaining in\ufb01nite steps after an episode is terminated [10]. The in\ufb01nite horizon\nis only to ensure theoretically all episodes have the same horizon length.\n\n5\n\n\fProposition 4.2 Let \u0398 be the optimal RPR on E (K)\n\u221e and \u0398\u2217 be the optimal RPR in the complete\nPOMDP environment. Let the auxiliary RPR hyper-parameters (\u03bb) be updated according to (7) and\n(8), with u1 \u2265 umin\n\n. Let \u03a0e be the exploration policy and \u01eb \u2265 0. Then either (a) bV (E\u221e; \u0398) \u2265\nbV (E\u221e; \u0398\u2217) \u2212 \u01eb, or (b) the probability that the auxiliary RPR suggests executing \u03a0e in some episode\n\n\u221e is at least \u01eb(1\u2212\u03b3)\nRmax\n\nunseen in E (K)\n\n1\n\n.\n\nProof:\n\nPexlpore(E (\\K)\n\n\u221e , bV (E (K)\n\n= E\u221e \\ E (K)\n\n\u221e , \u0398, \u03bb, \u03a0e) \u2265\n\n\u221e ; \u0398\u2217) which, together with Lemma 4.1, implies\n\nIt is suf\ufb01cient to show that if (a) does not hold, then (b) must hold. Let us assume\n\u221e ; \u0398\u2217), which\n\u221e . We show below that\n\n\u221e ; \u0398\u2217) \u2212 \u01eb. where E (\\K)\n\n\u221e ; \u0398) < bV (E (\\K)\n\u221e ; \u0398, \u03bb, \u03a0e) \u2265 bV (E (\\K)\n\n\u221e ; \u0398) \u2265 bV (E (K)\nbV (E\u221e; \u0398) < bV (E\u221e; \u0398\u2217) \u2212 \u01eb. Because \u0398 is optimal in E (K)\nimplies bV (E (\\K)\nbVf (E (\\K)\n\u221e ; \u0398)i\nRmax hbVf (E (\\K)\n\u221e ; \u0398, \u03bb, \u03a0e) \u2212 bV (E (\\K)\n\u221e ; \u0398)i \u2265\nRmax hbV (E (\\K)\n\u221e ; \u0398\u2217) \u2212 bV (E (\\K)\n\u221e ; \u0398\u2217). By construction, bVf (E (\\K)\n\nWe now show bVf (E (\\K)\n\n\u221e ; \u0398, \u03bb, \u03a0e) \u2265 bV (E (\\K)\n\n\u221e ; \u0398, \u03bb, \u03a0e) is an\noptimistic value function, in which the agent receives Rmax at any time t unless if y\u03c4 = 0 at \u03c4 =\n0, 1, \u00b7 \u00b7 \u00b7 , t. However, y\u03c4 = 0 at \u03c4 = 0, 1, \u00b7 \u00b7 \u00b7 , t implies that {h\u03c4 : \u03c4 = 0, 1, \u00b7 \u00b7 \u00b7 , t} \u2282 Hknown, By\nthe premise, \u03bb is updated according to (7) and (8) and u1 \u2265 umin\n, therefore \u0398 is optimal in Hknown\n(see the discussions following (7) and (8)), which implies \u0398 is optimal in {h\u03c4 : \u03c4 = 0, 1, \u00b7 \u00b7 \u00b7 , t}.\nThus, the inequality holds.\nQ.E.D.\n\n\u01eb(1 \u2212 \u03b3)\n\nRmax\n\n1 \u2212 \u03b3\n\n1 \u2212 \u03b3\n\n\u2265\n\n\u221e\n\n1\n\nProposition 4.2 shows that whenever the primary RPR achieves less accumulative reward than\nthe optimal RPR by \u01eb,\nthe auxiliary RPR suggests exploration with a probability exceeding\nmax. Conversely, whenever the auxiliary RPR suggests exploration with a probability\n\u01eb(1 \u2212 \u03b3)R\u22121\nsmaller than \u01eb(1 \u2212 \u03b3)R\u22121\nmax, the primary RPR achieves \u01eb-near optimality. This ensures that the agent\nis either receiving suf\ufb01cient rewards or it is performing suf\ufb01cient exploration.\n\n5 Experimental Results\n\nOur experiments are based on Shuttle, a benchmark POMDP problem [7], with the following setup.\nThe primary policy is a RPR with |Z| = 10 and a prior in (4), with all hyper-parameters initially\nset to one (which makes the initial prior non-informative). The auxiliary policy is a RPR sharing\n{\u00b5, W } with the primary RPR and having a prior for \u03bb as in (6). The prior of \u03bb is initially biased\ntowards exploration by using u0 = 1 and u1 > 1. We consider various values of u1 to examine\nthe different effects. The agent performs online learning: upon termination of each new episode,\nthe primary and auxiliary RPR posteriors are updated by using the previous posteriors as the current\npriors. The primary RPR update follows Theorem 2.4 with K = 1 while the auxiliary RPR update\nfollows (7) and (8) for \u03bb (it shares the same update with the primary RPR for \u00b5 and W ). We perform\n100 independent Monte Carlo runs. In each run, the agent starts learning from a random position\nin the environment and stops learning when Ktotal episodes are completed. We compare various\nmethods that the agent uses to balance exploration and exploitation: (i) following the auxiliary RPR,\nwith various values of u1, to adaptively switch between exploration and exploitation; (ii) randomly\nswitching between exploration and exploitation with a \ufb01xed exploration rate Pexplore (various values\nof Pexplore are examined). When performing exploitation, the agent follows the current primary RPR\n(using the \u0398 that maximizes the posterior); when performing exploration, it follows an exploration\npolicy \u03a0e. We consider two types of \u03a0e: (i) taking random actions and (ii) following the policy\nobtained by solving the true POMDP using PBVI [8] with 2000 belief samples. In either case,\nrmeta = 1 and \u03b7 = 0.001. In case (ii), the PBVI policy is the oracle and incurs a query cost c.\nWe report: (i) the sum of discounted rewards accrued within each episode during learning; these\nrewards result from both exploitation and exploration. (ii) the quality of the primary RPR upon\ntermination of each learning episode, represented by the sum of discounted rewards averaged over\n251 episodes of following the primary RPR (using the standard testing procedure for Shuttle: each\nepisode is terminated when either the goal is reached or a maximum of 251 steps is taken); these\nrewards result from exploitation alone. (iii) the exploration rate Pexplore in each learning episode,\nwhich is the number of time steps at which exploration is performed divided by the total time steps in\n\n6\n\n\fa given episode. In order to examine the optimality, the rewards in (i)-(ii) has the corresponding op-\ntimal rewards subtracted, where the optimal rewards are obtained by following the PBVI policy; the\ndifference are reported, with zero difference indicating optimality and minus difference indicating\nsub-optimality. All results are averaged over the 100 Monte Carlo runs. The results are summarized\nin Figure 1 when \u03a0e takes random actions and in Figure 2 when \u03a0e is an oracle (the PBVI policy).\n\n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \n\ni\n\ng\nn\nn\nr\na\ne\n\nl\n \n\nd\ne\nu\nr\nc\nc\nA\n\n \n \n \n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\no\n\n \ns\nu\nn\nm\n\ni\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n\u221212\n\n\u221214\n\n\u221216\n0\n\n \n\n \n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=20\n1\nDual\u2212RPR, u\n=200\n1\n = 0\nRPR, P\n = 0.1\nRPR, P\n = 0.3\nRPR, P\nRPR, P\n = 1.0\n\nexplore\n\nexplore\n\nexplore\n\nexplore\n\n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \n\ng\nn\n\ni\nt\ns\ne\n\nt\n \n\nd\ne\nu\nr\nc\nc\nA\n\n \n \n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\no\n\n \ns\nu\nn\nm\n\ni\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\nNumber of episodes used in the learning phase\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n \n\n\u221212\n0\n\n \n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=20\n1\nDual\u2212RPR, u\n=200\n1\n = 0\nRPR, P\n = 0.1\nRPR, P\n = 0.3\nRPR, P\nRPR, P\n = 1.0\n\nexplore\n\nexplore\n\nexplore\n\nexplore\n2500\n\n500\n\n1000\n\n1500\n\n2000\n\nNumber of episodes used in the learning phase\n\n3000\n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\nl\n\na\nr\no\np\nx\nE\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n \n\n0\n0\n\n \n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=20\n1\nDual\u2212RPR, u\n=200\n1\n\n500\n\n1000\n\n1500\n\n2000\n\n2500\n\n3000\n\nNumber of episodes used in the learning phase\n\nFigure 1: Results on Shuttle with a random exploration policy, with Ktotal = 3000. Left: accumulative\ndiscounted reward accrued within each learning episode, with the corresponding optimal reward subtracted.\nMiddle: accumulative discounted rewards averaged over 251 episodes of following the primary RPR obtained\nafter each learning episode, again with the corresponding optimal reward subtracted. Right: the rate of explo-\nration in each learning episode. All results are averaged over 100 independent Monte Carlo runs.\n\n20\n\n40\n\n60\n\nNumber of episodes used in the learning phase\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n \n\n\u221220\n0\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n \n\n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \n\ng\nn\n\ni\nt\ns\ne\n\nt\n \n\nd\ne\nu\nr\nc\nc\nA\n\n \n \n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\no\n\n \ns\nu\nn\nm\n\ni\n\n100\n\n \n\n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \n\ng\nn\n\ni\nt\ns\ne\n\nt\n \n\nd\ne\nu\nr\nc\nc\nA\n\n \n \n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\no\n\n \ns\nu\nn\nm\n\ni\n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=10\n1\nDual\u2212RPR, u\n=20\n1\n = 0.158\nRPR, P\n = 0.448\nRPR, P\n = 0.657\nRPR, P\nRPR, P\n = 1.0\n\nexplore\n\nexplore\n\nexplore\n\nexplore\n80\n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=10\n1\nDual\u2212RPR, u\n=20\n1\n = 0.081\nRPR, P\n = 0.295\nRPR, P\n = 0.431\nRPR, P\nRPR, P\n = 1.0\n\nexplore\n\nexplore\n\nexplore\n\n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \n\ni\n\ng\nn\nn\nr\na\ne\n\nl\n \n\nd\ne\nu\nr\nc\nc\nA\n\n \n \n \n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\no\n\n \ns\nu\nn\nm\n\ni\n\n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \n\ni\n\ng\nn\nn\nr\na\ne\n\nl\n \n\nd\ne\nu\nr\nc\nc\nA\n\n \n \n \n \n \n \n \n \n \n\nd\nr\na\nw\ne\nr\n \nl\n\na\nm\n\ni\nt\n\np\no\n\n \ns\nu\nn\nm\n\ni\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n \n\n\u221212\n0\n\n0\n\n\u22122\n\n\u22124\n\n\u22126\n\n\u22128\n\n\u221210\n\n \n\n\u221212\n0\n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=10\n1\nDual\u2212RPR, u\n=20\n1\n = 0.158\nRPR, P\n = 0.448\nRPR, P\n = 0.657\nRPR, P\nRPR, P\n = 1.0\n\nexplore\n\nexplore\n\nexplore\n\nexplore\n\n20\n\n40\n\n60\n\n80\n\nNumber of episodes used in the learning phase\n\n \n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\nl\n\na\nr\no\np\nx\nE\n\n100\n\n \n\n \n\n0\n0\n\n1\n\n \n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=10\n1\nDual\u2212RPR, u\n=20\n1\n\n20\n\n40\n\n60\n\n80\n\nNumber of episodes used in the learning phase\n\n100\n\n \n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=10\n1\nDual\u2212RPR, u\n=20\n1\n\ne\n\nt\n\na\nr\n \n\nn\no\n\ni\nt\n\nl\n\na\nr\no\np\nx\nE\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nDual\u2212RPR, u\n=2\n1\nDual\u2212RPR, u\n=10\n1\nDual\u2212RPR, u\n=20\n1\n = 0.081\nRPR, P\n = 0.295\nRPR, P\n = 0.431\nRPR, P\nRPR, P\n = 1.0\n\nexplore\n\nexplore\n\nexplore\n\nexplore\n80\n\n20\n\n40\n\n60\n\nNumber of episodes used in the learning phase\n\nexplore\n80\n\n100\n\n \n\n\u221220\n0\n\n20\n\n40\n\n60\n\nNumber of episodes used in the learning phase\n\n100\n\n \n\n0\n0\n\n20\n\n40\n\n60\n\n80\n\n100\n\nNumber of episodes used in the learning phase\n\nFigure 2: Results on Shuttle with an oracle exploration policy incurring cost c = 1 (top row) and c = 3\n(bottom row), and Ktotal = 100. Each \ufb01gure in a row is a counterpart of the corresponding \ufb01gure in Figure 1,\nwith the random \u03a0e replaced by the oracle \u03a0e. See the captions there for details.\n\nIt is seen from Figure 1 that, with random exploration and u1 = 2, the primary policy converges\nto optimality and, accordingly, Pexplore drops to zero, after about 1500 learning episodes. When\nu1 increases to 20, the convergence is slower:\nit does not occur (and Pexplore > 0) until after\nabound 2500 learning episodes. With u1 increased to 200, the convergence does not happen and\nPexplore > 0.2 within the \ufb01rst 3000 learning episodes. These results verify our analysis in Section\n3 and 4: (i) the primary policy improves as Pexplore decreases; (ii) the agent explores when it is\nnot acting optimally and it is acting optimally when it stops exploring; (iii) there exists \ufb01nite u1\nsuch that the primary policy is optimal if Pexplore = 0. Although u1 = 2 may still be larger than\n, it is small enough to ensure convergence within 1500 episodes. We also observe from Figure\numin\n1 that: (i) the agent explores more ef\ufb01ciently when it is adaptively switched between exploration\nand exploitation by the auxiliary policy, than when the switch is random; (ii) the primary policy\ncannot converge to optimality when the agent never explores; (iii) the primary policy may converge\n\n1\n\n7\n\n\fto optimality when the agent always takes random actions, but it may need in\ufb01nite learning episodes\nto converge.\nThe results in Figure 2, with \u03a0e being an oracle, provide similar conclusions as those in Figure\n1 when \u03a0e is random. However, there are two special observations from Figure 2: (i) Pexplore is\naffected by the query cost c: with a larger c, the agent performs less exploration. (ii) the convergence\nrate of the primary policy is not signi\ufb01cantly affected by the query cost. The reason for (ii) is that the\noracle always provides optimal actions, thus over-exploration does not harm the optimality; as long\nas the agent takes optimal actions, the primary policy continually improves if it is not yet optimal,\nor it remains optimal if it is already optimal.\n\n6 Conclusions\n\nWe have presented a dual-policy approach for jointly learning the agent behavior and the optimal\nbalance between exploitation and exploration, assuming the unknown environment is a POMDP. By\nidentifying a known part of the environment in terms of histories (parameterized by the RPR), the ap-\nproach adaptively switches between exploration and exploitation depending on whether the agent is\nin the known part. We have provided theoretical guarantees for the agent to either explore ef\ufb01ciently\nor exploit ef\ufb01ciently. Experimental results show good agreement with our theoretical analysis and\nthat our approach \ufb01nds the optimal policy ef\ufb01ciently. Although we empirically demonstrated the\nexistence of a small u1 to ensure ef\ufb01cient convergence to optimality, further theoretical analysis is\n, the tight lower bound of u1, which ensures convergence to optimality with\nneeded to \ufb01nd umin\njust the right amount of exploration (without over-exploration). Finding the exact umin\nis dif\ufb01cult\nbecause of the partial observability. However, it is hopeful to \ufb01nd a good approximation to umin\n. In\nthe worst case, the agent can always choose to be optimistic, like in E3 and Rmax. An optimistic\nagent uses a large u1, which usually leads to over-exploration but ensures convergence to optimality.\n\n1\n\n1\n\n1\n\nThe authors would like to thank the anonymous reviewers for their valuable comments and sugges-\ntions. This work is supported by AFOSR.\n\nProof of Lemma 4.1: We expand (10) as,\n\nVf (E (K)\n\nT\n\n; \u0398, \u03bb, \u03a0e) =\n\nT\n\nt=0\u03b3trtp(a0:t, o1:t, rt|y0:t = 0, \u0398, M )p(y0:t = 0|\u0398, \u03bb)\n\n(K)\nT\n\nT\n\nt=0 \u03b3tRmax\n\ny0:t6=0p(a0:t, o1:t, rt|y0:t, \u0398, M, \u03a0e)p(y0:t|\u0398, \u03bb)\n\nwhere y0:t is an an abbreviation for y\u03c4 = 0 \u2200 \u03c4 = 0, \u00b7 \u00b7 \u00b7 , t and y0:t 6= 0 is an an abbreviation for \u2203 0 \u2264 \u03c4 \u2264 t\nsatisfying y\u03c4 6= 0. The sum\n\n. The difference between (9) and (11) is\n\nis over all episodes in E (K)\n\nV (E (K)\n|\n\n, \u0398) \u2212\n\nV (E (K)\n\nT\n\n; \u0398, \u03bb)| =\n\nt=0\u03b3trtp(a0:t, o1:t, rt|y0:t = 0, \u0398, M )(1 \u2212 p(y0:t = 0|\u0398, \u03bb))\n\nE\n\n(K)\nT\n\nT\n\nt=0 \u03b3tRmax\n\ny0:t6=0p(a0:t, o1:t, rt|y0:t, \u0398, M, \u03a0e)p(y0:t|\u0398, \u03bb)\n\nT\n\nt=0 \u03b3trtp(a0:t, o1:t, rt|y0:t = 0, \u0398, M )\n\ny0:t6=0p(y0:t|\u0398, \u03bb)\n\nT\n\nt=0 \u03b3tRmax\n\ny0:t6=0p(a0:t, o1:t, rt|y0:t, \u0398, M, \u03a0e)p(y0:t|\u0398, \u03bb)\n\n\u03b3trt\n\ny0:t6=0\n\np(a0:t, o1:t, rt|y0:t = 0, \u0398, M ) \u2212\n\np(a0:t, o1:t, rt|y0:t, \u0398, M, \u03a0e)\n\np(y0:t|\u0398, \u03bb)\n\nT\n\nt=0 \u03b3tRmax\n\ny0:t6=0p(y0:t|\u0398, \u03bb) =\n\nT\n\nt=0\u03b3tRmax(1 \u2212 p(y0:t = 0|\u0398, \u03bb))\n\n(1 \u2212 p(y0:T = 0|\u0398, \u03bb))\n\nT\n\nt=0\u03b3tRmax \u2264\n\n(1 \u2212 p(y0:T = 0|\u0398, \u03bb))\n\nE\n\n(K)\nT\n\nwhere\n\ny0:t6=0 is a sum over all sequences {y0:t : \u2203 0 \u2264 \u03c4 \u2264 t satisfying y\u03c4 6= 0}.\n\nQ.E.D.\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:12)(cid:12)(cid:12)\n(cid:12)(cid:12)(cid:12)\n\ni\n\n\b\n\b\n\b\n\b\n\nT\n\n7 Acknowledgements\n\nE\n\n+\n\n\b\n\b\n\b\n\b\n\b\n\b\n\b\n\n(K)\nT\n\n(K)\nT\n\nE\n\nE\n\nE\n\n(K)\nT\n\n(K)\nT\n\nE\n\n\b\n\b\n(cid:12)(cid:12)(cid:12)\b\n\b\n\nb\nX\n\nh\n\n\u2212\n\n\u2212\n\nAppendix\n\nb\nb\n(cid:12)(cid:12)(cid:12)\b\n\b\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X\nTX\n\b\n\b\n\b\n\b\n\n(K)\nT\n\n(K)\nT\n\n(K)\nT\n\n(K)\nT\n\nt=0\n\nT\n\nE\n\nE\n\nE\n\nE\n\n=\n\n=\n\n\u2264\n\n\u2264\n\n\b\n\b\n\b\n\n8\n\nT\n\n\b\n\b\n\nrt\n\nRmax\n\n(K)\nT\n\nE\nRmax\n1 \u2212 \u03b3\n\n\fReferences\n[1] M. J. Beal. Variational Algorithms for Approximate Bayesian Inference. PhD thesis, Gatsby Computa-\n\ntional Neuroscience Unit, Univertisity College London, 2003.\n\n[2] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal rein-\n\nforcement learning. Journal of Machine Learning Research, 3(OCT):213\u2013231, 2002.\n\n[3] F. Doshi, J. Pineau, and N. Roy. Reinforcement learning with limited reinforcement: Using Bayes risk for\nactive learning in POMDPs. In Proceedings of the 25th international conference on Machine learning,\npages 256\u2013263. ACM, 2008.\n\n[4] M. Kearns and D. Koller. Ef\ufb01cient reinforcement learning in factored mdps. In Proc. of the Sixteenth\n\nInternational Joint Conference of Arti\ufb01cial Intelligence, pages 740\u2013747, 1999.\n\n[5] M. Kearns and S. P. Singh. Near-optimal performance for reinforcement learning in polynomial time. In\n\nProc. ICML, pages 260\u2013268, 1998.\n\n[6] H. Li, X. Liao, and L. Carin. Multi-task reinforcement learning in partially observable stochastic envi-\n\nronments. Journal of Machine Learning Research, 10:1131\u20131186, 2009.\n\n[7] M.L. Littman, A.R. Cassandra, and L.P. Kaelbling. Learning policies for partially observable environ-\n\nments: scaling up. In ICML, 1995.\n\n[8] J. Pineau, G. Gordon, and S. Thrun. Point-based value iteration: An anytime algorithm for POMDPs. In\nProceedings of the Sixteenth International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), pages 1025\n\u2013 1032, August 2003.\n\n[9] P. Poupart and N. Vlassis. Model-based bayesian reinforcement learning in partially observable domains.\n\nIn International Symposiu on Arti\ufb01cial Intelligence and Mathmatics (ISAIM), 2008.\n\n[10] R. Sutton and A. Barto. Reinforcement learning: An introduction. MIT Press, Cambridge, MA, 1998.\n\n9\n\n\f", "award": [], "sourceid": 1107, "authors": [{"given_name": "Chenghui", "family_name": "Cai", "institution": null}, {"given_name": "Xuejun", "family_name": "Liao", "institution": null}, {"given_name": "Lawrence", "family_name": "Carin", "institution": null}]}