{"title": "Explicit Planning for Efficient Exploration in Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 7488, "page_last": 7497, "abstract": "Efficient exploration is crucial to achieving good performance in reinforcement learning. Existing systematic exploration strategies (R-MAX, MBIE, UCRL, etc.), despite being promising theoretically, are essentially greedy strategies that follow some predefined heuristics. When the heuristics do not match the dynamics of Markov decision processes (MDPs) well, an excessive amount of time can be wasted in travelling through already-explored states, lowering the overall efficiency. We argue that explicit planning for exploration can help alleviate such a problem, and propose a Value Iteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme by solving an augmented MDP. We then present a detailed analysis of the exploration behaviour of some popular strategies, showing how these strategies can fail and spend O(n^2 md) or O(n^2 m + nmd) steps to collect sufficient data in some tower-shaped MDPs, while the optimal exploration scheme, which can be obtained by VIEC, only needs O(nmd), where n, m are the numbers of states and actions and d is the data demand. The analysis not only points out the weakness of existing heuristic-based strategies, but also suggests a remarkable potential in explicit planning for exploration.", "full_text": "Explicit Planning for Ef\ufb01cient Exploration in\n\nReinforcement Learning\n\nLiangpeng Zhang1, Ke Tang2, and Xin Yao2,1\u2217\n\n1CERCIA, School of Computer Science, University of Birmingham, U.K.\n\n2Shenzhen Key Laboratory of Computational Intelligence,\n\nUniversity Key Laboratory of Evolving Intelligent Systems of Guangdong Province,\n\nDepartment of Computer Science and Engineering,\n\nSouthern University of Science and Technology, Shenzhen 518055, China\n\nL.Zhang.7@pgr.bham.ac.uk, tangk3@sustc.edu.cn, xiny@sustc.edu.cn\n\nAbstract\n\nEf\ufb01cient exploration is crucial to achieving good performance in reinforcement\nlearning. Existing systematic exploration strategies (R-MAX, MBIE, UCRL, etc.),\ndespite being promising theoretically, are essentially greedy strategies that follow\nsome prede\ufb01ned heuristics. When the heuristics do not match the dynamics of\nMarkov decision processes (MDPs) well, an excessive amount of time can be\nwasted in travelling through already-explored states, lowering the overall ef\ufb01ciency.\nWe argue that explicit planning for exploration can help alleviate such a problem,\nand propose a Value Iteration for Exploration Cost (VIEC) algorithm which com-\nputes the optimal exploration scheme by solving an augmented MDP. We then\npresent a detailed analysis of the exploration behaviour of some popular strategies,\nshowing how these strategies can fail and spend O(n2md) or O(n2m + nmd)\nsteps to collect suf\ufb01cient data in some tower-shaped MDPs, while the optimal\nexploration scheme, which can be obtained by VIEC, only needs O(nmd), where\nn, m are the numbers of states and actions and d is the data demand. The analysis\nnot only points out the weakness of existing heuristic-based strategies, but also\nsuggests a remarkable potential in explicit planning for exploration.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL), exploration plays a key role in deciding the quality of data and thus\nhas a direct impact to the overall performance. Simple exploration strategies such as \u03b5-greedy may\nneed exponentially many steps to \ufb01nd a (near-)optimal policy [1]. On the other hand, more systematic\nexploration strategies (R-MAX, UCRL, MBIE and their variants) have far promising theoretical\nperformance guarantees (see e.g. [2, 3, 4, 5]). Recently, some of these systematic strategies have been\nsuccessfully generalised and applied to deep reinforcement learning, achieving good performance in\ndomains that are known to be hard to explore, such as Montezuma\u2019s Revenge [6, 7].\nSystematic exploration strategies are carefully designed to ensure that suf\ufb01cient data is collected for\nevery unknown states, so that the chance of converging to undesirable policies due to ignorance is\ncontrolled. Unfortunately, the actual data collection process is less carefully executed, in the sense\nthat these strategies choose actions simply by maximising some prede\ufb01ned heuristics. When the\ndesign of such heuristics does not match the properties of the learning problem well, an excessive\namount of less useful data will be collected due to revisiting well-explored states/actions.\n\n\u2217The corresponding author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fA straightforward example is as follows. Suppose both a nearby Area1 and a distant Area2 need to be\nexplored. The transition dynamics makes it easy to travel from Area2 to Area1, but trying to move\nfrom Area1 to Area2 sends the agent back to Area1 with high probability. Clearly, exploring in the\norder of Area2\u2192 Area1 is better than Area1\u2192 Area2, since the latter wastes additional time in trying\nto travel to Area2 from Area1 which leads to excessive data being collected in Area1. However, most\nsystematic strategies choose to explore Area1 \ufb01rst because it is nearer than Area2 and thus has a\nhigher heuristic score. We call this a distance trap.\nOur analysis in this paper points out that there exist cases where these heuristic-based strategies need\neither O(n2md) or O(n2m + nmd) steps to collect suf\ufb01cient data, while an optimal exploration\nscheme only needs O(nmd), where n, m, and d denote number of states, number of actions, and the\nminimum amount of data to be obtained at each state-action pair, respectively. Since n is usually very\nlarge in real-world problems, this result indicates that a signi\ufb01cant amount of steps can be wasted by\nthe heuristic-based strategies due to their careless execution of data collection. It also suggests that\nexplicit planning for exploration can be highly bene\ufb01cial for improving learning ef\ufb01ciency.\nThe contributions of this paper are as follows.\n\n1. Formulate the planning for exploration problem as an augmented undiscounted Markov\ndecision process and show that the optimal exploration scheme can be discovered by solving\nthe Bellman optimality equations for exploration costs.\n\n2. Propose a Value Iteration for Exploration Cost (VIEC) algorithm for \ufb01nding the optimal\n\nexploration scheme.\n\n3. Point out two weaknesses of existing systematic exploration strategies: (a) distance traps and\n(b) reward traps, and use tower MDPs as examples to give a concrete explanation about how\nexisting strategies can fail and need O(n2md) or O(n2m + nmd) steps while the optimal\nexploration scheme needs only O(nmd) steps to ful\ufb01l the same exploration demand.\n\n2 Preliminaries\n\nIn this paper we follow the common formulation of reinforcement learning [8] in which M =\n(S,A, P, R, \u03b3) represents a \ufb01nite discounted Markov decision process (MDP) with set of states S,\nset of actions A, transition probability function P , reward function R, and discount factor \u03b3. Unless\notherwise stated, we use n and m to denote the number of states and actions of an MDP. A policy is\ndenoted \u03c0 and its value functions are denoted V \u03c0(s) and Q\u03c0(s, a), while for optimal policy we write\n\u03c0\u2217, V \u2217 and Q\u2217, which by de\ufb01nition satisfy V \u2217(s) = max\u03c0 V \u03c0(s) and Q\u2217(s, a) = max\u03c0 Q\u03c0(s, a)\nfor all s \u2208 S and a \u2208 A. If exact information about M is available, then \u03c0\u2217 can be obtained by\ns(cid:48) P (s(cid:48)|s, a)V \u2217(s(cid:48))) or Q\u2217(s, a) =\n\nsolving the Bellman equations V \u2217(s) = maxa(E[R(s, a)] + \u03b3(cid:80)\nE[R(s, a)] + \u03b3(cid:80)\n\ns(cid:48) P (s(cid:48)|s, a) maxa(cid:48) Q\u2217(s(cid:48), a(cid:48)) using Value Iteration algorithm [9].\n\nIn reality M is often unknown and needs to be estimated from the data collected during learning. A\nstraightforward way is to use \u02c6P (s(cid:48)|s, a) = N (s, a, s(cid:48))/N (s, a) and \u02c6R(s, a) = C(s, a)/N (s, a) as\nestimates of P (s(cid:48)|s, a) and E[R(s, a)], where N (s, a) and N (s, a, s(cid:48)) indicate the occurrences of\nchoice (s, a) and transition (s, a, s(cid:48)) and C(s, a) is the sum of the rewards collected at (s, a). As\nN (s, a) \u2192 \u221e at all (s, a), this model \u02c6M of M converges in probability to the true M, and thus we\ncan eventually obtain \u03c0\u2217 of M from \u02c6M. Such process is called model-based RL.\nResearches on systematic exploration are often based on model-based RL, so that the quality of\nlearning is mostly decided by their exploration strategies. This paper follows this idea and limits its\nscope to the model-based case, but its general suggestion (explicit planning for exploration can be\nbene\ufb01cial) is also applicable to model-free RL.\n\n3 Formulation of the Planning for Exploration Problem\n\n3.1 Data demands\n\nSince the goal of learning is to \ufb01nd out a suf\ufb01ciently good policy rather than to have an extremely\naccurate estimate of V or Q, a \ufb01nite amount of data is often suf\ufb01cient for the purpose. Various\nresearches have shown that by applying Hoeffding\u2019s or Chernoff\u2019s inequalities, the minimum amount\n\n2\n\n\f1\n\n\u03b52(1\u2212\u03b3)4 (n+ln nm\n\nof data needed at each state-action pair for guaranteeing certain learning quality can be derived. For\n\u03b4 )) data for each state-action pair is suf\ufb01cient\nexample, [2, 10] proved that some O(\nfor R-MAX to be (\u03b5, \u03b4)-PAC, while [4] proved that for MBIE it is O(\n\u03b5(1\u2212\u03b3)\u03b4 )),\nwhere n and m are the number of states and actions.\nIn practice, the theoretical demands of this kind are still likely to be excessive (see e.g. [11, 12, 13]),\nand users usually have to specify how much data to be collected based on their domain knowledge or\ntrial-and-error. Whichever the case, the main idea is that such data demands are given (either directly\nor indirectly) by the parameter settings prior to the actual learning process, and thus can be used to\nmake plans for more ef\ufb01cient exploration. The formal de\ufb01nition of data demands is as follows.\nDe\ufb01nition 3.1 In an MDP with n states and m actions, a demand matrix D is an n \u00d7 m matrix in\nwhich entry D[s, a] = k \u2265 0 indicates that at least k more data should be collected for state-action\npair (s, a) during learning.\n\nnm\n\n1\n\n\u03b52(1\u2212\u03b3)4 (n + ln\n\nWe write Dt to indicate the demand matrix at time t during learning. After some action At is executed\nat some state St, the corresponding entry in the demand matrix should be subtracted by 1 unless it is\nalready 0, while other entries remain unchanged, that is,\n\n(cid:26)max{0, Dt[s, a] \u2212 1}\n\nDt+1[s, a] =\n\nDt[s, a]\n\n(s, a) = (St, At)\notherwise.\n\nFor convenience, we de\ufb01ne the demand reduction function H as follows:\n\n(cid:26)D \u2212 es,a D[s, a] > 0\n\nD\n\nD[s, a] = 0,\n\nH(D; s, a) :=\n\nwhere es,a is an n \u00d7 m matrix \ufb01lled with 0 except for the only nonzero entry es,a[s, a] = 1. Then\nwe can express the change of Dt after (St, At) simply as Dt+1 = H(Dt; St, At).\nThe demand space (the set of all possible demand matrix) of an MDP is denoted D. It is reasonable\nto assume that the demands at every state-action never exceed some suf\ufb01ciently large positive integer\nd, thus the size of demand space is at most (d + 1)nm.\nRemark. Readers may wonder how to \ufb01nd out the \u201coptimal\u201d demand matrix (that has e.g. the least\ntotal demand) for a given learning task. Such matrix can only be obtained with full knowledge of\nthe MDP, and thus is impractical to obtain in reality. Our point is that given any demand matrix, the\nexploration ef\ufb01ciency can be improved via planning. It is achieved by minimising the amount of data\ncollected beyond the speci\ufb01ed demand (i.e. optimal exploration scheme, see next section) rather than\nchoosing a better demand matrix, and thus the optimality of demand matrices is not the main concern\nof this paper.\n\n3.2 Planning for exploration\n\nDemand matrix D indicates how much data is suf\ufb01cient for obtaining a good policy, and we are\ninterested in collecting all this required amount of data with the number of steps as small as possible,\nsince this means that the least amount of unnecessary data is collected beyond D. The exploration\nbehaviour of a learning agent can be described as an exploration scheme, while its exploration cost is\nthe expected number of steps needed to ful\ufb01l all the demands, de\ufb01ned formally as follows.\nDe\ufb01nition 3.2 An exploration scheme \u03c8 is a mapping D \u00d7 S (cid:55)\u2192 A, where \u03c8(D; s) = a indicates\nthat action a should be taken at state s when the demand matrix is D.\n\nDe\ufb01nition 3.3 The exploration cost C \u03c8(D; s, a) is the expected time t that the current demand Dt\n\ufb01rst becomes the all-zero matrix 0 by starting from (s, a) and following \u03c8, i.e. C \u03c8(D; s, a) :=\nE[inf{t : Dt+1 = 0|D1 = D, S1 = s, A1 = a, Ak = \u03c8(Dk; Sk) \u2200k > 1}].\nGiven MDP M and exploration scheme \u03c8, the interaction process becomes a Markov process with\naugmented state space D \u00d7 S and transition probability Pr(D(cid:48), s(cid:48)|D, s) = P (s(cid:48)|s, \u03c8(D; s)) for\nD(cid:48) = H(D; s, \u03c8(D; s)) and 0 otherwise. As for the exploration cost, by de\ufb01nition when D = 0\nwe have C \u03c8(D; s, a) = 0 for any (s, a). Any step after Dt = 0 will not result in any exploration\n\n3\n\n\fcost, while each step before reaching Dt = 0 will increase the cost by 1 uniformly. Therefore, the\nplanning for exploration problem is an augmented undiscounted MDP, and the following Bellman\nequation holds for the exploration cost:\n\ns(cid:48)\u2208S P (s(cid:48)|s, a) C \u03c8(cid:0)H(D; s, a); s(cid:48), \u03c8(cid:0)H(D; s, a); s(cid:48)(cid:1)(cid:1) D(cid:54)=0\n\n(cid:26)1 +(cid:80)\n\nC \u03c8(D; s, a) =\n\n0\n\nD=0.\n\n(1)\n\nLet \u03a8 be the set of all possible exploration schemes for a given MDP. Since less exploration cost is\nmore desirable, the de\ufb01nition for the optimal scheme is as follows.\nDe\ufb01nition 3.4 An optimal exploration scheme \u03c8\u2217 \u2208 \u03a8 is the one that satis\ufb01es C \u03c8\u2217\nmin\u03c8\u2208\u03a8 C \u03c8(D; s, a) for any D \u2208 D, s \u2208 S and a \u2208 A.\nsimply as C\u2217. In strongly connected\nFor convenience we write the optimal exploration cost C \u03c8\u2217\nMDPs, it can be shown that similar to optimal value functions Q\u2217 and V \u2217, the optimal exploration\ncost C\u2217 exists and is unique. In MDPs that are not strongly connected, on the other hand, there exist\ncases where some demands are not satis\ufb01able. For example, in an MDP with two states {s1, s2}\nand one action a which transits the agent to s2 with probability 1 from both s1 and s2, a demand\nD[s1, a] > 1 can never be satis\ufb01ed and will lead to an in\ufb01nite exploration cost. However, as discussed\nin Section 3.1, since users more or less have control to the exploration demands, in the rest of this\npaper we assume that they do not assign unsatis\ufb01able demands and thus C\u2217 exists.\n\n(D; s, a) =\n\n3.3 Computing \u03c8\u2217\nBy combining Equation 1 with De\ufb01nition 3.4 we get the Bellman optimality equation for C\u2217:\n\n(cid:26)1 +(cid:80)\n\n0\n\nC\u2217(D; s, a) =\n\ns(cid:48)\u2208S P (s(cid:48)|s, a) mina(cid:48)\u2208A C\u2217(H(D; s, a); s(cid:48), a(cid:48)) D(cid:54)=0\nD=0.\n\n(2)\n\nan input demand matrix Din, we can easily arrange all k =(cid:81)\n\nSince this equation has structure similar to the original Bellman optimality equation for Q\u2217 and V \u2217,\nwe can modify Value Iteration to compute C\u2217. Note that H(D; s, a) \u2264 D for any D, s and a, given\ns,a(Din[s, a] + 1) demand matrices\nsatisfying D \u2264 Din by topological ordering, i.e. D(0) = (0...0; ...; 0...0), D(1) = (0...0; ...; 0...1), ...,\nD(k\u22121) = Din, and compute C\u2217 from D(0) to D(k\u22121) to avoid extra iterations on D.\nThe pseudocode of the Value Iteration for Exploration Cost (VIEC) is presented in Algorithm 1,\nwhere U (D; s) := mina C(D; s, a), which plays a role similar to V (s) in computing Q(s, a).\n\nrepeat\n\nAlgorithm 1 Value Iteration for Exploration Cost (VIEC)\nInput: Demand matrix Din, transition P\nOutput: Exploration scheme \u03c8\n1: Initialise all C(D; s, a) = 0, U (D; s) = 0\ns,a(Din[s, a] + 1) \u2212 1 do\n\n2: for i = 1 to(cid:81)\n\n\u2206 = 0\nfor s \u2208 S do\n\nfor a \u2208 A do\n\nc = 1 +(cid:80)\n\ns(cid:48) P (s(cid:48)|s, a) U(cid:0)H(D(i); s, a); s(cid:48)(cid:1)\n\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12: Output \u03c8 such that \u03c8(D; s) = argminaC(D; s, a)\n\n\u2206 = max{\u2206, |C(D(i); s, a) \u2212 c|}\nC(D(i); s, a) = c\n\nU (D(i); s) = mina C(D(i); s, a)\n\nuntil \u2206 < threshold\n\nSimilar to the original Value Iteration, with a suf\ufb01ciently small stopping threshold in Line 11, C\nconverges to C\u2217 and thus the output \u03c8 \u2192 \u03c8\u2217. The proof can be obtained straightforwardly from the\nconvergence proof of original Value Iteration and will not be elaborated here.\n\n4\n\n\fVIEC needs to iterate over(cid:81)\n\ns,a(Din[s, a] + 1) = O(dnm) demand matrices and is not computation-\nally ef\ufb01cient in practice. Unfortunately, this is unavoidable for computing \u03c8\u2217 because even in the\nsimplest case with deterministic transitions and demands no more than 1, it is a Rural Postman Prob-\nlem [14] which is NP-hard, thus solving in polynomial time is impossible unless P=NP. Nevertheless,\nan approximation to \u03c8\u2217 might be suf\ufb01cient for the purpose, and we leave this to the future work.\nIn most RL settings, the transition function P is not known to the learning algorithm in prior. In\nthis case, one possible choice is to use estimated transition \u02c6P instead, following an iterative process\nshown in Algorithm 2. In this case, output \u03c8 of VIEC is an optimal exploration scheme for the\nenvironment model \u02c6M rather than the true environment M. With more data been collected, \u02c6M gets\ncloser to M and thus \u03c8 becomes closer to the true \u03c8\u2217. While a \u03c8 improving over time is surely not\nas good as \u03c8\u2217, it is still better than never conduct planning, and thus Algorithm 2 should provide a\nrelatively ef\ufb01cient way of exploration in general.\n\nAlgorithm 2 Model-based RL with Planning for Exploration\nInput: Initial demand D1\nOutput: Policy \u03c0\n1: Initialise \u02c6P , \u02c6R randomly or based on prior knowledge\n2: \u03c8 = VIEC(D1, \u02c6P )\n3: repeat\n4:\n5:\n6:\n7:\n8: until Dt = 0 or \u03c0 is suf\ufb01ciently good\n\nCollect data by following \u03c8\nUpdate \u02c6P , \u02c6R, Dt using collected data\nUpdate \u03c8 using VIEC(Dt, \u02c6P )\nUpdate \u03c0 using Value Iteration( \u02c6P , \u02c6R)\n\n4 When and How Heuristics Fail and Explicit Planning Helps\n\nSystematic exploration strategies choose action by maximising some prede\ufb01ned heuristics \u02dcQ(s, a),\nwhich is prone to the traps as follows. Suppose at current state St and demand Dt there are actions\na1, a2 satisfying C(Dt; St, a1) < C(Dt; St, a2), so one should choose a1 over a2.\nDistance traps. Let the nearest (in terms of expected number of steps to arrive) to-be-explored state\nafter taking a1 and a2 be s(cid:48) and s(cid:48)(cid:48) respectively. If s(cid:48)(cid:48) is closer to St than s(cid:48), then the uncertainty of\ns(cid:48)(cid:48) is less discounted than s(cid:48) in \u02dcQ, resulting in \u02dcQ(St, a1) < \u02dcQ(St, a2) and thus a2 is picked.\nReward traps. Let the reward (or expected return) of taking a1 and a2 be r(cid:48) and r(cid:48)(cid:48), respectively.\nThen r(cid:48) < r(cid:48)(cid:48) can lead to \u02dcQ(St, a1) < \u02dcQ(St, a2) and thus a2 is picked.\nThese traps can appear in any MDP and signi\ufb01cantly reduce the ef\ufb01ciency of heuristic-based strategies.\nTo present this more clearly and intuitively, we introduce a class of MDPs called tower MDPs,\nanalyse the behaviours and exploration costs of several typical exploration strategies and the optimal\nexploration scheme in tower MDPs, and then discuss the implication of the results.\n\n4.1 Tower MDPs\n\n1, ..., s(cid:48)\n\nA tower MDP of height h has two groups of states, namely upward states s1, ..., sh and downward\nstates s(cid:48)\nh. The total number of states is n = 2h. The agent always starts interaction from s1.\nAn example with height h = 5 is shown in Figure 1.\nThe transitions are deterministic in tower MDPs. Each upward state sk has an action a(cid:48) that transits\nthe agent to s(cid:48)\nk (dashed arrows in Figure 1), and also an action a that transits to sk+1 if k < h (solid\narrows). Each downward state s(cid:48)\nk is an m-armed bandit, which has m actions a1, ..., am that yield\nrewards following some prede\ufb01ned distributions and transit the agent to s(cid:48)\nk\u22121 (k > 1) or s1 (k = 1)\n(collectively drawn as the double arrows in Figure 1).\nTo \ufb01nd out the optimal policy, the agent has to collect data in these bandits for information about their\nreward distributions. For simplicity we assume that the initial demands at each of these m-armed\n\n5\n\n\fa(cid:48)\n\na(cid:48)\n\na(cid:48)\n\na(cid:48)\n\na(cid:48)\n\na\n\na\n\na\n\na\n\n5\n\n4\n\n3\n\n2\n\n1\n\na1, ..., am\n\n5(cid:48)\n\n4(cid:48)\n\n3(cid:48)\n\n2(cid:48)\n\n1(cid:48)\n\na1, ..., am\n\na1, ..., am\n\na1, ..., am\n\na1, ..., am\n\nFigure 1: A tower MDP of height h = 5. Each double arrow represents an m-armed bandit.\n\nbandits are uniformly set to some d > 0. As for a and a(cid:48) in upward states, since there is no uncertainty\nat all, their initial demands are set to 0.\n\n4.2 Optimal exploration scheme\n\nIn a tower MDP of height h with m-armed bandits in downward states, it is easy to see that the\noptimal scheme to collect d data at each arm is to repeatedly take the closed path [s1s2...shs(cid:48)\n1s1].\nEach time taking this path, the demand of one arm at every downward state is reduced by 1, and thus\nit needs to be repeated md times to collect all the data required. Since the length of this path is 2h,\nthe optimal exploration scheme needs 2hmd = O(nmd) steps to fully satisfy the initial demands.\n\nh...s(cid:48)\n\n4.3\n\n\u03b5-greedy\n\nAlthough \u03b5-greedy is already well-known for its lack of ef\ufb01ciency, it is nevertheless interesting to see\nhow it performs in tower MDPs. Let the bandit in s(cid:48)\n1 gives a reward of 1 with probability 1 on all of\nits m arms, am in s(cid:48)\nh gives reward 1010 with probability 0.01 and reward 0 otherwise, while all other\nbandits/arms give zero reward. At the beginning of learning, \u03b5-greedy does not know any of these\nrewards, and thus has a 50-50 chance to choose between going to state s2 and s(cid:48)\n1. If it chooses s2,\nthen it has another 50-50 chance between s3 and s(cid:48)\n2, and so on. Therefore, the probability it arrives at\nsh without visiting any of s(cid:48)\n1, ..., s(cid:48)\nh\u22121 is 0.5h\u22121. If it ever goes to any of state s(cid:48)\n1, ..., s(cid:48)\nh\u22121 before\narriving at sh, which happens with probability 1 \u2212 0.5h\u22121, it will be aware of the reward at s(cid:48)\n1, and\nthereafter be trapped to going to s(cid:48)\n1 as often as possible. Whenever it gets back to s1, it only has\nprobability (0.5\u03b5)h\u22121 to randomly wanders into s(cid:48)\nh.\nTherefore, the average number of steps \u03b5-greedy spends to visit s(cid:48)\n0.5h\u22121)O(\nfully ful\ufb01l the demands, the exploration cost of \u03b5-greedy is O(nmd 2n).\n\n(0.5\u03b5)h\u22121 ) = O(n2n) if \u03b5 is seen as a constant. Since it needs to visit s(cid:48)\n\nh\n\nh once is 0.5h\u22121 \u00b7 2h + (1 \u2212\nh (md) times to\n\n4.4 R-MAX\n\nR-MAX [15] is one of the \ufb01rst systematic strategies that are proved to have polynomial sample\ncomplexity upper bounds [2, 10]. Many exploration strategies are designed based on R-MAX and\nhave similar performance guarantees, including Delayed Q-learning [16], MoR-MAX [17], V-MAX\n[18], and ICR [13], just to name a few.\nR-MAX works as follows. When a state-action pair has a positive demand to ful\ufb01l, it is labelled\n\u201cunknown\u201d and its estimated value \u02dcQ(s, a) is set to Vmax := Rmax\n1\u2212\u03b3 , where Rmax is the maximum\npossible reward. If its demand is already 0, then it is labelled \u201cknown\u201d and the algorithm uses the\nBellman equation to estimate its \u02dcQ(s, a). R-MAX always chooses the action with maximum \u02dcQ(s, a).\nIn tower MDPs, all actions in downward states are initially \u201cunknown\u201d and thus their \u02dcQ = Vmax at the\nbeginning of learning. Let the bandits at all states except s(cid:48)\nh give zero reward, while the bandit at s(cid:48)\nh\ngives reward Rmax = 1 with probability 0.1 and reward 0 otherwise for all arms. Under such setting,\nR-MAX will not be aware of any positive rewards until s(cid:48)\nh is explored. It can be shown recursively\n\n6\n\n\fthat at this stage of learning, at any upward state sk, R-MAX will choose a(cid:48) to go to s(cid:48)\nk rather than a\nthat goes to sk+1. Concretely, at sh, the only choice is a(cid:48) which leads to \u201cunknown\u201d actions in s(cid:48)\nh, thus\nhas \u02dcQ(sh, a(cid:48)) = \u03b3Vmax. At state sh\u22121, going to sh has value \u02dcQ(sh\u22121, a) = \u03b3 \u02dcQ(sh, a(cid:48)) = \u03b32Vmax,\nwhile going to s(cid:48)\nh\u22121 has \u02dcQ(sh\u22121, a(cid:48)) = \u03b3Vmax > \u02dcQ(sh\u22121, a), thus R-MAX will choose a(cid:48) at sh\u22121\nas well. The same happens at every state from sh\u22121 down to s1. Since the agent starts from state s1,\nR-MAX will stick to [s1s(cid:48)\nAfter collecting suf\ufb01cient data at state s(cid:48)\nstarts choosing a at s1. Since \u02dcQ at states other than s1 and s(cid:48)\nof exploration due to having the least discount in \u02dcQ. This leads to a behaviour of taking [s1s2s(cid:48)\nto collect at s(cid:48)\n3, and so on, and \ufb01nally s(cid:48)\nprocess is 2md + 4md + ... + (2h)md = h(h + 1)md = O(n2md).\n\n1, \u02dcQ(s1, a) drops greatly from \u03b3Vmax to \u03b34Vmax and R-MAX\n2 is the next target\n1s1]\nh. The exploration cost of such\n\n1 are tried d times and become \u201cknown\u201d.\n\n1s1] until all a1, ..., am at s(cid:48)\n\n1 remain unchanged, s(cid:48)\n\n2, then [s1...s3s(cid:48)\n\n1s1] for s(cid:48)\n\n3...s(cid:48)\n\n2s(cid:48)\n\n4.5\n\nInterval estimation\n\n\u03b2\u221a\n\nN (s,a)\n\n\u03b2\u221a\n\nN (s,a)\n\n\u03b3(cid:80)\n\ns(cid:48) \u02c6P (s(cid:48)|s, a) maxa(cid:48) \u02dcQ(s(cid:48), a(cid:48)), where \u02dcR(s, a) := \u02c6R(s, a) +\n\nInterval estimation (IE) based exploration strategies utilise statistical methods to create con\ufb01dence\nintervals (CIs) for the estimated models or state/action values. CIs computed by this type of strategies\nusually take the form of X(s, a) \u00b1\n, where X(s, a) is the variable being estimated, \u03b2 is a\nparameter, and N (s, a) is the amount of data collected at (s, a). Clearly, state-action pairs with less\ndata have longer CIs, and vice versa. Estimated variable X(s, a) can be transition probability, reward,\nor state/action values. When choosing actions, the action with highest estimated value among all\npossible MDP models that lie within the CIs is selected.\nIn this section we take MBIE-EB as example to show how IE-based strategies can be tricked to\nmake inferior decisions. In MBIE-EB, action values are estimated using \u02dcQ(s, a) = \u02dcR(s, a) +\n. Since N (s, a) = 0 leads to\ndivision by zero, in the following analysis we assume that they all start with 1. At each step the action\nwith highest \u02dcQ(St, a) is executed, thus \u02dcQ is the heuristic used in MBIE-EB.\nWe start our analysis with the simplest case m = 1 where all bandits in the tower MDP is one-armed.\nThe expression of \u02dcQ can be obtained by solving the Bellman equation. Note that although max\noperator is involved on all state-action pairs, the algorithm is essentially choosing between paths\n[s1...sjs(cid:48)\nj, and\nNj be N (s, a) at that bandit, then we have \u02dcQj = \u03b3j\n1\u2212\u03b32j\nLet the actual reward of the bandits be the same as the settings used in the analysis of R-MAX. At the\n1+\u03b3j ). Clearly \u02dcQ1 > \u02dcQ2 > ... > \u02dcQh,\n\nbeginning of learning \u02dcRj = \u03b2/(cid:112)Nj = \u03b2, thus \u02dcQj = \u03b2\n\n1s1] with different j. Let \u02dcQj be \u02dcQ for the j-th path, \u02dcRj be \u02dcR for the bandit at s(cid:48)\n\n(cid:80)j\ni=1 \u03b3j\u2212i \u02dcRi.\n1\u2212\u03b3 (1\u2212 1\n\n1s1], which increases N1 and reduces \u02dcR1.\n\n(cid:80)j\nthus MBIE-EB starts with path [s1s(cid:48)\ni=1 \u03b3j\u2212i \u02dcRi shows that \u02dcQj with larger j has a greater discount \u03b3j\u2212i\nThe expression \u02dcQj = \u03b3j\n1\u2212\u03b32j\non \u02dcRi, and thus exploring s(cid:48)\n1 reduces \u02dcQj less for larger j. Therefore, \u02dcQ2 will eventually surpass \u02dcQ1\nand MBIE-EB moves to exploring s(cid:48)\n3, and so on, leading to an exploration behaviour similar\nto R-MAX, but lingers less at the same state than R-MAX. A smaller discount factor \u03b3 leads to a\nlarger gap between different \u02dcQj, which then leads to a slower pace for MBIE-EB to move upward.\nIn the case where MBIE-EB only lingers exactly once at each level of the tower, it will take path\n[(s1s(cid:48)\n1)...] until sh is reached. Thereafter \u02dcQh will always be the largest,\nand thus the remaining demand will be ful\ufb01lled through repeating [s1...shs(cid:48)\n1s1]. Such behaviour\nhas exploration cost (2 + 4 + ... + 2h) + 2h(d \u2212 1) = h(h + 1) + 2h(d \u2212 1) = O(n2 + nd) steps.\n2 \u2248 0.618 is suf\ufb01cient to make\n5\u22121\nFor the sake of space we skip the full derivation here2, but a \u03b3 <\nsure that MBIE-EB will perform as bad as this3.\n\n1)(s1s2s3s(cid:48)\n\n1)(s1s2s(cid:48)\n\n2, then s(cid:48)\n\nh...s(cid:48)\n\nj...s(cid:48)\n\n2s(cid:48)\n\n3s(cid:48)\n\n2s(cid:48)\n\n\u221a\n\n2The proof will be given in the online supplementary material.\n3Note that \u03b3 < 0.618 can effectively be achieved by inserting additional dummy states into all transitions,\n\ne.g. if \u03b3 = 0.9, by inserting 4 states between all transitions the discount becomes 0.95 = 0.59 < 0.618.\n\n7\n\n\fStrategy\nOptimal scheme\n\u03b5-greedy\nR-MAX\nInterval estimation O(n2m+nmd)\n\nExploration cost Weakness\nO(nmd)\nO(nmd 2n)\nO(n2md)\n\n-\nDistance, reward\nDistance\nDistance, reward\n\nTable 1: Summary of results on tower MDPs.\n\nNj,k\n\n\u03b2\u221a\n\nwhere Nj,k is the number of data at the k-th arm at state s(cid:48)\n\nIn the case of m \u2265 2 where there are 2 or more arms in each bandit, \u02dcRj in the expression of \u02dcQj\nbecomes the maximum of\nj. As a\nresult, the same pattern as in m = 1 is repeated m times for the case of m \u2265 2, and thus the total\nexploration cost is O(n2m + nmd).\nNote that MBIE-EB and other IE-based exploration strategies also take \u02c6R(s, a) into consideration\nwhen choosing actions, and thus can be further tricked by a deceiving setting of true reward R(s, a).\nFor example, if the setting of rewards in Section 4.3 is used, then more weight will be put into \u02dcR1,\nwhich gives \u02dcQj with smaller j more advantage due to having smaller discount on \u02dcR1. As a result,\nMBIE-EB will stay at lower levels more often and thus will have a worse exploration cost than above.\n\n4.6 Discussion\n\nTable 1 sums up the results of the analysis. As can be seen from the exploration cost column, \u03b5-greedy\nis clearly inferior to the rest for being exponential to the number of states n. MBIE-EB is seemingly\nbetter than R-MAX, but since in reality it often happens that n (cid:29) d, the difference between the two\ncan be small, and both are far worse than the optimal scheme which is only O(nmd). Such results\nsuggest that explicit planning for exploration can be highly bene\ufb01cial when the state space is large.\nIt is interesting to compare the exploration costs with sample complexity bounds, a well-studied\nexploration ef\ufb01ciency metric. R-MAX and MBIE-EB have sample complexity upper bounds O(n2m)\n(ignoring other factors and logarithms) [2, 4], which is similar to the exploration costs O(n2md) on\nn and m. However, a variant of R-MAX called MoR-MAX is known to have sample complexity\nO(nm) [17], yet its exploration cost in tower MDPs is still O(n2md) due to having exactly the same\nbehaviour as R-MAX. This might explain why sample complexity is usually not a good indicator of\npractical exploration ef\ufb01ciency.\nThe \u201cdistance\u201d and \u201creward\u201d in the weakness column of Table 1 refers to the distance traps and\nreward traps mentioned at the beginning of Section 4. A longer distance makes \u03b5-greedy visit states\nin the higher levels via random walk less often, while for R-MAX and IE algorithms a longer distance\nleads to more discount and thus a lower heuristic score \u02dcQ for the states in the higher levels. Reward\ntraps lure both \u03b5-greedy and IE to the lower-level states, while R-MAX is more resistant to it due to\nusing Vmax in computing \u02dcQ. The optimal scheme is the result of minimising undiscounted exploration\ncost and is affected by neither traps.\nTower MDPs in the above analysis only use deterministic transitions for simplicity.\nIn non-\ndeterministic cases, the negative impact of distance traps can be even more severe due to transition\nprobabilities amplifying the gaps in average distances. For example, if transiting from state s1 to s(cid:48)\nby taking a(cid:48) has probability 1, while taking a at s1 has probability 0.5 to go to s2 and probability\n1\n0.5 to stay in s1, then the gap between \u02dcQ1 and \u02dcQ2 becomes larger and IE algorithms will take path\n[s1s(cid:48)\nMDPs in reality may not have the same structure as tower MDPs, but the distance traps and the\nreward traps discussed above can happen in any types of MDPs. It is possible that in some easier\ncases the difference between the optimal scheme and heuristic-based strategies is not as large as\nO(nmd) vs. O(n2md), but in domains where millions of data is required for obtaining an acceptable\npolicy, even a difference in constant factor can be practically signi\ufb01cant.\nRemark on the reward trap. One may argue that whether or not the reward function actually acts\nas a trap is problem-dependent, and there are cases where being trapped by the rewards is actually\ndesirable due to it leading to an early convergence to good policies.\n\n1s1] more often, increasing the exploration cost.\n\n8\n\n\fThis is partly true, but one should also consider the fact that it is very dif\ufb01cult, if not impossible,\nto design a reward function that simultaneously leads to both good policies and good exploration\nbehaviours. If the way being trapped does not coincide with the (near-)optimal policies, algorithms\nlike MBIE eventually deviate from the current behaviour and restart exploration. In the early stages\nof learning, since the data is still lacking, such situation can occur frequently, resulting in the whole\nlearning process being prolonged and the total reward reduced.\nTherefore, even in the case where the total reward during learning is of concern, ignoring it in the\nearly stages of the learning process and seeking for a more ef\ufb01cient exploration behaviour can be\nbene\ufb01cial in the long run.\n\n5 Conclusion and Future Work\n\nIn this paper we have formulated the planning for exploration problem as solving augmented MDPs,\nand provided the Bellman optimality equation for exploration costs. We have proposed a Value\nIteration for Exploration Cost (VIEC) algorithm which computes the optimal exploration scheme\ngiven full knowledge of MDP, and a model-based RL method with planning-for-exploration com-\nponent integrated. We have presented a detailed study of exploration behaviours of several popular\nexploration strategies. The analysis exposes the weakness of these heuristic-based strategies and\nsuggests a remarkable potential in planning for exploration.\nA possible direction for future work is to \ufb01nd a fast and suf\ufb01ciently good approximate to VIEC. As\nwe pointed out in Section 3.3, since the demand space is exponential in the number of states, applying\nVIEC directly can be computationally expensive in practice. Techniques such as Prioritized Sweeping\n[19] may help reduce the computation involved, thus make VIEC more practically useful.\nAnother direction is to design better heuristic-based exploration strategies that can handle the distance\nand reward traps discussed in Section 4 better. Although by No Free Lunch theorem [20] no heuristic\ncan perform universally better than others, it is nevertheless useful to have a larger toolbox of\neasy-to-compute heuristics that can cope with different types of MDPs.\n\nAcknowledgments\n\nThis work was supported by EPSRC (Grant Nos. EP/J017515/1 and EP/P005578/1), the Royal\nSociety (through a Newton Advanced Fellowship to Ke Tang and hosted by Xin Yao), the Program\nfor Guangdong Introducing Innovative and Enterpreneurial Teams (Grant No. 2017ZT07X386),\nShenzhen Peacock Plan (Grant No. KQTD2016112514355531) and the Program for University Key\nLaboratory of Guangdong Province (Grant No. 2017KSYS008).\n\nReferences\n[1] Steven D. Whitehead. A complexity analysis of cooperative mechanisms in reinforcement\nlearning. In Proceedings of AAAI 1991, pages 607\u2013613, Palo Alto, CA, USA, 1991. AAAI\nPress.\n\n[2] Sham M. Kakade. On the sample complexity of reinforcement learning. PhD thesis, University\n\nCollege London, London, U.K., 2003.\n\n[3] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement\nlearning. In Advances in Neural Information Processing Systems 19, pages 49\u201356. MIT Press,\nCambridge, MA, USA, 2006.\n\n[4] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for\nmarkov decision processes. Journal of Computer and System Sciences, 74(8):1309\u20131331, 2008.\n\n[5] Tor Lattimore and Marcus Hutter. Near-optimal PAC bounds for discounted MDPs. Theoretical\n\nComputer Science, 558:125\u2013143, 2014.\n\n[6] Marc Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom Schaul, David Saxton, and Remi\nMunos. Unifying count-based exploration and intrinsic motivation. In Advances in Neural\nInformation Processing Systems, pages 1471\u20131479, New York, USA, 2016. Curran Associates,\nInc.\n\n9\n\n\f[7] Haoran Tang, Rein Houthooft, Davis Foote, Adam Stooke, Xi Chen, Yan Duan, John Schulman,\nFilip DeTurck, and Pieter Abbeel. # Exploration: A study of count-based exploration for\ndeep reinforcement learning. In Advances in Neural Information Processing Systems, pages\n2753\u20132762, 2017.\n\n[8] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press,\n\n2018.\n\n[9] Martin Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming.\n\nWiley-Interscience, Hoboken, NJ, USA, 1994.\n\n[10] Alexander L. Strehl, Lihong Li, and Michael L. Littman. Reinforcement learning in \ufb01nite\n\nMDPs: PAC analysis. Journal of Machine Learning Research, 10:2413\u20132444, 2009.\n\n[11] Alexander L. Strehl and Michael L. Littman. An empirical evaluation of interval estimation for\nmarkov decision processes. In Tools with Arti\ufb01cial Intelligence (ICTAI), pages 128\u2013135, New\nYork, USA, 2004. IEEE.\n\n[12] J. Zico Kolter and Andrew Y. Ng. Near-Bayesian exploration in polynomial time. In Proceedings\nof the 26th International Conference on Machine Learning, pages 513\u2013520, New York, USA,\n2009. ACM.\n\n[13] Liangpeng Zhang, Ke Tang, and Xin Yao. Increasingly cautious optimism for practical PAC-\nMDP exploration. In Proceedings of the 24th International Joint Conference on Arti\ufb01cial\nIntelligence, pages 4033\u20134040, Palo Alto, CA, USA, 2015. AAAI Press.\n\n[14] Horst A Eiselt, Michel Gendreau, and Gilbert Laporte. Arc routing problems, part ii: The rural\n\npostman problem. Operations research, 43(3):399\u2013414, 1995.\n\n[15] Ronen I. Brafman and Moshe Tennenholtz. R-max\u2013a general polynomial time algorithm for\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 3:213\u2013231, 2002.\n\n[16] Alexander L. Strehl, Lihong Li, Eric Wiewiora, John Langford, and Michael L. Littman. PAC\nmodel-free reinforcement learning. In Proceedings of the 23rd International Conference on\nMachine learning, pages 881\u2013888, New York, USA, 2006. ACM.\n\n[17] Istv\u00e1n Szita and Csaba Szepesv\u00e1ri. Model-based reinforcement learning with nearly tight\nIn Proceedings of the 27th International Conference on\n\nexploration complexity bounds.\nMachine Learning, pages 1031\u20131038, New York, USA, 2010. ACM.\n\n[18] Karun Rao and Shimon Whiteson. V-max: tempered optimism for better PAC reinforcement\nlearning. In Proceedings of the 11th International Conference on Autonomous Agents and\nMultiagent Systems, volume 1, pages 375\u2013382, 2012.\n\n[19] Harm Van Seijen and Rich Sutton. Planning by prioritized sweeping with small backups. In\nProceedings of the 30th International Conference on Machine Learning, volume 28, pages\n361\u2013369, Atlanta, Georgia, USA, 17\u201319 Jun 2013. PMLR.\n\n[20] David H Wolpert and William G Macready. No free lunch theorems for optimization. IEEE\n\nTransactions on Evolutionary Computation, 1(1):67\u201382, 1997.\n\n10\n\n\f", "award": [], "sourceid": 4081, "authors": [{"given_name": "Liangpeng", "family_name": "Zhang", "institution": "University of Birmingham"}, {"given_name": "Ke", "family_name": "Tang", "institution": "Southern University of Science and Technology"}, {"given_name": "Xin", "family_name": "Yao", "institution": "Southern University of Science and Technology"}]}