{"title": "Multi-resolution Exploration in Continuous Spaces", "book": "Advances in Neural Information Processing Systems", "page_first": 1209, "page_last": 1216, "abstract": "The essence of exploration is acting to try to decrease uncertainty. We propose a new methodology for representing uncertainty in continuous-state control problems. Our approach, multi-resolution exploration (MRE), uses a hierarchical mapping to identify regions of the state space that would benefit from additional samples. We demonstrate MRE's broad utility by using it to speed up learning in a prototypical model-based and value-based reinforcement-learning method. Empirical results show that MRE improves upon state-of-the-art exploration approaches.", "full_text": "Multi-resolution Exploration in Continuous Spaces\n\nAli Nouri\n\nMichael L. Littman\n\nDepartment of Computer Science\n\nRutgers University\n\nPiscataway , NJ 08854\n\nDepartment of Computer Science\n\nRutgers University\n\nPiscataway , NJ 08854\n\nnouri@cs.rutgers.edu\n\nmlittman@cs.rutgers.edu\n\nAbstract\n\nThe essence of exploration is acting to try to decrease uncertainty. We propose\na new methodology for representing uncertainty in continuous-state control prob-\nlems. Our approach, multi-resolution exploration (MRE), uses a hierarchical map-\nping to identify regions of the state space that would bene\ufb01t from additional sam-\nples. We demonstrate MRE\u2019s broad utility by using it to speed up learning in a pro-\ntotypical model-based and value-based reinforcement-learning method. Empirical\nresults show that MRE improves upon state-of-the-art exploration approaches.\n\n1 Introduction\n\nExploration, in reinforcement learning, refers to the strategy an agent uses to discover new informa-\ntion about the environment. A rich set of exploration techniques, some ad hoc and some not, have\nbeen developed in the RL literature for \ufb01nite MDPs (Kaelbling et al., 1996). Using optimism in the\nface of uncertainty in combination with explicit model representation, some of these methods have\nled to the derivation of polynomial sample bounds on convergence to near-optimal policies (Kearns\n& Singh, 2002; Brafman & Tennenholtz, 2002). But, because they treat each state independently,\nthese techniques are not directly applicable to continuous-space problems, where some form of gen-\neralization must be used.\nSome attempts have been made to improve the exploration effectiveness of algorithms in continuous-\nstate spaces. Kakade et al. (2003) extended previous work of Kearns and Singh (2002) to metric\nspaces and provided a conceptual approach for creating general provably convergent model-based\nlearning methods. Jong and Stone (2007) proposed a method that can be interpreted as a practical\nimplementation of this work, and Strehl and Littman (2007) improved its complexity in the case that\nthe model can be captured by a linear function.\nThe performance metric used in these works demands near-optimal behavior after a polynomial\nnumber of timesteps with high probability, but does not insist on performance improvements be-\nfore or after convergence. Such \u201canytime\u201d behavior is encouraged by algorithms with regret\nbounds (Auer & Ortner, 2006), although regret-type algorithms have not yet been explored in\ncontinuous-state space problems to our knowledge.\nAs a motivating example for the work we present here, consider how a discrete state-space algorithm\nmight be adapted to work for a continuous state-space problem. The practitioner must decide how\nto discretize the state space. While \ufb01ner discretizations allow the learning algorithm to learn more\naccurate policies, they require much more experience to learn well. The dilemma of picking \ufb01ne\nor coarse resolution has to be resolved in advance using estimates of the available resources, the\ndynamics and reward structure of the environment, and a desired level of optimality. Performance\ndepends critically on these a priori choices instead of responding dynamically to the available re-\nsources.\n\n\fWe propose using multi-resolution exploration (MRE) to create algorithms that explore continuous\nstate spaces in an anytime manner without the need for a priori discretization. The key to this ideal\nis to be able to dynamically adjust the level of generalization the agent uses during the learning\nprocess. MRE sports a knownness criterion for states that allows the agent to reliably apply function\napproximation with different degrees of generalization to different regions of the state space.\nOne of the main contributions of this work is to provide a general exploration framework that can be\nused in both model-based and value-based algorithms. While model-based techniques are known for\ntheir small sample complexities, thanks to their smart exploration, they haven\u2019t been as successful\nas value-based methods in continuous spaces because of their expensive planning part. Value-based\nmethods, on the other hand, have been less fortunate in terms of intelligent exploration, and some\nof the very powerful RL techniques in continuous spaces, such as LSPI (Lagoudakis & Parr, 2003)\nand \ufb01tted Q-iteration (Ernst et al., 2005) are in the form of of\ufb02ine batch learning and completely\nignore the problem of exploration. In practice, an exploration strategy is usually incorporated with\nthese algorithms to create online versions. Here, we examine \ufb01tted Q-iteration and show how MRE\ncan be used to improve its performance over conventional exploration schemes by systematically\ncollecting better samples.\n\n2 Background\n\nWe consider environments that are modeled as Markov decision processes (MDPs) with continuous\nstate spaces (Puterman, 1994). An MDP M in our setting can be described as a tuple (cid:104)S, A, T, R, \u03b3(cid:105),\nwhere S is a bounded measurable subspace of (cid:60)k; we say the problem is k-dimensional as one can\nrepresent a state by a vector of size k and we use s(i) to denote the i-th component of this vector.\nA = {a1, ..., am} is the discrete set of actions. T is the transition function that determines the next\nstate given the current state and action. It can be written in the form of st+1 = T (xt, at) + \u03c9t,\nwhere xt and at are the state and action at time t and \u03c9t is a white noise drawn i.i.d. from a known\ndistribution. R : S \u2192 (cid:60) is the bounded reward function, whose maximum we denote by Rmax, and\n\u03b3 is the discount factor.\nOther concepts are similar to that of a general \ufb01nite MDP (Puterman, 1994). In particular, a policy\n\u03c0 is a mapping from states to actions that prescribes what action to take from each state. Given a\npolicy \u03c0 and a starting state s, the value of s under \u03c0, denoted by V \u03c0(s), is the expected discounted\nsum of rewards the agent will collect by starting from s and following policy \u03c0. Under mild condi-\ntions (Puterman, 1994), at least one policy exists that maximizes this value function over all states,\nwhich we refer to as the optimal policy or \u03c0\u2217. The value of states under this policy is called the\noptimal value function V \u2217(\u00b7) = V \u03c0\u2217(\u00b7).\nThe learning agent has prior knowledge of S, \u03b3, \u03c9 and Rmax, but not T and R, and has to \ufb01nd a\nnear-optimal policy solely through direct interaction with the environment.\n\n3 Multi-resolution Exploration\n\nWe\u2019d like to build upon the previous work of Kakade et al. (2003). One of the key concepts to this\nmethod and many other similar algorithms is the notion of known state. Conceptually, it refers to the\nportion of the state space in which the agent can reliably predict the behavior of the environment.\nImagine how the agent would decide whether a state is known or unknown as described in (Kakade\net al., 2003). Based on the prior information about the smoothness of the environment and the level\nof desired optimality, we can form a hyper sphere around each query point and check if enough data\npoints exist inside it to support the prediction.\nIn this method, we use the same hyper-sphere size across the entire space, no matter how the sample\npoints are distributed, and we keep this size \ufb01xed during the entire learning process. In another\nwords, the degree of generalization is \ufb01xed both in time and space.\nTo support \u201canytime\u201d behavior, we need to make the degree of generalization variable both in time\nand space. MRE partitions the state space into a variable resolution discretization that dynamically\nforms smaller cells for regions with denser sample sets. Generalization happens inside the cells (sim-\nilar to the hyper sphere example), therefore it allows for wider but less accurate generalization in\n\n\fparts of the state space that have fewer sample points, and narrow but more accurate ones for denser\nparts.\nTo effectively use this mechanism, we need to change the notion of known states, as its common\nde\ufb01nition is no longer applicable. Let\u2019s de\ufb01ne a new knownness criterion that maps S into [0, 1] and\nquanti\ufb01es how much we should trust the function approximation. The two extreme values, 0 and\n1, are the two degenerate cases equal to unknown and known conditions in the previous de\ufb01nitions.\nIn the remainder of this section, we \ufb01rst show how to form the variable resolution structure and\ncompute the knownness, and then we demonstrate how to use this structure in a prototypical model-\nbased and value-based algorithm.\n\n3.1 Regression Trees and Knownness\n\nRegression trees are function approximators that partition the input space into non-overlapping re-\ngions and use the training samples of each region for prediction of query points inside it. Their ability\nto maintain a non-uniform discretization of high-dimensional spaces with relatively fast query time\nhas proven to be very useful in various RL algorithms (Ernst et al., 2005; Munos & Moore, 2002).\nFor the purpose of our discussion, we use a variation of the kd-tree structure (Preparata & Shamos,\n1985) to maintain our variable-resolution partitioning and produce knownness values. We call this\nstructure the knownness-tree. As this structure is not used in a conventional supervised-learning\nsetting, we next describe some of the details.\nA knownness-tree \u03c4 with dimension k accepts points s \u2208 (cid:60)k satisfying ||s||\u221e \u2264 1 1, and answers\nqueries of the form 0 \u2264 knownness(s) \u2264 1. Each node \u03c2 of the tree covers a bounded region and\nkeeps track of the points inside that region, with the root covering the whole space. Let R\u03c2 be the\nregion of \u03c2.\nEach internal node splits its region into two half-regions along one of the dimensions to create two\nchild nodes. Parameter \u03bd determines the maximum allowed number of points in each leaf. For a\nnode l, l.size is the inf-norm of the size of the region it covers and l.count is the number of points\ninside it. Given n points, the normalizing size of the resulting tree, denoted by \u00b5, is the region size of\na hypothetical uniform discretization of the space that puts \u03bd/k points inside each cell, if the points\nwere uniformly distributed in the space; that is \u00b5 =\n\n1\n\n\u221a\nnk/\u03bd(cid:99).\n(cid:98) k\n\nUpon receiving a new point, the traversal algorithm starts at the root and travels down the tree,\nguided by the splitting dimension and value of each internal node. Once inside a leaf l, it adds the\npoint to its list of points; if l.count is more than \u03bd, the node splits and creates two new half-regions2.\nSplitting is performed by selecting a dimension j \u2208 [1..k] and splitting the region into two equal\nhalf-regions along the j-th dimension.\nThe points inside the list are added to each of the children according to what half-region they fall\ninto. Similar to regular regression trees, several different criteria could be used to select j. Here, we\nassume a round-robin method just like kd-tree.\nTo answer a query knownness(s), the lookup algorithm \ufb01rst \ufb01nds the corresponding leaf that con-\ntains s, denoted l(s), then computes knownness based on l(s).size, l(s).count and \u00b5:\n\nknownness(s) = min(1,\n\nl(s).count\n\n\u03bd\n\n.\n\n\u00b5\n\nl(s).size\n\n)\n\n(1)\n\nThe normalizing size of the tree is bigger when the total number of data points is small. This creates\nhigher knownness values for a \ufb01xed cell at the beginning of the learning. As more experience is\ncollected, \u00b5 becomes smaller and encourages \ufb01ner discretization. This process creates a variable\ndegree of generalization over time.\n\n1In practice, scaling can be used to satisfy this property.\n2For the sake of practicality, we can assign a maximum depth to avoid inde\ufb01nite growth of the tree\n\n\f3.2 Application to Model-based RL\n\nThe model-based algorithm we describe here uses function approximation to estimate T and R,\nwhich are the two unknown parameters of the environment. Let \u0398 be the set of function approx-\ni \u2208 \u0398 : (cid:60)k \u2192 (cid:60) predicting the i-th\nimators for estimating the transition function, with each \u03b8j\ni . Let \u03c6 : (cid:60)k \u2192 (cid:60) be the\ncomponent of T (., aj). Accordingly, let \u03c4 j\nfunction approximator for the reward function. The estimated transition function, \u02c6T (s, a), is there-\n(cid:69)\nfore formed by concatenating all the \u03b8a\nConstruct the augmented MDP M(cid:48) =\nby adding a new state, sf , with a\nreward of Rmax and only self-loop transitions. The augmented transition function \u02c6T (cid:48) is a stochastic\nfunction de\ufb01ned as:\n\n(cid:68)\ni (s). Let knownness(s, a) = mini{\u03c4 a\n\ni be a knownness-tree for \u03b8j\n\ni .knownness(s)}.\n\nS + sf , A, \u02c6T (cid:48), \u03c6, \u03b3\n\n(cid:40)\n\n\u02c6T (cid:48)(s, a) =\n\nsf\n\u02c6T (s, a) + \u03c9 , otherwise\n\n, with probability 1 \u2212 knownness(s, a)\n\n(2)\n\nAlgorithm 1 constructs and solves M(cid:48) and always acts greedily with respect to this internal model.\nDPlan is a continuous MDP planner that supports two operations: solveModel, which solves a given\nMDP and getBestAction, which returns the greedy action for a given state.\n\nAlgorithm 1 A model-based algorithm using MRE for exploration\n1: Variables: DPlan, \u0398, \u03c6 and solving period planFreq\n2: Observe a transition of the form (st, at, rt, st+1)\n3: Add (st, rt) as a training sample to \u03c6.\n4: Add (st, st+1(i)) as a training sample to \u03b8at\ni .\n5: Add (st) to \u03c4 at\ni\n6: if t mod planFreq = 0 then\nConstruct the Augmented MDP M(cid:48) as de\ufb01ned earlier.\n7:\nDPlan.solveModel(M(cid:48))\n8:\n9: end if\n10: Execute action DPlan.getBestAction(st+1)\n\n.\n\nWhile we leave a rigorous theoretical analysis of Algorithm 1 to another paper, we\u2019d like to discuss\nsome of its properties. The core of the algorithm is the way knownness is computed and how it\u2019s\nused to make the estimated transition function optimistic. In particular, if we use a uniform \ufb01xed\ngrid instead of the knownness-tree, the algorithm starts to act similar to MBIE (Strehl & Littman,\n2005). That is, like MBIE, the value of a state becomes gradually less optimistic as more data is\navailable. Because of their similarity, we hypothesize that similar PAC-bounds could be proved for\nMRE in this con\ufb01guration.\nIf we further change knownness(s, a) to be (cid:98)knownness(s.a)(cid:99), the algorithm reduces to an in-\nstance of metric E3 (Kakade et al., 2003), which can also be used to derive \ufb01nite sample bounds.\nBut, Algorithm1 also has \u201canytime\u201d behavior. Let\u2019s assume the transition and reward functions are\nLipschitz smooth with Lipschitz constants CT and CR respectively. Let \u03c1t be the maximum size of\nthe cells and (cid:96)t be the minimum knownness of all of the trees \u03c4 j\ni at time t. The following establishes\nperformance guarantee of the algorithm at time t.\n\nTheorem 1 If learning is frozen at time t, Algorithm 1 achieves \u0001-optimal behavior, with \u0001 being:\n\n(cid:16) \u03c1t(CR + CT\n\n\u221a\nk) + 2(1 \u2212 (cid:96)t)\n\n(1 \u2212 \u03b3)2\n\n(cid:17)\n\n\u0001 = O\n\nProof 1 (sketch) This follows as an application of the simulation lemma (Kearns & Singh, 2002).\nWe can use the smoothness assumptions to compute the closeness of \u02c6T (cid:48) to the original transition\nfunction based on the shape of the trees and the knownness they output. (cid:3)\n\n\fOf course, this theorem doesn\u2019t provide a bound for \u03c1t and (cid:96)t based on t, as used in common\n\u201canytime\u201d analyses, but gives us some insight on how the algorithm would behave. For example,\nthe incremental re\ufb01nement of model estimation assures a certain global accuracy before forcing the\nalgorithm to collect denser sampling locally. As a result, MRE encourages more versatile sampling\nat the early stages of learning. As time goes by and size of the cells gets smaller, the algorithm\ngets closer to the optimal policy. In fact, we hypothesize that with some caveats concerning the\ncomputation of \u00b5, it can be proved that Algorithm 1 converges to the optimal policy in the limit,\ngiven that an oracle planner is available.\nThe bound in Theorem 1 is loose because it involves only the biggest cell size, as opposed to indi-\nvidual cell sizes. Alternatively, one might be able to achieve better bounds, similar to those in the\nwork of Munos and Moore (2000), by taking the variable resolution of the tree into account.\n\n3.3 Application to Value-based RL\n\nHere, we show how to use MRE in \ufb01tted Q-iteration, which is a value-based batch learner for\ncontinuous spaces. A similar approach can be used to apply MRE to other types of value-based\nmethods, such as LSPI, as an alternative to random sampling or \u0001-greedy exploration, which are\nwidely used in practice.\nThe \ufb01tted Q-iteration algorithm accepts a set of four-tuple samples S = {(sl, al, rl, s(cid:48)l), l = 1 . . . n}\nand uses regression trees to iteratively compute more accurate \u02c6Q-functions. In particular, let \u02c6Qj\ni be\nthe regression tree used to approximate Q(\u00b7, j) in the i-th iteration. Let Sj \u2282 S be the set of samples\n0 = {(sl, rl)|(sl, al, rl, s(cid:48)l) \u2208 Sj}. \u02c6Qj\nwith action equal to j. The training samples for \u02c6Qj\nis constructed based on \u02c6Qi in the following way:\n\n0 are Sj\n\ni+1\n\nxl = {sl|(sl, al, rl, s(cid:48)l) \u2208 Sj}\nyl = {rl + \u03b3 max\na\u2208A\ni+1 = {(xl, yl)}.\nSj\n\ni (s(cid:48)l)|(sl, al, rl, s(cid:48)l) \u2208 Sj}\n\u02c6Qa\n\n(3)\n(4)\n\n(5)\n\nRandom sampling is usually used to collect S for \ufb01tted Q-iteration when used as an of\ufb02ine algorithm.\nIn online settings, \u0001-greedy can be used as the exploration scheme to collect samples. The batch\nportion of the algorithm is applied periodically to incorporate the new collected samples.\nCombining MRE with \ufb01tted Q-iteration is very simple. Let \u03c4 j correspond to \u02c6Qj\ni for all i\u2019s, and be\ntrained on the same samples. The only change in the algorithm is the computation of Equation 4. In\norder to use optimistic values, we elevate \u02c6Q-functions according to their knownness:\n\nyl = \u03c4 j.knownness(sl)\n\n(cid:16)\n\n(cid:17)\n(cid:0)1 \u2212 \u03c4 j.knownness(cid:0)sl(cid:1)(cid:1)(cid:16) Rmax\ni (s(cid:48)l)\nQa\n\nrl + \u03b3 max\na\u2208A\n\n1 \u2212 \u03b3\n\n+\n\n(cid:17)\n\n.\n\n4 Experimental Results\n\nTo empirically evaluate the performance of MRE, we consider a well-studied environment called\n\u201cMountain Car\u201d (Sutton & Barto, 1998). In this domain, an underpowered car tries to climb up\nto the right of a valley, but has to increase its velocity via several back and forth trips across the\nvalley. The state space is 2-dimensional and consists of the horizontal position of the car in the\nrange of [\u22121.2, 0.6], and its velocity in [\u22120.07, 0.07]. The action set is forward, backward, and\nneutral, which correspond to accelerating in the intended direction. Agent receives a \u22121 penalty in\neach timestep except for when it escapes the valley and receives a reward of 0 that ends the episode.\nEach episode has a cap of 300 steps, and \u03b3 = 0.95 is used for all the experiments. A small amount\nof gaussian noise \u03c9 \u223c N(0, 0.01) is added to the position component of the deterministic transition\nfunction used in the original de\ufb01nition, and the starting position of the car is chosen very close to\n\n\fthe bottom of the hill with a random velocity very close to 0 (achieved by drawing samples from a\nnormal distribution with the mean on the bottom of the hill and variance of 1/15 of the state space.\nThis set of parameters makes this environment especially interesting for the purpose of comparing\nexploration strategies, because it is unlikely for random exploration to guide the car to the top of the\nhill. Similar scenarios occur in almost all of the complex real-life domains, where a long trajectory\nis needed to reach the goal.\nThree versions of Algorithm 1 are compared in Figure 1(a): the \ufb01rst two implementations use \ufb01xed\ndiscretizations instead of the knownness-tree, with different normalized resolutions of 0.05 and 0.3.\nThe third one uses variable discretization using the knownness-tree as de\ufb01ned in Section 3.1. All\nthe instances use the same \u0398 and \u03c6, which are regular kd-tree structures (Ernst et al., 2005) with\nmaximum allowed points of 10 in each cell. All of the algorithms use \ufb01tted value-iteration (Gordon,\n1999) as their DPlan, and their planFreq is set to 100. Furthermore, the known threshold parameter\nof the \ufb01rst two instances was hand-tuned to 4 and 30 respectively.\nThe learning curve in Figure 1(a) is averaged over 20 runs with different random seeds and smoothed\nover a window of size 5 to avoid a cluttered graph. The \ufb01ner \ufb01xed-discretization converges to a\nvery good policy, but takes a long time to do so, because it trusts only very accurate estimations\nthroughout the learning. The coarse discretization on the other hand, converges very fast, but not\nto a very good policy; it constructs rough estimations and doesn\u2019t compensate as more samples are\ngathered. MRE re\ufb01nes the notion of knownness to make use of rough estimations at the beginning\nand accurate ones later, and therefore converges to a good policy fast.\n\n(a)\n\n(b)\n\nFigure 1: (a) The learning curve of Algorithm 1 in Mountain Car with three different exploration\nstrategies. (b) Average performance of Algorithm 1 in Mountain Car with three exploration strate-\ngies. Performance is evaluated at three different stages of learning.\n\nA more detailed comparison of this result is shown in Figure 1(b), where the average time-per-\nepisode is provided for three different phases: At the early stages of learning (episode 1-100), at\nthe middle of learning (episode 100-200), and during the late stages (episode 200-300). Standard\ndeviation is used as the error bar.\nTo have a better look at why MRE provides better results than the \ufb01xed 0.05 at the early stages of\nlearning (note that both of them achieve the same performance level at the end), value functions of\nthe two algorithms at timestep = 1500 are shown in Figure 2. Most of the samples at this stage\nhave very small knownness in the \ufb01xed version, due to the very \ufb01ne discretization, and therefore have\nvery little effect on the estimation of the transition function. This situation results in a too optimistic\nvalue function (the \ufb02at part of the function). The variable discretization however, achieves a more\nrealistic and smooth value function by allowing coarser generalizations in parts of the state space\nwith fewer samples.\nThe same type of learning curve is shown for the \ufb01tted Q-iteration algorithm in Figure 3. Here,\nwe compare \u0001-greedy to two versions of variable-resolution MRE; in the \ufb01rst version, although a\nknownness-tree is chosen for partitioning the state space, knownness is computed as a Boolean value\nusing the (cid:98)(cid:99) operator. The second version uses continuous knownness. For \u0001-greedy, \u0001 is set to 0.3\nat the beginning and is decayed linearly to 0.03 as t = 10000, and is kept constant afterward. This\nparameter setting is the result of a rough optimization through a few trial and errors. As expected,\n\u0001-greedy performs poorly, because it cannot collect good samples to feed the batch learner. Both of\nthe versions of MRE converge to the same policy, although the one that uses continuous knownness\ndoes so faster.\n\n250300variablefixed005200250o goalfixed0.05fixed 0.3100150Step to50100050100150050100150Episode150200250oal per episodefixed 0.05fixed 0.3variable050100episode 0-100episode 100-200episode 200-300Average step to go\f(a)\n\n(b)\n\nFigure 2: Snapshot of the value function at timestep 1500 in Algorithm 1 with two con\ufb01guration:\n(a) \ufb01xed discretization with resolution= 0.05, and (b) variable resolution.\n\nFigure 3: The learning curve for \ufb01tted Q-iteration in Mountain Car. \u0001-greedy is compared to two\nversions of MRE: one that uses Boolean knownness, and one that uses continuous knownness.\n\nTo have a better understanding of why the continuous knownness helps \ufb01tted Q-iteration during the\nearly stages of learning, snapshots of knownness from the two versions are depicted in Figure 6,\nalong with the set of visited states at timestep 1500. Black indicates a completely unknown region,\nwhile white means completely known; gray is used for intermediate values. The continuous notion\nof knownness helps \ufb01tted Q-iteration in this case to collect better-covering samples at the beginning\nof learning.\n\n5 Conclusion\n\nIn this paper, we introduced multi-resolution exploration for reinforcement learning in continuous\nspaces and demonstrated how to use it in two algorithms from the model-based and value-based\nparadigms. The combination of two key features distinguish MRE from previous smart exploration\nschemes in continuous spaces: The \ufb01rst is that MRE uses a variable-resolution structure to identify\nknown vs. unknown regions, and the second is that it successively re\ufb01nes the notion of knownness\nduring learning, which allows it to assign continuous, instead of Boolean, knownness. The appli-\ncability of MRE to value-based methods allows us to bene\ufb01t from smart exploration ideas from the\nmodel-based setting in powerful value-based batch learners that usually use naive approaches like\n\n0-5-10-15-200.110.050.500-0.5-0.05-1-0.1-1.5-1.50-5-10-15-200.110.050.500-0.5-0.05-1-0.1300300Continuous knownnessBoolean knownness250lll\u01dd-greedy200p to goa150Ste10050050100150200250Episode\f(a)\n\n(b)\n\nFigure 4: Knownness computed in two versions of MRE for \ufb01tted Q-iteration: One that has Boolean\nvalues, and one that uses continuous ones. Black indicates completely unknown and white means\ncompletely known. Collected samples are also shown for the same two versions at timestep 1500.\n\nrandom sampling to collect data. Experimental results con\ufb01rm that MRE holds signi\ufb01cant advantage\nover some other exploration techniques widely used in practice.\n\nReferences\nAuer, P., & Ortner, R. (2006). Logarithmic online regret bounds for undiscounted reinforcement\n\nlearning. Advances in Neural Information Processing Systems 20 (NIPS-06).\n\nBrafman, R. I., & Tennenholtz, M. (2002). R-max, a general polynomial time algorithm for near-\n\noptimal reinforcement learning. Journal of Machine Learning Research, 3, 213\u2013231.\n\nErnst, D., Geurts, P., & Wehenkel, L. (2005). Tree-based batch mode reinforcement learning. Jour-\n\nnal of Maching Learning Research, 6, 503\u2013556.\n\nGordon, G. J. (1999). Approximate solutions to Markov decision processes. Doctoral dissertation,\n\nSchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA.\n\nJong, N. K., & Stone, P. (2007). Model-based function approximation for reinforcement learning.\n\nThe Sixth International Joint Conference on Autonomous Agents and Multiagent Systems.\n\nKaelbling, L. P., Littman, M. L., & Moore, A. P. (1996). Reinforcement learning: A survey. Journal\n\nof Arti\ufb01cial Intelligence Research, 4, 237\u2013285.\n\nKakade, S., Kearns, M., & Langford, J. (2003). Exploration in metric state spaces. In Proc. of the\n\n20th International Conference on Machine Learning, 2003.\n\nKearns, M. J., & Singh, S. P. (2002). Near-optimal reinforcement learning in polynomial time.\n\nMachine Learning, 49, 209\u2013232.\n\nLagoudakis, M. G., & Parr, R. (2003). Least-squares policy iteration. Journal of Machine Learning\n\nResearch, 4, 1107\u20131149.\n\nMunos, R., & Moore, A. (2002). Variable resolution discretization in optimal control. Machine\n\nLearning, 49, 291\u2013323.\n\nMunos, R., & Moore, A. W. (2000). Rates of convergence for variable resolution schemes in optimal\ncontrol. Proceedings of the Seventeenth International Conference on Machine Learning (ICML-\n00) (pp. 647\u2013654).\n\nPreparata, F. P., & Shamos, M. I. (1985). Computational geometry - an introduction. Springer.\nPuterman, M. L. (1994). Markov decision processes: Discrete stochastic dynamic programming.\n\nNew York: Wiley.\n\nStrehl, A., & Littman, M. (2007). Online linear regression and its application to model-based rein-\n\nforcement learning. Advances in Neural Information Processing Systems 21 (NIPS-07).\n\nStrehl, A. L., & Littman, M. L. (2005). A theoretical analysis of model-based interval estimation.\n\nICML-05 (pp. 857\u2013864).\n\nSutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. Cambridge, MA:\n\nMIT Press.\n\n\u00ed1.2\u00ed0.40.6\u00ed0.0600.06-10.06            00.6-0.060.06        0-0.06   -1.2                            0.4                                 0.60.060-0.06-1               -0.4            0.6\f", "award": [], "sourceid": 730, "authors": [{"given_name": "Ali", "family_name": "Nouri", "institution": null}, {"given_name": "Michael", "family_name": "Littman", "institution": null}]}