{"title": "Interval Estimation for Reinforcement-Learning Algorithms in Continuous-State Domains", "book": "Advances in Neural Information Processing Systems", "page_first": 2433, "page_last": 2441, "abstract": "The reinforcement learning community has explored many approaches to obtain- ing value estimates and models to guide decision making; these approaches, how- ever, do not usually provide a measure of confidence in the estimate. Accurate estimates of an agent\u2019s confidence are useful for many applications, such as bi- asing exploration and automatically adjusting parameters to reduce dependence on parameter-tuning. Computing confidence intervals on reinforcement learning value estimates, however, is challenging because data generated by the agent- environment interaction rarely satisfies traditional assumptions. Samples of value- estimates are dependent, likely non-normally distributed and often limited, partic- ularly in early learning when confidence estimates are pivotal. In this work, we investigate how to compute robust confidences for value estimates in continuous Markov decision processes. We illustrate how to use bootstrapping to compute confidence intervals online under a changing policy (previously not possible) and prove validity under a few reasonable assumptions. We demonstrate the applica- bility of our confidence estimation algorithms with experiments on exploration, parameter estimation and tracking.", "full_text": "Interval Estimation for Reinforcement-Learning\n\nAlgorithms in Continuous-State Domains\n\nMartha White\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nwhitem@cs.ualberta.ca\n\nAdam White\n\nDepartment of Computing Science\n\nUniversity of Alberta\n\nawhite@cs.ualberta.ca\n\nAbstract\n\nThe reinforcement learning community has explored many approaches to obtain-\ning value estimates and models to guide decision making; these approaches, how-\never, do not usually provide a measure of con\ufb01dence in the estimate. Accurate\nestimates of an agent\u2019s con\ufb01dence are useful for many applications, such as bi-\nasing exploration and automatically adjusting parameters to reduce dependence\non parameter-tuning. Computing con\ufb01dence intervals on reinforcement learning\nvalue estimates, however, is challenging because data generated by the agent-\nenvironment interaction rarely satis\ufb01es traditional assumptions. Samples of value-\nestimates are dependent, likely non-normally distributed and often limited, partic-\nularly in early learning when con\ufb01dence estimates are pivotal. In this work, we\ninvestigate how to compute robust con\ufb01dences for value estimates in continuous\nMarkov decision processes. We illustrate how to use bootstrapping to compute\ncon\ufb01dence intervals online under a changing policy (previously not possible) and\nprove validity under a few reasonable assumptions. We demonstrate the applica-\nbility of our con\ufb01dence estimation algorithms with experiments on exploration,\nparameter estimation and tracking.\n\n1\n\nIntroduction\n\nIn reinforcement learning, an agent interacts with the environment, learning through trial-and-error\nbased on scalar reward signals. Many reinforcement learning algorithms estimate values for states to\nenable selection of maximally rewarding actions. Obtaining con\ufb01dence intervals on these estimates\nhas been shown to be useful in practice, including directing exploration [17, 19] and deciding when\nto exploit learned models of the environment [3]. Moreover, there are several potential applications\nusing con\ufb01dence estimates, such as teaching interactive agents (using con\ufb01dence estimates as feed-\nback), adjusting behaviour in non-stationary environments and controlling behaviour in a parallel\nmulti-task reinforcement learning setting.\nComputing con\ufb01dence intervals was \ufb01rst studied by Kaelbling for \ufb01nite-state Markov decision pro-\ncesses (MDPs) [11]. Since this preliminary work, many model-based algorithms have been proposed\nfor evaluating con\ufb01dences for discrete-state MDPs. The extension to continuous-state spaces with\nmodel-free learning algorithms, however, has yet to be undertaken. In this work we focus on con-\nstructing con\ufb01dence intervals for online model-free reinforcement learning agents.\nThe agent-environment interaction in reinforcement learning does not satisfy classical assumptions\ntypically used for computing con\ufb01dence intervals, making accurate con\ufb01dence estimation challeng-\ning.\nIn the discrete case, certain simplifying assumptions make classical normal intervals more\nappropriate; in the continuous setting, we will need a different approach.\nThe main contribution of this work is a method to robustly construct con\ufb01dence intervals for approx-\nimated value functions in continuous-state reinforcement learning setting. We \ufb01rst describe boot-\n\n1\n\n\fstrapping, a non-parametric approach to estimating con\ufb01dence intervals from data. We then prove\nthat bootstrapping can be applied to our setting, addressing challenges due to sample dependencies,\nchanging policies and non-stationarity (because of learning). Then, we discuss how to address com-\nplications in computing con\ufb01dence intervals for sparse or local linear representations, common in\nreinforcement learning, such as tile coding, radial basis functions, tree-based representations and\nsparse distributed memories. Finally, we propose several potential applications of con\ufb01dence inter-\nvals in reinforcement learning and conclude with an empirical investigation of the practicality of our\ncon\ufb01dence estimation algorithm for exploration, tuning the temporal credit parameter and tracking.\n\n2 Related Work\n\nKaelbling was the \ufb01rst to employ con\ufb01dence interval estimation method for exploration in \ufb01nite-\nstate MDPs [11]. The agent estimates the probability of receiving a reward of 1.0 for a given state-\naction pair and constructs an upper con\ufb01dence bound on this estimate using a Bernoulli con\ufb01dence\ninterval. Exploration is directed by selecting the action with the highest upper con\ufb01dence bound,\nwhich corresponds to actions for which it has high uncertainty or high value estimates [11].\nInterval estimation for model-based reinforcement learning with discrete state spaces has been quite\nextensively studied. Mannor et al. (2004) investigated con\ufb01dence estimates for the parameters of the\nlearned transition and reward models, assuming Gaussian rewards [5, 16]. The Model Based Interval\nEstimation Algorithm (MBIE) uses upper con\ufb01dence bounds on the model transition probabilities to\nselect the model that gives the maximal reward [22]. The Rmax algorithm uses a heuristic notion of\ncon\ufb01dence (state visitiation counts) to determine when to explore, or exploit the learned model [3].\nBoth Rmax and MBIE are guaranteed to converge to the optimal policy in polynomially many steps.\nThese guarantees, however, become dif\ufb01cult for continuous state spaces.\nA recently proposed framework, KWIK (\u201cKnows What It Knows\u201d), is a formal framework for algo-\nrithms that explore ef\ufb01ciently by minimizing the number of times an agent must return the response\n\u201cI do not know\u201d [23]. For example, for reinforcement learning domains, KWIK-RMAX biases ex-\nploration toward states that the algorithm currently does not \u201cknow\u201d an accurate estimate of the\nvalue [23]. KWIK-RMAX provides an uncertainty estimate (not a con\ufb01dence interval) on a linear\nmodel by evaluating if the current feature vector is contained in the span of previously observed\nfeature vectors. Though quite general, the algorithm remains theoretical due to the requirement of a\nsolution to the model.\nBayesian methods (e.g., GPTD [6]) provide a natural measure of con\ufb01dence: one can use the poste-\nrior distribution to form credible intervals for the mean value of a state-action pair. However, if one\nwants to use non-Gaussian priors and likelihoods, then the Bayesian approach is intractable without\nappropriate approximations. Although this approach is promising, we are interested in computing\nclassical frequentist con\ufb01dence intervals for agents, while not restricting the underlying learning\nalgorithm to use a model or particular update mechanism.\nSeveral papers have demonstrated the empirical bene\ufb01ts of using heuristic con\ufb01dence estimates to\nbias exploration [14, 17, 19] and guide data collection in model learning [9, 18]. For example, Nouri\net al. [19] discretize the state space with a KD-tree and mark the state as \u201cknown\u201d after reaching a\nvisitation count threshold.\nIn the remainder of this work, we provide the \ufb01rst study of estimating con\ufb01dence intervals for\nmodel-free, online reinforcement learning value estimates in the continuous-state setting.\n\n3 Background\n\nIn this section, we will introduce the reinforcement learning model of sequential decision making\nand bootstrapping, a family of techniques used to compute con\ufb01dence intervals for means of depen-\ndent data from an unknown (likely non-normal) underlying distribution.\n\n3.1 Reinforcement Learning\n\nIn reinforcement learning, an agent interacts with its environment, receiving observations and se-\nlecting actions to maximize a scalar reward signal provided by the environment. This interaction is\n\n2\n\n\fk=0 \u03b3krt+k+1|st = s, at = a(cid:3), where \u03b3 \u2208 [0 1] discounts the contribution of future rewards.\n(cid:2)(cid:80)\u221e\n\nusually modeled by a Markov decision process (MDP). An MDP consists of (S, A, P, R) where S is\nthe set of states; A is a \ufb01nite set of actions; P , the transition function, which describes the probability\nof reaching a state s(cid:48) from a given state and action (s, a); and \ufb01nally the reward function R(s, a, s(cid:48)),\nwhich returns a scalar value for transitioning from state-action (s, a) to state s(cid:48). The state of the\nenvironment is said to be Markov if P r(st+1, rt+1|st, at) = P r(st+1, rt+1|st, at, . . . , s0, a0). The\nagent\u2019s objective is to learn a policy, \u03c0 : S \u2192 A, such that R is maximized for all s \u2208 S.\nMany reinforcement learning algorithms maintain an state-action value function, Q\u03c0(s, a), equal\nto the expected discounted sum of future rewards for a given state-action pair: Q\u03c0(s, a) =\nE\u03c0\nThe optimal state-action value function, Q\u2217(s, a), is the maximum achievable value given the agent\nstarts in state s and selects action a. The optimal policy, \u03c0\u2217, is greedy with respect to the opti-\nmal value function: \u03c0\u2217(s) = argmaxa\u2208A Q\u2217(s, a) for all s \u2208 S. During learning the agent must\nbalance selecting actions to achieve high reward (according to \u02c6Q(s, a)) or selecting actions to gain\nmore information about the environment. This is called the exploration-exploitation trade-off.\nIn many practical applications, the state space is too large to store in a table. In this case, a function\napproximator is used to estimate the value of a state-action pair. A linear function approximator\nproduces a value prediction using a linear combination of basis units: \u02c6Q(s, a) = \u03b8T \u03c6(s, a). We\nrefer the reader to the introductory text [25] for a more detailed discussion on reinforcement learning.\n\n3.2 Bootstrapping a con\ufb01dence interval for dependent data\n\nwe can use bootstrapping to compute a con\ufb01dence interval around the mean, Tn = n\u22121(cid:80)n\n\nBootstrapping is a statistical procedure for estimating the distribution of a statistic (such as the\nsample mean), particularly when the underlying distribution is complicated or unknown, samples\nare dependent and power calculations (e.g. variance) are estimated with limited sample sizes [21].\nThis estimate can then be used to approximate a 1 \u2212 \u03b1 con\ufb01dence interval around the statistic: an\ninterval for which the probability of seeing the statistic outside of the interval is low (probability \u03b1).\nFor example, for potentially dependent data sampled from an unknown distribution P (X1, X2, . . .),\ni=1 xn.\nThe key idea behind bootstrapping is that the data is an appropriate approximation, Pn, of the true\ndistribution: resampling from the data represents sampling from Pn. Samples are \u201cdrawn\u201d from Pn\nto produce a bootstrap sample, x\u2217\nn, of the statistic.\nThis process is repeated B times, giving B samples of the statistic, T \u2217\nn,B. These, for\nn,b \u2212 T\n\nexample, can be used to estimate VarP (Tn) \u2248 VarPn (Tn) =(cid:80)(T \u2217\n\nn \u2282 {x1, . . . , xn}, and an estimate, T \u2217\nn,1, . . . , T \u2217\n\u2217\nn)2/(B \u2212 1).\n\n1, . . . , x\u2217\n\n\u221a\nBootstrapped intervals have been shown to have a lower coverage error than normal intervals for\ndependent, non-normal data. A normal interval has a coverage error of O(1/\nn), whereas boot-\nstrapping has a coverage error of O(n\u22123/2) [29]. The coverage error represents how quickly the\nestimated interval converges to the true interval: higher order coverage error indicates faster con-\nvergence1. Though the theoretical conditions for these guarantees are somewhat restrictive [29],\nbootstrapping has nevertheless proved very useful in practice for more general data [4, 21].\nWith the bootstrapped samples, a percentile-t (studentized) interval is constructed by\n\nP (T \u2208 (2Tn \u2212 T \u2217\n\n1\u2212\u03b1/2, 2Tn \u2212 T \u2217\n\n\u03b1/2)) \u2265 1 \u2212 \u03b1\n\nwhere T \u2217\ntion of size n is the continuous sample quantile:\n\n\u03b2 is the \u03b2 sample quantile of T \u2217\n\nn,1, . . . , T \u2217\n\nn,B. Usually, the \u03b2-quantile of an ordered popula-\n\n(1 \u2212 r)T \u2217\n\nn,j + rT \u2217\n\nn,j+1\n\nwhere\n\nj = (cid:98)n\u03b2(cid:99) + m, r = n\u03b2 \u2212 j + m\n\nwhere m is dependent on quantile type, with m = \u03b2+1\n3\nThe remaining question is how to bootstrap from the sequence of samples. In the next section, we\ndescribe the block bootstrap, applicable to Markov processes, which we will show represents the\nstructure of data for value estimates in reinforcement learning.\n\ncommon for non-normal distributions.\n\n1More theoretically, coverage error is the approximation error in the Edgeworth expansions used to approx-\n\nimate the distribution in bootstrap proofs.\n\n3\n\n\f3.2.1 Moving Block Bootstrap\n\nIn the moving block bootstrap method, blocks of consecutive samples are drawn with replacement\nfrom a set of overlapping blocks, making the k-th block {xk\u22121+t : t = 1 . . . , l}. The bootstrap\nresample is the concatenation of n/l blocks chosen randomly with replacement, making a time\nseries of length n; B of these concatenated resamples are used in the bootstrap estimate. The\nblock bootstrap is appropriate for sequential processes because the blocks implicitly maintain a\ntime-dependent structure. An common heuristic for the block length, l, is n1/3 [8].\nThe moving block bootstrap was designed for stationary, dependent data; however, our scenario\ninvolves nonstationary data. Lahiri [12] proved a coverage error of o(n\u22121/2) when applying the\nmoving block bootstrap to nonstationary, dependent data, better than the normal coverage error.\nFortunately, the conditions are not restrictive for our scenario, described further in the next section.\nNote that there are other bootstrapping techniques applicable to sequential, dependent data with\nlower coverage error, such as the double bootstrap [13], block-block bootstrap [1] and Markov or\nSieve bootstrap [28]. In particular, the Markov bootstrap has been shown to have a lower cover-\nage error for Markov data than the block bootstrap under certain restricted conditions [10]. These\ntechniques, however, have not been shown to be valid for nonstationary data.\n\n4 Con\ufb01dence intervals for continuous-state Markov decision processes\n\nIn this section, we present a theoretically sound approach to constructing con\ufb01dence intervals for\nparametrized Q(s, a) using bootstrapping for dependent data. We then discuss how to address sparse\nrepresentations, such as tile coding, which make con\ufb01dence estimation more complicated.\n\n4.1 Bootstrapped Con\ufb01dence Intervals for Global Representations\n\nThe goal is to compute a con\ufb01dence estimate for Q(st, at) on time step t. Assume that we are\nlearning a parametrized value function Q(s, a) = f (\u03b8, s, a), with \u03b8 \u2208 Rd and a smooth function\nf : Rd \u00d7 S \u00d7 A \u2192 R. A common example is a linear value function Q(s, a) = \u03b8T \u03c6(s, a), with\n\u03c6 : S \u00d7 A \u2192 Rd. During learning, we have a sequence of changing weights, {\u03b81, \u03b82, . . . , \u03b8n} up\nto time step n, corresponding to the random process {\u03981, . . . , \u0398n}. If this process were stationary,\nthen we could compute an interval around the mean of the process. In almost all cases, however, the\nprocess will be nonstationary with means {\u00b51, . . . , \u00b5n}. Instead, our goal is to estimate\n\nn(cid:88)\n\n\u00affn(s, a) = n\u22121\n\nE[f (\u0398t, s, a)]\n\nt=1\n\nwhich represents the variability in the current estimation of the function \u02c6Q for any given state-action\npair, (s, a) \u2208 S \u00d7 A. Because Q is parametrized, the sequence of weights, {\u0398t}, represents the\nvariability for the uncountably many state-action pairs.\nAssume that the weight vector on time step t + 1 is drawn from the unknown distribution\nPa[(\u0398t+1, st+1)|(\u03b8t, st), . . . , (\u03b8t\u2212k, st\u2212k)], giving a k-order Markov dependence on previous\nstates and weight vectors. Notice that Pa incorporates P and R, using st, \u03b8t (giving the policy\n\u03c0) and R to determine the reward passed to the algorithm to then obtain \u03b8t+1. This allows the learn-\ning algorithm to select actions using con\ufb01dence estimates based on the history of the k most recent\n\u03b8, without invalidating that the sequence of weights are drawn from Pa. In practice, the length of\nthe dependence, k, can be estimated using auto-correlation [2].\nApplying the Moving Block Bootstrap method to a non-stationary sequence of \u03b8\u2019s requires several\nassumptions on the underlying MDP and the learning algorithm. We require two assumptions on the\nunderlying MDP: a bounded density function and a strong mixing requirement. The assumptions\non the algorithm are less strict, only requiring that the algorithm be non-divergent and produce a\nsequence of {Qt(s, a)} that 1) satisfy a smoothness condition (a dependent Cramer condition), 2)\nhave a bounded twelfth moment and 3) satisfy an m-dependence relation where suf\ufb01ciently sepa-\nrated Qi(s, a), Qj(s, a) are independent. Based on these assumptions (stated formally in the sup-\nplement), we can prove that the moving block bootstrap produces an interval with a coverage error\nof o(n\u22121/2 for the studentized interval on fn(s, a).\n\n4\n\n\fTheorem 1 Given that Assumption 1-7 are satis\ufb01ed and there exists constants C1, C2 > 0, 0 <\n\u03b1 \u2264 \u03b2 < 1/4 such that C1n\u03b1 < l < C2n\u03b2 (i.e. l increases with n), then the moving block bootstrap\nproduces a one-sided con\ufb01dence interval that is consistent and has a coverage error of o(n\u22121/2) for\nthe studentization of the mean of the process {f (\u03b8t, s, a)}, where Qt(s, a) = f (\u03b8t, s, a).\nThe proof for the above theorem follows Lahiri\u2019s proof [12] for the coverage error of the moving\nblock bootstrap for nonstationary data. The general approach for coverage error proofs involve\napproximating the unknown distribution with an Edgeworth expansion (see [7]), with the coverage\nerror dependent on the order of the expansion, similar to the the idea of a Taylor series expansion.\nAssuming Pa is k-order Markov results in two important practical implications on the learning\nalgorithm: 1) inability to use eligibility traces and 2) restrictions on updates to parameters (such\nas the learning rate). These potential issues, however, are actually not restrictive. First, the tail of\neligibility traces has little effect, particularly for larger k; the most recent k weights incorporate the\nmost important information for the eligibility traces. Second, the learning rate, for example, cannot\nbe updated based on time. The learning rate, however, can still be adapted based on changes between\nweight vectors, a more principled approach taken, by the meta-learning algorithm, IDBD [24].\nThe \ufb01nal algorithm is summarized in the pseudocode below. In practice, a window of data of length\nw is stored due to memory restrictions; other data selection techniques are possible. Corresponding\nto the notation in Section 3.2, Qi represents the data samples (of \u02c6Q(s, a)), (Q\u2217\ni,1, . . . , Q\u2217i, M ) the\ndependently sampled blocks for the ith resample and T \u2217\nAlgorithm 1 GetUpperCon\ufb01dence(f (\u00b7, s, a),{\u03b8n\u2212w, . . . \u03b8n}, \u03b1)\nl = block length, B = num bootstrap resamples\nlast w weights and con\ufb01dence level \u03b1 (= 0.05)\n1: QN \u2190 {f (\u03b8n\u2212w, s, a), . . . f (\u03b8n, s, a)}\n2: Blocks = {[Qn\u2212w, . . . , Qn\u2212w+l\u22121], [Qn\u2212w+1, . . . , Qn\u2212w+l], . . . , [Qn\u2212l+1, . . . , Qn]}\n3: M \u2190 (cid:98)w/l(cid:99)\n4: for all i = 1 to B do\n(Q\u2217\n1, Q\u2217\n2, . . . , Q\u2217\n5:\nT \u2217\ni = 1\n6:\nM\u2217l\n7: end for\nB})\n8: sort({T \u2217\n1 , . . . , T \u2217\n9: j \u2190 (cid:98) B\u03b1\n6 (cid:99), r \u2190 B\u03b1\n2 + \u03b1+2\n\u03b1/2 \u2190 (1 \u2212 r)T \u2217\n10: T \u2217\nj + rT \u2217\n11: Return 2mean(QN ) \u2212 T \u2217\n\nM\u2217l) \u2190 concatMRandomBlocks(Blocks, M)\n\ni the mean of the i resample.\n\n2 + \u03b1+2\n\n6 \u2212 j\n\n(cid:80) Q\u2217\n\nj\n\nthe number of length lblocks to sample with replacement and concatenate\n\nj+1\n\n\u03b1/2\n\n4.2 Bootstrapped Con\ufb01dence Intervals for Sparse Representations\n\nWe have shown that bootstrapping is a principled approach for computing intervals for global rep-\nresentations; sparse representations, however, complicate the solution. In an extreme case, for ex-\nample, for linear representations, features active on time step t may have never been active before.\nSamples Q1(st, at), . . . , Qt(st, at) would therefore all equal Q0(st, at), because the weights would\nhave never been updated for those features. Consequently, the samples erroneously indicate low\nvariance for Q(st, at).\nWe propose that, for sparse linear representations, the samples for the weights can be treated inde-\npendently and still produce a reasonable, though currently unproven, bootstrap interval. Notice that\nfor \u03b8(i) the ith feature\nPa[(\u03b8t, st)|(\u03b8t\u22121, st\u22121), . . . , (\u03b8t\u2212k, st\u2212k)] = \u03a0d\ni=1Pa[(\u03b8t(i), st)|(\u03b8t\u22121, st\u22121), . . . , (\u03b8t\u2212k, st\u2212k)]\nbecause updates to weights \u03b8(i), \u03b8(j) are independent given the previous states and weights vectors\nfor all i, j \u2208 {1, . . . , d}. We could, therefore, estimate upper con\ufb01dence bounds on the individual\ni=1 ucbi(s, a) \u2217 \u03c6i(s, a), to produce\nan upper con\ufb01dence bound on Q(st, at). To approximate the variance of \u03b8(i) on time step t, we can\nuse the last w samples of \u03b8(i) where \u03b8(i) changed.\n\nweights, ucbi(s, a), and then combine them, via ucb(s, a) =(cid:80)d\n\n5\n\n\fProving coverage error results for sparse representations will require analyzing the covariance be-\ntween components of \u03b8 over time. The above approach for sparse representations does not capture\nthis covariance; due to sparsity, however, the dependence between many of the samples for \u03b8(i)\nand \u03b8(j) will likely be weak. We could potentially extend the theoretical results by bounding the\ncovariance between the samples and exploiting independencies. The means for individual weights\ncould likely be estimated separately, therefore, and still enable a valid con\ufb01dence interval. In future\nwork, a potential extension is to estimate the covariances between the individual weights to improve\nthe interval estimate.\n\n5 Applications of con\ufb01dence intervals for reinforcement learning\n\nThe most obvious application of interval estimation is to bias exploration to select actions with\nhigh uncertainty. Con\ufb01dence-based exploration should be comparable to optimistic initialization\nin domains where exhaustive search is required and \ufb01nd better policies in domains where noisy\nrewards and noisy dynamics can cause the optimistic initialization to be prematurely decreased and\ninhibit exploration. Furthermore, con\ufb01dence-based exploration reduces parameter tuning because\nthe policy does not require knowledge of the reward range, as in softmax and optimistic initialization.\nCon\ufb01dence-based exploration could be bene\ufb01cial in domains where the problem dynamics and re-\nward function change over time. In an extreme case, the agent may converge to a near-optimal policy\nbefore the goal is teleported to another portion of the space. If the agent continues to act greedily\nwith respect to its action-value estimates without re-exploring, it may act sub-optimally inde\ufb01nitely.\nThese tracking domains require that the agent \u201cnotice\u201d that its predictions are incorrect and begin\nsearching for a better policy. AN example of a changing reward signals arises in interactive teaching.\nIn this scenario, the a human teaching shapes the agent by providing a drifting reward signal. Even\nin stationary domains, tracking the optimal policy may be more effective than converging due to the\nnon-stationarity introduced by imperfect function approximation [26].\nAnother potential application of con\ufb01dence estimation is to automate parameter tuning online. For\nexample, many TD-based reinforcement learning algorithms use an eligibility parameter (\u03bb) to ad-\ndress the credit assignment problem. Learning performance can be sensitive to \u03b3. There has been\nlittle work, however, exploring the effects of different decay functions for \u03bb; using different \u03bb values\nfor each state/feature; or for meta-learning \u03bb. Con\ufb01dence estimates could be used to increase \u03bb\nwhen the agent is uncertain, re\ufb02ecting and decrease \u03bb for con\ufb01dent value estimates [25].\nCon\ufb01dence estimates could also be used to guide the behaviour policy for a parallel multi-task\nreinforcement learning system. Due to recent theoretical developments [15], several target value\nfunctions can be learned in parallel, off-policy, based on a single stream of data from a behaviour\npolicy. The behaviour policy should explore to provide samples that generalize well between the\nvarious target policies, speeding overall convergence. For example, if one-sided intervals are main-\ntained for each target value functions, the behaviour policy could select an action corresponding to\nthe maximal sum of those intervals. Exploration is then biased to highly uncertain areas where more\nsamples are required.\nFinally, con\ufb01dence estimates could be used to determine when features should be evaluated in a\nfeature construction algorithm. Many feature construction algorithms, such as cascade correlation\nnetworks, interleave proposing candidate features and evaluation. In an online reinforcement learn-\ning setting, these methods freeze the representation for a \ufb01xed window of time to accurately evaluate\nthe candidate [20]. Instead of using a \ufb01xed window, a more principled approach is to evaluate the\nfeatures after the con\ufb01dence on the weights of the candidate features reached some threshold.\n\n6 Experimental Results\n\nIn this section, we provide a preliminary experimental investigation into the practicality of con\ufb01-\ndence estimation in continuous-state MDPs. We evaluate a naive implementation of the block boot-\nstrap method for (1) exploration in a noisy reward domain, (2) automatically tuning \u03bb in the Cartpole\ndomain and (3) tracking a moving goal in a navigation task. In all tests we used the Sarsa(\u03bb) learn-\ning algorithm with tile coding function approximation (see Sutton and Barto [25]). All experiments\nwere evaluated using RL-Glue [27] and averaged over 30 independent runs.\n\n6\n\n\f(a) Exploration: convergence\n\n(b) Exploration: comparison\n\nFigure 1: Results showing (a) convergence of various exploration techniques in the navigation task\nand (b) average cumulative reward of various exploration techniques on the navigation task.\n\n6.1 Exploration\n\nTo evaluate the effectiveness of con\ufb01dence-based exploration, we use a simple two-goal continuous\nnavigation task. The small goal yields a reward of 1.0 on every visit. The \ufb02ashing goal yields a\nreward selected uniformly from {100,\u2212100, 5,\u22125, 50}. The reward on all other steps is zero and\n\u03b3 = 0.99 (similar results for -1 per step and \u03b3 = 1.0). The agent\u2019s observation is a continuous (x, y)\nposition and actions move the agent {N,S,E,W} perturbed by uniform noise 10% of the time. We\npresent only the \ufb01rst 200 episodes to highlight early learning performance.\nSimilar to Kaelbling, we select the action with the highest upper con\ufb01dence in each state. We\ncompare our con\ufb01dence exploration algorithm to three baselines commonly used in continuous state\nMDPs: (1) \u0001-greedy (selecting the highest-value action with probability 1 \u2212 \u0001, random otherwise),\n(2) optimistic initialization (initializing all weights to a high \ufb01xed value to encourage exploration)\nand (3) softmax (choosing actions probabilistically according to their values). We also compare our\nalgorithm to an exploration policy using normal (instead of bootstrapped) intervals to investigate the\neffectiveness of making simplifying assumptions on the data distribution. We present the results for\nthe best parameter setting for each exploration policy for clarity. Figure 1 summarizes the results.\nThe \u0001-greedy policy convergences slowly to the small goal. The optimistic policy slowly converges\nto the small goal for lower initializations and does not favour either goal for higher initializations.\nThe softmax policy navigates to the small goal on most runs and also convergences slowly. The\nnormal-interval exploration policy does prefer the \ufb02ashing goal but not as quickly as the bootstrap\npolicy. Finally, the bootstrap-interval exploration policy achieves highest cumulative reward and is\nthe only policy that converges to the \ufb02ashing goal, despite the large variance in the reward signal.\n\n6.2 Adjusting Lambda\n\nTo illustrate the effect of adjusting \u03bb based on con\ufb01dence intervals, we study the Cartpole problem.\nWe selected Cartpole because the performance of Sarsa is particularly sensitive to \u03bb in this domain.\nThe objective in Cartpole is to apply forces to a cart on a track to keep a pole from falling over.\nAn episode ends when the pole falls past a given angle or the cart reaches the end of the track.\nThe reward is +1 for each step of the episode. The agent\u2019s observations are the cart position and\nvelocity and the poles\u2019 angle and angular velocity. The Cartpole environment is based on Sutton and\nBarto\u2019s [25] pole-balancing task and is available in RL-library [27].\nTo adjust the \u03bb value, we reset \u03bb on every time step: \u03bb = normalized(ucb) where ucb = 0.9 \u2217\nucb + 0.1\u2217 getUpperCon\ufb01dence(\u03c6(s, a), \u03b8, \u03b1). The con\ufb01dence estimates were only used to adjust \u03bb\nfor clarity: exploration was performed using optimistic initialization. Figure 2 presents the average\nbalancing time on the last episode for various values of \u03bb. The \ufb02at line depicts the average balancing\ntime for Sarsa with \u03bb tuned via con\ufb01dence estimates. Setting \u03bb via con\ufb01dence estimates achieves\nperformance near the best value of \u03bb. We also tested adjusting \u03bb using normal con\ufb01dence intervals,\nhowever, the normal con\ufb01dence intervals resulted in worse performance then any \ufb01xed value of \u03bb.\n\n7\n\n 0 100 200 300 400 500 600 700 800 900 1000 0 20 40 60 80 100 120 140 160 180 200Average steps until terminationEpisode NumberCon\ufb01dence ExplorationNormale-greedyOptimisticsoftmaxe-greedy-200 0 200 400 600 800 1000 1200 1400 1600 1800 2000 0 20 40 60 80 100 120 140 160 180 200Cummulative RewardEpisode NumberCon\ufb01dence ExplorationNormalOptimisticsoftMax\fFigure 2: Performance of Sarsa(\u03bb) on\nCartpole for various values of \u03bb. The\nstraight\nline depicts the performance\nof Sarsa with \u03bb adjusted using the\ncon\ufb01dence estimation algorithm.\n\n6.3 Non-stationary Navigation Task\n\nOne natural source of non-stationarity is introduced by shaping a robot through successive approx-\nimations to a goal task (e.g., changing the reward function). We studied the effects of this form\nof non-stationarity, where the agent learns to go to a goal and then another, better goal becomes\navailable (near the \ufb01rst goal to better guide it to the next goal). In our domain, the agent receives -1\nreward per step and +10 at termination in a goal region. After 150 episodes, the goal region is tele-\nported to a new location within 50 steps of the previous goal. The agent receives +10 in the new goal\nand now 0 in the old goal. We used \u0001 = 0 to enable exploration only with optimistic initialization.\nWe recorded the number of times the agent converged to the new goal with the change after an\ninitial learning period of 150 episodes. The bootstrap-based explorer found the new goal 70% of\nthe time. It did not always \ufb01nd the new goal because the -1 structure biased it to stay with the\nsafe 0 goal. Interestingly, optimistic initialization was unable to \ufb01nd the new goal because of this\nbias, illustrating that the con\ufb01dence-based explorer detected the increase in variance and promoted\nre-exploration automatically.\n\n7 Conclusion\n\nIn this work, we investigated constructing con\ufb01dence intervals on value estimates in the continuous-\nstate reinforcement learning setting. We presented a robust approach to computing con\ufb01dence es-\ntimates for function approximation using bootstrapping, a nonparametric estimation technique. We\nproved that our con\ufb01dence estimate has low coverage error under mild assumptions on the learning\nalgorithm. In particular, we did so even for a changing policy that uses the con\ufb01dence estimates. We\nillustrated the usefulness of our estimates for three applications: exploration, tuning \u03bb and tracking.\nWe are currently exploring several directions for future work. We have begun testing the con\ufb01dence-\nbased exploration on a mobile robot platform. Despite the results presented in this work, many\ntraditional deterministic, negative cost-to-goal problems (e.g., Mountain Car, Acrobot and Puddle\nWorld) are ef\ufb01ciently solved using optimistic exploration. Robotic tasks, however, are often more\nnaturally formulated as continual learning tasks with a sparse reward signal, such as negative reward\nfor bumping into objects, or a positive reward for reaching some goal. We expect con\ufb01dence based\ntechniques to perform better in these settings where the reward range may be truly unknown (e.g.\ngenerated dynamically by a human teacher) and under natural variability in the environment (noisy\nsensors and imperfect motion control). We have also begun evaluating con\ufb01dence-interval driven\nbehaviour for large-scale, parallel off-policy learning on the same robot platform.\nThere are several potential algorithmic directions, in addition to those mentioned throughout this\nwork. We could potentially improve coverage error by extending other bootstrapping techniques,\nsuch as the Markov bootstrap, to non-stationary data. We could also explore the theoretical work\non exponential bounds, such as the Azuma-Hoeffding inequality, to obtain different con\ufb01dence es-\ntimates with low coverage error. Finally, it would be interesting to extend the theoretical results in\nthe paper to sparse representations.\nAcknowledgements: We would like to thank Csaba Szepesv\u00b4ari, Narasimha Prasad and Daniel Li-\nzotte for their helpful comments and NSERC, Alberta Innovates and the University of Alberta for\nfunding the research.\n\n8\n\n 300 400 500 600 700 800 900 10000.00.10.50.91.0Average steps until terminationLambda\fReferences\n[1] D.W.K. Andrews. The block-block bootstrap:\n\n72(3):673\u2013700, 2004.\n\nImproved asymptotic re\ufb01nements. Econometrica,\n\n[2] G.E.P. Box, G.M. Jenkins, and G.C. Reinsel. Time series analysis: forecasting and control. Holden-day\n\nSan Francisco, 1976.\n\n[3] R. I. Brafman and M. Tennenholtz. R-max - a general polynomial time algorithm for near-optimal rein-\n\nforcement learning. Journal of Machine Learning Research, 3:213\u2013231, 2002.\n\n[4] A.C. Davison and DV Hinkley. Bootstrap methods and their application. Cambridge Univ Pr, 1997.\n[5] E. Delage and S. Mannor. Percentile optimization for Markov decision processes with parameter uncer-\n\ntainty. Operations Research, 58(1):203, 2010.\n\n[6] Y. Engel, S. Mannor, and R. Meir. Reinforcement learning with Gaussian processes. In Proceedings of\n\nthe 22nd international conference on Machine learning, page 208. ACM, 2005.\n\n[7] P Hall. The bootstrap and Edgeworth expansion. Springer Series in Statistics, Jan 1997.\n[8] Peter Hall, Joel L. Horowitz, and Bing-Yi Jing. On blocking rules for the bootstrap with dependent data.\n\nBiometrika, 82(3):561\u201374, 1995.\n\n[9] Todd Hester and Peter Stone. Generalized model learning for reinforcement learning in factored domains.\nIn The Eighth International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2009.\n\n[10] J.L. Horowitz. Bootstrap methods for Markov processes. Econometrica, 71(4):1049\u20131082, 2003.\n[11] Leslie P. Kaelbling. Learning in Embedded Systems (Bradford Books). The MIT Press, May 1993.\n[12] SN Lahiri. Edgeworth correction by moving blockbootstrap for stationary and nonstationary data. Ex-\n\nploring the Limits of Bootstrap, pages 183\u2013214, 1992.\n\n[13] S. Lee and PY Lai. Double block bootstrap con\ufb01dence intervals for dependent data. Biometrika, 2009.\n[14] L. Li, M.L. Littman, and C.R. Mansley. Online exploration in least-squares policy iteration. In Proc. of\n\nThe 8th Int. Conf. on Autonomous Agents and Multiagent Systems, volume 2, pages 733\u2013739, 2009.\n\n[15] H.R. Maei, C. Szepesv\u00b4ari, S. Bhatnagar, and R.S. Sutton. Toward off-policy learning control with function\n\napproximation. ICM (2010), 50, 2010.\n\n[16] S. Mannor, D. Simester, P. Sun, and J.N. Tsitsiklis. Bias and variance in value function estimation. In\n\nProceedings of the twenty-\ufb01rst international conference on Machine learning, page 72. ACM, 2004.\n\n[17] Lilyana Mihalkova and Raymond J. Mooney. Using active relocation to aid reinforcement learning. In\n\nFLAIRS Conference, pages 580\u2013585, 2006.\n\n[18] Peter Stone Nicholas K. Jong. Model-based exploration in continuous state spaces. In The 7th Symposium\n\non Abstraction, Reformulation, and Approximation, July 2007.\n\n[19] A. Nouri and M.L. Littman. Multi-resolution exploration in continuous spaces. In NIPS, pages 1209\u2013\n\n1216, 2008.\n\n[20] Franc\u00b8ois Rivest and Doina Precup. Combining td-learning with cascade-correlation networks. In ICML,\n\npages 632\u2013639, 2003.\n\n[21] J. Shao and D. Tu. The jackknife and bootstrap. Springer, 1995.\n[22] A.L. Strehl and M.L. Littman. An empirical evaluation of interval estimation for markov decision pro-\n\ncesses. In Proc. of the 16th Int. Conf. on Tools with Arti\ufb01cial Intelligence (ICTAI04), 2004.\n\n[23] Alexander L. Strehl and Michael L. Littman. Online linear regression and its application to model-based\n\nreinforcement learning. In NIPS, 2007.\n\n[24] R.S. Sutton. Adapting bias by gradient descent: An incremental version of delta-bar-delta. In Proceedings\n\nof the National Conference on Arti\ufb01cial Intelligence, pages 171\u2013171, 1992.\n\n[25] R.S. Sutton and A.G. Barto. Introduction to reinforcement learning. MIT Press Cambridge, USA, 1998.\n[26] R.S. Sutton, A. Koop, and D. Silver. On the role of tracking in stationary environments. In Proceedings\n\nof the 24th international conference on Machine learning, page 878. ACM, 2007.\n\n[27] Brian Tanner and Adam White. RL-Glue : Language-independent software for reinforcement-learning\n\nexperiments. JMLR, 10:2133\u20132136, September 2009.\n\n[28] B A. Turlach. Bandwidth selection in kernel density estimation: A review.\n\nStatistique, 1993.\n\nIn CORE and Institut de\n\n[29] J. Zvingelis. On bootstrap coverage probability with dependent data. Computer-Aided Econ., 2001.\n\n9\n\n\f", "award": [], "sourceid": 443, "authors": [{"given_name": "Martha", "family_name": "White", "institution": null}, {"given_name": "Adam", "family_name": "White", "institution": null}]}