{"title": "Context-dependent upper-confidence bounds for directed exploration", "book": "Advances in Neural Information Processing Systems", "page_first": 4779, "page_last": 4789, "abstract": "Directed exploration strategies for reinforcement learning are critical for learning an optimal policy in a minimal number of interactions with the environment. Many algorithms use optimism to direct exploration, either through visitation estimates or upper confidence bounds, as opposed to data-inefficient strategies like e-greedy that use random, undirected exploration. Most data-efficient exploration methods require significant computation, typically relying on a learned model to guide exploration. Least-squares methods have the potential to provide some of the data-efficiency benefits of model-based approaches\u2014because they summarize past interactions\u2014with the computation closer to that of model-free approaches. In this work, we provide a novel, computationally efficient, incremental exploration strategy, leveraging this property of least-squares temporal difference learning (LSTD). We derive upper confidence bounds on the action-values learned by LSTD, with context-dependent (or state-dependent) noise variance. Such context-dependent noise focuses exploration on a subset of variable states, and allows for reduced exploration in other states. We empirically demonstrate that our algorithm can converge more quickly than other incremental exploration strategies using confidence estimates on action-values.", "full_text": "Context-Dependent Upper-Con\ufb01dence Bounds for\n\nDirected Exploration\n\nRaksha Kumaraswamy1, Matthew Schlegel1, Adam White1,2, Martha White1\n\n1Department of Computing Science, University of Alberta; 2DeepMind\n\n{kumarasw, mkschleg}@ualberta.ca, adamwhite@google.com, whitem@ualberta.ca\n\nAbstract\n\nDirected exploration strategies for reinforcement learning are critical for learning\nan optimal policy in a minimal number of interactions with the environment. Many\nalgorithms use optimism to direct exploration, either through visitation estimates\nor upper-con\ufb01dence bounds, as opposed to data-inef\ufb01cient strategies like \u270f-greedy\nthat use random, undirected exploration. Most data-ef\ufb01cient exploration methods\nrequire signi\ufb01cant computation, typically relying on a learned model to guide\nexploration. Least-squares methods have the potential to provide some of the\ndata-ef\ufb01ciency bene\ufb01ts of model-based approaches\u2014because they summarize past\ninteractions\u2014with the computation closer to that of model-free approaches. In\nthis work, we provide a novel, computationally ef\ufb01cient, incremental exploration\nstrategy, leveraging this property of least-squares temporal difference learning\n(LSTD). We derive upper-con\ufb01dence bounds on the action-values learned by\nLSTD, with context-dependent (or state-dependent) noise variance. Such context-\ndependent noise focuses exploration on a subset of variable states, and allows for\nreduced exploration in other states. We empirically demonstrate that our algorithm\ncan converge more quickly than other incremental exploration strategies using\ncon\ufb01dence estimates on action-values.\n\n1\n\nIntroduction\n\nExploration is crucial in reinforcement learning, as the data gathering process signi\ufb01cantly impacts\nthe optimality of the learned policies and values. The agent needs to balance the amount of time\ntaking exploratory actions to learn about the world, versus taking actions to maximize cumulative\nrewards. If the agent explores insuf\ufb01ciently, it could converge to a suboptimal policy; exploring too\nconservatively, however, results in many suboptimal decisions. The goal of the agent is data-ef\ufb01cient\nexploration: to minimize how many samples are wasted in exploration, particularly exploring parts of\nthe world that are known, while still ensuring convergence to the optimal policy.\nTo achieve such a goal, directed exploration strategies are key. Undirected strategies, where random\nactions are taken such as in \u270f-greedy, are a common default. In small domains these methods are\nguaranteed to \ufb01nd an optimal policy [35], because the agent is guaranteed to visit the entire space\u2014\nbut may take many many steps to do so, as undirected exploration can interfere with improving\npolicies in incremental control. In this paper we explore the idea of constructing con\ufb01dence intervals\naround the agent\u2019s value estimates. The agent can use these learned con\ufb01dence intervals to select\nactions with the highest upper-con\ufb01dence bound ensuring actions selected are of high value or whose\nvalues are highly uncertain. This optimistic approach is promising for directed exploration, but as yet\nthere are few such methods that are model-free, incremental and computationally ef\ufb01cient.\nDirected exploration strategies have largely been explored under the framework of \u201coptimism in\nthe face of uncertainty\u201d [13]. These can generally be categorized into count-based approaches\nand con\ufb01dence-based approaches. Count-based approaches estimate the \u201cknown-ness\u201d of a state,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\ftypically by maintaining counts for \ufb01nite state-spaces [16, 6, 36, 37, 43] and extensions on counting\nfor continuous states [14, 10, 26, 19, 33, 15, 32, 21]. Con\ufb01dence interval estimates, on the other\nhand, depend on variance of the target, not just on visitation frequency for states. Con\ufb01dence-based\napproaches can be more data-ef\ufb01cient for exploration, because the agent can better direct exploration\nwhere the estimates are less accurate. The majority of con\ufb01dence-based approaches compute\ncon\ufb01dence intervals on model parameters, both for \ufb01nite state-spaces [12, 47, 16, 6, 2, 3, 9, 43, 29]\nand continuous state-spaces [11, 27, 8, 1, 28]. There is early work quantifying uncertainty for value\nestimates directly for \ufb01nite state-spaces [22], describing the dif\ufb01culties with extending the local\nmeasures of uncertainty from the bandit literature to RL, since there are long-term dependencies.\nThese dif\ufb01culties suggest why using con\ufb01dence intervals directly on value estimates for exploration\nin RL has been less explored, until recently. More approaches are now being developed that maintain\ncon\ufb01dence intervals on the value function for continuous state-spaces, by maintaining a distribution\nover value functions [8, 31], or by maintaining a randomized set of value functions from which to\nsample [46, 31, 30, 34, 25]. Though signi\ufb01cant steps forward, these approaches have limitations\nparticularly in terms of computational ef\ufb01ciency. Delayed Gaussian Process Q-learning (DGPQ)\n[8] requires updating two Gaussian processes, which is cubic in the number of basis vectors for\nthe Gaussian process. RLSVI [31] is relatively ef\ufb01cient, maintaining a Gaussian distribution over\nparameters with Thompson sampling to get randomized values. Their staged approach for \ufb01nite-\nhorizon problems, however, does not allow for value estimates to be updated online, as the value\nfunction is \ufb01xed per episode to gather an entire trajectory of data. Moerland et al. [25], on the\nother hand, sample a new parameter vector from the posterior distribution each time an action is\nconsidered, which is expensive. The bootstrapping approaches can be ef\ufb01cient, as they simply have\nto store several value functions, either for training on a bootstrapped subset of samples\u2014such as in\nBootstrapped DQN [30]\u2014or for maintaining a moving bootstrap around the changing parameters\nthemselves, for UCBootstrap [46]. For both of these approaches, however, it is unclear how many\nvalue functions would be required, which could be large depending on the problem.\nIn this paper, we provide an incremental, model-free exploration algorithm with fast converging upper-\ncon\ufb01dence bounds, called UCLS: Upper-Con\ufb01dence Least-Squares. We derive the upper-con\ufb01dence\nbounds for Least-Squares Temporal Difference learning (LSTD), taking advantage of the fact that\nLSTD has an ef\ufb01cient summary of past interaction to facilitate computation of con\ufb01dence intervals.\nImportantly, these upper-con\ufb01dence bounds have context-dependent variance, where variance is\ndependent on state rather than a global estimate, focusing exploration on states with higher-variance.\nComputing con\ufb01dence intervals for action-values in RL has remained an open problem, and we\nprovide the \ufb01rst theoretically sound result for obtaining upper-con\ufb01dence bounds for policy evaluation\nunder function approximation, without making strong assumptions on the noise. We demonstrate\nin several simulated domains that UCLS outperforms DGPQ, UCBootstrap, and RLSVI. We also\nempirically show the bene\ufb01t of using UCLS to a simpli\ufb01ed version that uses a global variance\nestimate, rather than context-dependent variance.\n2 Background\n\nR is the reward function; and  : S\u21e5A\u21e5S!\n\nWe focus on the problem of learning an optimal policy for a Markov decision process, from on-\npolicy interaction. A Markov decision process consists of (S,A, Pr, r, ) where S is the set of\nstates; A is the set of actions; Pr : S\u21e5A\u21e5S!\n[0,1) provides the transition probabilities;\n[0, 1] is the transition-based\nr : S\u21e5A\u21e5S!\ndiscount function which enables either continuing or episodic problems to be speci\ufb01ed [45]. On each\nstep, the agent selects action At in state St, and transitions to St+1, according to Pr, receiving reward\ndef= (St, At, St+1). For a policy \u21e1 : S\u21e5A! [0, 1],\nRt+1\nwherePa2A \u21e1(s, a) = 1 8s 2S , the value at a given state s, taking action a, is the expected\n\ndiscounted sum of future rewards, with actions selected according to \u21e1 into the future,\n\ndef= r(St, At, St+1) and discount t+1\n\nQ\u21e1(s, a) = EhRt+1 + t+1Xa2A\n\n\u21e1(St+1, a)Q\u21e1(St+1, a)St = s, At = ai\n\nFor problems in which Q\u21e1 can be stored in a table, a \ufb01xed point for the action-values Q\u21e1 exists for a\ngiven \u21e1. In most domains, Q\u21e1 must be approximated by Q\u21e1\nIn the case of linear function approximation, state-action features x(st, at) are used to approximate\naction-values Q\u21e1\nw(st, at) = x(st, at)>w. The weights w can be learned with a stochastic approx-\nimation algorithm, called temporal difference (TD) learning [39]. The TD update [39] processes\n\nw, parametrized by w 2W\u21e2 Rd.\n\n2\n\n\fdef= x(St, At).\nsamples one at a time, w = w + \u21b5tzt, with t\nThe eligibility trace zt = xt + t+1zt1 facilitates multi-step updates via an exponentially weighted\nmemory of previous feature activations decayed by  2 [0, 1] and z0 = 0. Alternatively, we can\ndirectly compute the weight vector found by TD using least-squares temporal difference learning\n(LSTD) [5]. The LSTD solution is more data-ef\ufb01cient, and can avoid the need to tune TD\u2019s stepsize\nparameter \u21b5> 0. The LSTD update can be ef\ufb01ciently computed incrementally without approximation\nor storing the data [5, 4], by maintaining a matrix AT and vector bT ,\n\ndef= Rt+1 + t+1x>t+1w  x>t w for xt\n\nAT\n\ndef=\n\n1\nT\n\nzt(xt  t+1xt+1)>\n\nbT\n\ndef=\n\n1\nT\n\nT1Xt=0\n\nztRt+1\n\n(1)\n\nT1Xt=0\n\nThe value function approximation at time step T is the weights that satisfy the linear system AT w =\nbT . In practice, the inverse of the matrix A1 is maintained using a Sherman-Morrison update, with\na small regularizer \u2318 added to the matrix A to guarantee invertibility [41].\nOne approach to ensure systematic exploration is to initialize the agent\u2019s value estimates optimistically.\nThe action-value function must be initialized to predict the maximum possible return (or greater)\nfrom each state and action. For example, for cost-to-goal problems, with -1 per step, the values can\nbe initialized to zero. For continuing problems, with constant discount c < 1, the values can be\ninitialized to Gmax = Rmax/(1  c), if the maximum reward Rmax is known. For \ufb01xed features that\nare non-negative and encode locality\u2014such as tile coding or radial basis functions\u2014the weights w\ncan be simply set to Gmax, to make Qw optimistic.\nMore generally, however, it can be problematic to use optimistic initialization. Optimistic initialization\nassumes the beginning of time is special\u2014a period when systematic exploration should be performed\nafter which the agent should more or less exploit its current knowledge. Many problems are\nnon-stationary\u2014or at least bene\ufb01t from a tracking approach due to aliasing caused by function\napproximation\u2014and bene\ufb01t from continual exploration. Further, unlike for \ufb01xed features, it is\nunclear how to set and maintain initial values at Gmax for learned features, such as with neural\nnetworks. Optimistic initialization is also not straightforward for algorithms like LSTD, which\ncompletely overwrite the estimate w on each step with a closed-form solution. In fact, we have found\nthat this issue with LSTD has been obfuscated, because the regularizer \u2318 has inadvertently played a\nrole in providing optimism (see Appendix A). Rather, to use optimism in LSTD for control, we need\nto explicitly compute upper-con\ufb01dence bounds.\nCon\ufb01dence intervals around action-values, then, provide another mechanism for exploration in\nreinforcement learning. Consider action selection with explicit con\ufb01dence intervals around mean\nestimates \u02c6Qw(St, At), with estimated radius \u02c6U (St, At). The action selection is greedy w.r.t. to these\n\u02c6Qw(St, a) + \u02c6U (St, a), which provides a high-con\ufb01dence upper bound\noptimistic values, argmaxa\non the best possible value for that action. The use of upper-con\ufb01dence bounds on value estimates for\nexploration has been well-studied and motivated theoretically in online learning [7]. In reinforcement\nlearning, there have only been a few specialized proofs for particular algorithms using optimistic\nestimates [8, 31], but the result can be expressed more generally by using the idea of stochastic\noptimism. We extract the central argument by Osband et al. [31] to provide a general Optimistic\nValues Theorem in Appendix B. In particular, similar to online learning, we can guarantee that\ngreedy-action selection according to upper-con\ufb01dence values will converge to the optimal policy,\nif the con\ufb01dence interval radius shrinks to zero, if the algorithm to estimate action-values for a\npolicy converges to the corresponding actions and if upper-con\ufb01dence estimates are stochastically\noptimal\u2014remain above the optimal action-values in expectation.\nMotivated by this result, we pursue principled ways to compute upper-con\ufb01dence bounds for the\ngeneral, online reinforcement learning setting. We make a step towards computing such values\nincrementally, under function approximation, by providing upper-con\ufb01dence bounds for value\nestimates made by LSTD, for a \ufb01xed policy. We approximate these bounds to create a new algorithm\nfor control\u2014called Upper-Con\ufb01dence-Least-Squares (UCLS).\n3 Estimating Upper-Con\ufb01dence Bounds for Policy Evaluation using LSTD\n\nConsider the goal of obtaining a con\ufb01dence interval around value estimates learned incrementally by\nLSTD for a \ufb01xed policy \u21e1. The value estimate is x>w for state-action features x for the current state\nand action. We would like to guarantee, with probability 1  p for a small p > 0, that the con\ufb01dence\n\n3\n\n\finterval around this estimate contains the value x>w\u21e4 given by the optimal w\u21e4 2W . To estimate\nsuch an interval without parametric assumptions, we use Chebyshev\u2019s inequality which\u2014unlike other\nconcentration inequalities like Hoeffding or Bernstein\u2014does not require independent samples.\nTo use this inequality, we need to determine the variance of the estimate x>w; the variance of the\nestimate, given x, is due to the variance of the weights. Let w\u21e4 be \ufb01xed point solution for the\nprojected Bellman operator for the -return\u2014the TD \ufb01xed point, for a \ufb01xed policy \u21e1. To characterize\nthe noise for this optimal estimator, let \u232bt be the TD-error for the optimal weights w\u21e4, where\n\nwith E[\u232btzt] = 0.\n\nrt+1 = (xt  xt+1)>w\u21e4 + \u232bt\n\n(2)\nThe expectation is taken across all states weighted by the sampling distribution, typically the stationary\ndistribution d\u21e1 : S! [0,1) or in the off-policy case the stationary distribution of the behaviour\npolicy. We know that E[\u232btzt] = 0, by the de\ufb01nition of the Projected Bellman Error \ufb01xed point.\nThis noise \u232bt is incurred from the variability in the reward, the variability in the transition dynamics\nand potentially the capabilities of the function approximator. A common assumption\u2014when using\nlinear regression for contextual bandits [20] and for reinforcement learning [31]\u2014is that the variance\nof the target is a constant value 2 for all contexts x. Such an assumption, however, is likely to\nproduce larger con\ufb01dence intervals than necessary. For example, consider a one-state world with\ntwo actions, where one action has a high variance reward and the other has a lower variance reward\n(see Appendix A, Figure 4). A global sample variance will encourage both actions to be taken many\ntimes. For data-ef\ufb01cient exploration, however, the agent should take the high-variance action more,\nand only needs a few samples from the low-variance action.\nWe derive a con\ufb01dence interval for LSTD, in Theorem 1. We also derive the con\ufb01dence interval\nassuming a global variance in Corollary 1, to provide a comparison. We compare to using this global-\nvariance upper-con\ufb01dence bound in our experiments, and show that it results in signi\ufb01cantly worse\nperformance than using a context-dependent variance. Note that we do not assume AT is invertible;\nif we did, the big-O term in (3) below would disappear. We include this term for preciseness of the\nresult\u2014even though we will not estimate it\u2014because for smaller T , AT is unlikely to be invertible.\nHowever, we expect this big-O term to get small quickly, and be dominated by the other terms. In our\nalgorithm, therefore, we ignore the big-O term.\nTheorem 1. Let \u00af\u232bT\nLet \u270f\u21e4T\nis invertible. Assume that the following are all \ufb01nite: E[A+\nstate-action features x. With probability at least 1  p, given state-action features x,\nT \u00af\u232bT \u00af\u232b>T A+>T ]x + OE[(x>\u270f\u21e4T )2]\n\nT is the pseudoinverse of AT .\nT AT  I)w\u21e4 re\ufb02ect the degree to which AT is not invertible; it is zero when AT\nT \u00af\u232bT + \u270f\u21e4T ] and all\nx>w\u21e4 \uf8ff x>wT +q p+1\n\nProof: First we compute the mean and variance for our learned parameters. Because rt+1 =\n(xt  xt+1)>w\u21e4 + \u232bt,\n\np qx>E[A+\n\nT PT1\n\nt=0 zt\u232bt and wT = A+\n\nT bT where A+\n\nT \u00af\u232bT + \u270f\u21e4T ], V[A+\n\ndef= (A+\n\ndef= 1\n\n(3)\n\nT\n\nwT = 1\nT1Xt=0\nT 1\n\n= A+\n\nT\n\nztrt+1!\n\nT\n\nzt(xt  xt+1)>!1 1\nT1Xt=0\nzt((xt  xt+1)>w\u21e4 + \u232bt)!\nT1Xt=0\nT 1\n\nzt\u232bt!\n\nT\n\nT1Xt=0\n\n= A+\n\nT AT w\u21e4 + A+\n\n= w\u21e4 + A+\n\nT \u00af\u232bT + \u270f\u21e4T\n\nThis estimate has a small amount of bias, that vanishes asymptotically. But, for a \ufb01nite sample,\n\nE\"A+\nT 1\n\nT\n\nT1Xt=0\n\nzt\u232bt!# 6= E[A+\n\nT ]E\" 1\n\nT\n\nzt\u232bt# = 0.\n\nT1Xt=0\n\nFurther, because AT may not be invertible, there is an additional error \u270f\u21e4T term which will vanish\nwith enough samples, i.e., once AT can be guaranteed to be invertible.\n\n4\n\n\fFor covariance, because\n\nwT  E[wT ] =w\u21e4 + A+\n\n= A+\n\nthe covariance of the weights is\n\nT \u00af\u232bT + \u270f\u21e4T )\u21e4\n\nT \u00af\u232bT + \u270f\u21e4T  E\u21e5w\u21e4 + A+\n\nT \u00af\u232bT + \u270f\u21e4T  E\u21e5A+\nV[wT ] = V\u21e5A+\n\nT \u00af\u232bT + \u270f\u21e4T\u21e4\nT \u00af\u232bT + \u270f\u21e4T\u21e4\nPr\u21e3|X  E[X]| <\u270fpV[X]\u2318  1 \nPr\u21e3|X  E[X]| <q 1\nppV[X]\u2318  1  p\n\n1\n\u270f2\n\nThe goal for computing variances is to use a concentration inequality. Chebyshev\u2019s inequality1 states\nthat for a random variable X, if the E[X] and V[X] are bounded, then for any \u270f  0:\n\nIf we set \u270f =p1/p, then this gives\n\nNow we have characterized the variance of the weights, but what we really want is to characterize the\nvariance of the value estimates. Notice that the variance of the value-estimate, for state-action x is\n\nV[x>wT|x] = E[x>wT w>T x|x]  E[x>wt|x]2\n= x>E[wT w>T ]  E[wT ]E[wT ]> x\n\n= x>V[wT ]x\n\n(4)\n\ndef= A+\n\n1\n\n1\n\n\uf8ff\n\n=\n\nTherefore, the variance of the estimate is characterized by the variance of the weights. With high\nprobability,\n\ndef= E[A+\n\nT \u00af\u232bT )> + \u270f\u21e4T \u270f\u21e4>T .\n\nT \u00af\u232bT \u270f\u21e4>T + \u270f\u21e4T (A+\n\nT \u00af\u232bT + \u270f\u21e4T ] and \u2303\u21e4T\n\nT \u00af\u232bT + \u270f\u21e4T ]\n\nx>wT  x>w\u21e4 =x>(wT  E[wT ]) + x>(E[wT ]  w\u21e4)\n\uf8ffx>(wT  E[wT ]) +x>(E[wT ]  w\u21e4)\nppqx>V\u21e5A+\nT \u00af\u232bT + \u270f\u21e4T\u21e4 x +x>E[A+\nppqx>E\u21e5A+\nT \u00af\u232bT \u00af\u232b>T A+>T + \u2303\u21e4T\u21e4  \u00b5\u21e4T \u00b5\u21e4>T  x +qx>\u00b5\u21e4T \u00b5\u21e4>T x (5)\nwhere Equation 4 uses Chebyshev\u2019s inequality, and the last step is a rewriting of Equation 4 using the\nde\ufb01nitions \u00b5\u21e4T\nTo simplify (5), we need to determine an upper bound for the general formula cpa2  b2 + b where\na  b  0. Because p < 1, we know that c =p1/p  1. Therefore, the extremal points for b, b = a\nand b = 0, both result in an upper bound of ca. Taking the derivative of the objective, gives a single\nstationary point in-between [0, a], with b = apc2+1. The value at this point evaluates to be apc2 + 1.\nTherefore, this objective is upper-bounded by apc2 + 1.\nNow for a2 = x>E\u21e5A+\nT \u00af\u232bT \u00af\u232b>T A+>T + \u2303\u21e4T\u21e4 x, the term involving x>E [\u2303\u21e4T ] x should quickly\nT \u00af\u232bT )(x>\u270f\u21e4T ) + (x>\u270f\u21e4T )2\u21e4, which results in the additional O(E[(x>\u270f\u21e4T )2]) in the bound.\nE\u21e52(x>A+\nT PT1\naction features x. With probability at least 1  p, given state-action features x,\n\n\u2305\nCorollary 1. Assume that \u232bt are i.i.d., with mean zero and bounded variance 2. Let \u00afzT =\nT \u00afzT \u00afz>T A+>T ] and all state-\n1\n\ndisappear, since it is only due to the potential lack of invertibility of AT . This term is equal to\n\nt=0 zt and assume that the following are \ufb01nite: E[\u270f\u21e4T ], V[\u270f\u21e4T ], E[A+\n\nx>w\u21e4 \uf8ff x>wT + q p+1\n\n(6)\n1Bernstein\u2019s inequality cannot be used here because we do not have independent samples. Rather, we\ncharacterize behaviour of the random variable w, using variance of w, but cannot use bounds that assume w is\nthe sum of independent random variables. The bound with Chebyshev will be loose, but we can better control\nthe looseness of the bound with the selection of p and the constant in front of the square root.\n\nT \u00afzT \u00afz>T A+>T ]x + OE[(x>\u270f\u21e4T )2]\n\np qx>E[A+\n\n5\n\n\fProof: The result follows similarly to above, with some simpli\ufb01cations due to global-variance:\n\nE\u21e5A+\n\nT \u00af\u232bT\u21e4 = EhEhA+\n\nE[A+\n\nT \u00af\u232bT \u00af\u232b>T A+>T ] = 2E[A+\n\nT \u00af\u232bTS0, ...., STii = E\"A+\n\nT \u00afzT \u00afz>T A+>T ]\n\nT\n\n1\nT\n\nT1Xt=0\n\nztEh\u232btS0, ...., STi# = 0\n\n\u2305\n\n4 UCLS: Estimating upper-con\ufb01dence bounds for LSTD in control\n\nIn this section, we present Upper-Con\ufb01dence-Least-Squares (UCLS)2, a control algorithm, which\nincrementally estimates the upper-con\ufb01dence bounds provided in Theorem 1, for guiding on-policy\nexploration. The upper-con\ufb01dence bounds are sound without requiring i.i.d. assumptions; however,\nthey are derived for a \ufb01xed policy. In control, the policy is slowly changing, and so instead we will be\nslowly tracking this upper bound. The general strategy, like policy iteration, is to slowly estimate both\nthe value estimates and the upper-con\ufb01dence bounds, under a changing policy that acts greedily with\nrespect to the upper-con\ufb01dence bounds. Tracking these upper bounds incurs some approximations;\nwe identify and address potential issues here. The complete psuedocode for UCLS is given in the\nAppendix (Algorithm 2).\nFirst, we are not evaluating one \ufb01xed policy; rather, the policy is changing. The estimates AT and bT\nwill therefore be out-of-date. As is common for LSTD with control, we use an exponential moving\naverage, rather than a sample average, to estimate AT , bT and the upper-con\ufb01dence bound. The\nexponential moving average uses AT = (1  )AT1 + zT (xt  xt+1)>, for some  2 [0, 1].\nIf  = 1/T , then this reduces to the standard sample average; otherwise, for a \ufb01xed , such as\n = 0.01, more recent samples have a higher weight in the average. Because an exponential average\nis unbiased, the result in Theorem 1 would still hold, and in practice the update will be more effective\nfor the control setting.\nSecond, we cannot obtain samples of the noise \u232bt = rt+1 + t+1x>t+1w\u21e4  x>t w\u21e4, which is the\nTD-error for the optimal value function parameters w\u21e4 (see Equation (2)). Instead, we use t as a\nproxy. This proxy results in an upper bound that is too conservative\u2014too loose\u2014because t is likely\nto be larger than \u232bt. This is likely to ensure suf\ufb01cient exploration, but may cause more exploration\nthan is needed. The moving average update\n\n\u00af\u232bt = \u00af\u232bt1 + t(tzt  \u00af\u232bt1)\n\n(7)\n\nT \u00af\u232bT \u00af\u232b>T A1\n\nshould also help mitigate this issue, as older t are likely larger than more recent ones.\nThird, the covariance matrix C estimating E[A1\nT ] could underestimate covariances, de-\npending on a skewed distribution over states and depending on the initialization. This is particularly\ntrue in early learning, where the distribution over states is skewed to be higher near the start state;\na sample average can result in underestimates in as yet unvisited parts of the space. To see why,\nlet a = A1\nT \u00af\u232bT . The covariance estimate Cij = E[aiaj] corresponds to feature i and j. The agent\nbegins in a certain region of the space, and so features that only become active outside of this region\nwill be zero, providing samples aiaj = 0. As a result, the covariance is arti\ufb01cially driven down\nin unvisited regions of the space, because the covariance accumulates updates of 0. Further, if\nthe initialization to the covariance Cii is an underestimate, a visited state with high variance will\narti\ufb01cially look more optimistic than an unvisited state.\nWe propose two simple approaches to this issue: updating C based on locality and adaptively\nadjusting the initialization to Cii. Each covariance estimate Cij for features i and j should only be\nupdated if the sampled outer-product is relevant, with the agent in the region where i and j are active.\nTo re\ufb02ect this locality, each Cij is updated with the aiaj only if the eligibility traces is non-zero for i\nand j. To adaptively update the initialization, the maximum observed a2\ni is stored, as cmax, and the\ninitialization c0 to each Cii is retroactively updated using\n\nCii = Cii  (1  )cic0 + (1  )cicmax\n\n2We do not characterize the regret of UCLS, and instead similarly to policy iteration, rely on a sound update\nunder a \ufb01xed policy to motivate incrementally estimating these values as if the policy is \ufb01xed and then acting\naccording to them. The only model-free algorithm that achieves a regret bound is RLSVI, but that bound is\nrestricted to the \ufb01nite horizon, batch, tabular setting. It would be a substantial breakthrough to provide such a\nregret bound, and is beyond the scope of this work.\n\n6\n\n\fwhere ci is the number of times Cii has been updated. This update is equivalent to having initialized\nCii = cmax. We provide a more stable retroactive update to Cii, in the pseudocode in Algorithm 2,\nthat is equivalent to this update.\nFourth, to improve the computational complexity of the algorithm, we propose an alternative,\nincremental strategy for estimating w, that takes advantage of the fact that we already need to\nestimate the inverse of A for the upper bound. In order to do so, we make use of the summarized\ninformation in A to improve the update, but avoid directly computing A1 as it may be poorly\nconditioned. Instead, we maintain an approximation B \u21e1 A> that uses a simple gradient descent\nupdate, to minimize kA>Bxt  xtk2\n2. If B is the inverse of A>, then this loss is zero; otherwise,\nminimizing it provides an approximate inverse. This estimate B is useful for two purposes in the\nalgorithm. First, it is clearly needed to estimate the upper-con\ufb01dence bound. Second, it also provides\na pre-conditioner for the iterative update w = w + G(b  Aw), for preconditioner G. The optimal\npreconditioner is in fact the inverse of A, if it exists. We use G = B> + \u2318I for a small \u2318> 0\nto ensure that the preconditioner is full rank. Developing this stable update for LSTD required\nsigni\ufb01cant empirical investigation into alternatives; in addition to providing a more practical UCLS\nalgorithm, we hope it can improve the use of LSTD in other applications.\n5 Experiments\n\nWe conducted several experiments to investigate the bene\ufb01ts of UCLS\u2019 directed exploration against\nother methods that use con\ufb01dence intervals for action selection, to evaluate sensitivity of UCLS\u2019s\nperformance with respect to its key parameter p, and to contrast the advantage contextual variance\nestimates offer over global variance estimates in control. Our experiments were intentionally con-\nducted in small\u2014though carefully selected\u2014simulation domains so that we could conduct extensive\nparameter sweeps, hundreds of runs for averaging, and compare numerous state-of-the-art exploration\nalgorithms (many of which are computationally expensive on larger domains). We believe that such\nexperiments constitute a signi\ufb01cant contribution, because effectively using con\ufb01dence bounds for\nmodel free-exploration in RL is still in its infancy\u2014not yet at the large-scale demonstration state\u2013with\nmuch work to be done. This point is highlighted nicely below as we demonstrate that several recently\nproposed exploration methods fail on these simple domains.\n5.1 Algorithms\nWe compare UCLS to DGPQ [8], UCBootstrap [46], our extension of LSPI-Rmax to an incremental\nsetting [19] and RLSVI [31]. In-depth descriptions of each algorithm and implementation details\ncan be found in the Appendix. These algorithms are chosen because they either keep con\ufb01dence\nintervals explicitly, as in UCBootstrap, or implicitly as in DGPQ and RLSVI. In addition, we included\nLSPI-Rmax as a natural alternative approach to using LSTD to maintain optimistic value estimates.\nWe also include Sarsa with \u270f-greedy, with \u270f optimized over an extensive parameter sweep. Though\n\u270f-greedy is not a generally practical algorithm, particularly in larger worlds, we include it as a\nbaseline. We do not include Sarsa with optimistic initialization, because even though it has been a\ncommon heuristic, it is not a general strategy for exploration. Optimistic initialization can converge\nto suboptimal solutions if initial optimism fades too quickly [46]. Further, initialization only happens\nonce, at the beginning of learning.\nIf the world changes, then an agent relying on systematic\nexploration due to its initialization may not react, because it no longer explores. For completeness\ncomparing to previous work using optimistic initialization, we include such results in Appendix G.\n5.2 Environments\nSparse Mountain Car is a version of classic mountain car problem Sutton and Barto [40], only\ndiffering in the reward structure. The agent only receives a reward of +1 at the goal and 0 otherwise,\nand a discounted, episodic  of 0.998. The start state is sampled from the range [0.6,0.4] with\nvelocity zero. This domain is used to highlight how exploration techniques perform when the reward\nsignal is sparse, and thus initializing the value function to zero is not optimistic.\nPuddle World is a continuous state 2-dimensional world with (x, y) 2 [0, 1]2 with 2 puddles: (1)\n[0.45, 0.4] to [0.45, 0.8], and (2) [0.1, 0.75] to [0.45, 0.75] - with radius 0.1 and the goal is the region\n(x, y) 2 ([0.95, 1.0], [0.95, 1.0]). The agent receives a reward of 1400\u21e4d on each time step, where\nd denotes the distance between the agent\u2019s position and the center of the puddle, and an undiscounted,\nepisodic  of 1.0. The agent can select an action to move 0.05 + \u21e3, \u21e3 \u21e0 N (\u00b5 = 0, 2 = 0.01).\n\n7\n\n\f10\n\nSteps per\nEpisode\nx*10^3\n(better\nperf.)\n\n4\n\n2\n\nRLSVI\nLSPI-Rmax\n\nUCBootstrap\n\n\u03b5-Greedy\n\nUCLS\n\n5\n\n10\n\nEpisodes\n\nSparse Mountain Car\n\n12\n\nNegated\n\nTotal\nReward\n2^x\n(better\nperf.)\n\n6\n5\n\n30\n\nLSPI-Rmax\n\nUCBootstrap\n\nRLSVI\n\nDGPQ\n\nUCLS\n\n50 100\n\nEpisodes\nPuddle World\n\n350\n\nUCLS\n\nDGPQ\n\nUCBootstrap\n\nOptimal\n\n4\n(better\nperf.)\nTotal\nReward\n10^x\n\nLSPI-Rmax\n\nRLSVI\n\n\u03b5-Greedy\n\n5\n\n10\n\n1\n\nSteps (x*10^3)\n\n25\n\nRiver Swim\n\nFigure 1: A comparison of speed of learning in Sparse Mountain Car, Puddle World and River Swim.\nIn plots (a) and (b) lower on y-axis are better, whereas in (c) curves higher along y-axis are better.\nSparse Mountain Car and Puddle World are episodic problems with a \ufb01xed experience budget. Thus\nthe length of the lines in plots (a) and (b) indicate how many episodes each algorithm completed over\n50,000 steps, and the height on the y-axis indicates the quality of the learned policy\u2014lower indicates\nbetter performance. Note RLSVI did not show signi\ufb01cant learning after 50,000 steps. The RLSVI\nresult in Puddle World uses a budget of 1 million.\nThe agent\u2019s initial state is uniformly sampled from (x, y) 2 ([0.1, 0.3], [0.45, 0.65]). This domain\nhighlights a common dif\ufb01culty for traditional exploration methods: high magnitude negative rewards,\nwhich often cause the agent to erroneously decrease its value estimates too quickly.\nRiver Swim is a standard continuing exploration benchmark [42] inspired by a \ufb01sh trying to swim\nupriver, with high reward (+1) upstream which is dif\ufb01cult to reach and, a lower but still positive\nreward (+0.005), which is easily reachable downstream. We extended this domain to continuous states\nin [0, 1], with a stochastic displacement of 0.1 when taking an action up or down, with low-probability\nof success for up. The starting position is sampled uniformly in [0, 0.1], and  = 0.99.\n\n5.3 Experimental Setup\n\nWe investigate a learning regime where the agents are allowed a \ufb01xed budget of interaction steps with\nthe environment, rather than allowing a \ufb01nite number of episodes of unlimited length. Our primary\nconcern is early learning performance, thus each experiment is restricted to 50,000 steps, with an\nepisode cutoff (in Sparse Mountain Car and Puddle World) at 10,000 steps. In this regime, an agent\nthat spends a signi\ufb01cant time exploring the world during the \ufb01rst episode may not be able to complete\nmany episodes, the cutoff makes exploration easier given the strict budget on experience. Whereas,\nin the more common framework of allowing a \ufb01xed number of episodes, an agent can consume many\nsteps during the \ufb01rst few episodes exploring, which is dif\ufb01cult to detect in the \ufb01nal performance\nresults. We average over 100 runs in River Swim and 200 runs for the other domains . For all the\nalgorithms that utilize eligibility traces we set  to be 0.9. For algorithms which use exponential\naveraging,  is set to 0.001, and the regularizer \u2318 is set to be 0.0001. The parameters for UCLS\nare \ufb01xed. RLSVI\u2019s weights are recalculated using all experienced transitions at the beginning of\nan episode in Puddle World and Sparse Mountain Car, and every 5,000 steps in River Swim. The\nparameters of competitors, where necessary, are selected as the best from a large parameter sweep.\nAll the algorithms except DGPQ use the same representation: (1) Sparse Mountain Car - 8 tilings\nof 8x8, hashed to a memory space of 512, (2) River Swim - 4 tilings of granularity 32, hashed to a\nmemory space of 128, and (3) Puddle World - 5 tilings of granularity 5x5, hashed to a memory space\nof 128. DGPQ uses its own kernel-based representation with normalized state information.\n\n5.4 Results & Analysis\n\nOur \ufb01rst experiment simply compares UCLS against other control algorithms in all the domains.\nFigure 1 shows the early learning results across all three domains. In all three domains UCLS achieves\nthe best \ufb01nal performance. In Sparse Mountain Car, UCLS learns faster than the other methods,\nwhile in River Swim DGPQ learns faster initially. UCBootstrap and UCLS learn at a similar rate in\nPuddle World, which is a cost-to-goal domain. UCBootstrap, and bootstrapping approaches generally,\ncan suffer from insuf\ufb01cient optimism, as they rely on suf\ufb01ciently optimistic or diverse initialization\nstrategies [46, 30]. LSPI-Rmax and RLSVI do not perform well in any of the domains. DGPQ does\nnot perform as well as UCLS in Puddle World, and exhibits high variance compared with the other\nmethods. In Puddle World, UCLS goes on to \ufb01nish 1200 episodes in the alloted budget of steps,\n\n8\n\n\f10-1\n\nOptimal\n\n4\n\n4\n\nOptimal\n\nTotal\nReward\n(10^x)\n\n2\n\n1\n\nTotal\nReward\n(10^x)\n\n2\n\n1\n\n10-5\n\n1\n\nSuboptimal Random\nSteps (x*10^4)\n\n2\nUCLS\n\n5\n\n10-5\n\n10-1\n\nSuboptimal Random\nSteps (x*10^4)\n\n2\n\n1\n\n5\n\nGV-UCB\n\nFigure 2: The effect of the\ncon\ufb01dence parameter p on\nthe policy,\nin River Swim,\nusing context-dependent vari-\nance (UCLS) and global vari-\nance (GV-UCB). The values\nfor p are {105, [1, 2, . . . , 9] \u21e5\n103, 102, 101}.\n\nwhereas in River Swim both UCLS and DGPQ get close to the optimal policy by the end of the\nexperiment.\nThe DGPQ algorithm uses the maximum reward (Rmax) to initialize the Gaussian processes. In\nSparse Mountain Car this effectively converts the problem back into the traditional -1 per-step\nformulation. In this traditional variant of Mountain Car UCLS signi\ufb01cantly outperforms DGPQ\n(Appendix G). Sarsa with \u270f-greedy learns well in Puddle world as it is a cost-to-goal problem in\nwhich by default Sarsa uses optimistic initialization, and therefore is reported in the Appendix. .\nNext we investigated the impact of the con\ufb01dence level 1  p, on the performance of UCLS in River\nSwim. The con\ufb01dence interval radius is proportional top1 + 1/p; smaller p should correspond to a\nhigher rate of exploration. In Figure 2, smaller p resulted in a slower convergence rate, but all values\neventually reach the optimal policy.\nFinally, we investigate the bene\ufb01t using contextual variance estimates over global variance estimates\nwithin UCLS. In Figure 2, we also show the effect of various p values on the performance of the\nalgorithm resulting from Corollary 1, which we call Global Variance-UCB (GV-UCB) (see Appendix\nE.1 for more details about this algorithm). For this range of p, UCLS still converges to the optimal\npolicy, albeit at different rates. Using a global variance estimates (GV-UCB), on the other hand,\nresults in signi\ufb01cant over-estimates of variance, resulting in poor performance.\n6 Conclusion and Discussion\n\nThis paper develops a sound upper-con\ufb01dence bound on the value estimates for least-squares tem-\nporal difference learning (LSTD), without making i.i.d. assumptions about noise distributions. In\nparticular, we allow for context-dependent noise, where variability could be due to noise in rewards,\ntransition dynamics or even limitations of the function approximator. We then introduce an algorithm,\ncalled UCLS, that estimates these upper-con\ufb01dence bounds incrementally, for policy iteration. We\ndemonstrate empirically that UCLS requires far fewer exploration steps to \ufb01nd high-quality policies\ncompared to several baselines, across domains chosen to highlight different exploration dif\ufb01culties.\nThe goal of this paper is to provide an incremental, model-free, data-ef\ufb01cient, directed exploration\nstrategy. The upper-con\ufb01dence bounds for action-values for \ufb01xed policies are one of the few available\nunder function approximation, and so a step towards exploration with optimistic values in the general\ncase. A next step is to theoretically show that using these upper bounds for exploration ensures\nstochastic optimism, and so converges to optimal policies.\nOne promising aspect of UCLS is that it uses least-squares to ef\ufb01ciently summarize past experience,\nbut is not tied to a speci\ufb01c state representation. Though we considered a \ufb01xed representation for\nUCLS, it is feasible that an analysis for the non-stationary case could be used as well for the setting\nwhere the representation is being adapted over time. If the representation drifts slowly, then UCLS\nmay be able to similarly track the upper-con\ufb01dence bounds. Recent work has shown that combining\ndeep Q-learning with Least-squares can result in signi\ufb01cant performance gains over vanilla DQN[18].\nWe expect that combining deep networks and UCLS could result in even larger gains, and is a natural\ndirection for future work.\n7 Acknowledgements\n\nWe would like to thank Bernardo \u00c1vila Pires and Jian Qian for their helpful comments, alongwith\nCalcul Qu\u00e9bec (www.calculquebec.ca) and Compute Canada (www.computecanada.ca) for the\ncomputing resources used in this work.\n\n9\n\n\fReferences\n[1] Y. Abbasi-Yadkori and C. Szepesvari. Bayesian Optimal Control of Smoothly Parameterized Systems: The\n\nLazy Posterior Sampling Algorithm. In Uncertainty in Arti\ufb01cial Intelligence, 2014.\n\n[2] P. Auer and R. Ortner. Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning.\n\nAdvances in Neural Information Processing Systems, 2006.\n\n[3] P. L. Bartlett and A. Tewari. REGAL - A Regularization based Algorithm for Reinforcement Learning in\n\nWeakly Communicating MDPs. In Conference on Uncertainty in Arti\ufb01cial Intelligence, 2009.\n\n[4] J. A. Boyan. Technical update: Least-squares temporal difference learning. Machine learning, 49(2-3):\n\n233\u2013246, 2002.\n\n[5] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Machine\n\nlearning, 22(1-3):33\u201357, 1996.\n\n[6] R. Brafman and M. Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforce-\n\nment learning. The Journal of Machine Learning Research, 2003.\n\n[7] W. Chu, L. Li, L. Reyzin, and R. E. Schapire. Contextual Bandits with Linear Payoff Functions. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, 2011.\n\n[8] R. Grande, T. Walsh, and J. How. Sample Ef\ufb01cient Reinforcement Learning with Gaussian Processes. In\n\nInternational Conference on Machine Learning, 2014.\n\n[9] T. Jaksch, R. Ortner, and P. Auer. Near-optimal Regret Bounds for Reinforcement Learning. The Journal\n\nof Machine Learning Research, 2010.\n\n[10] N. Jong and P. Stone. Model-based exploration in continuous state spaces. Abstraction, Reformulation,\n\nand Approximation, 2007.\n\n[11] T. Jung and P. Stone. Gaussian processes for sample ef\ufb01cient reinforcement learning with RMAX-like\n\nexploration. In Machine Learning: ECML PKDD, 2010.\n\n[12] L. P. Kaelbling. Learning in embedded systems. MIT press, 1993.\n\n[13] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal of Arti\ufb01cial\n\nIntelligence Research, 1996.\n\n[14] S. Kakade, M. Kearns, and J. Langford. Exploration in metric state spaces. In International Conference on\n\nMachine Learning, 2003.\n\n[15] K. Kawaguchi. Bounded Optimal Exploration in MDP. In AAAI Conference on Arti\ufb01cial Intelligence,\n\n2016.\n\n[16] M. J. Kearns and S. P. Singh. Near-Optimal Reinforcement Learning in Polynomial Time. Machine\n\nLearning, 2002.\n\n[17] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. The Journal of Machine Learning Research,\n\n2003.\n\n[18] N. Levine, T. Zahavy, D. J. Mankowitz, A. Tamar, and S. Mannor. Shallow updates for deep reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, pages 3138\u20133148, 2017.\n\n[19] L. Li, M. Littman, and C. Mansley. Online exploration in least-squares policy iteration. In International\n\nConference on Autonomous Agents and Multiagent Systems, 2009.\n\n[20] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article\n\nrecommendation. In World Wide Web Conference, 2010.\n\n[21] J. Martin, S. N. Sasikumar, T. Everitt, and M. Hutter. Count-Based Exploration in Feature Space for\n\nReinforcement Learning. In International Joint Conference on Arti\ufb01cial IntelligenceI, 2017.\n\n[22] N. Meuleau and P. Bourgine. Exploration of Multi-State Environments - Local Measures and Back-\n\nPropagation of Uncertainty. Machine Learning, 1999.\n\n[23] C. D. Meyer, Jr. Generalized inversion of modi\ufb01ed matrices. SIAM Journal on Applied Mathematics, 24\n\n(3):315\u2013323, 1973.\n\n10\n\n\f[24] K. S. Miller. On the inverse of the sum of matrices. Mathematics magazine, 54(2):67\u201372, 1981.\n\n[25] T. M. Moerland, J. Broekens, and C. M. Jonker. Ef\ufb01cient exploration with Double Uncertain Value\n\nNetworks. In Advances in Neural Information Processing Systems, 2017.\n\n[26] A. Nouri and M. L. Littman. Multi-resolution Exploration in Continuous Spaces. In Advances in Neural\n\nInformation Processing Systems, 2009.\n\n[27] R. Ortner and D. Ryabko. Online Regret Bounds for Undiscounted Continuous Reinforcement Learning.\n\nIn Advances in Neural Information Processing Systems, 2012.\n\n[28] I. Osband and B. Van Roy. Why is Posterior Sampling Better than Optimism for Reinforcement Learning?\n\nIn International Conference on Machine Learning, 2017.\n\n[29] I. Osband, D. Russo, and B. Van Roy. (More) Ef\ufb01cient Reinforcement Learning via Posterior Sampling. In\n\nAdvances in Neural Information Processing Systems, 2013.\n\n[30] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy. Deep Exploration via Bootstrapped DQN. In Advances\n\nin Neural Information Processing Systems, 2016.\n\n[31] I. Osband, B. Van Roy, and Z. Wen. Generalization and Exploration via Randomized Value Functions. In\n\nInternational Conference on Machine Learning, 2016.\n\n[32] G. Ostrovski, M. G. Bellemare, A. van den Oord, and R. Munos. Count-Based Exploration with Neural\n\nDensity Models. In International Conference on Machine Learning, 2017.\n\n[33] J. Pazis and R. Parr. PAC optimal exploration in continuous space Markov decision processes. In AAAI\n\nConference on Arti\ufb01cial Intelligence, 2013.\n\n[34] M. Plappert, R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and\n\nM. Andrychowicz. Parameter Space Noise for Exploration. arXiv.org, 2017.\n\n[35] S. P. Singh, T. S. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence Results for Single-Step\n\nOn-Policy Reinforcement-Learning Algorithms. Machine Learning, 2000.\n\n[36] A. Strehl and M. Littman. Exploration via model based interval estimation. In International Conference\n\non Machine Learning, 2004.\n\n[37] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforcement learning.\n\nIn International Conference on Machine Learning, 2006.\n\n[38] R. Sutton, C. Szepesv\u00e1ri, A. Geramifard, and M. Bowling. Dyna-style planning with linear function\n\napproximation and prioritized sweeping. In Conference on Uncertainty in Arti\ufb01cial Intelligence, 2008.\n\n[39] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 1988.\n\n[40] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press Cambridge, 1998.\n\n[41] C. Szepesvari. Algorithms for Reinforcement Learning. Morgan & Claypool Publishers, 2010.\n\n[42] I. Szita and A. Lorincz. The many faces of optimism. In International Conference on Machine Learning,\n\n2008.\n\n[43] I. Szita and C. Szepesvari. Model-based reinforcement learning with nearly tight exploration complexity\n\nbounds. In International Conference on Machine Learning, 2010.\n\n[44] H. van Seijen and R. Sutton. A deeper look at planning as learning from replay. In International Conference\n\non Machine Learning, 2015.\n\n[45] M. White. Unifying task speci\ufb01cation in reinforcement learning. In International Conference on Machine\n\nLearning, 2017.\n\n[46] M. White and A. White. Interval estimation for reinforcement-learning algorithms in continuous-state\n\ndomains. In Advances in Neural Information Processing Systems, 2010.\n\n[47] M. A. Wiering and J. Schmidhuber. Ef\ufb01cient Model-Based Exploration. In Simulation of Adaptive Behavior\n\nFrom Animals to Animats, 1998.\n\n11\n\n\f", "award": [], "sourceid": 2309, "authors": [{"given_name": "Raksha", "family_name": "Kumaraswamy", "institution": "University of Alberta"}, {"given_name": "Matthew", "family_name": "Schlegel", "institution": "University of Alberta"}, {"given_name": "Adam", "family_name": "White", "institution": "University of Alberta; DeepMind"}, {"given_name": "Martha", "family_name": "White", "institution": "University of Alberta"}]}