{"title": "Adaptive Temporal-Difference Learning for Policy Evaluation with Per-State Uncertainty Estimates", "book": "Advances in Neural Information Processing Systems", "page_first": 11895, "page_last": 11905, "abstract": "We consider the core reinforcement-learning problem of on-policy value function approximation from a batch of trajectory data, and focus on various issues of Temporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two methods are known to achieve complementary bias-variance trade-off properties, with TD tending to achieve lower variance but potentially higher bias. In this paper, we argue that the larger bias of TD can be a result of the amplification of local approximation errors. We address this by proposing an algorithm that adaptively switches between TD and MC in each state, thus mitigating the propagation of errors. Our method is based on learned confidence intervals that detect biases of TD estimates. We demonstrate in a variety of policy evaluation tasks that this simple adaptive algorithm performs competitively with the best approach in hindsight, suggesting that learned confidence intervals are a powerful technique for adapting policy evaluation to use TD or MC returns in a data-driven way.", "full_text": "Adaptive Temporal-Difference Learning for Policy\nEvaluation with Per-State Uncertainty Estimates\n\nHugo Penedones \u21e4\n\nDeepMind\n\nCarlos Riquelme \u21e4\n\nGoogle Brain\n\nDamien Vincent\n\nGoogle Brain\n\nHartmut Maennel\n\nGoogle Brain\n\nTimothy Mann\n\nDeepMind\n\nAndr\u00e9 Barreto\n\nDeepMind\n\nSylvain Gelly\nGoogle Brain\n\nGergely Neu\n\nUniversitat Pompeu Fabra\n\nAbstract\n\nWe consider the core reinforcement-learning problem of on-policy value function\napproximation from a batch of trajectory data, and focus on various issues of Tem-\nporal Difference (TD) learning and Monte Carlo (MC) policy evaluation. The two\nmethods are known to achieve complementary bias-variance trade-off properties,\nwith TD tending to achieve lower variance but potentially higher bias. In this paper,\nwe argue that the larger bias of TD can be a result of the ampli\ufb01cation of local\napproximation errors. We address this by proposing an algorithm that adaptively\nswitches between TD and MC in each state, thus mitigating the propagation of\nerrors. Our method is based on learned con\ufb01dence intervals that detect biases of TD\nestimates. We demonstrate in a variety of policy evaluation tasks that this simple\nadaptive algorithm performs competitively with the best approach in hindsight,\nsuggesting that learned con\ufb01dence intervals are a powerful technique for adapting\npolicy evaluation to use TD or MC returns in a data-driven way.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL) an agent must learn how to behave while interacting with an envi-\nronment. This challenging problem is usually formalized as the search for a decision policy\u2014i.e., a\nmapping from states to actions\u2014that maximizes the amount of reward received in the long run [24].\nClearly, in order to carry out such a search we must be able to assess the quality of a given policy.\nThis process, known as policy evaluation, is the focus of the current paper.\nA common way to evaluate a policy is to resort to the concept of value function. Simply put, the value\nfunction of a policy associates with each state the expected sum of rewards, possibly discounted over\ntime, that an agent following the policy from that state onwards would obtain. Thus, in this context\nthe policy evaluation problem comes down to computing a policy\u2019s value function.\nPerhaps the simplest way to estimate the value of a policy in a given state is to use Monte Carlo (MC)\nreturns: the policy is executed multiple times from the state of interest and the resulting outcomes are\naveraged [23]. Despite their apparent naivety, MC estimates enjoy some nice properties and have\nbeen advocated as an effective solution to the policy evaluation problem [1]. Another way to address\nthe policy evaluation problem is to resort to temporal-difference (TD) learning [22]. TD is based\non the insight that the value of a state can be recursively de\ufb01ned based on other states\u2019 values [4].\nRoughly speaking, this means that, when estimating the value of a state, instead of using an entire\ntrajectory one uses the immediate reward plus the value of the next state. This idea of updating an\nestimate from another estimate allows the agent to learn online and incrementally.\n\n\u21e4These two authors contributed equally. Correspondence to rikel@google.com\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBoth MC and TD have advantages and disadvantages. From a statistical point of view, the estimates\nprovided by MC are unbiased but may have high variance, while TD estimates show the opposite\nproperties [24]. As a consequence, the relative performance of the two methods depends on the\namount of data available: while TD tends to give better estimates in small data regimes, MC often\nperforms better with a large amount of data. Since the amount of data that leads to MC outperforming\nTD varies from problem to problem, it is dif\ufb01cult to make an informed decision on which method to\nuse in advance. It is also unlikely that the best choice will be the same for all states, not only because\nthe number of samples associated with each state may vary but also because the characteristics of the\nvalue function itself may change across the state space.\nIdeally, we would have a method that adjusts the balance between bias and variance per state based\non the progress of the learning process. In this paper we propose an algorithm that accomplishes that\nby dynamically choosing between TD and MC before each value-function update. Adaptive TD is\nbased on a simple idea: if we have con\ufb01dence intervals associated with states\u2019 values, we can decide\nwhether or not to apply TD updates by checking if the resulting targets fall within these intervals. If\nthe targets are outside of the con\ufb01dence intervals we assume that the bias in the TD update is too\nhigh, and just apply an MC update instead.\nAlthough this idea certainly allows for many possible instantiations, in this work we focus on simple\ndesign choices. Our experimental results cover a wide range of scenarios, from toy problems to Atari\ngames, and they highlight the robustness of the method, whose performance is competitive with the\nbest of both worlds in most cases. We hope this work opens the door to further developments in\npolicy evaluation with function approximation in complex environments.\n\n2 The Problem\n\nThis section formally introduces the problem of policy evaluation in Markov decision processes,\nas well as the two most basic approaches for tackling this fundamental problem: Monte-Carlo\nand Temporal-Difference learning. After the main de\ufb01nitions, we discuss the key advantages and\ndisadvantages of these methods, which will enable us to state the main goals of our work.\n\n2.1 Policy Evaluation in Reinforcement Learning\nLet M = hS,A, P, r,, \u00b5 0i denote a Markov decision process (MDP) where S is the set of states, A\nis the set of actions, and P is the transition function so that, for all s, s0 2S and a 2A , P (s0|s, a)\ndenotes the probability of transitioning to s0 from state s after taking action a. Also, r : S\u21e5A! R\nmaps each pair (state, action) to its expected reward,  2 (0, 1] is the discount factor, and \u00b50 is the\nprobability distribution over initial states.\nLet \u21e1 : S!D (A) be a policy, where D(\u00b7) is the set of distributions over its argument set. Assume\nat each state s we sample a \u21e0 \u21e1(s). The value function of M at s 2S under policy \u21e1 is de\ufb01ned by\n(1)\n\nv\u21e1(s) = E\" 1Xt=0\n\nS0 = s# ,\n\nt r(St,\u21e1 (St))\n\nwhere St+1 \u21e0 P (\u00b7|St, At) and At \u21e0 \u21e1(s). We will drop the dependence of r on a for simplicity.\nOur goal is to recover value function v\u21e1 from samples collected by running policy \u21e1 in the MDP.\nAs \u21e1 is \ufb01xed, we will simply use the notation v = v\u21e1 below. We consider trajectories collected\non M by applying \u21e1: \u2327 = h(s0, a0, r0), (s1, a1, r1), . . .i. Given a collection of n such trajectories\ni=1, a policy evaluation algorithm outputs a function bV : S! R. We are interested in\nDn = {\u2327i}n\ndesigning algorithms that minimize the Mean Squared Value Error ofbV de\ufb01ned as\nMore speci\ufb01cally, we will search for an appropriate value estimate bV within a \ufb01xed hypothesis\nset of functions H = {h : S! R}, attempting to \ufb01nd an element with error comparable to\nminh2H MSVE(h). We mainly consider the set H of neural networks with a \ufb01xed architecture.\nThe key challenge posed by the policy evaluation problem is that the regression target v(s) in (2)\nis not directly observable. The algorithms we consider deal with this challenge by computing an\n\nS0\u21e0\u00b50\uf8ff\u21e3v(S0) bV (S0)\u23182 .\n\nMSVE(bV ) = E\n\n(2)\n\n2\n\n\flow reward\n\n.\n\ns2\n\n.\n\ns4\n\ns3\n\n.\n.\n\ns1\n\nhigh reward\n\nFigure 1: Left. Bootstrapping approximation errors. Center and Right. MC (center) versus\nTD (right) on a simple environment with 2 rooms completely separated by a wall (see Map 2 in\nFigure 11). For each state s, the heatmaps show: \u02c6V (s)  V (s). The true value on the upper half\nof the plane is zero. MC overestimates the values of a narrow region right above the wall, due to\nfunction approximation limitations. With TD, these unavoidable approximation errors also occur, but\nthings get worse when bootstrap updates propagate them to much larger regions (see right).\n\nappropriate regression target T (s) and, instead, attempt to minimize (T (s) bV (s))2 as a function of\nbV , usually via stochastic optimization.\n\n2.2 Monte Carlo for Policy Evaluation\nThe Monte-Carlo approach is based on the intuitive observation that the in\ufb01nite discounted sum of\nrewards realized by running the policy from a state s is an unbiased estimator of v(s). This suggests\nthat a reasonably good regression target can be constructed for all st 2 \u2327i as\n\nt \u2318 :=\nTMC\u21e3s(i)\ncompute the minimizer within h 2H ofPn\n\nnit1Xk=0\ni=1Pni\n\nwhere ni is the length of trajectory \u2327i. Thus, one viable approach for policy evaluation is to\nt ))2. Risking some minor\n\nt=1(TMC(s(i)\n\ninconsistency2 with the literature, we refer to this method as Monte Carlo policy evaluation.\n\nk r\u21e3s(i)\nt+k\u2318 ,\nt )  h(s(i)\n\n(3)\n\n2.3 Temporal-Difference for Policy Evaluation\nTemporal-Difference algorithms are based on the fact that the value function should satisfy Bellman\n\nequations: v(s) = r(s) + Ps0 P (s0|s, a) v(s0) for all s, which suggests that a good estimate of the\n\nvalue function should minimize the squared error between the two sides of the above equation. In our\nframework, this can be formulated as using the regression target\n\nTTD(0)\u21e3s(i)\n\nt \u2318 := r\u21e3s(i)\n\nt \u2318 +  bV \u21e3s(i)\nt+1\u2318\n\n(4)\n\nto replace v(s) in the objective of Equation (2). The practice of using the estimatebV as part of the\n\ntarget is commonly referred to as \u201cbootstrapping\u201d within the RL literature. Again with a slight abuse\nof common terminology3, we will refer to this algorithm as TD(0), or just TD.\nTD and Monte Carlo provide different target functions and, depending on the problem instance, each\nof them may offer some bene\ufb01ts. In the tabular case, it is easy to see that Monte Carlo converges\nto the optimal solution with an in\ufb01nite amount of data, since the targets concentrate around their\ntrue mean, the value function. However, the Monte Carlo targets can also suffer from large variance\ndue to the excessive randomness of the cumulative rewards. On the other hand, TD can be shown\nto converge to the true value function in the same setting too [22], with the additional potential to\n2[23, 24] exclusively refer to the tabular version of the above method as Monte Carlo; this method is a natural\n\ngeneralization to general value-function classes.\n\n3This algorithm would be more appropriately called \u201cleast squares TD\u201d or LSTD, following [6], with the\n\nunderstanding that our method considers general (rather than linear) value-function classes.\n\n3\n\n\fconverge faster due to the potential variance reduction in the updates. Indeed, when considering a\n\ufb01xed value function, the only randomness in the TD target is due to the immediate next state, whereas\nthe MC target is impacted by the randomness of the entire trajectory. Thus, the advantage of TD is\nmore pronounced in low data regimes, or when the return distribution from a state has large variance.\nThe story may be different with function approximation: even in the limit of in\ufb01nite data, both MC\nand TD are going to lead to biased estimates of the value function, due to the approximation error\nintroduced by the function class H. In the case of linear function classes, the biases of the two\nalgorithms are well-understood; the errors in estimating the value functions can be upper-bounded\nin terms of the distance between the value function and the span of the features, with MC enjoying\ntighter upper bounds than TD [26, 18]. These results, however, do not provide a full characterization\nof the errors: even when considering linear function approximation, there are several known examples\nin the literature where TD provably outperforms MC and vice versa [18]. Thus, the winner between\nthe two algorithms will be generally determined by the particular problem instance we are tackling.\nTo see the intuitive difference between the behavior of the two algorithms, consider a situation where\nthe true underlying value function V has some sharp discontinuities that the class of functions at hand\nis not \ufb02exible enough to capture. In these cases, both methods suffer to \ufb01t the value function in some\nregions of the state space, even when we have lots of data. The errors, however, behave differently for\nthe two methods: while the errors of Monte Carlo are localized to the regions with discontinuities due\nto directly \ufb01tting the data, TD bootstraps values from this problematic region, and thus propagates\nthe errors even further. We refer to such errors arising due to discontinuities as leakage.\nTo illustrate the leakage phenomenon, consider the toy example in the left side of Figure 2.3. There\nis a wall separating the green region (high reward), and the red region (low reward). Assume that a\nfunction approximator f will need to make a compromise to \ufb01t both s1 and s3 as they are close in the\nstate space, even though no trajectory goes through both of them due to the wall. For example, we\ncan assume f will over-estimate the true value of s3. Let us now examine a third state, s2, that is in\nthe red region, and such that there is a trajectory that goes from s2 to s3. The distance in state-space\nbetween s2 and s3 could be, in principle, arbitrarily large. If we apply Monte Carlo updates, then the\nimpact of the s1 ! s3 leakage on s2 will be minimal. Instead, the TD update for the value of s2 will\nexplicitly depend on the estimate for s3, which is overestimated due to function approximation near\na wall. In the center and right side of Figure 2.3, we show the actual outcome of running TD and\nMC in such a setting. TD dramatically propagates estimation errors far into the low-reward region as\nexpected, while MC is way more robust and errors stay located very close to the wall. Alternatively,\nin a smoother scenario, however, bootstrapping function approximation estimates can still be certainly\nbene\ufb01cial and speed up learning. We present and analyze a toy MDP in the appendix, Section A.\nThe key observation of our work is that different regions of the state space may be best suited for\neither TD or MC updates, amending the existing folk wisdom that TD and MC may \u201cglobally\u201d\noutperform each other in different MDPs. Accordingly, we set our goal as designing a method that\nadaptively chooses a target depending on the properties of the value function around the speci\ufb01c state.\n\n3 The Adaptive-TD Algorithm\n\nIn the previous sections, we saw that whether MC or TD is a better choice heavily depends on the\nspeci\ufb01c scenario we are tackling, and on the family of functions we choose to \ufb01t the value function.\nWhile in hindsight we may be able to declare a winner, in practice the algorithmic decision needs to\nbe made at the beginning of the process. Also, running both and picking the best-performing one\nin training time can be challenging, as the compound variance of the return distribution over long\ntrajectories may require extremely large validation sets to test the methods and choose, while this\nthen limits the amount of available training data. We aim to design a robust algorithm which does not\nrequire any knowledge of the environment, and that dynamically adapts to both the geometry of the\ntrue value and transition functions, and to the function approximator it has at its disposal.\nIn this section, we propose an algorithmic approach that aims to achieve the best of both the MC\nand TD worlds. The core idea is to limit the bootstrapping power of TD to respect some hard\nlimits imposed by an MC con\ufb01dence interval. The main driver of the algorithm is TD, since it can\nsigni\ufb01cantly speed up learning in the absence of obstacles. However, in regions of the state space\nwhere we somehow suspect that our estimates may be poor (e.g., near walls, close to big rewards,\nnon-markovianity, partial observability, or after irrecoverable actions) we would like to be more\n\n4\n\n\fInput: Con\ufb01dence level \u21b5 2 (0, 1). Trajectories \u23271, . . . ,\u2327 n generated by policy \u21e1.\nLet S be the set of visited states in \u23271, . . . ,\u2327 n. InitializebV (s) = 0, for all s.\nCompute Monte-Carlo returns dataset as in (3): DMC = {(s, TM C(s))s2S}.\nFit con\ufb01dence function to DMC: CI\u21b5\nrepeat\n\nM C(s) := (L\u21b5\n\nM C(s), U \u21b5\n\nM C(s)).\n\nt+1)\nM C(s(i)\n\nt )) then\n\nt ) + UM C(s(i)\n\nt )/2\n\nfor i = 1 to n do\n\nfor t = 1 to |\u2327i| 1 do\n\nt ), U \u21b5\n\nM C(s(i)\n\nis the t-th state of \u2327i.\n\nt ) +  bV (s(i)\n\ns(i)\nt\nTTD(0) = r(s(i)\nif TTD(0) 2 (L\u21b5\nTi,t TTD(0)\nelse\nTi,t (LM C(s(i)\nend if\nUse target Ti,t to \ufb01tbV (s(i)\n\nt ).\n\nend for\n\nend for\n\nuntil epochs exceeded\n\nAlgorithm 1: Adaptive TD\n\nconservative, and rely mostly on targets purely based on data. We explicitly control how conservative\nwe would like to be by tuning the con\ufb01dence level \u21b5 of the intervals: by letting \u21b5 go to 1, the intervals\nbecome trivially wide, and we recover TD. Similarly, if we let \u21b5 go to 0, we end up with MC.\nIt is easy to design toy environments where none of the obstacles described above apply, and where\nTD is the optimal choice (see Map 0 in Figure 11). As a consequence of the switching and testing\ncost, it is not reasonable to expect our algorithm to dominate both MC and TD in all environments.\nOur goal is to design an algorithm that is not worse than the worst of MC and TD in any scenario,\nand is close to the best one in most cases, or actually better. When data goes to in\ufb01nity and the true\nvalue function falls in our family of models, the MC intervals will converge to the true values and\nAdaptive TD will be forced to converge too. However, these are asymptotic results.\n\n3.1 Formal Description\n\nM C(s), U \u21b5\n\nAdaptive TD is presented as Algorithm 1. It has two components: con\ufb01dence computation, and value\nfunction \ufb01tting. First, we compute the MC target dataset DMC = {(s, TM C(s))s2S} for all states\ns in any input episode (we refer to S as the union of those). Then, we need to solve the regression\nproblem with a method that provides con\ufb01dence intervals: with probability \u21b5 the expected return\nfrom s under \u21e1 is in (L\u21b5\nM C(s)). There are a variety of approximate methods we can use to\ncompute such con\ufb01dence bounds; we discuss this in detail in the next subsection. Note, however, that\nthis can be regarded as an additional input or hyper-parameter to Adaptive TD.\nIn the second stage, after \ufb01xing a con\ufb01dence function CI\u21b5\nM C(s), we apply a constrained version of\nTD. We loop over all states s 2 S (possibly, in a randomized way, as the main loop over data can be\nreplaced by a stochastic mini-batch), and we compute their TD target TTD(0)(s). If the TD target\nM C(s), we simply use it to update bV (s). If it does not, i.e.\nfalls within the con\ufb01dence interval CI\u21b5\nwhen TTD(0)(s) /2 (L\u21b5\nof the MC interval, (L + U )/2. We can also use the closest endpoint to TTD(0)(s), either L or U.\n\nM C(s)), then we no longer trust the TD target, and use the mid-point\n\nM C(s), U \u21b5\n\n3.2 Uncertainty Estimates\n\nPoor quality uncertainty estimates may severely affect the performance of Adaptive TD. In particular,\nin those states where the ground truth is not captured by the MC con\ufb01dence intervals, the TD target\nwill be forced to bootstrap wrong values without any possibility of recovery. In the case of neural\nnetworks, uncertainty estimation has been an active area of research in the last few years. In some\n\n5\n\n\fFigure 2: Average normalized MSVE for Lab2D and Atari environments in each data regime.\nFor each number of train rollouts and scenario, we normalize the MSVE of each algorithm A\nby (MSVE(A)  minA0 MSVE(A0))/(maxA0 MSVE(A0)  minA0 MSVE(A0)). Equivalently, the\nworst algorithm is assigned relative MSVE 1, and the best one is assigned relative MSVE 0. Then,\nfor each number of rollouts, we take the average across scenarios (i.e., all the 10 environments are\nworth the same). This allows for a reasonably fair comparison of performance in different domains.\n\ndecision-making tasks, like exploration, all we need are samples from the output distribution, while\nactual intervals are required for Adaptive TD. A simple \ufb01x for models that provide samples is to take\na number of them, and then construct an approximate interval.\nCommon approaches include variational inference [5], dropout [9], Monte Carlo methods [27, 12],\nbootstrapped estimates [8, 15], Gaussian Processes [16], or Bayesian linear regression on learned\nfeatures [19, 17, 2]. While all the methods above could be used in combination with Adaptive TD,\nfor simplicity, we decided to use an ensemble of m MC networks [11]. The algorithm works as\nfollows. We \ufb01t m networks on the DMC = {(s, TM C(s))s2S} dataset (we may or may not boostrap\nthe data at the episode level). Given a new state s, the networks provide value estimates v1, . . . , vm\nat s. We then compute a predictive con\ufb01dence interval, under the assumption that vi for i = 1, . . . , m\nare i.i.d. samples from some distribution F. Now, if vm+1 was sampled from the same distribution,\nthen we could expect vm+1 to fall in the predictive interval with probability \u21b5. The idea is that the\nTD estimate should approximately correspond to another sample from the MC distribution. If the\ndeviation is too large, we will rely on the MC estimates instead.\nIn particular, we do assume F is Gaussian: v1, . . . , vm \u21e0N (\u00b5, 2) for unknown \u00b5, 2. Let us de\ufb01ne\n\u00afv =Pi vi/m, and \u02c62\nm =Pi(vi  \u00afv)2/(m  1). Finally, if the assumptions hold, we expect that\n\u00afv  z\u21b5\u02c6mp1 + 1/m \uf8ff vm+1 \uf8ff \u00afv + z\u21b5\u02c6mp1 + 1/m\n(5)\nwith probability \u21b5, where z\u21b5 is the 100(1  \u21b5/2) percentile of the Student\u2019s distribution with m  1\ndegrees of freedom. Then, we set L\u21b5\nM C(s) to the left and right-hand sides of (5) (note vi\ndepends on s). Of course, in practice the assumptions may not hold (for example, vi, vj for i 6= j will\nnot be independent unless we condition on the data), but we still hope to get a reasonable estimate.\n\nM C(s) and U \u21b5\n\n4 Experimental Results\n\nIn this section we test the performance of Adaptive TD in a number of scenarios that we describe\nbelow. The scenarios (for which we \ufb01x a speci\ufb01c policy) capture a diverse set of aspects that are\nrelevant to policy evaluation: low and high-dimensional state spaces, sharp value jumps or smoother\nepsilon-greedy behaviors, near-optimal and uniformly random policies. We present here the results\nfor Labyrinth-2D and Atari environments, and Mountain Car is presented in the appendix, Section C.\nWe compare Adaptive TD with a few baselines: a single MC network, raw TD, and TD(). TD() is\na temporal differences algorithm which computes an average of all n-step TD returns (an extension of\nthe 1-step target in (4)), [23]. For a clean comparison across algorithms in each scenario, we normalize\nthe MSVE of all algorithms (y-axis) by the worst performing one, and we do this independently for\n\n6\n\n\feach number of data rollouts (x-axis). In addition, the appendix contains the absolute values with\nempirical con\ufb01dence intervals for all cases. Our implementation details are presented in Section B of\nthe appendix. In general, we did not make any effort to optimize hyper-parameters, as the goal was to\ncome up with an algorithm that is robust and easy to tune across different scenarios. Accordingly, for\nAdaptive TD, we use an ensemble of 3 networks trained with the MC target, and con\ufb01dence intervals\nat the 95% level. The data for each network in the ensemble is bootstrapped at the rollout level (i.e.,\nwe randomly pick rollouts with replacement). Plots also show the performance of the MC ensemble\nwith 3 networks, to illustrate the bene\ufb01ts of Adaptive TD compared to its auxiliary networks.\nLabyrinth-2D. We \ufb01rst evaluate the performance of the algorithms in a toy scenario which represents\na 2-d map with some target regions we would like to reach. The state s = (x, y) are the coordinates\nin the map, and the policy takes a uniformly random angle and then applies a \ufb01xed-size step. The\ninitial state s0 for each episode is selected uniformly at random inside the map, and the episode\nends after each step with probability p = 0.0005. Reward is r = 30 inside the green regions, r = 0\nelsewhere. The maps layouts and their value functions are shown in Figure 11 in the appendix. The\nsimple different layouts cover a number of challenging features for policy evaluation: sharp jumps\nin value near targets, several kind of walls, and locked areas with no reward (see maps 2 and 3).\nDue to the randomized policy and initial state, we tend to uniformly cover the state space. We run\nexperiments with n = 5, 10, 20, 50, 75, 100 training episodes. We approximate the ground truth in a\ngrid by sampling and averaging a large number of test episodes from each state.\nThe results are shown in Figure 3. As expected, in most of the maps we observe that MC outperforms\nTD in high-data regimes, while MC consistently suffers when the number of available data rollouts is\nlimited. Adaptive TD shows a remarkably robust performance in all cases, being able to strongly\nbene\ufb01t from TD steps in the low-data regime (see, basically, all maps) while remaining very close to\nMC\u2019s performance when a large number of rollouts are available. In that regime, the improvement\nwith respect to TD is dramatic in maps that are prone to strong leakage effects, like maps 1, 2, 3, and\n5. In Figure 14 in the appendix, we can also see the results for TD(). In this particular case, it seems\n = 0.75 is a good choice, and it is competitive with Adaptive TD in challenging maps 1, 2, 3, and 5.\nHowever, the remaining values of  are mostly outperformed in these scenarios. Figure 17 shows the\nregions of the maps state space where the TD target falls outside the MC interval for Adaptive TD.\nAtari. The previous examples illustrate many of the practical issues that arise in policy evaluation.\nIn order to model those issues in a clean disentangled way, and provide some intuition, we focused\nso far on lower-dimensional state spaces. In this section we evaluate all the methods in a few Atari\nenvironments [3]: namely, Breakout, Space Invaders, Pong, and MsPacman. The state consists of\nfour stacked frames, each with (84, 84) pixels, and the initial one is \ufb01xed for each game. We would\nlike to focus on competitive policies for the games, while still offering some stochasticity to create\na diverse set of trajectories (as Atari environments are deterministic). We use soft Q policies that\nsample from the action distribution to generate the training and test trajectories. The temperature of\nthe softmax layer was adjusted to keep a good trade-off between trajectory diversity and performance\nof the policy. Directly computing the ground-truth value is not feasible this time, so we rely on a large\nnumber of test rollouts to evaluate our predictions. This increases the variance of our MSVE results.\nThe results are shown in Figure 4. TD does a good job for all number of rollouts. This suggests\nthat in high-dimensional state spaces (like Atari frames) the required number of samples for MC to\ndominate may be extremely large. In addition, a single MC network seems to struggle in all games,\nwhile the prediction of the MC ensemble proves signi\ufb01cantly more robust. Adaptive TD outperforms\nMC, and its auxiliary MC ensemble. Moreover, it offers a performance close to that of TD, maybe\ndue to wide MC con\ufb01dence intervals in high-dimensional states which reduce Adaptive TD to simply\nTD in most of the states. We show the results for TD() in Figure 18 in the appendix. In this case,\n = 0.75 \u2013which did a good job in Labyrinth-2D scenarios\u2013 is always the worst. In particular,\nAdaptive TD improvements compared to TD( = 0.75) range from 30% in the low-data regimes of\nPong, to consistent 20% improvements across all data regimes of Space Invaders.\nSummary. Figure 2 displays the overall results normalized and averaged over the 10 scenarios.\nAdaptive TD strongly outperforms TD and MC, and offers signi\ufb01cant bene\ufb01ts with respect to its\nauxiliary ensemble, and TD(). This highlights the main feature of Adaptive TD: its robustness.\nWhile TD and MC outperform each other often by a huge margin depending on the scenario and\ndata size, Adaptive TD tends to automatically mimic the behavior of the best-performing one. TD()\nmethods offer a way to interpolate between TD and MC, but they require to know a good value of \n\n7\n\n\fFigure 3: Labyrinth-2D. For each number of train rollouts, we normalize the MSVE of each algo-\nrithm A by MSVE(A)/ maxA0 MSVE(A0). Absolute numbers and conf. intervals in the appendix.\n\nin advance, and we have seen that this value can signi\ufb01cantly change across problems. In most cases,\nAdaptive TD was able to perform \u2013at least\u2013 comparably to TD() for the best problem-dependent .\n\nFigure 4: Atari. For each number of train rollouts, we normalize the MSVE of each algorithm A by\nMSVE(A)/ maxA0 MSVE(A0). Absolute numbers and con\ufb01dence intervals are in the appendix.\n\n5 Related Work\n\nBoth n-step TD and TD() offer practical mechanisms to control the balance between bias and\nvariance by tuning n and . However, these parameters are usually set a priori, without taking into\naccount the progress of the learning process, and are not state-dependant.\nA number of works have addressed on-policy evaluation. In [13] the authors introduce an algorithm\nfor batch on-policy evaluation, which is capable of selecting the best  parameter for LSTD by doing\nef\ufb01cient cross-validation. However, it only works for the case of linear function approximation,\nand does not take per-state decisions. In the TD-BMA algorithm [7], decisions are taken by state,\nlike in ours, while TD-BMA is restricted to the tabular setting. We cover function approximation,\nincluding deep neural networks, and large dimensional input spaces, where uncertainty estimates\nrequire different techniques.\n\n8\n\n\fPer-state  was used for the off-policy case in [20]. (For an overview of methods with \ufb01xed ,\nincluding the off-policy case, see [10].) This study is continued in [21] with methods that selectively\n(de-)emphasize states. The motivation is that function approximation has to give a compromise\nfunction (e.g. forcing similar values at nearby states, even if the observed values are not similar),\nwhich should be guided by emphasizing the more important states. In off-policy evaluation (the\nfocus of their paper) there is more interest in states that will occur more frequently in our actual\npolicy. Similarly, when we switch to MC for some states in our approach, we may be less interested\nto model their value correctly in the TD function approximation. This paper also hints that (s) may\nbe modi\ufb01ed depending on the variance of the returns after s; this is then developed in [28].\nThe algorithm of [28] estimates the variance of the \u2013returns arising from the data by establishing\na Bellman operator for the squared return, for which they are looking for a \ufb01xed point. Then they\noptimize \u201cgreedily\u201d the (st) such that they get the optimal bias/variance trade-off for this state.\nHowever, their variance of the returns is restricted to the uncertainty coming from the actions and\nreturns, but does not take into account the model uncertainty arising from the function approximation\n(which we include here by evaluating an ensemble of networks). [25] introduced a different approach\nto compute an optimized and state-dependent combination of n\u2013step\u2013returns. They ask what the\noptimal combination of n\u2013step returns would be if the estimates were unbiased and their variances\nand covariances were known. This differs from our approach as we are also trying to minimize the\nbias that is introduced by function approximation and that is ampli\ufb01ed by TD\u2019s bootstrapping.\n\n6 Future Work\n\nThere are a number of avenues for future work. Adaptive TD can be easily extended to n-step TD or\nTD(): at each transition, the agent checks whether the potential n-step or TD() target is within the\nassociated con\ufb01dence interval. If so, the agent applies the TD update; otherwise, it proceeds to the\nnext transition and carries over the unused target to compute the next TD target and repeat the process.\nIn addition, the proposed policy evaluation algorithm can be implicitly or explicitly incorporated into\na policy improvement one for control. Finally, we expect that constructing more sophisticated and\naccurate con\ufb01dence intervals based on MC returns will improve the performance of Adaptive TD.\n\nAcknowledgments\nThe authors thank Matthieu Geist for his comments and suggestions, and the anonymous reviewers\nfor their valuable feedback. G. Neu was supported by \u201cLa Caixa\u201d Banking Foundation through the\nJunior Leader Postdoctoral Fellowship Programme and a Google Faculty Research Award.\n\nReferences\n[1] Artemij Amiranashvili, Alexey Dosovitskiy, Vladlen Koltun, and Thomas Brox. Analyzing the\nrole of temporal differencing in deep reinforcement learning. In International Conference on\nLearning Representations, 2018.\n\n[2] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Ef\ufb01cient exploration\nthrough bayesian deep q-networks. In 2018 Information Theory and Applications Workshop\n(ITA), pages 1\u20139. IEEE, 2018.\n\n[3] Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning\nenvironment: An evaluation platform for general agents. Journal of Arti\ufb01cial Intelligence\nResearch, 47:253\u2013279, 2013.\n\n[4] Richard E. Bellman. Dynamic Programming. Princeton University Press, 1957.\n[5] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty\n\nin neural networks. arXiv preprint arXiv:1505.05424, 2015.\n\n[6] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference\n\nlearning. Machine learning, 22(1-3):33\u201357, 1996.\n\n[7] Carlton Downey, Scott Sanner, et al. Temporal difference bayesian model averaging: A bayesian\n\nperspective on adapting lambda. In ICML, pages 311\u2013318. Citeseer, 2010.\n\n9\n\n\f[8] Bradley Efron. The jackknife, the bootstrap, and other resampling plans, volume 38. Siam,\n\n1982.\n\n[9] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model\nuncertainty in deep learning. In international conference on machine learning, pages 1050\u20131059,\n2016.\n\n[10] Matthieu Geist and Bruno Scherrer. Off-policy learning with eligibility traces: A survey. J.\n\nMach. Learn. Res., 15(1):289\u2013333, 2014.\n\n[11] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable\npredictive uncertainty estimation using deep ensembles. In Advances in Neural Information\nProcessing Systems, pages 6402\u20136413, 2017.\n\n[12] Stephan Mandt, Matthew Hoffman, and David Blei. A variational analysis of stochastic gradient\n\nalgorithms. In International Conference on Machine Learning, pages 354\u2013363, 2016.\n\n[13] Timothy A Mann, Hugo Penedones, Shie Mannor, and Todd Hester. Adaptive lambda least-\n\nsquares temporal difference learning. arXiv preprint arXiv:1612.09465, 2016.\n\n[14] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G\nBellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al.\nHuman-level control through deep reinforcement learning. Nature, 518(7540):529, 2015.\n\n[15] Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via\nbootstrapped dqn. In Advances in neural information processing systems, pages 4026\u20134034,\n2016.\n\n[16] Carl Edward Rasmussen. Gaussian processes in machine learning. In Summer School on\n\nMachine Learning, pages 63\u201371. Springer, 2003.\n\n[17] Carlos Riquelme, George Tucker, and Jasper Snoek. Deep bayesian bandits showdown: An\nempirical comparison of bayesian deep networks for thompson sampling. arXiv preprint\narXiv:1802.09127, 2018.\n\n[18] Bruno Scherrer. Should one compute the temporal difference \ufb01x point or minimize the Bellman\nresidual? The uni\ufb01ed oblique projection view. In 27th International Conference on Machine\nLearning-ICML 2010, 2010.\n\n[19] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram,\nMostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep\nneural networks. In International conference on machine learning, pages 2171\u20132180, 2015.\n\n[20] Rich Sutton, Ashique Rupam Mahmood, Doina Precup, and Hado Hasselt. A new Q()\nwith interim forward view and Monte Carlo equivalence. In Eric P. Xing and Tony Jebara,\neditors, Proceedings of the 31st International Conference on Machine Learning, volume 32 of\nProceedings of Machine Learning Research, pages 568\u2013576, Bejing, China, 22\u201324 Jun 2014.\nPMLR.\n\n[21] Richard Sutton, Ashique Rupam Mahmood, and Martha White. An emphatic approach to the\nproblem of off-policy temporal-difference learning. Journal of Machine Learning Research, 17,\n03 2015.\n\n[22] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning,\n\n3(1):9\u201344, 1988.\n\n[23] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press\n\nCambridge, 1998.\n\n[24] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction - second\n\nedition. MIT press Cambridge, 2018.\n\n10\n\n\f[25] Philip S. Thomas, Scott Niekum, Georgios Theocharous, and George Konidaris. Policy eval-\nuation using the \u2326-Return.\nIn C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 28, pages 334\u2013342.\nCurran Associates, Inc., 2015.\n\n[26] John N Tsitsiklis and Benjamin Van Roy. Analysis of temporal-diffference learning with function\napproximation. In Advances in neural information processing systems, pages 1075\u20131081, 1997.\n[27] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient langevin dynamics.\nIn Proceedings of the 28th international conference on machine learning (ICML-11), pages\n681\u2013688, 2011.\n\n[28] Martha White and Adam M. White. A greedy approach to adapting the trace parameter for\n\ntemporal difference learning. In AAMAS, 2016.\n\n11\n\n\f", "award": [], "sourceid": 6386, "authors": [{"given_name": "Carlos", "family_name": "Riquelme", "institution": "Google Brain"}, {"given_name": "Hugo", "family_name": "Penedones", "institution": "Google DeepMind"}, {"given_name": "Damien", "family_name": "Vincent", "institution": "Google Brain"}, {"given_name": "Hartmut", "family_name": "Maennel", "institution": "Google"}, {"given_name": "Sylvain", "family_name": "Gelly", "institution": "Google Brain (Zurich)"}, {"given_name": "Timothy", "family_name": "Mann", "institution": "DeepMind"}, {"given_name": "Andre", "family_name": "Barreto", "institution": "DeepMind"}, {"given_name": "Gergely", "family_name": "Neu", "institution": "Universitat Pompeu Fabra"}]}