{"title": "Is the Bellman residual a bad proxy?", "book": "Advances in Neural Information Processing Systems", "page_first": 3205, "page_last": 3214, "abstract": "This paper aims at theoretically and empirically comparing two standard optimization criteria for Reinforcement Learning: i) maximization of the mean value and ii) minimization of the Bellman residual. For that purpose, we place ourselves in the framework of policy search algorithms, that are usually designed to maximize the mean value, and derive a method that minimizes the residual $\\|T_* v_\\pi - v_\\pi\\|_{1,\\nu}$ over policies. A theoretical analysis shows how good this proxy is to policy optimization, and notably that it is better than its value-based counterpart. We also propose experiments on randomly generated generic Markov decision processes, specifically designed for studying the influence of the involved concentrability coefficient. They show that the Bellman residual is generally a bad proxy to policy optimization and that directly maximizing the mean value is much better, despite the current lack of deep theoretical analysis. This might seem obvious, as directly addressing the problem of interest is usually better, but given the prevalence of (projected) Bellman residual minimization in value-based reinforcement learning, we believe that this question is worth to be considered.", "full_text": "Is the Bellman residual a bad proxy?\n\nMatthieu Geist1, Bilal Piot2,3 and Olivier Pietquin 2,3\n\n1 Universit\u00e9 de Lorraine & CNRS, LIEC, UMR 7360, Metz, F-57070 France\n\n2 Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 - CRIStAL, F-59000 Lille, France\n\n3 Now with Google DeepMind, London, United Kingdom\n\nmatthieu.geist@univ-lorraine.fr\n\nbilal.piot@univ-lille1.fr, olivier.pietquin@univ-lille1.fr\n\nAbstract\n\nThis paper aims at theoretically and empirically comparing two standard optimiza-\ntion criteria for Reinforcement Learning: i) maximization of the mean value and\nii) minimization of the Bellman residual. For that purpose, we place ourselves in\nthe framework of policy search algorithms, that are usually designed to maximize\nthe mean value, and derive a method that minimizes the residual (cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,\u03bd\nover policies. A theoretical analysis shows how good this proxy is to policy op-\ntimization, and notably that it is better than its value-based counterpart. We also\npropose experiments on randomly generated generic Markov decision processes,\nspeci\ufb01cally designed for studying the in\ufb02uence of the involved concentrability\ncoef\ufb01cient. They show that the Bellman residual is generally a bad proxy to policy\noptimization and that directly maximizing the mean value is much better, despite\nthe current lack of deep theoretical analysis. This might seem obvious, as directly\naddressing the problem of interest is usually better, but given the prevalence of\n(projected) Bellman residual minimization in value-based reinforcement learning,\nwe believe that this question is worth to be considered.\n\n1\n\nIntroduction\n\nReinforcement Learning (RL) aims at estimating a policy \u03c0 close to the optimal one, in the sense\nthat its value, v\u03c0 (the expected discounted return), is close to maximal, i.e (cid:107)v\u2217 \u2212 v\u03c0(cid:107) is small (v\u2217\nbeing the optimal value), for some norm. Controlling the residual (cid:107)T\u2217v\u03b8 \u2212 v\u03b8(cid:107) (where T\u2217 is the\noptimal Bellman operator and v\u03b8 a value function parameterized by \u03b8) over a class of parameterized\nvalue functions is a classical approach in value-based RL, and especially in Approximate Dynamic\nProgramming (ADP). Indeed, controlling this residual allows controlling the distance to the optimal\nvalue function: generally speaking, we have that\n(cid:107)v\u2217 \u2212 v\u03c0v\u03b8(cid:107) \u2264\n\n(1)\n\nC\n\nwith the policy \u03c0v\u03b8 being greedy with respect to v\u03b8 [17, 19].\nSome classical ADP approaches actually minimize a projected Bellman residual, (cid:107)\u03a0(T\u2217v\u03b8 \u2212 v\u03b8)(cid:107),\nwhere \u03a0 is the operator projecting onto the hypothesis space to which v\u03b8 belongs: Approximate Value\nIteration (AVI) [11, 9] tries to minimize this using a \ufb01xed-point approach, v\u03b8k+1 = \u03a0T\u2217v\u03b8k, and it has\nbeen shown recently [18] that Least-Squares Policy Iteration (LSPI) [13] tries to minimize it using\na Newton approach1. Notice that in this case (projected residual), there is no general performance\nbound2 for controlling (cid:107)v\u2217 \u2212 v\u03c0v\u03b8(cid:107).\n\n1(Exact) policy iteration actually minimizes (cid:107)T\u2217v \u2212 v(cid:107) using a Newton descent [10].\n2With a single action, this approach reduces to LSTD (Least-Squares Temporal Differences) [5], that can be\n\n1 \u2212 \u03b3 (cid:107)T\u2217v\u03b8 \u2212 v\u03b8(cid:107),\n\narbitrarily bad in an off-policy setting [20].\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fDespite the fact that (unprojected) residual approaches come easily with performance guarantees,\nthey are not extensively studied in the (value-based) literature (one can mention [3] that considers\na subgradient descent or [19] that frames the norm of the residual as a delta-convex function). A\nreason for this is that they lead to biased estimates when the Markovian transition kernel is stochastic\nand unknown [1], which is a rather standard case. Projected Bellman residual approaches are more\ncommon, even if not introduced as such originally (notable exceptions are [16, 18]).\nAn alternative approach consists in maximizing directly the mean value E\u03bd[v\u03c0(S)] for a user-\nde\ufb01ned state distribution \u03bd, this being equivalent to directly minimizing (cid:107)v\u2217 \u2212 v\u03c0(cid:107)1,\u03bd, see Sec. 2.\nThis suggests de\ufb01ning a class of parameterized policies and optimizing over them, which is the\npredominant approach in policy search3 [7].\nThis paper aims at theoretically and experimentally studying these two approaches: maximizing the\nmean value (related algorithms operate on policies) and minimizing the residual (related algorithms\noperate on value functions). In that purpose, we place ourselves in the context of policy search\nalgorithms. We adopt this position because we could derive a method that minimizes the residual\n(cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,\u03bd over policies and compare to other methods that usually maximize the mean value.\nOn the other hand, adapting ADP methods so that they maximize the mean value is way harder4. This\nnew approach is presented in Sec. 3, and we show theoretically how good this proxy.\nIn Sec. 4, we conduct experiments on randomly generated generic Markov decision processes to\ncompare both approaches empirically. The experiments are speci\ufb01cally designed to study the in\ufb02uence\nof the involved concentrability coef\ufb01cient. Despite the good theoretical properties of the Bellman\nresidual approach, it turns out that it only works well if there is a good match between the sampling\ndistribution and the discounted state occupancy distribution induced by the optimal policy, which is a\nvery limiting requirement. In comparison, maximizing the mean value is rather insensitive to this\nissue and works well whatever the sampling distribution is, contrary to what suggests the sole related\ntheoretical bound. This study thus suggests that maximizing the mean value, although it doesn\u2019t\nprovide easy theoretical analysis, is a better approach to build ef\ufb01cient and robust RL algorithms.\n\n2 Background\n\n2.1 Notations\n\n\u03b3 \u2208 (0, 1) is the discount factor. For v \u2208 RS, we write (cid:107)v(cid:107)1,\u03bd =(cid:80)\n\nLet \u2206X be the set of probability distributions over a \ufb01nite set X and Y X the set of applications\nfrom X to the set Y . By convention, all vectors are column vectors, except distributions (for left\nmultiplication). A Markov Decision Process (MDP) is a tuple {S,A, P,R, \u03b3}, where S is the\n\ufb01nite state space5, A is the \ufb01nite action space, P \u2208 (\u2206S )S\u00d7A is the Markovian transition kernel\n|s, a) denotes the probability of transiting to s(cid:48) when action a is applied in state s), R \u2208 RS\u00d7A\n(P (s(cid:48)\nis the bounded reward function (R(s, a) represents the local bene\ufb01t of doing action a in state s) and\ns\u2208S \u03bd(s)|v(s)| the \u03bd-weighted\n(cid:96)1-norm of v.\nNotice that when the function v \u2208 RS is componentwise positive, that is v \u2265 0, the \u03bd-weighted\n(cid:96)1-norm of v is actually its expectation with respect to \u03bd: if v \u2265 0, then (cid:107)v(cid:107)1,\u03bd = E\u03bd[v(S)] = \u03bdv.\nWe will make an intensive use of this basic property in the following.\nA stochastic policy \u03c0 \u2208 (\u2206A)S associates a distribution over actions to each state. The policy-induced\nreward and transition kernels, R\u03c0 \u2208 RS and P\u03c0 \u2208 (\u2206S )S, are de\ufb01ned as\n|s) = E\u03c0(.|s)[P (s(cid:48)\n|s, A)].\nThe quality of a policy is quanti\ufb01ed by the associated value function v\u03c0 \u2208 RS:\n\u03b3tR\u03c0(St)|S0 = s, St+1 \u223c P\u03c0(.|St)].\n\nR\u03c0(s) = E\u03c0(.|s)[R(s, A)] and P\u03c0(s(cid:48)\n\nv\u03c0(s) = E[\n\n(cid:88)\n\nt\u22650\n\n3A remarkable aspect of policy search is that it does not necessarily rely on the Markovian assumption, but\nthis is out of the scope of this paper (residual approaches rely on it, through the Bellman equation). Some recent\nand effective approaches build on policy search, such as deep deterministic policy gradient [15] or trust region\npolicy optimization [23]. Here, we focus on the canonical mean value maximization approach.\n\n4Approximate linear programming could be considered as such but is often computationally intractable [8, 6].\n5This choice is done for ease and clarity of exposition, the following results could be extended to continuous\n\nstate and action spaces.\n\n2\n\n\f(cid:88)\n\nt\u22650\n\nThe value v\u03c0 is the unique \ufb01xed point of the Bellman operator T\u03c0, de\ufb01ned as T\u03c0v = R\u03c0 + \u03b3P\u03c0v for\nany v \u2208 RS. Let de\ufb01ne the second Bellman operator T\u2217 as, for any v \u2208 RS, T\u2217v = max\u03c0\u2208(\u2206A)S T\u03c0v.\nA policy \u03c0 is greedy with respect to v \u2208 RS, denoted \u03c0 \u2208 G(v) if T\u03c0v = T\u2217v. There exists an\noptimal policy \u03c0\u2217 that satis\ufb01es componentwise v\u03c0\u2217 \u2265 v\u03c0, for all \u03c0 \u2208 (\u2206A)S. Moreover, we have\nthat \u03c0\u2217 \u2208 G(v\u2217), with v\u2217 being the unique \ufb01xed point of T\u2217.\nFinally, for any distribution \u00b5 \u2208 \u2206S, the \u03b3-weighted occupancy measure induced by the policy \u03c0\nwhen the initial state is sampled from \u00b5 is de\ufb01ned as\n\nd\u00b5,\u03c0 = (1 \u2212 \u03b3)\u00b5\n\n\u03b3tP t\n\n\u03c0 = (1 \u2212 \u03b3)\u00b5(I \u2212 \u03b3P\u03c0)\u22121 \u2208 \u2206S .\n\u03bd (cid:107)\u221e the smallest constant C satisfying, for all s \u2208 S,\n\nFor two distributions \u00b5 and \u03bd, we write (cid:107) \u00b5\n\u00b5(s) \u2264 C\u03bd(s). This quantity measures the mismatch between the two distributions.\n2.2 Maximizing the mean value\nLet P be a space of parameterized stochastic policies and let \u00b5 be a distribution of interest. The\noptimal policy has a higher value than any other policy, for any state. If the MDP is too large,\nsatisfying this condition is not reasonable. Therefore, a natural idea consists in searching for a policy\nsuch that the associated value function is as close as possible to the optimal one, in expectation,\naccording to a distribution of interest \u00b5. More formally, this means minimizing (cid:107)v\u2217 \u2212 v(cid:107)1,\u00b5 =\nE\u00b5[v\u2217(S) \u2212 v\u03c0(S)] \u2265 0. The optimal value function being unknown, one cannot address this\nproblem directly, but it is equivalent to maximizing E\u00b5[v\u03c0(S)].\nThis is the basic principle of many policy search approaches:\n\nmax\n\n\u03c0\u2208P J\u03bd(\u03c0) with J\u03bd(\u03c0) = E\u03bd[v\u03c0(S)] = \u03bdv\u03c0.\n\nNotice that we used a sampling distribution \u03bd here, possibly different from the distribution of interest\n\u00b5. Related algorithms differ notably by the considered criterion (e.g., it can be the mean reward rather\nthan the \u03b3-discounted cumulative reward considered here) and by how the corresponding optimization\nproblem is solved. We refer to [7] for a survey on that topic.\nContrary to ADP, the theoretical ef\ufb01ciency of this family of approaches has not been studied a lot.\nIndeed, as far as we know, there is a sole performance bound for maximizing the mean value.\nTheorem 1 (Scherrer and Geist [22]). Assume that the policy space P is stable by stochastic mixture,\nthat is \u2200\u03c0, \u03c0(cid:48)\n\u2208 P. De\ufb01ne the \u03bd-greedy-complexity of the policy\nspace P as\n\n\u2208 P,\u2200\u03b1 \u2208 (0, 1),\n\n\u03c0(cid:48)\u2208P d\u03bd,\u03c0(T\u2217v\u03c0 \u2212 T\u03c0(cid:48)v\u03c0).\nThen, any policy \u03c0 that is an \u0001-local optimum of J\u03bd, in the sense that\n\nE\u03bd(P) = max\n\n(1\u2212 \u03b1)\u03c0 + \u03b1\u03c0(cid:48)\n\u03c0\u2208P min\n\nenjoys the following global performance guarantee:\n\n\u2200\u03c0(cid:48)\n\n\u2208 \u03a0,\n\nlim\n\u03b1\u21920\n\n\u03bdv(1\u2212\u03b1)\u03c0+\u03b1\u03c0(cid:48) \u2212 \u03bdv\u03c0\n\n\u2264 \u0001,\n\n\u03b1\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\u00b5,\u03c0\u2217\n\n\u03bd\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n\u00b5(v\u2217 \u2212 v\u03c0) \u2264\n\n1\n\n(1 \u2212 \u03b3)2\n\n(E\u03bd(P) + \u0001) .\n\nThis bound (as all bounds of this kind) has three terms: an horizon term, a concentrability term and\nan error term. The term 1\n1\u2212\u03b3 is the average optimization horizon. This concentrability coef\ufb01cient\n((cid:107)d\u00b5,\u03c0\u2217/\u03bd(cid:107)\u221e) measures the mismatch between the used distribution \u03bd and the \u03b3-weighted occupancy\nmeasure induced by the optimal policy \u03c0\u2217 when the initial state is sampled from the distribution of\ninterest \u00b5. This tells that if \u00b5 is the distribution of interest, one should optimize Jd\u00b5,\u03c0\u2217 , which is\nnot feasible, \u03c0\u2217 being unknown (in this case, the coef\ufb01cient is equal to 1, its lower bound). This\ncoef\ufb01cient can be arbitrarily large: consider the case where \u00b5 concentrates on a single starting\nstate (that is \u00b5(s0) = 1 for a given state s0) and such that the optimal policy leads to other states\n(that is, d\u00b5,\u03c0\u2217 (s0) < 1), the coef\ufb01cient is then in\ufb01nite. However, it is also the best concentrability\ncoef\ufb01cient according to [21], that provides a theoretical and empirical comparison of Approximate\nPolicy Iteration (API) schemes. The error term is E\u03bd(P) + \u0001, where E\u03bd(P) measures the capacity of\n\n3\n\n\fthe policy space to represent the policies being greedy with respect to the value of any policy in P\nand \u0001 tells how the computed policy \u03c0 is close to a local optimum of J\u03bd.\nThere exist other policy search approches, based on ADP rather than on maximizing the mean value,\nsuch as Conservative Policy Iteration (CPI) [12] or Direct Policy Iteration (DPI) [14]. The bound of\nThm. 1 matches the bounds of DPI or CPI. Actually, CPI can be shown to be a boosting approach\nmaximizing the mean value. See the discussion in [22] for more details. However, this bound is also\nbased on a very strong assumption (stability by stochastic mixture of the policy space) which is not\nsatis\ufb01ed by all commonly used policy parameterizations.\n\n3 Minimizing the Bellman residual\n\nDirect maximization of the mean value operates on policies, while residual approaches operate on\nvalue functions. To study these two optimization criteria together, we introduce a policy search\nmethod that minimizes a residual. As noted before, we do so because it is much simpler than\nintroducing a value-based approach that maximizes the mean value. We also show how good this\nproxy is to policy optimization. Although this algorithm is new, it is not claimed to be a core\ncontribution of the paper. Yet it is clearly a mandatory step to support the comparison between\noptimization criteria.\n\n3.1 Optimization problem\nWe propose to search a policy in P that minimizes the following Bellman residual:\n\n\u03c0\u2208P J\u03bd(\u03c0) with J\u03bd(\u03c0) = (cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,\u03bd.\nmin\n\nNotice that, as for the maximization of the mean value, we used a sampling distribution \u03bd, possibly\ndifferent from the distribution of interest \u00b5.\nFrom the basic properties of the Bellman operator, for any policy \u03c0 we have that T\u2217v\u03c0 \u2265 v\u03c0.\nConsequently, the \u03bd-weighted (cid:96)1-norm of the residual is indeed the expected Bellman residual:\n\nJ\u03bd(\u03c0) = E\u03bd[[T\u2217v\u03c0](S) \u2212 v\u03c0(S)] = \u03bd(T\u2217v\u03c0 \u2212 v\u03c0).\n\nTherefore, there is naturally no bias problem for minimizing a residual here, contrary to other residual\napproaches [1]. This is an interesting result on its own, as removing the bias in value-based residual\napproaches is far from being straightforward. This results from the optimization being done over\npolicies and not over values, and thus from v\u03c0 being an actual value (the one of the current policy)\nobeying to the Bellman equation6.\nAny optimization method can be envisioned to minimize J\u03bd. Here, we simply propose to apply a\nsubgradient descent (despite the lack of convexity).\nTheorem 2 (Subgradient of J\u03bd). Recall that given the considered notations, the distribution \u03bdPG(v\u03c0)\nis the state distribution obtained by sampling the initial state according to \u03bd, applying the action\nbeing greedy with respect to v\u03c0 and following the dynamics to the next state. This being said, the\nsubgradient of J\u03bd is given by\n1\n1 \u2212 \u03b3\n\n(cid:0)d\u03bd,\u03c0(s) \u2212 \u03b3d\u03bdPG(v\u03c0 ),\u03c0(s)(cid:1) \u03c0(a|s)\u2207 ln \u03c0(a|s)q\u03c0(s, a),\n\ns,a\n\n(cid:88)\nwith q\u03c0(s, a) = R(s, a) + \u03b3(cid:80)\ns(cid:48)\u2208S P (s(cid:48)\n\n\u2212\u2207J\u03bd(\u03c0) =\n\n|s, a)v\u03c0(s(cid:48)) the state-action value function.\nProof. The proof relies on basic (sub)gradient calculus, it is given in the appendix.\n\nThere are two terms in the negative subgradient \u2212\u2207J\u03bd: the \ufb01rst one corresponds to the gradient of\nJ\u03bd, the second one (up to the multiplication by \u2212\u03b3) is the gradient of J\u03bdPG(v\u03c0 ) and acts as a kind of\ncorrection. This subgradient can be estimated using Monte Carlo rollouts, but doing so is harder than\nfor classic policy search (as it requires additionally sampling from \u03bdPG(v\u03c0), which requires estimating\n6The property T\u2217v \u2265 v does not hold if v is not the value function of a given policy, as in value-based\n\napproaches.\n\n4\n\n\fthe state-action value function). Also, this gradient involves computing the maximum over actions\n(as it requires sampling from \u03bdPG(v\u03c0), that comes from explicitly considering the Bellman optimality\noperator), which prevents from extending easily this approach to continuous actions, contrary to\nclassic policy search.\nThus, from an algorithmic point of view, this approach has drawbacks. Yet, we do not discuss\nfurther how to ef\ufb01ciently estimate this subgradient since we introduced this approach for the sake\nof comparison to standard policy search methods only. For this reason, we will consider an ideal\nalgorithm in the experimental section where an analytical computation of the subgradient is possible,\nsee Sec. 4. This will place us in an unrealistically good setting, which will help focusing on the main\nconclusions. Before this, we study how good this proxy is to policy optimization.\n\n3.2 Analysis\n\nTheorem 3 (Proxy bound for residual policy search). We have that\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\u00b5,\u03c0\u2217\n\n\u03bd\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e J\u03bd(\u03c0) =\n\n(cid:107)v\u2217 \u2212 v\u03c0(cid:107)1,\u00b5 \u2264\n\n1\n1 \u2212 \u03b3\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\u00b5,\u03c0\u2217\n\n\u03bd\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e (cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,\u03bd.\n\n1\n1 \u2212 \u03b3\n\nProof. The proof can be easily derived from the analyses of [12], [17] or [22]. We detail it for\ncompleteness in the appendix.\n\nThis bound shows how controlling the residual helps in controlling the error. It has a linear dependency\non the horizon and the concentrability coef\ufb01cient is the best one can expect (according to [21]). It has\nthe same form has the bounds for value-based residual minimization [17, 19] (see also Eq. (1)). It is\neven better due to the involved concentrability coef\ufb01cient (the ones for value-based bounds are worst,\nsee [21] for a comparison).\nUnfortunately, this bound is hardly comparable to the one of Th. 1, due to the error terms. In Th. 3,\nthe error term (the residual) is a global error (how good is the residual as a proxy), whereas in Th. 1\nthe error term is mainly a local error (how small is the gradient after minimizing the mean value).\nNotice also that Th. 3 is roughly an intermediate step for proving Th. 1, and that it applies to any\npolicy (suggesting that searching for a policy that minimizes the residual makes sense). One could\nargue that a similar bound for mean value maximization would be something like: if J\u00b5(\u03c0) \u2265 \u03b1, then\n(cid:107)v\u2217 \u2212 v\u03c0(cid:107)1,\u00b5 \u2264 \u00b5v\u2217 \u2212 \u03b1. However, this is an oracle bound, as it depends on the unknown solution\nv\u2217. It is thus hardly exploitable.\nThe aim of this paper is to compare these two optimization approaches to RL. At a \ufb01rst sight,\nmaximizing directly the mean value should be better (as a more direct approach). If the bounds of\nTh. 1 and 3 are hardly comparable, we can still discuss the involved terms. The horizon term is better\n(linear instead of quadratic) for the residual approach. Yet, an horizon term can possibly be hidden in\nthe residual itself. Both bounds imply the same concentrability coef\ufb01cient, the best one can expect.\nThis is a very important term in RL bounds, often underestimated: as these coef\ufb01cients can easily\nexplode, minimizing an error makes sense only if it\u2019s not multiplied by in\ufb01nity. This coef\ufb01cient\nsuggests that one should use d\u00b5,\u03c0\u2217 as the sampling distribution. This is rarely reasonable, while using\ninstead directly the distribution of interest is more natural. Therefore, the experiments we propose on\nthe next section focus on the in\ufb02uence of this concentrability coef\ufb01cient.\n\n4 Experiments\n\nWe consider Garnet problems [2, 4]. They are a class of randomly built MDPs meant to be totally\nabstract while remaining representative of the problems that might be encountered in practice. Here,\na Garnet G(|S|,|A|, b) is speci\ufb01ed by the number of states, the number of actions and the branching\nfactor. For each (s, a) couple, b different next states are chosen randomly and the associated\nprobabilities are set by randomly partitioning the unit interval. The reward is null, except for 10% of\nstates where it is set to a random value, uniform in (1, 2). We set \u03b3 = 0.99.\nFor the policy space, we consider a Gibbs parameterization: P = {\u03c0w : \u03c0w(a|s) \u221d ew(cid:62)\u03c6(s,a)}.\nThe features are also randomly generated, F (d, l). First, we generate binary state-features \u03d5(s) of\ndimension d, such that l components are set to 1 (the others are thus 0). The positions of the 1\u2019s are\n\n5\n\n\f(cid:62)\n\n0 . . . 0)\n\nselected randomly such that no two states have the same feature. Then, the state-action features, of\n, the position of the\ndimension d|A|, are classically de\ufb01ned as \u03c6(s, a) = (0 . . . 0 \u03d5(s)\nzeros depending on the action. Notice that in general this policy space is not stable by stochastic\nmixture, so the bound for policy search does not formally apply.\nWe compare classic policy search (denoted as PS(\u03bd)), that maximizes the mean value, and residual\npolicy search (denoted as RPS(\u03bd)), that minimizes the mean residual. We optimize the relative\nobjective functions with a normalized gradient ascent (resp. normalized subgradient descent) with\na constant learning rate \u03b1 = 0.1. The gradients are computed analytically (as we have access to\nthe model), so the following results represent an ideal case, when one can do an in\ufb01nite number of\nrollouts. Unless said otherwise, the distribution \u00b5 \u2208 \u2206S of interest is the uniform distribution.\n4.1 Using the distribution of interest\n\nFirst, we consider \u03bd = \u00b5. We generate randomly 100 Garnets G(30, 4, 2) and 100 features F (8, 3).\nFor each Garnet-feature couple, we run both algorithms for T = 1000 iterations. For each algorithm,\nwe measure two quantities: the (normalized) error (cid:107)v\u2217\u2212v\u03c0(cid:107)1,\u00b5\n(notice that as rewards are positive, we\n(cid:107)v\u2217(cid:107)1,\u00b5\nhave (cid:107)v\u2217(cid:107)1,\u00b5 = \u00b5v\u2217) and the Bellman residual (cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,\u00b5, where \u03c0 depends on the algorithm\nand on the iteration. We show the results (mean\u00b1standard deviation) on Fig. 1.\n\na. Error for PS(\u00b5).\n\nb. Error for RPS(\u00b5).\n\nc. Residual for PS(\u00b5). d. Residual for RPS(\u00b5).\n\nFigure 1: Results on the Garnet problems, when \u03bd = \u00b5.\n\nFig. 1.a shows that PS(\u00b5) succeeds in decreasing the error. This was to be expected, as it is the\ncriterion it optimizes. Fig. 1.c shows how the residual of the policies computed by PS(\u00b5) evolves.\nBy comparing this to Fig. 1.a, it can be observed that the residual and the error are not necessarily\ncorrelated: the error can decrease while the residual increases, and a low error does not necessarily\ninvolves a low residual.\nFig. 1.d shows that RPS(\u00b5) succeeds in decreasing the residual. Again, this is not surprising, as it is\nthe optimized criterion. Fig. 1.b shows how the error of the policies computed by RPS(\u00b5) evolves.\nComparing this to Fig. 1.d, it can be observed that decreasing the residual lowers the error: this is\nconsistent with the bound of Thm. 3.\nComparing Figs. 1.a and 1.b, it appears clearly that RPS(\u00b5) is less ef\ufb01cient than PS(\u00b5) for decreasing\nthe error. This might seem obvious, as PS(\u00b5) directly optimizes the criterion of interest. However,\nwhen comparing the errors and the residuals for each method, it can be observed that they are not\nnecessarily correlated. Decreasing the residual lowers the error, but one can have a low error with a\nhigh residual and vice versa.\nAs explained in Sec. 1, (projected) residual-based methods are prevalent for many reinforcement\nlearning approaches. We consider a policy-based residual rather than a value-based one to ease the\ncomparison, but it is worth studying the reason for such a different behavior.\n\n4.2 Using the ideal distribution\n\nThe lower the concentrability coef\ufb01cient (cid:107) d\u00b5,\u03c0\u2217\n\u03bd (cid:107)\u221e is, the better the bounds in Thm. 1 and 3 are.\nThis coef\ufb01cient is minimized for \u03bd = d\u00b5,\u03c0\u2217. This is an unrealistic case (\u03c0\u2217 is unknown), but since\nwe work with known MDPs we can compute this quantity (the model being known), for the sake\nof a complete empirical analysis. Therefore, PS(d\u00b5,\u03c0\u2217) and RPS(d\u00b5,\u03c0\u2217) are compared in Fig. 2. We\nhighlight the fact that the errors and the residuals shown in this \ufb01gure are measured respectively to\nthe distribution of interest \u00b5, and not the distribution d\u00b5,\u03c0\u2217 used for the optimization.\n\n6\n\n02004006008001000numberofiterations0.00.20.40.60.8kv\u2217\u2212v\u03c0k1,\u00b5\u00b7kv\u2217k\u221211,\u00b502004006008001000numberofiterations0.00.20.40.60.8kv\u2217\u2212v\u03c0k1,\u00b5\u00b7kv\u2217k\u221211,\u00b502004006008001000numberofiterations0.00.20.40.60.81.01.21.4kT\u2217v\u03c0\u2212v\u03c0k1,\u00b502004006008001000numberofiterations0.00.20.40.60.81.01.21.4kT\u2217v\u03c0\u2212v\u03c0k1,\u00b5\fa. Error for PS(d\u00b5,\u03c0\u2217). b. Error for RPS(d\u00b5,\u03c0\u2217).\n\nc. Residual for\n\nPS(d\u00b5,\u03c0\u2217).\n\nd. Residual for\nRPS(d\u00b5,\u03c0\u2217).\n\nFigure 2: Results on the Garnet problems, when \u03bd = d\u00b5,\u03c0\u2217.\n\nFig. 2.a shows that PS(d\u00b5,\u03c0\u2217) succeeds in decreasing the error (cid:107)v\u2217 \u2212 v\u03c0(cid:107)1,\u00b5. However, comparing\nFig. 2.a to Fig. 1.a, there is no signi\ufb01cant gain in using \u03bd = d\u00b5,\u03c0\u2217 instead of \u03bd = \u00b5. This suggests\nthat the dependency of the bound in Thm. 1 on the concentrability coef\ufb01cient is not tight. Fig. 2.c\nshows how the corresponding residual evolves. Again, there is no strong correlation between the\nresidual and the error.\nFig. 2.d shows how the residual (cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,\u00b5 evolves for RPS(d\u00b5,\u03c0\u2217). It is not decreasing, but it\nis not what is optimized (the residual (cid:107)T\u2217v\u03c0 \u2212 v\u03c0(cid:107)1,d\u00b5,\u03c0\u2217 , not shown, decreases indeed, in a similar\nfashion than Fig. 1.d). Fig. 2.b shows how the related error evolves. Compared to Fig. 2.a, there is no\nsigni\ufb01cant difference. The behavior of the residual is similar for both methods (Figs. 2.c and 2.d).\nOverall, this suggests that controlling the residual (RPS) allows controlling the error, but that this\nrequires a wise choice for the distribution \u03bd. On the other hand, controlling directly the error (PS)\nis much less sensitive to this. In other words, this suggests a stronger dependency of the residual\napproach to the mismatch between the sampling distribution and the discounted state occupancy\nmeasure induced by the optimal policy.\n\n4.3 Varying the sampling distribution\n\nThis experiment is designed to study the effect of the mismatch between the distributions. We sample\n100 Garnets G(30, 4, 2), as well as associated feature sets F (8, 3). The distribution of interest is no\nlonger the uniform distribution, but a measure that concentrates on a single starting state of interest\ns0: \u00b5(s0) = 1. This is an adverserial case, as it implies that (cid:107) d\u00b5,\u03c0\u2217\n\u00b5 (cid:107)\u221e = \u221e: the branching factor\nbeing equal to 2, the optimal policy \u03c0\u2217 cannot concentrate on s0.\nThe sampling distribution is de\ufb01ned as being a mixture between the distribution of interest and the\nideal distribution. For \u03b1 \u2208 [0, 1], \u03bd\u03b1 is de\ufb01ned as \u03bd\u03b1 = (1 \u2212 \u03b1)\u00b5 + \u03b1d\u00b5,\u03c0\u2217. It is straightforward to\nshow that in this case the concentrability coef\ufb01cient is indeed 1\n0 = \u221e):\n\n\u03b1 (with the convention that 1\n\n(cid:19)\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\u00b5,\u03c0\u2217\n\n\u03bd\u03b1\n\n(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n(cid:18)\n\n= max\n\nd\u00b5,\u03c0\u2217 (s0)\n\n(1 \u2212 \u03b1) + \u03b1d\u00b5,\u03c0\u2217 (s0)\n\n;\n\n1\n\u03b1\n\n=\n\n1\n\u03b1\n\n.\n\nFor each MDP, the learning (for PS(\u03bd\u03b1) and RPS(\u03bd\u03b1)) is repeated, from the same initial policy, by\nsetting \u03b1 = 1\nk , for k \u2208 [1; 25]. Let \u03c0t,x be the policy learnt by algorithm x (PS or RPS) at iteration t,\nthe integrated error (resp. integrated residual) is de\ufb01ned as\n\nT(cid:88)\n\nt=1\n\n1\nT\n\n(cid:107)v\u2217 \u2212 v\u03c0t,x(cid:107)1,\u00b5\n\n(cid:107)v\u2217(cid:107)1,\u00b5\n\n(resp. 1\nT\n\nT(cid:88)\nt=1 (cid:107)T\u2217v\u03c0t,x \u2212 v\u03c0t,x(cid:107)1,\u00b5).\n\nNotice that here again, the integrated error and residual are de\ufb01ned with respect to \u00b5, the distribution\nof interest, and not \u03bd\u03b1, the sampling distribution used for optimization. We get an integrated error\n(resp. residual) for each value of \u03b1 = 1\n\u03bd\u03b1 (cid:107)\u221e, the\nconcentrability coef\ufb01cient. Results are presented in Fig. 3, that shows these functions averaged across\nthe 100 randomly generated MDPs (mean\u00b1standard deviation as before, minimum and maximum\nvalues are shown in dashed line).\nFig. 3.a shows the integrated error for PS(\u03bd\u03b1). It can be observed that the mismatch between measures\nhas no in\ufb02uence on the ef\ufb01ciency of the algorithm. Fig. 3.b shows the same thing for RPS(\u03bd\u03b1). The\nintegrated error increases greatly as the mismatch between the sampling measure and the ideal one\n\nk , and represent it as a function of k = (cid:107) d\u00b5,\u03c0\u2217\n\n7\n\n02004006008001000numberofiterations0.00.20.40.60.8kv\u2217\u2212v\u03c0k1,\u00b5\u00b7kv\u2217k\u221211,\u00b502004006008001000numberofiterations0.00.20.40.60.8kv\u2217\u2212v\u03c0k1,\u00b5\u00b7kv\u2217k\u221211,\u00b502004006008001000numberofiterations0.00.20.40.60.81.01.21.4kT\u2217v\u03c0\u2212v\u03c0k1,\u00b502004006008001000numberofiterations0.00.20.40.60.81.01.21.4kT\u2217v\u03c0\u2212v\u03c0k1,\u00b5\fa. Integrated error for\n\nPS(\u03bd\u03b1).\n\nb. Integrated error for\n\nRPS(\u03bd\u03b1).\n\nc. Integrated residual\n\nfor PS(\u03bd\u03b1).\n\nd. Integrated residual\n\nfor RPS(\u03bd\u03b1).\n\nFigure 3: Results for the sampling distribution \u03bd\u03b1.\n\nincreases (the value to which the error saturates correspond to no improvement over the initial policy).\nComparing both \ufb01gures, it can be observed that RPS performs as well as PS only when the ideal\ndistribution is used (this corresponds to a concentrability coef\ufb01cient of 1). Fig. 3.c and 3.d show the\nintegrated residual for each algorithm. It can be observed that RPS consistently achieves a lower\nresidual than PS.\nOverall, this suggests that using the Bellman residual as a proxy is ef\ufb01cient only if the sampling\ndistribution is close to the ideal one, which is dif\ufb01cult to achieve in general (the ideal distribution\nd\u00b5,\u03c0\u2217 being unknown). On the other hand, the more direct approach consisting in maximizing the\nmean value is much more robust to this issue (and can, as a consequence, be considered directly with\nthe distribution of interest).\nOne could argue that the way we optimize the considered objective function is rather naive (for\nexample, considering a constant learning rate). But this does not change the conclusions of this\nexperimental study, that deals with how the error and the Bellman residual are related and with how\nthe concentrability in\ufb02uences each optimization approach. This point is developed in the appendix.\n\n5 Conclusion\n\nThe aim of this article was to compare two optimization approaches to reinforcement learning:\nminimizing a Bellman residual and maximizing the mean value. As said in Sec. 1, Bellman residuals\nare prevalent in ADP. Notably, value iteration minimizes such a residual using a \ufb01xed-point approach\nand policy iteration minimizes it with a Newton descent. On another hand, maximizing the mean\nvalue (Sec. 2) is prevalent in policy search approaches.\nAs Bellman residual minimization methods are naturally value-based and mean value maximization\napproaches policy-based, we introduced a policy-based residual minimization algorithm in order to\nstudy both optimization problems together. For the introduced residual method, we proved a proxy\nbound, better than value-based residual minimization. The different nature of the bounds of Th. 1\nand 3 made the comparison dif\ufb01cult, but both involve the same concentrability coef\ufb01cient, a term\noften underestimated in RL bounds.\nTherefore, we compared both approaches empirically on a set of randomly generated Garnets,\nthe study being designed to quantify the in\ufb02uence of this concentrability coef\ufb01cient. From these\nexperiments, it appears that the Bellman residual is a good proxy for the error (the distance to the\noptimal value function) only if, luckily, the concentrability coef\ufb01cient is small for the considered\nMDP and the distribution of interest, or one can afford a change of measure for the optimization\nproblem, such that the sampling distribution is close to the ideal one. Regarding this second point,\none can change to a measure different from the ideal one, d\u00b5,\u03c0\u2217 (for example, using for \u03bd a uniform\ndistribution when the distribution of interest concentrates on a single state would help), but this is\ndif\ufb01cult in general (one should know roughly where the optimal policy will lead to). Conversely,\nmaximizing the mean value appears to be insensitive to this problem. This suggests that the Bellman\nresidual is generally a bad proxy to policy optimization, and that maximizing the mean value is more\nlikely to result in ef\ufb01cient and robust reinforcement learning algorithms, despite the current lack of\ndeep theoretical analysis.\nThis conclusion might seems obvious, as maximizing the mean value is a more direct approach, but\nthis discussion has never been addressed in the literature, as far as we know, and we think it to be\nimportant, given the prevalence of (projected) residual minimization in value-based RL.\n\n8\n\n0510152025concentrabilitycoef\ufb01cient0.00.20.40.60.81.0integratederror0510152025concentrabilitycoef\ufb01cient0.00.20.40.60.81.0integratederror0510152025concentrabilitycoef\ufb01cient0.00.51.01.5integratedresidual0510152025concentrabilitycoef\ufb01cient0.00.51.01.5integratedresidual\fReferences\n[1] Andr\u00e1s Antos, Csaba Szepesv\u00e1ri, and R\u00e9mi Munos. Learning near-optimal policies with\nBellman-residual minimization based \ufb01tted policy iteration and a single sample path. Machine\nLearning, 71(1):89\u2013129, 2008.\n\n[2] TW Archibald, KIM McKinnon, and LC Thomas. On the generation of Markov decision\n\nprocesses. Journal of the Operational Research Society, pages 354\u2013361, 1995.\n\n[3] Leemon C. Baird. Residual Algorithms: Reinforcement Learning with Function Approximation.\n\nIn International Conference on Machine Learning (ICML), pages 30\u201337, 1995.\n\n[4] Shalabh Bhatnagar, Richard S Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural\n\nactor-critic algorithms. Automatica, 45(11):2471\u20132482, 2009.\n\n[5] Steven J. Bradtke and Andrew G. Barto. Linear Least-Squares algorithms for temporal difference\n\nlearning. Machine Learning, 22(1-3):33\u201357, 1996.\n\n[6] Daniela Pucci de Farias and Benjamin Van Roy. The linear programming approach to approxi-\n\nmate dynamic programming. Operations research, 51(6):850\u2013865, 2003.\n\n[7] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. A Survey on Policy Search for\n\nRobotics. Foundations and Trends in Robotics, 2(1-2):1\u2013142, 2013.\n\n[8] Vijay V. Desai, Vivek F. Farias, and Ciamac C. Moallemi. Approximate dynamic programming\n\nvia a smoothed linear program. Oper. Res., 60(3):655\u2013674, May 2012.\n\n[9] Damien Ernst, Pierre Geurts, and Louis Wehenkel. Tree-Based Batch Mode Reinforcement\n\nLearning. Journal of Machine Learning Research, 6:503\u2013556, 2005.\n\n[10] Jerzy A Filar and Boleslaw Tolwinski. On the Algorithm of Pollatschek and Avi-ltzhak.\n\nStochastic Games And Related Topics, pages 59\u201370, 1991.\n\n[11] Geoffrey Gordon. Stable Function Approximation in Dynamic Programming. In International\n\nConference on Machine Learning (ICML), 1995.\n\n[12] Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning.\n\nIn International Conference on Machine Learning (ICML), 2002.\n\n[13] Michail G. Lagoudakis and Ronald Parr. Least-squares policy iteration. Journal of Machine\n\nLearning Research, 4:1107\u20131149, 2003.\n\n[14] Alessandro Lazaric, Mohammad Ghavamzadeh, and R\u00e9mi Munos. Analysis of a classi\ufb01cation-\nbased policy iteration algorithm. In International Conference on Machine Learning (ICML),\n2010.\n\n[15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,\nDavid Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In\nInternational Conference on Learning Representations (ICLR), 2016.\n\n[16] Hamid R Maei, Csaba Szepesv\u00e1ri, Shalabh Bhatnagar, and Richard S Sutton. Toward off-policy\nlearning control with function approximation. In International Conference on Machine Learning\n(ICML), 2010.\n\n[17] R\u00e9mi Munos. Performance bounds in (cid:96)p-norm for approximate value iteration. SIAM journal\n\non control and optimization, 46(2):541\u2013561, 2007.\n\n[18] Julien P\u00e9rolat, Bilal Piot, Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. Softened\nApproximate Policy Iteration for Markov Games. In International Conference on Machine\nLearning (ICML), 2016.\n\n[19] Bilal Piot, Matthieu Geist, and Olivier Pietquin. Difference of Convex Functions Programming\nfor Reinforcement Learning. In Advances in Neural Information Processing Systems (NIPS),\n2014.\n\n9\n\n\f[20] Bruno Scherrer. Should one compute the Temporal Difference \ufb01x point or minimize the\nIn International Conference on\n\nBellman Residual? The uni\ufb01ed oblique projection view.\nMachine Learning (ICML), 2010.\n\n[21] Bruno Scherrer. Approximate Policy Iteration Schemes: A Comparison. In International\n\nConference on Machine Learning (ICML), pages 1314\u20131322, 2014.\n\n[22] Bruno Scherrer and Matthieu Geist. Local Policy Search in a Convex Space and Conservative\nPolicy Iteration as Boosted Policy Search. In European Conference on Machine Learning and\nPrinciples and Practice of Knowledge Discovery in Databases (ECML/PKDD), 2014.\n\n[23] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region\n\npolicy optimization. In International Conference on Machine Learning (ICML), 2015.\n\n10\n\n\f", "award": [], "sourceid": 1831, "authors": [{"given_name": "Matthieu", "family_name": "Geist", "institution": "Universit\u00e9 de Lorraine"}, {"given_name": "Bilal", "family_name": "Piot", "institution": "DeepMind"}, {"given_name": "Olivier", "family_name": "Pietquin", "institution": "DeepMind"}]}