{"title": "Non-delusional Q-learning and value-iteration", "book": "Advances in Neural Information Processing Systems", "page_first": 9949, "page_last": 9959, "abstract": "We identify a fundamental source of error in Q-learning and other forms of dynamic programming with function approximation. Delusional bias arises when the approximation architecture limits the class of expressible greedy policies. Since standard Q-updates make globally uncoordinated action choices with respect to the expressible policy class, inconsistent or even conflicting Q-value estimates can result, leading to pathological behaviour such as over/under-estimation, instability and even divergence. To solve this problem, we introduce a new notion of policy consistency and define a local backup process that ensures global consistency through the use of information sets---sets that record constraints on policies consistent with backed-up Q-values. We prove that both the model-based and model-free algorithms using this backup remove delusional bias, yielding the first known algorithms that guarantee optimal results under general conditions. These algorithms furthermore only require polynomially many information sets (from a potentially exponential support). Finally, we suggest other practical heuristics for value-iteration and Q-learning that attempt to reduce delusional bias.", "full_text": "Non-delusional Q-learning and Value Iteration\n\nTyler Lu\nGoogle AI\n\ntylerlu@google.com\n\nDale Schuurmans\n\nGoogle AI\n\nschuurmans@google.com\n\nCraig Boutilier\n\nGoogle AI\n\ncboutilier@google.com\n\nAbstract\n\nWe identify a fundamental source of error in Q-learning and other forms of dy-\nnamic programming with function approximation. Delusional bias arises when the\napproximation architecture limits the class of expressible greedy policies. Since\nstandard Q-updates make globally uncoordinated action choices with respect to\nthe expressible policy class, inconsistent or even con\ufb02icting Q-value estimates\ncan result, leading to pathological behaviour such as over/under-estimation, insta-\nbility and even divergence. To solve this problem, we introduce a new notion of\npolicy consistency and de\ufb01ne a local backup process that ensures global consis-\ntency through the use of information sets\u2014sets that record constraints on policies\nconsistent with backed-up Q-values. We prove that both the model-based and\nmodel-free algorithms using this backup remove delusional bias, yielding the \ufb01rst\nknown algorithms that guarantee optimal results under general conditions. These\nalgorithms furthermore only require polynomially many information sets (from a\npotentially exponential support). Finally, we suggest other practical heuristics for\nvalue-iteration and Q-learning that attempt to reduce delusional bias.\n\n1\n\nIntroduction\n\nQ-learning is a foundational algorithm in reinforcement learning (RL) [34, 26]. Although Q-learning\nis guaranteed to converge to an optimal state-action value function (or Q-function) when state-\naction pairs are explicitly enumerated [34], it is potentially unstable when combined with function\napproximation (even simple linear approximation) [1, 8, 29, 26]. Numerous modi\ufb01cations of the basic\nupdate, restrictions on approximators, and training regimes have been proposed to ensure convergence\nor improve approximation error [12, 13, 27, 18, 17, 21]. Unfortunately, simple modi\ufb01cations are\nunlikely to ensure near-optimal performance, since it is NP-complete to determine whether even\na linear approximator can achieve small worst-case Bellman error [23]. Developing variants of Q-\nlearning with good worst-case behaviour for standard function approximators has remained elusive.\nDespite these challenges, Q-learning remains a workhorse of applied RL. The recent success of\ndeep Q-learning, and its role in high-pro\ufb01le achievements [19], seems to obviate concerns about\nthe algorithm\u2019s performance: the use of deep neural networks (DNNs), together with various aug-\nmentations (such as experience replay, hyperparameter tuning, etc.) can reduce instability and poor\napproximation. However, deep Q-learning is far from robust, and can rarely be applied successfully\nby inexperienced users. Modi\ufb01cations to mitigate systematic risks in Q-learning include double\nQ-learning [30], distributional Q-learning [4], and dueling network architectures [32]. A study of\nthese and other variations reveals surprising results regarding the relative bene\ufb01ts of each under\nablation [14]. Still, the full range of risks of approximation in Q-learning has yet to be delineated.\nIn this paper, we identify a fundamental problem with Q-learning (and other forms of dynamic\nprogramming) with function approximation, distinct from those previously discussed in the literature.\nSpeci\ufb01cally, we show that approximate Q-learning suffers from delusional bias, in which updates are\nbased on mutually inconsistent values. This inconsistency arises because the Q-update for a state-\naction pair, (s, a), is based on the maximum value estimate over all actions at the next state, which\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fignores the fact that the actions so-considered (including the choice of a at s) might not be jointly\nrealizable given the set of admissible policies derived from the approximator. These \u201cunconstrained\u201d\nupdates induce errors in the target values, and cause a distinct source of value estimation error:\nQ-learning readily backs up values based on action choices that the greedy policy class cannot realize.\nOur \ufb01rst contribution is the identi\ufb01cation and precise de\ufb01nition of delusional bias, and a demonstration\nof its detrimental consequences. From this new perspective, we are able to identify anomalies in\nthe behaviour of Q-learning and value iteration (VI) under function approximation, and provide\nnew explanations for previously puzzling phenomena. We emphasize that delusion is an inherent\nproblem affecting the interaction of Q-updates with constrained policy classes\u2014more expressive\napproximators, larger training sets and increased computation do not resolve the issue.\nOur second contribution is the development of a new policy-consistent backup operator that fully\nresolves the problem of delusion. Our notion of consistency is in the same spirit as, but extends,\nother recent notions of temporal consistency [5, 22]. This new operator does not simply backup a\nsingle future value at each state-action pair, but instead backs up a set of candidate values, each with\nthe associated set of policy commitments that justify it. We develop a model-based value iteration\nalgorithm and a model-free Q-learning algorithm using this backup that carefully integrate value-\nand policy-based reasoning. These methods complement the value-based nature of value iteration\nand Q-learning with explicit constraints on the policies consistent with generated values, and use\nthe values to select policies from the admissible policy class. We show that in the tabular case with\npolicy constraints\u2014isolating delusion-error from approximation error\u2014the algorithms converge to\nan optimal policy in the admissible policy class. We also show that the number of information sets is\nbounded polynomially when the greedy policy class has \ufb01nite VC-dimension; hence, the algorithms\nhave polynomial-time iteration complexity in the tabular case.\nFinally, we suggest several heuristic methods for imposing policy consistency in batch Q-learning for\nlarger problems. Since consistent backups can cause information sets to proliferate, we suggest search\nheuristics that focus attention on promising information sets, as well as methods that impose (or\napproximate) policy consistency within batches of training data, in an effort to drive the approximator\ntoward better estimates.\n\n2 Preliminaries\n\nat every state s. The state-value function for \u03c0 is given by V \u03c0(s) = E[(cid:80)\n\nA Markov decision process (MDP) is de\ufb01ned by a tuple M = (S, A, p, p0, R, \u03b3) specifying a set of\nstates S and actions A; a transition kernel p; an initial state distribution p0; a reward function R; and\na discount factor \u03b3 \u2208 [0, 1]. A (stationary, deterministic) policy \u03c0 : S\u2192 A speci\ufb01es the agent\u2019s action\nt\u22650 \u03b3tR(st, \u03c0(st))] while\np(s(cid:48)|s,a) V \u03c0(s(cid:48)), where expectations\nthe state-action value (or Q-function) is Q\u03c0(s, a) = R(s, a)+\u03b3 E\nare taken over random transitions and rewards. Given any Q-function, the policy \u201cGreedy\u201d is de\ufb01ned\nby selecting an action a at state s that maximizes Q(s, a). If Q = Q\u2217, then Greedy is optimal.\nWhen p is unknown, Q-learning can be used to acquire the optimal Q\u2217 by observing trajectories\ngenerated by some (suf\ufb01ciently exploratory) behavior policy. In domains where tabular Q-learning\nis impractical, function approximation is typically used [33, 28, 26]. With function approximation,\nQ-values are approximated by some function from a class parameterized by \u0398 (e.g., the weights\nof a linear function or neural network). We let F = {f\u03b8 : S\u00d7A \u2192 R | \u03b8 \u2208 \u0398} denote the set of\nexpressible value function approximators, and denote the class of admissible greedy policies by\n\nIn such cases, online Q-learning at transition s, a, r, s(cid:48) (action a is taken at state s, leading to reward\nr and next state s(cid:48)) uses the following update given a previously estimated Q-function Q\u03b8 \u2208 F,\n\n\u03b8 \u2190 \u03b8 + \u03b1\n\nr + \u03b3 max\na(cid:48)\u2208A\n\n(cid:48)\nQ\u03b8(s\n\n(cid:48)\n, a\n\n) \u2212 Q\u03b8(s, a)\n\n\u2207\u03b8Q\u03b8(s, a).\n\n(2)\n\n(cid:16)\n\nBatch versions of Q-learning (e.g., \ufb01tted Q-iteration, batch experience replay) are similar, but \ufb01t a\nregressor repeatedly to batches of training examples (and are usually more data ef\ufb01cient and stable).\n\n2\n\n(cid:110)\n\n(cid:12)(cid:12)(cid:12) \u03c0\u03b8(s) = argmax\n\na\u2208A\n\nG(\u0398) =\n\n\u03c0\u03b8\n\nf\u03b8(s, a), \u03b8 \u2208 \u0398\n\n.\n\n(1)\n\n(cid:111)\n\n(cid:17)\n\n\fFigure 1: A simple MDP that illustrates delusional bias (see text for details).\n\n3 Delusional bias and its consequences\n\nThe problem of delusion can be given a precise statement (which is articulated mathematically in\nSection 4): delusional bias occurs whenever a backed-up value estimate is derived from action\nchoices that are not realizable in the underlying policy class. A Q-update backs up values for each\nstate-action pair (s, a) by independently choosing actions at the corresponding next states s(cid:48) via the\nmax operator; this process implicitly assumes that maxa(cid:48)\u2208A Q\u03b8(s(cid:48), a(cid:48)) is achievable. However, the\nupdate can become inconsistent under function approximation: if no policy in the admissible class can\njointly express all past (implicit) action selections, backed-up values do not correspond to Q-values\nthat can be achieved by any expressible policy. (We note that the source of this potential estimation\nerror is quite different than the optimistic bias of maximizing over noisy Q-values addressed by\ndouble Q-learning; see Appendix A.5.) Although the consequences of such delusional bias might\nappear subtle, we demonstrate how delusion can profoundly affect both Q-learning and value iteration.\nMoreover, these detrimental effects manifest themselves in diverse ways that appear disconnected, but\nare symptoms of the same underlying cause. To make these points, we provide a series of concrete\ncounter-examples. Although we use linear approximation for clarity, the conclusions apply to any\napproximator class with \ufb01nite capacity (e.g., DNNs with \ufb01xed architectures), since there will always\nbe a set of d + 1 state-action choices that are jointly infeasible given a function approximation\narchitecture with VC-dimension d < \u221e [31] (see Theorem 1 for the precise statement).\n3.1 A concrete demonstration\n\nWe begin with a simple illustration. Consider the undiscounted MDP in Fig. 1, where episodes start\nat s1, and there are two actions: a1 causes termination, except at s1 where it can move to s4 with\nprobability q; a2 moves deterministically to the next state in the sequence s1 to s4 (with termination\nwhen a2 taken at s4). All rewards are 0 except for R(s1, a1) and R(s4, a2). For concreteness, let\nq = 0.1, R(s1, a1) = 0.3 and R(s4, a2) = 2. Now consider a linear approximator f\u03b8(\u03c6(s, a))\nwith two state-action features: \u03c6(s1, a1) = \u03c6(s4, a1) = (0, 1); \u03c6(s1, a2) = \u03c6(s2, a2) = (0.8, 0);\n\u03c6(s3, a2) = \u03c6(s4, a2) = (\u22121, 0); and \u03c6(s2, a1) = \u03c6(s3, a1) = (0, 0). Observe that no \u03c0 \u2208 G(\u0398)\ncan satisfy both \u03c0(s2) = a2 and \u03c0(s3) = a2, hence the optimal unconstrained policy (take a2\neverywhere, with expected value 2) is not realizable. Q-updating can therefore never converge to the\nunconstrained optimal policy. Instead, the optimal achievable policy in G(\u0398) takes a1 at s1 and a2 at\ns4 (achieving a value of 0.5, realizable with \u03b8\u2217 = (\u22122, 0.5)).\nUnfortunately, Q-updating is unable to \ufb01nd the optimal admissible policy \u03c0\u03b8\u2217 in this example. How\nthis inability materializes depends on the update regime, so consider online Q-learning (Eq. 2) with\ndata generated using an \u03b5Greedy behavior policy (\u03b5 = 0.5). In this case, it is not hard to show that Q-\nlearning must converge to a \ufb01xed point \u02c6\u03b8 = (\u02c6\u03b81, \u02c6\u03b82) where \u2212\u02c6\u03b81 \u2264 \u02c6\u03b82, implying that \u03c0\u02c6\u03b8(s2) (cid:54)= a2, i.e.,\n\u03c0\u02c6\u03b8 (cid:54)= \u03c0\u03b8\u2217 (we also show this for any \u03b5 \u2208 [0, 1/2] when R(s1, a1) = R(s4, a2) = 1; see derivations in\nAppendix A.1). Instead, Q-learning converges to a \ufb01xed point that gives a \u201ccompromised\u201d admissible\npolicy which takes a1 at both s1 and s4 (with a value of 0.3; \u02c6\u03b8 \u2248 (\u22120.235, 0.279)).\nThis example shows how delusional bias prevents Q-learning from reaching a reasonable \ufb01xed-point.\nConsider the backups at (s2, a2) and (s3, a2). Suppose \u02c6\u03b8 assigns a \u201chigh\u201d value to (s3, a2) (i.e., so\nthat Q\u02c6\u03b8(s3, a2) > Q\u02c6\u03b8(s3, a1)) as required by \u03c0\u03b8\u2217; intuitively, this requires that \u02c6\u03b81 < 0, and generates\na \u201chigh\u201d bootstrapped value for (s2, a2). But any update to \u02c6\u03b8 that tries to \ufb01t this value (i.e., makes\nQ\u02c6\u03b8(s2, a2) > Q\u02c6\u03b8(s2, a1)) forces \u02c6\u03b81 > 0, which is inconsistent with the assumption, \u02c6\u03b81 < 0, needed\nto generate the high bootstrapped value. In other words, any update that moves (s2, a2) higher\nundercuts the justi\ufb01cation for it to be higher. The result is that the Q-updates compete with each\nother, with Q\u02c6\u03b8(s2, a2) converging to a compromise value that is not realizable by any policy in G(\u0398).\n\n3\n\ns1s2s3s4R(s1,a1)a1prob.1\u2212qa1prob.qa2a2a2a2a1a1a1R(s4,a2)\fThis induces an inferior policy with lower expected value than \u03c0\u03b8\u2217. We show in Appendix A.1 that\navoiding any backup of these inconsistent edges results in Q-learning converging to the optimal\nexpressible policy. Critically, this outcome is not due to approximation error itself, but the inability\nof Q-learning to \ufb01nd the value of the optimal representable policy.\n\n3.2 Consequences of delusion\n\nThere are several additional manifestations of delusional bias that cause detrimental outcomes under\nQ-updating. Concrete examples are provided to illustrate each, but we relegate details to the appendix.\nDivergence: Delusional bias can cause Q-updating to diverge. We provide a detailed example\nof divergence in Appendix A.2 using a simple linear approximator. While divergence is typically\nattributed to the interaction of the approximator with Bellman or Q-backups, the example shows that\nif we correct for delusional bias, convergent behavior is restored. Lack of convergence due to cyclic\nbehavior (with a lower-bound on learning rates) can also be caused by delusion: see Appendix A.3.\nThe Discounting Paradox: Another phenomenon induced by delusional bias is the discounting\nparadox: given an MDP with a speci\ufb01c discount factor \u03b3eval, Q-learning with a different discount\n\u03b3train results in a Q-function whose greedy policy has better performance, relative to the target \u03b3eval,\nthan when trained with \u03b3eval. In Appendix A.4, we provide an example where the paradox is extreme:\na policy trained with \u03b3 = 1 is provably worse than one trained myopically with \u03b3 = 0, even when\nevaluated using \u03b3 = 1. We also provide an example where the gap can be made arbitrarily large. These\nresults suggest that treating the discount as hyperparameter might yield systematic training bene\ufb01ts;\nwe demonstrate that this is indeed the case on some benchmark (Atari) tasks in Appendix A.10.\nApproximate Dynamic Programming: Delusional bias arises not only in Q-learning, but also in\napproximate dynamic programming (ADP) (e.g., [6, 9]), such as approximate value iteration (VI).\nWith value function approximation, VI performs full state Bellman backups (as opposed to sampled\nbackups as in Q-learning), but, like Q-learning, applies the max operator independently at successor\nstates when computing expected next state values. When these choices fall outside the greedy policy\nclass admitted by the function approximator, delusional bias can arise. Delusion can also occur with\nother forms of policy constraints (without requiring the value function itself to be approximated).\nBatch Q-learning: In the example above, we saw that delusional bias can cause convergence to\nQ-functions that induce poor (greedy) policies in standard online Q-learning. The precise behavior\ndepends on the training regime, but poor behavior can emerge in batch methods as well. For instance,\nbatch Q-learning with experience replay and replay buffer shuf\ufb02ing will induce the same tension\nbetween the con\ufb02icting updates. Speci\ufb01c (nonrandom) batching schemes can cause even greater\ndegrees of delusion; for example, training in a sequence of batches that run through a batch of\ntransitions at s4, followed by batches at s3, then s2, then s1 will induce a Q-function that deludes\nitself into estimating the value of (s1, a2) to be that of the optimal unconstrained policy.\n\n4 Non-delusional Q-learning and dynamic programming\n\nWe now develop a provably correct solution that directly tackles the source of the problem: the\npotential inconsistency of the set of Q-values used to generate a Bellman or Q-backup. Our approach\navoids delusion by using information sets to track the \u201cdependencies\u201d contained in all Q-values,\ni.e., the policy assumptions required to justify any such Q-value. Backups then prune infeasible\nvalues whose information sets are not policy-class consistent. Since backed-up values might be\ndesignated inconsistent when new dependencies are added, this policy-consistent backup must\nmaintain alternative information sets and their corresponding Q-values, allowing the (implicit)\nbacktracking of prior decisions (i.e., max Q-value choices). Such a policy-consistent backup can be\nviewed as unifying both value- and policy-based RL methods, a perspective we detail in Sec. 4.3.\nWe develop policy consistent backups in the tabular case while allowing for an arbitrary policy class\n(or arbitrary policy constraints)\u2014the case of greedy policies with respect to some approximation\narchitecture f\u03b8 is simply a special case. This allows the method to focus on delusion, without\nmaking any assumptions about the speci\ufb01c value approximation. Because delusion is a general\nphenomenon, we \ufb01rst develop a model-based consistent backup, which gives rise to non-delusional\npolicy-class value iteration, and then describe the sample-backup version, policy-class Q-learning.\nOur main theorem establishes the convergence, correctness and optimality of the algorithm (including\n\n4\n\n\ffor all s, a do\n\nQ[sa] \u2190 Rsa + \u03b3(cid:76)\n\ns(cid:48) p(s(cid:48)\n\nAlgorithm 1 Policy-Class Value Iteration (PCVI)\nInput: S, A, p(s(cid:48)\n| s, a), R, \u03b3, \u0398, initial state s0\n1: Q[sa] \u2190 initialize to mapping \u0398 (cid:55)\u2192 0 for all s, a\n2: ConQ[sa] \u2190 initialize to mapping [s (cid:55)\u2192 a] (cid:55)\u2192 0 for all s, a\n3: Update ConQ[s] for all s (i.e., combine all table entries in ConQ[sa1], . . . , ConQ[sam])\n4: repeat\n5:\n6:\n7:\n8:\n9:\n10: until Q converges: dom(Q(sa)) and Q(sa)(X) does not change for all s, a, X\n11: /* Then recover an optimal policy */\n12: X\u2217\n\u2190 argmaxX ConQ[s0](X)\n\u2190 ConQ[s0](X\u2217)\n13: q\u2217\n14: \u03b8\u2217\n\u2190 Witness(X\u2217)\n15: return \u03c0\u03b8\u2217 and q\u2217.\n\nConQ[sa](Z) \u2190 Q[sa](X) for all X such that Z = X \u2229 [s (cid:55)\u2192 a] is non-empty\nUpdate ConQ[s] by combining table entries of ConQ[sa(cid:48)] for all a(cid:48)\n\n| s, a)ConQ[s(cid:48)]\n\nend for\n\nthe complete removal of delusional bias), and computational tractability (subject to a tractable\nconsistency oracle).\n\n4.1 Policy-class value iteration\n\nWe begin by de\ufb01ning policy-class value iteration (PCVI), a new VI method that operates on collections\nof information sets to guarantee discovery of the optimal policy in a given class. For concreteness, we\nspecify a policy class using Q-function parameters, which determines the class of realizable greedy\npolicies (just as in classical VI). Proofs and more formal de\ufb01nitions can be found in Appendix A.6.\nWe provide a detailed illustration of the PCVI algorithm in Appendix A.7, walking through the steps\nof PCVI on the example MDP in Fig. 1.\nAssume an MDP with n states S = {s1, . . . , sn} and m actions A = {a1, . . . , am}. Let \u0398 be the\nparameter class de\ufb01ning Q-functions. Let F and G(\u0398), as above, denote the class of expressible\nvalue functions and admissible greedy policies respectively. (We assume ties are broken in some\ncanonical fashion.) De\ufb01ne [s (cid:55)\u2192 a] = {\u03b8 \u2208 \u0398 | \u03c0\u03b8(s) = a}. An information set X \u2286 \u0398 is a set of\nparameters (more generally, policy constraints) that justify assigning a particular Q-value q to some\n(s, a) pair. Below we use the term \u201cinformation set\u201d to refer both to X and (X, q) as needed.\nInformation sets will be organized into \ufb01nite partitions of \u0398, i.e., a set of non-empty subsets\nP = {X1, . . . , Xk} such that X1 \u222a \u00b7\u00b7\u00b7 \u222a Xk = \u0398 and Xi \u2229 Xj = \u2205, for all i (cid:54)= j. We abstractly\nrefer to the elements of P as cells. A partition P (cid:48) is a re\ufb01nement of P if for all X(cid:48)\n\u2208 P (cid:48) there\nexists an X \u2208 P such that X(cid:48)\n\u2286 X. Let P(\u0398) be the set of all \ufb01nite partitions of \u0398. A partition\nfunction h : P \u2192 R associates values (e.g., Q-values) with all cells (e.g., information sets). Let\nH = {h : P \u2192 R | P \u2208 P(\u0398)} denote the set of all such partition functions. De\ufb01ne the intersection\nsum for h1, h2 \u2208 H to be:\n(h1 \u2295 h2)(X1 \u2229 X2) = h1(X1) + h2(X2),\nNote that the intersection sum incurs at most a quadratic blowup: |dom(h)| \u2264 |dom(h1)|\u00b7|dom(h2)|.\nThe methods below require an oracle to check whether a policy \u03c0\u03b8 is consistent with a set of state-to-\naction constraints: i.e., given {(s, a)} \u2286 S \u00d7 A, whether there exists a \u03b8 \u2208 \u0398 such that \u03c0\u03b8(s) = a for\nall pairs. We assume access to such an oracle, \u201cWitness\u201d. For linear Q-function parameterizations,\nWitness can be implemented in polynomial time by checking the consistency of a system of linear\ninequalities.\nPCVI, shown in Alg. 1, computes the optimal policy \u03c0\u03b8\u2217 \u2208 G(\u0398) by using information sets and\ntheir associated Q-values organized into partitions (i.e., partition functions over \u0398). We represent\nQ-functions using a table Q with one entry Q[sa] for each (s, a) pair. Each such Q[sa] is a partition\nfunction over dom(Q[sa]) \u2208 P(\u0398). For each Xi \u2208 dom(Q[sa]) (i.e., for each information set Xi \u2286 \u0398\nassociated with (s, a)), we assign a unique Q-value Q[sa](Xi). Intuitively, the Q-value Q[sa](Xi) is\n\n\u2200X1 \u2208 dom(h1), X2 \u2208 dom(h2), X1 \u2229 X2 (cid:54)= \u2205.\n\n5\n\n\fjusti\ufb01ed only if we limit attention to policies {\u03c0\u03b8 : \u03b8 \u2208 Xi}. Since dom(Q[sa]) is a partition, we have\na Q-value for any realizable policy. (The partitions dom(Q[sa]) for each (s, a) generally differ.)\nConQ[sa] is a restriction of Q[sa] obtained by intersecting each cell in its domain, dom(Q[sa]), with\n[s (cid:55)\u2192 a]. In other words, ConQ[sa] is a partition function de\ufb01ned on some partition of [s (cid:55)\u2192 a]\n(rather than all of \u0398), and represents Q-values of cells that are consistent with [s (cid:55)\u2192 a]. Thus, if\nXi \u2229 [s (cid:55)\u2192 a] = \u2205 for some Xi \u2208 dom(Q[sa]), the corresponding Q-value disappears in ConQ[sa].\nFinally, ConQ[s] = \u222aaConQ[sa] is the partition function over \u0398 obtained by collecting all the\n\u201crestricted\u201d action value functions. Since \u222aa[s (cid:55)\u2192 a] is a partition of \u0398, so is ConQ[s].\nThe key update in Alg. 1 is Line 6, which jointly updates all Q-values of the relevant sets of policies\nin G(\u0398). Notice that the maximization typically found in VI is not present\u2014this is because the\noperation computes and records Q-values for all choices of actions at the successor state s(cid:48). This is\nthe key to allowing VI to maintain consistency: if a future Bellman backup is inconsistent with some\nprevious max-choice at a reachable state, the corresponding cell will be pruned and an alternative\nmaximum will take its place. Pruning of cells, using the oracle Witness, is implicit in Line 6 (pruning\nof \u2295) and Line 7 (where non-emptiness is tested).1 Convergence of PCVI requires that each Q[sa]\ntable\u2014both its partition and associated Q-value\u2014converge to a \ufb01xed point.\nTheorem 1 (PCVI Theorem). PCVI (Alg. 1) has the following guarantees:\n\nof Alg. 1 runs in time O(nm \u00b7 [(cid:0)m\n\n2\n\n(a) (Convergence and correctness) The function Q converges and, for each s \u2208 S, a \u2208 A, and\nany \u03b8 \u2208 \u0398: there is a unique X \u2208 dom(Q[sa]) such that \u03b8 \u2208 X and Q\u03c0\u03b8 (s, a) = Q[sa](X).\n(b) (Optimality and non-delusion) \u03c0\u03b8\u2217 is an optimal policy within G(\u0398) and q\u2217 is its value.\n(c) (Runtime bound) Assume \u2295 and non-emptiness checks (lines 6 and 7) have access to Witness.\nLet G = {g\u03b8(s, a, a(cid:48)) := 1[f\u03b8(s, a) \u2212 f\u03b8(s, a(cid:48)) > 0], \u2200s, a (cid:54)= a(cid:48)\n| \u03b8 \u2208 \u0398}. Each iteration\nVC-dimension of a set of boolean-valued functions, and w is the worst-case running time of\nWitness (with at most nm state-action constraints). Combined with Part (a), if VCDim(G)\nis \ufb01nite then Q converges in time polynomial in n, m and w.\nCorollary 2. Alg. 1 runs in polynomial time for linear greedy policies. It runs in polynomial time in\nthe presence of a polynomial time Witness for deep Q-network (DQN) greedy policies.\n\n(cid:1)n]2 VCDim(G)(m \u2212 1)w) where VCDim(\u00b7) denotes the\n\n(A more complete statement of the Cor. 2 is found in Appendix A.6.) The number of cells in a\npartition may be signi\ufb01cantly less than suggested by the bounds, as it depends on the reachability\nstructure of the MDP. For example, in an MDP with only self-transitions, the partition for each state\nhas a single cell. We note that Witness is tractable for linear approximators, but is NP-hard for\nDNNs [7]. The poly-time result in Cor. 2 does not contradict the NP-hardness of \ufb01nding a linear\napproximator with small worst-case Bellman error [23], since nothing is asserted about the Bellman\nerror and we are treating the approximator\u2019s VC-dimension as a constant.\n\nDemonstrating PCVI: We illustrate PCVI with a simple example that shows how poorly classical\napproaches can perform with function approximation, even in \u201ceasy\u201d environments. Consider a\nsimple deterministic grid world with the 4 standard actions and rewards of 0, except 1 at the top-right,\n2 at the bottom-left, and 10 at the bottom-right corners; the discount is \u03b3 = 0.95. The agent starts at\nthe top-left. The optimal policy is to move down the left side to the left-bottom corner, then along the\nbottom to the right bottom corner, then staying. To illustrate the effects of function approximation,\nwe considered linear approximators de\ufb01ned over random feature representations: feature vectors\nwere produced for each state-action pair by drawing independent standard normal values.\nFig. 2 shows the estimated maximum value achievable from the start state produced by each method\n(dark lines), along with the actual expected value achieved by the greedy policies produced by each\nmethod (light lines). The left \ufb01gure shows results for a 4 \u00d7 4 grid with 4 random features, and the\nright for a 5\u00d7 5 grid with 5 random features. Results are averaged over 10 runs with different random\nfeature sets (shared by the algorithms). Surprisingly, even when the linear approximator can support\nnear-optimal policies, classical methods can utterly fail to realize this possibility: in 9 of 10 trials\n(4 \u00d7 4) and 10 of 10 trials (5 \u00d7 5) the classical methods produce greedy policies with an expected\nvalue of zero, while PCVI produces policies with value comparable to the global optimum.\n\n1If arbitrary policy constraints are allowed, there may be no feasible policies, in which case Witness will\n\nprune each cell immediately, leaving no Q-functions, as desired.\n\n6\n\n\fFigure 2: Planning and learning in a grid world with random feature representations. (Left: 4 \u00d7 4\ngrid using 4 features; Right: 5 \u00d7 5 grid using 5 features.) Here \u201citerations\u201d means a full sweep\nover state-action pairs, except for Q-learning and PCQL, where an iteration is an episode of length\n3/(1 \u2212 \u03b3) = 60 using \u03b5Greedy exploration with \u03b5 = 0.7. Dark lines: estimated maximum achievable\nexpected value. Light lines: actual expected value achieved by greedy policy.\n\n4.2 Policy-class Q-learning\n\nA tabular version of Q-learning using the same partition-function representation of Q-values as in\nPCVI yields policy-class Q-learning PCQL, shown in Alg. 2.2 The key difference with PCVI is\nsimply that we use sample backups in Line 4 instead of full Bellman backups as in PCVI.\n\nAlgorithm 2 Policy-Class Q-learning (PCQL)\nInput: Batch B = {(st, at, rt, s(cid:48)\n1: for (s, a, r, s(cid:48)) \u2208 B, t is iteration counter do\n\nt)}T\n\nt .\nt=1, \u03b3, \u0398, scalars \u03b1sa\n\n2:\n3:\n4:\n5:\n6: end for\n7: Return ConQ, Q\n\nFor all a(cid:48), if s(cid:48)a(cid:48)\nUpdate ConQ[s(cid:48)] by combining ConQ[s(cid:48)a(cid:48)](X), for all a(cid:48), X \u2208 dom(ConQ[s(cid:48)a(cid:48)])\nQ[sa] \u2190 (1 \u2212 \u03b1sa\nConQ[sa](Z) \u2190 Q[sa](X) for all X such that Z = X \u2229 [s (cid:55)\u2192 a] is non-empty\n\n(cid:54)\u2208 ConQ then initialize ConQ[s(cid:48)a(cid:48)] \u2190 ([s(cid:48)\nt )Q[sa] \u2295 \u03b1sa\n\nt (r + \u03b3ConQ[s(cid:48)])\n\n(cid:55)\u2192 a(cid:48)] (cid:55)\u2192 0).\n\nThe method converges under the usual assumptions for Q-learning: a straightforward extension of\nthe proof for PCVI, replacing full VI backups with Q-learning-style sample backups, yields the\nfollowing:\nTheorem 3. The (a) convergence and correctness properties and (b) optimality and non-delusion\nproperties associated with the PCVI Theorem 1 hold for PCQL, assuming the usual sampling\nrequirements, the Robbins-Monro stochastic convergence conditions on learning rates \u03b1sa\nt and\naccess to the Witness oracle.\n\nDemonstrating PCQL: We illustrate PCQL in the same grid world tasks as before, again using\nrandom features. Figure 2 shows that PCQL achieves comparable performance to PCVI, but with\nlighter time and space requirements, and is still signi\ufb01cantly better than classical methods.\nWe also applied PCQL to the initial illustrative example in Fig. 1 with R(s1, a1) = 0.3, R(s4, a2) = 2\nand uniform random exploration as the behaviour policy, adding the use of a value approximator (a\nlinear regressor). We use a heuristic that maintains a global partition of \u0398 with each cell X holding\na regressor ConQ(s, a; wX ), for wX \u2208 \u0398 predicting the consistent Q-value at s, a (see details in\nSec. 5 and Appendix A.8). The method converges with eight cells corresponding to the realizable\npolicies. The policy (equivalence class) is ConQ(s1, \u03c0X (s1); wX ) where \u03c0X (s1) is the cell\u2019s action\n\n2PCQL uses the same type of initialization and optimal policy extraction as PCVI; details are omitted.\n\n7\n\n020406080100iterations020406080100120140160expected valuevalue_iteration_greedyapprox_value_iter_greedyapprox_Q-learn_greedyPCVI_greedyPCQL_greedyvalue_iteration_estimateapprox_value_iter_estimateapprox_Q-learn_estimatePCVI_estimatePCQL_estimate020406080100iterations020406080100120140expected valuevalue_iteration_greedyapprox_value_iter_greedyapprox_Q-learn_greedyPCVI_greedyPCQL_greedyvalue_iteration_estimateapprox_value_iter_estimateapprox_Q-learn_estimatePCVI_estimatePCQL_estimate\fat s1; the value is wX \u00b7 \u03c6(s, \u03c0X (s1)). The cell X\u2217 with the largest such value at s1 is indeed the\noptimal realizable policy: it takes a1 at s1 and s2, and a2 elsewhere. The regressor wX\u2217 \u2248 (\u22122, 0.5)\n\ufb01ts the consistent Q-values perfectly, yielding optimal (policy-consistent) Q-values, because ConQ\nneed not make tradeoffs to \ufb01t inconsistent values.\n\n4.3 Uni\ufb01cation of value- and policy-based RL\n\nWe can in some sense interpret PCQL (and, from the perspective of model-based approximate dynamic\nprogramming, PCVI) as unifying value- and policy-based RL. One prevalent view of value-based RL\nmethods with function approximation, such Q-learning, is to \ufb01nd an approximate value function or\nQ-function (VF/QF) with low Bellman error (BE), i.e., where the (pointwise) difference between\nthe approximate VF/QF and its Bellman backup is small. In approximate dynamic programming\none often tries to minimize this directly, while in Q-learning, one usually \ufb01ts a regression model to\nminimize the mean-squared temporal difference (a sampled form of Bellman error minimization)\nover a training data set. One reason for this emphasis on small BE is that the max norm of BE can be\nused to directly bound the (max norm) loss of the value of greedy policy induced by the approximate\nVF/QF and the value of the optimal policy. It is this difference in performance that is of primary\ninterest.\nUnfortunately, the bounds on induced policy quality using the BE approximation are quite loose,\ntypically 2||BE||\u221e/(1 \u2212 \u03b3) (see [6], bounds with (cid:96)p norm are similar [20]). As such, minimizing\nBE does not generally provide policy guarantees of practical import (see, e.g., [11]). As we see in the\ncases above (and also in the appendix) that involve delusional bias, a small BE can in fact be rather\nmisleading with respect to the induced policy quality. For example, Q-learning, using least squares to\nminimize the TD-error as a proxy for BE, often produces policies of poor quality.\nPCQL and PCVI take a different perspective, embracing the fact that the VF/QF approximator strictly\nlimits that class of greedy policies that can be realized. In these algorithms, no Bellman backup or\nQ-update ever involves values that cannot be realized by an admissible policy. This will often result\nin VFs/QFs with greater BE than their classical counterparts. But, in the exact tabular case, we derive\nthe true value of the induced (approximator-constrained) policy and guarantee that it is optimal. In\nthe regression case (see Sec. 5), we might view this as attempting to minimize BE within the class of\nadmissible policies, since we only regress toward policy-consistent values.\nThe use of information sets and consistent cells effectively means that PCQL and PCVI are engaging\nin policy search\u2014indeed, in the algorithms presented here, they can be viewed as enumerating all\nconsistent policies (in the case of Q-learning, distinguishing only those that might differ on sampled\ndata). In contrast to other policy-search methods (e.g., policy gradient), both PCQL and PCVI use\n(sampled or full) Bellman backups to direct the search through policy space, while simultaneously\nusing policy constraints to limit the Bellman backups that are actually realized. They also use these\nvalues to select an optimal policy from the feasible policies generated within each cell.\n\n5 Toward practical non-delusional Q-learning\n\nThe PCVI and PCQL algorithms can be viewed as constructs that demonstrate how delusion arises\nand how it can be eliminated in Q-learning and approximate dynamic programming by preventing\ninadmissible policy choices from in\ufb02uencing Q-values. However, the algorithms maintain information\nsets and partition functions, which is impractical with massive state and action sets. In this section,\nwe suggest several heuristic methods that allow the propagation of some dependency information in\npractical Q-learning to mitigate the effects of delusional bias.\nMultiple regressors: With multiple information sets (or cells), we no longer have a unique set of\nlabels with which to \ufb01t an approximate Q-function regressor (e.g., DNN or linear approximator).\nInstead, each cell has its own set of labels. Thus, if we maintain a global collection of cells, each with\nits own Q-regressor, we have a set of approximate Q-functions that give both a compact representation\nand the ability to generalize across state-action pairs for any set of policy consistent assumptions.\nThis works in both batch and pure online Q-learning (see Appendix A.8 for details.)\nThe main challenge is the proliferation of information sets. One obvious way to address this is to\nsimply limit the total number of cells and regressors: given the current set of regressors, at any update,\nwe \ufb01rst create the (larger number of) new cells needed for the new examples, \ufb01t the regressor for\n\n8\n\n\feach new consistent cell, then prune cells according to some criterion to keep the total number of\nregressors manageable. This is effectively a search through the space of information sets and can be\nmanaged using a variety of methods (branch-and-bound, beam search, etc.). Criteria for generating,\nsampling and/or pruning cells can involve: (a) the magnitude to the Q-labels (higher expected values\nare better); (b) the constraints imposed by the cell (less restrictive is better, since it minimizes future\ninconsistency); the diversity of the cell assignments (since the search frontier is used to manage\n\u201cbacktracking\u201d).\nIf cell search maintains a restricted frontier, our cells may no longer cover all of policy space (i.e, Q is\nno longer a partition of \u0398). This runs the risk that some future Q-updates may not be consistent with\nany cell. If we simply ignore such updates, the approach is hyper-vigilant, guaranteeing policy-class\nconsistency at the expense of losing training data. An alternative relaxed approach is to merge cells to\nmaintain a full partition of policy space (or prune cells and in some other fashion relax the constraints\nof the remaining cells to recover a partition). This relaxed approach ensures that all training data\nis used, but risks allowing some delusion to creep into values by not strictly enforcing all Q-value\ndependencies.\nQ-learning with locally consistent data: An alternative approach is to simply maintain a single\nregressor, but ensure that any batch of Q-labels is self-consistent before updating the regressor.\nSpeci\ufb01cally, given a batch of training data and the current regressor, we \ufb01rst create a single set of\nconsistent labels for each example (see below), then update the regressor using these labels. With no\ninformation sets, the dependencies that justi\ufb01ed the previous regressor are not accounted for when\nconstructing the new labels. This may allow delusion to creep in; but the aim is that this heuristic\napproach may mitigate its effects since each new regressor is at least \u201clocally\u201d consistent with respect\nto its own updates. Ideally, this will keep the sequence of approximations in a region of \u03b8-space where\ndelusional bias is less severe. Apart from the use of a consistent labeling procedure, this approach\nincurs no extra overhead relative to Q-learning.\nOracles and consistent labeling: The \ufb01rst approach above requires an oracle, Witness, to test\nconsistency of policy choices, which is tractable for linear approximators (linear feasibility test),\nbut requires solving an integer-quadratic program when using DQN (e.g., a ReLU network). The\nsecond approach needs some means for generating consistent labels. Given a batch of examples\nB = {(st, at, rt, s(cid:48)\nt for each\ns(cid:48)\nt as the max. The selection should satisfy: (a) \u2229t[s(cid:48)\nt] (cid:54)= \u2205 (i.e., selected max actions are\nmutually consistent); and (b) [st (cid:55)\u2192 at] \u2229 [s(cid:48)\nt is consistent with\ntaking at at st). We can \ufb01nd a consistent labeling maximizing some objective (e.g., sum of resulting\nlabels), subject to these constraints. For a linear approximator, the problem can be formulated as a\n(linear) mixed integer program (MIP); and is amenable to several heuristics (see Appendix A.9).\n\nt=1, and a current regressor (cid:101)Q, labels are generated by selecting an a(cid:48)\n\nt (cid:55)\u2192 a(cid:48)\n\nt)}T\n\nt (cid:55)\u2192 a(cid:48)\n\nt] (cid:54)= \u2205, for all t (i.e., choice at s(cid:48)\n\n6 Conclusion\n\nWe have identi\ufb01ed delusional bias, a fundamental problem in Q-learning and approximate dynamic\nprogramming with function approximation or other policy constraints. Delusion manifests itself in\ndifferent ways that lead to poor approximation quality or divergence for reasons quite independent\nof approximation error itself. Delusional bias thus becomes an important entry in the catalog of\nrisks that emerge in the deployment of Q-learning. We have developed and analyzed a new policy-\nclass consistent backup operator, and the corresponding model-based PCVI and model-free PCQL\nalgorithms, that fully remove delusional bias. We also suggested several practical heuristics for\nlarge-scale RL problems to mitigate the effect of delusional bias.\nA number of important direction remain. The further development and testing of practical heuristics\nfor policy-class consistent updates, as well as large-scale experiments on well-known benchmarks, is\ncritical. This is also important for identifying the prevalence of delusional bias in practice. Further\ndevelopment of practical consistency oracles for DNNs and consistent label generation is also of\ninterest. We are also engaged in a more systematic study of the discounting paradox and the use of\nthe discount factor as a hyper-parameter.\n\n9\n\n\fReferences\n[1] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In\nProceedings of the International Conference on Machine Learning (ICML-95), pages 30\u201337,\n1995.\n\n[2] Peter L. Bartlett, Nick Harvey, Chris Liaw, and Abbas Mehrabian. Nearly-tight VC-dimension\nand pseudodimension bounds for piecewise linear neural networks. arXiv:1703.02930, 2017.\n\n[3] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling. The arcade learning environment: An\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279,\nJune 2013.\n\n[4] Marc G. Bellemare, Will Dabney, and R\u00e9mi Munos. A distributional perspective on reinforce-\nment learning. In Proceedings of the International Conference on Machine Learning (ICML-17),\n2017.\n\n[5] Marc G. Bellemare, Georg Ostrovski, Arthur Guez, Philip S. Thomas, and R\u00e9mi Munos.\nIncreasing the action gap: New operators for reinforcement learning. In Proceedings of the\nThirtieth AAAI Conference on Arti\ufb01cial Intelligence (AAAI-16), pages 1476\u20131483, Qu\u00e9bec City,\nQC, 2016.\n\n[6] Dimitri P. Bertsekas and John. N. Tsitsiklis. Neuro-dynamic Programming. Athena, Belmont,\n\nMA, 1996.\n\n[7] Avrim Blum and Ronald L. Rivest. Training a 3-node neural network is NP-complete. In COLT,\n\npages 9\u201318, 1988.\n\n[8] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely\napproximating the value function. In G. Tesauro, D. S. Touretzky, and T. K. Leen, editors,\nAdvances in Neural Information Processing Systems 7 (NIPS-94), pages 369\u2013376. MIT Press,\nCambridge, 1995.\n\n[9] Daniela Pucci de Farias and Benjamin Van Roy. The linear programming approach to approxi-\n\nmate dynamic programming. Operations Research, 51(6):850\u2013865, 2003.\n\n[10] Rick Durrett. Probability: Theory and Examples. Cambridge University Press, 2013.\n\n[11] Matthieu Geist, Bilal Piot, and Olivier Pietquin. Is the Bellman residual a bad proxy? In\nAdvances in Neural Information Processing Systems 30 (NIPS-17), pages 3208\u20133217, Long\nBeach, CA, 2017.\n\n[12] Geoffrey J. Gordon. Stable function approximation in dynamic programming. In Proceedings\nof the Twelfth International Conference on Machine Learning (ICML-95), pages 261\u2013268, Lake\nTahoe, 1995.\n\n[13] Geoffrey J. Gordon. Approximation Solutions to Markov Decision Problems. PhD thesis,\n\nCarnegie Mellon University, 1999.\n\n[14] Matteo Hessel, Joseph Modayil, Hado van Hasselt, Tom Schaul, Georg Ostrovski, Will Dab-\nney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining\nimprovements in deep reinforcement learning. arXiv:1710.02298, 2017.\n\n[15] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective\nplanning horizon on model accuracy. In Proceedings of the Thirteenth International Joint\nConference on Autonomous Agents and Multiagent Systems (AAMAS-14), pages 1181\u20131189,\nIstanbul, Turkey, 2015.\n\n[16] Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of\nlong horizon problems. In Proceedings of the Thirty-second AAAI Conference on Arti\ufb01cial\nIntelligence (AAAI-18), 2018.\n\n[17] Hamid Maei, Csaba Szepesv\u00e1ri, Shalabh Bhatnagar, and Richard S. Sutton. Toward off-policy\nlearning control with function approximation. In International Conference on Machine Learning,\npages 719\u2013726, Haifa, Israel, 2010.\n\n10\n\n\f[18] Francisco Melo and M. Isabel Ribeiro. Q-learning with linear function approximation. In\nProceedings of the International Conference on Computational Learning Theory (COLT), pages\n308\u2013322, 2007.\n\n[19] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei Rusu, Joel Veness, Marc Bellemare,\nAlex Graves, Martin Riedmiller, Andreas Fidjeland, Georg Ostrovski, Stig Petersen, Charles\nBeattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra,\nShane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning.\nScience, 518:529\u2013533, 2015.\n\n[20] R\u00e9mi Munos. Performance bounds in lp-norm for approximate value iteration. SIAM Journal\n\non Control and Optimization, 46(2):541\u2013561, 2007.\n\n[21] R\u00e9mi Munos, Thomas Stepleton, Anna Harutyunyan, and Marc G. Bellemare. Safe and ef\ufb01cient\noff-policy reinforcement learning. In Advances in Neural Information Processing Systems 29\n(NIPS-16), pages 1046\u20131054, Barcelona, 2016.\n\n[22] O\ufb01r Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap\nbetween value and policy based reinforcement learning. In Advances in Neural Information\nProcessing Systems 30 (NIPS-17), pages 1476\u20131483, Long Beach, CA, 2017.\n\n[23] Marek Petrik. Optimization-based Approximate Dynamic Programming. PhD thesis, University\n\nof Massachusetts, 2010.\n\n[24] Norbert Sauer. On the density of families of sets. Journal of Combinatorial Theory, Series A,\n\n13:145\u2013147, July 1972.\n\n[25] Saharon Shelah. A combinatorial problem; stability and order for models and theories in\n\nin\ufb01nitary languages. Paci\ufb01c Journal of Mathematics, 41:247\u2013261, April 1972.\n\n[26] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. MIT Press,\n\nCambridge, MA, 2018.\n\n[27] Csaba Szepesv\u00e1ri and William Smart. Interpolation-based Q-learning. In Proceedings of the\n\nInternational Conference on Machine Learning (ICML-04), 2004.\n\n[28] Gerald Tesauro. Practical issues in temporal difference learning. Machine Learning, 8(3):257\u2013\n\n277, May 1992.\n\n[29] John H. Tsitsiklis and Benjamin Van Roy. An analysis of temporal-difference learning with\n\nfunction approximation. IEEE Transactions on Automatic Control, 42:674\u2013690, 1996.\n\n[30] Hado van Hasselt. Double Q-learning. In Advances in Neural Information Processing Systems\n\n23 (NIPS-10), pages 2613\u20132621, Vancouver, BC, 2010.\n\n[31] Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, September 1998.\n\n[32] Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, and Nando de\nFreitas. Dueling network architectures for deep reinforcement learning. In Proceedings of the\nInternational Conference on Machine Learning (ICML-16), 2016.\n\n[33] Christopher J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King\u2019s College,\n\nCambridge, UK, May 1989.\n\n[34] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8:279\u2013292,\n\n1992.\n\n11\n\n\f", "award": [], "sourceid": 6460, "authors": [{"given_name": "Tyler", "family_name": "Lu", "institution": "Google"}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": "Google Inc."}, {"given_name": "Craig", "family_name": "Boutilier", "institution": "Google"}]}