{"title": "Double Q-learning", "book": "Advances in Neural Information Processing Systems", "page_first": 2613, "page_last": 2621, "abstract": "In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values. These overestimations result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation.", "full_text": "Double Q-learning\n\nHado van Hasselt\n\nMulti-agent and Adaptive Computation Group\n\nCentrum Wiskunde & Informatica\n\nAbstract\n\nIn some stochastic environments the well-known reinforcement learning algo-\nrithm Q-learning performs very poorly. This poor performance is caused by large\noverestimations of action values. These overestimations result from a positive\nbias that is introduced because Q-learning uses the maximum action value as an\napproximation for the maximum expected action value. We introduce an alter-\nnative way to approximate the maximum expected value for any set of random\nvariables. The obtained double estimator method is shown to sometimes under-\nestimate rather than overestimate the maximum expected value. We apply the\ndouble estimator to Q-learning to construct Double Q-learning, a new off-policy\nreinforcement learning algorithm. We show the new algorithm converges to the\noptimal policy and that it performs well in some settings in which Q-learning per-\nforms poorly due to its overestimation.\n\n1\n\nIntroduction\n\nQ-learning is a popular reinforcement learning algorithm that was proposed by Watkins [1] and can\nbe used to optimally solve Markov Decision Processes (MDPs) [2]. We show that Q-learning\u2019s\nperformance can be poor in stochastic MDPs because of large overestimations of the action val-\nues. We discuss why this occurs and propose an algorithm called Double Q-learning to avoid this\noverestimation. The update of Q-learning is\n\nQt+1(st, at) = Qt(st, at) + \u03b1t(st, at)(cid:16)rt + \u03b3 max\n\na\n\nQt(st+1, a) \u2212 Qt(st, at)(cid:17) .\n\n(1)\n\nIn this equation, Qt(s, a) gives the value of the action a in state s at time t. The reward rt is drawn\nfrom a \ufb01xed reward distribution R : S \u00d7A\u00d7S \u2192 R, where E{rt|(s, a, s\u2032) = (st, at, st+1)} = Rs\u2032\nsa.\nThe next state st+1 is determined by a \ufb01xed state transition distribution P : S \u00d7 A \u00d7 S \u2192 [0, 1],\nwhere P s\u2032\nsa = 1.\nThe learning rate \u03b1t(s, a) \u2208 [0, 1] ensures that the update averages over possible randomness in the\nrewards and transitions in order to converge in the limit to the optimal action value function. This\noptimal value function is the solution to the following set of equations [3]:\n\nsa gives the probability of ending up in state s\u2032 after performing a in s, and Ps\u2032 P s\u2032\n\n\u2200s, a : Q\u2217(s, a) = Xs\u2032\n\nsa(cid:16)Rs\u2032\nP s\u2032\n\nsa + \u03b3 max\n\na\n\nQ\u2217(s\u2032, a)(cid:17) .\n\n(2)\n\nThe discount factor \u03b3 \u2208 [0, 1) has two interpretations. First, it can be seen as a property of the prob-\nlem that is to be solved, weighing immediate rewards more heavily than later rewards. Second, in\nnon-episodic tasks, the discount factor makes sure that every action value is \ufb01nite and therefore well-\nde\ufb01ned. It has been proven that Q-learning reaches the optimal value function Q\u2217 with probability\none in the limit under some mild conditions on the learning rates and exploration policy [4\u20136].\n\nQ-learning has been used to \ufb01nd solutions on many problems [7\u20139] and was an inspiration to similar\nalgorithms, such as Delayed Q-learning [10], Phased Q-learning [11] and Fitted Q-iteration [12],\nto name some. These variations have mostly been proposed in order to speed up convergence rates\n\n1\n\n\fcompared to the original algorithm. The convergence rate of Q-learning can be exponential in the\nnumber of experiences [13], although this is dependent on the learning rates and with a proper choice\nof learning rates convergence in polynomial time can be obtained [14]. The variants named above\ncan also claim polynomial time convergence.\n\nContributions An important aspect of the Q-learning algorithm has been overlooked in previous\nwork: the use of the max operator to determine the value of the next state can cause large over-\nestimations of the action values. We show that Q-learning can suffer a large performance penalty\nbecause of a positive bias that results from using the maximum value as approximation for the max-\nimum expected value. We propose an alternative double estimator method to \ufb01nd an estimate for\nthe maximum value of a set of stochastic values and we show that this sometimes underestimates\nrather than overestimates the maximum expected value. We use this to construct the new Double\nQ-learning algorithm.\n\nThe paper is organized as follows. In the second section, we analyze two methods to approximate\nthe maximum expected value of a set of random variables. In Section 3 we present the Double\nQ-learning algorithm that extends our analysis in Section 2 and avoids overestimations. The new\nalgorithm is proven to converge to the optimal solution in the limit. In Section 4 we show the results\non some experiments to compare these algorithms. Some general discussion is presented in Section\n5 and Section 6 concludes the paper with some pointers to future work.\n\n2 Estimating the Maximum Expected Value\n\nIn this section, we analyze two methods to \ufb01nd an approximation for the maximum expected value\nof a set of random variables. The single estimator method uses the maximum of a set of estimators\nas an approximation. This approach to approximate the value of the maximum expected value is\npositively biased, as discussed in previous work in economics [15] and decision making [16]. It is a\nbias related to the Winner\u2019s Curse in auctions [17, 18] and it can be shown to follow from Jensen\u2019s\ninequality [19]. The double estimator method uses two estimates for each variable and uncouples\nthe selection of an estimator and its value. We are unaware of previous work that discusses it. We\nanalyze this method and show that it can have a negative bias.\n\nConsider a set of M random variables X = {X1, . . . , XM }. In many problems, one is interested in\nthe maximum expected value of the variables in such a set:\n\nmax\n\ni\n\nE{Xi} .\n\n(3)\n\nby constructing approximations for E{Xi} for all i. Let S = SM\n\nWithout knowledge of the functional form and parameters of the underlying distributions of the\nvariables in X, it is impossible to determine (3) exactly. Most often, this value is approximated\ni=1 Si denote a set of samples,\nwhere Si is the subset containing samples for the variable Xi. We assume that the samples in Si\nare independent and identically distributed (iid). Unbiased estimates for the expected values can\ndef\nbe obtained by computing the sample average for each variable: E{Xi} = E{\u00b5i} \u2248 \u00b5i(S)\n=\n\n|Si| Ps\u2208Si s, where \u00b5i is an estimator for variable Xi. This approximation is unbiased since every\n\nsample s \u2208 Si is an unbiased estimate for the value of E{Xi}. The error in the approximation thus\nconsists solely of the variance in the estimator and decreases when we obtain more samples.\n\n1\n\nthe PDF and CDF of the ith estimator are denoted f \u00b5\n\nWe use the following notations: fi denotes the probability density function (PDF) of the ith variable\n\u2212\u221efi(x) dx is the cumulative distribution function (CDF) of this PDF. Similarly,\ni . The maximum expected value can\n\nXi and Fi(x) = R x\nbe expressed in terms of the underlying PDFs as maxi E{Xi} = maxiR \u221e\n\n\u2212\u221ex fi(x) dx .\n\ni and F \u00b5\n\n2.1 The Single Estimator\n\nAn obvious way to approximate the value in (3) is to use the value of the maximal estimator:\n\nmax\n\ni\n\nE{Xi} = max\n\ni\n\nE{\u00b5i} \u2248 max\n\ni\n\n\u00b5i(S) .\n\n(4)\n\nBecause we contrast this method later with a method that uses two estimators for each variable, we\ncall this method the single estimator. Q-learning uses this method to approximate the value of the\nnext state by maximizing over the estimated action values in that state.\n\n2\n\n\fThe maximal estimator maxi \u00b5i is distributed according to some PDF f \u00b5\nmax that is dependent on the\nPDFs of the estimators f \u00b5\ni . To determine this PDF, consider the CDF F \u00b5\nmax(x), which gives the prob-\nability that the maximum estimate is lower or equal to x. This probability is equal to the probability\ndef\nthat all the estimates are lower or equal to x: F \u00b5\n=\nmax(x) dx ,\n\n= P (maxi \u00b5i \u2264 x) = QM\ni (x). The value maxi \u00b5i(S) is an unbiased estimate for E{maxj \u00b5j} = R \u221e\n\ni=1 P (\u00b5i \u2264 x)\n\u2212\u221ex f \u00b5\n\nwhich can thus be given by\n\nQM\ni=1 F \u00b5\n\nmax(x)\n\ndef\n\nE{max\n\nj\n\n\u00b5j} = Z \u221e\n\n\u2212\u221e\n\nx\n\nd\ndx\n\nM\n\nYi=1\n\nF \u00b5\n\ni (x) dx =\n\nM\n\nXj\n\nZ \u221e\n\n\u2212\u221e\n\nx f \u00b5\n\nj (s)\n\nM\n\nYi6=j\n\nF \u00b5\n\ni (x) dx .\n\n(5)\n\nHowever, in (3) the order of the max operator and the expectation operator is the other way around.\nThis makes the maximal estimator maxi \u00b5i(S) a biased estimate for maxi E{Xi}. This result has\nbeen proven in previous work [16]. A generalization of this proof is included in the supplementary\nmaterial accompanying this paper.\n\n2.2 The Double Estimator\n\nThe overestimation that results from the single estimator approach can have a large negative impact\non algorithms that use this method, such as Q-learning. Therefore, we look at an alternative method\nto approximate maxi E{Xi}. We refer to this method as the double estimator, since it uses two sets\nof estimators: \u00b5A = {\u00b5A\n\nM } and \u00b5B = {\u00b5B\n\n1 , . . . , \u00b5B\n\n1 , . . . , \u00b5A\n\nM }.\n\ni\n\ns and \u00b5B\n\ni (S) = 1\n|SB\n\ni | Ps\u2208SA\n\nBoth sets of estimators are updated with a subset of the samples we draw, such that S = SA\u222aSB and\ni (S) = 1\nSA \u2229 SB = \u2205 and \u00b5A\ns. Like the single estimator\n|SA\ni and \u00b5B\n\u00b5i, both \u00b5A\ni are unbiased if we assume that the samples are split in a proper manner, for\ninstance randomly, over the two sets of estimators. Let M axA(S)\ni (S)(cid:9)\nbe the set of maximal estimates in \u00b5A(S). Since \u00b5B is an independent, unbiased set of estimators,\nj } = E{Xj} for all j, including all j \u2208 M axA. Let a\u2217 be an estimator that maximizes\nwe have E{\u00b5B\ndef\n\u00b5A: \u00b5A\ni (S). If there are multiple estimators that maximize \u00b5A, we can for instance\n= maxi \u00b5A\npick one at random. Then we can use \u00b5B\ni } and therefore also for\nmaxi E{Xi} and we obtain the approximation\n\na\u2217 as an estimate for maxi E{\u00b5B\n\n= (cid:8)j | \u00b5A\n\ni | Ps\u2208SB\n\nj (S) = maxi \u00b5A\n\na\u2217 (S)\n\ndef\n\ni\n\nmax\n\ni\n\nE{Xi} = max\n\ni\n\nE{\u00b5B\n\ni } \u2248 \u00b5B\n\na\u2217 .\n\n(6)\n\nAs we gain more samples the variance of the estimators decreases. In the limit, \u00b5A\nE{Xi} for all i and the approximation in (6) converges to the correct result.\nAssume that the underlying PDFs are continuous. The probability P (j = a\u2217) for any j is then\nequal to the probability that all i 6= j give lower estimates. Thus \u00b5A\nj (S) = x is maximal for some\nj =\ni are the PDF and CDF\n\ni < x). Integrating out x gives P (j = a\u2217) = R \u221e\nj (x)QM\n\ni6=j P (\u00b5A\ni . The expected value of the approximation by the double estimator can thus be given by\n\nvalue x with probability QM\nx)QM\n\ni6=j P (\u00b5A\n= R \u221e\n\u2212\u221ef A\n\ni (x) dx , where f A\n\ni (S) = \u00b5B\n\n\u2212\u221eP (\u00b5A\n\ni and F A\n\ni < x) dx\n\ni6=j F A\n\ni (S) =\n\nof \u00b5A\n\ndef\n\nM\n\nXj\n\nP (j = a\u2217)E{\u00b5B\n\nj } =\n\nM\n\nXj\n\nE{\u00b5B\n\nj }Z \u221e\n\n\u2212\u221e\n\nf A\nj (x)\n\nM\n\nYi6=j\n\nF A\n\ni (x) dx .\n\n(7)\n\nFor discrete PDFs the probability that two or more estimators are equal should be taken into account\nand the integrals should be replaced with sums. These changes are straightforward.\n\nthe monotonically increasing product Qi6=j F \u00b5\n\nComparing (7) to (5), we see the difference is that the double estimator uses E{\u00b5B\nj } in place of\nx. The single estimator overestimates, because x is within integral and therefore correlates with\ni (x). The double estimator underestimates because\nthe probabilities P (j = a\u2217) sum to one and therefore the approximation is a weighted estimate of\nunbiased expected values, which must be lower or equal to the maximum expected value. In the\nfollowing lemma, which holds in both the discrete and the continuous case, we prove in general that\nthe estimate E{\u00b5B\n\na\u2217} is not an unbiased estimate of maxi E{Xi}.\n\n3\n\n\fLemma 1. Let X = {X1, . . . , XM } be a set of random variables and let \u00b5A = {\u00b5A\n1 , . . . , \u00b5B\ni } = E{\u00b5B\n\u00b5B = {\u00b5B\nfor all i. Let M def\nvalues. Let a\u2217 be an element that maximizes \u00b5A: \u00b5A\ni . Then E{\u00b5B\nmaxi E{Xi}. Furthermore, the inequality is strict if and only if P (a\u2217 /\u2208 M) > 0.\n\nM } and\nM } be two sets of unbiased estimators such that E{\u00b5A\ni } = E{Xi},\n= {j | E{Xj} = maxi E{Xi}} be the set of elements that maximize the expected\na\u2217} = E{Xa\u2217} \u2264\n\na\u2217 = maxi \u00b5A\n\n1 , . . ., \u00b5A\n\nProof. Assume a\u2217 \u2208 M. Then E{\u00b5B\nchoose j \u2208 M. Then E{\u00b5B\nmutually exclusive, so the combined expectation can be expressed as\n\na\u2217} = E{Xa\u2217}\na\u2217} = E{Xa\u2217} < E{Xj}\n\n= maxi E{Xi}. Now assume a\u2217 /\u2208 M and\ndef\n= maxi E{Xi}. These two possibilities are\n\ndef\n\nE{\u00b5B\n\na\u2217} = P (a\u2217 \u2208 M)E{\u00b5B\n= P (a\u2217 \u2208 M) max\n\na\u2217|a\u2217 \u2208 M} + P (a\u2217 /\u2208 M)E{\u00b5B\nE{Xi} + P (a\u2217 /\u2208 M)E{\u00b5B\n\na\u2217|a\u2217 /\u2208 M}\n\na\u2217|a\u2217 /\u2208 M}\n\ni\n\n\u2264 P (a\u2217 \u2208 M) max\n\ni\n\nE{Xi} + P (a\u2217 /\u2208 M) max\n\ni\n\nE{Xi}\n\n= max\n\ni\n\nE{Xi} ,\n\nwhere the inequality is strict if and only if P (a\u2217 /\u2208 M) > 0. This happens when the variables have\ndifferent expected values, but their distributions overlap. In contrast with the single estimator, the\ndouble estimator is unbiased when the variables are iid, since then all expected values are equal and\nP (a\u2217 \u2208 M) = 1.\n\n3 Double Q-learning\n\nWe can interpret Q-learning as using the single estimator to estimate the value of the next\nstate: maxa Qt(st+1, a) is an estimate for E{maxa Qt(st+1, a)}, which in turn approximates\nmaxa E{Qt(st+1, a)}. The expectation should be understood as averaging over all possible runs\nof the same experiment and not\u2014as it is often used in a reinforcement learning context\u2014as\nthe expectation over the next state, which we will encounter in the next subsection as E{\u00b7|Pt}.\nTherefore, maxa Qt(st+1, a) is an unbiased sample, drawn from an iid distribution with mean\nE{maxa Qt(st+1, a)}.\nIn the next section we show empirically that because of this Q-learning\ncan indeed suffer from large overestimations. In this section we present an algorithm to avoid these\noverestimation issues. The algorithm is called Double Q-learning and is shown in Algorithm 1.\n\nDouble Q-learning stores two Q functions: QA and QB. Each Q function is updated with a value\nfrom the other Q function for the next state. The action a\u2217 in line 6 is the maximal valued action\nin state s\u2032, according to the value function QA. However, instead of using the value QA(s\u2032, a\u2217) =\nmaxa QA(s\u2032, a) to update QA, as Q-learning would do, we use the value QB(s\u2032, a\u2217). Since QB was\nupdated on the same problem, but with a different set of experience samples, this can be considered\nan unbiased estimate for the value of this action. A similar update is used for QB, using b\u2217 and QA.\nIt is important that both Q functions learn from separate sets of experiences, but to select an action\nto perform one can use both value functions. Therefore, this algorithm is not less data-ef\ufb01cient than\nQ-learning. In our experiments, we calculated the average of the two Q values for each action and\nthen performed \u01eb-greedy exploration with the resulting average Q values.\n\nDouble Q-learning is not a full solution to the problem of \ufb01nding the maximum of the expected\nvalues of the actions. Similar to the double estimator in Section 2, action a\u2217 may not be the ac-\ntion that maximizes the expected Q function maxa E{QA(s\u2032, a)}. In general E{QB(s\u2032, a\u2217)} \u2264\nmaxa E{QA(s\u2032, a\u2217)}, and underestimations of the action values can occur.\n\n3.1 Convergence in the Limit\n\nIn this subsection we show that in the limit Double Q-learning converges to the optimal policy.\nIntuitively, this is what one would expect: Q-learning is based on the single estimator and Double\nQ-learning is based on the double estimator and in Section 2 we argued that the estimates by the\nsingle and double estimator both converge to the same answer in the limit. However, this argument\ndoes not transfer immediately to bootstrapping action values, so we prove this result making use of\nthe following lemma which was also used to prove convergence of Sarsa [20].\n\n4\n\n\fAlgorithm 1 Double Q-learning\n1: Initialize QA,QB,s\n2: repeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\nend if\n11:\ns \u2190 s\u2032\n12:\n13: until end\n\nelse if UPDATE(B) then\n\nChoose a, based on QA(s, \u00b7) and QB(s, \u00b7), observe r, s\u2032\nChoose (e.g. random) either UPDATE(A) or UPDATE(B)\nif UPDATE(A) then\n\nDe\ufb01ne a\u2217 = arg maxa QA(s\u2032, a)\nQA(s, a) \u2190 QA(s, a) + \u03b1(s, a)(cid:0)r + \u03b3QB(s\u2032, a\u2217) \u2212 QA(s, a)(cid:1)\nDe\ufb01ne b\u2217 = arg maxa QB(s\u2032, a)\nQB(s, a) \u2190 QB(s, a) + \u03b1(s, a)(r + \u03b3QA(s\u2032, b\u2217) \u2212 QB(s, a))\n\nLemma 2. Consider a stochastic process (\u03b6t, \u2206t, Ft), t \u2265 0, where \u03b6t, \u2206t, Ft : X \u2192 R satisfy the\nequations:\n\n(8)\nwhere xt \u2208 X and t = 0, 1, 2, . . .. Let Pt be a sequence of increasing \u03c3-\ufb01elds such that \u03b60 and\n\u22060 are P0-measurable and \u03b6t, \u2206t and Ft\u22121 are Pt-measurable, t = 1, 2, . . . . Assume that the\n\n\u2206t+1(xt) = (1 \u2212 \u03b6t(xt))\u2206t(xt) + \u03b6t(xt)Ft(xt) ,\n\nfollowing hold: 1) The set X is \ufb01nite. 2) \u03b6t(xt) \u2208 [0, 1] , Pt \u03b6t(xt) = \u221e , Pt(\u03b6t(xt))2 < \u221e w.p.1\n\nand \u2200x 6= xt : \u03b6t(x) = 0. 3) ||E{Ft|Pt}|| \u2264 \u03ba||\u2206t|| + ct, where \u03ba \u2208 [0, 1) and ct converges to\nzero w.p. 1. 4) Var{Ft(xt)|Pt} \u2264 K(1 + \u03ba||\u2206t||)2, where K is some constant. Here || \u00b7 || denotes a\nmaximum norm. Then \u2206t converges to zero with probability one.\n\nWe use this lemma to prove convergence of Double Q-learning under similar conditions as Q-\nlearning. Our theorem is as follows:\nTheorem 1. Assume the conditions below are ful\ufb01lled. Then, in a given ergodic MDP, both QA and\nQB as updated by Double Q-learning as described in Algorithm 1 will converge to the optimal value\nfunction Q\u2217 as given in the Bellman optimality equation (2) with probability one if an in\ufb01nite number\nof experiences in the form of rewards and state transitions for each state action pair are given by\na proper learning policy. The additional conditions are: 1) The MDP is \ufb01nite, i.e. |S \u00d7 A| < \u221e.\n2) \u03b3 \u2208 [0, 1). 3) The Q values are stored in a lookup table. 4) Both QA and QB receive an\n\nin\ufb01nite number of updates. 5) \u03b1t(s, a) \u2208 [0, 1], Pt \u03b1t(s, a) = \u221e, Pt(\u03b1t(s, a))2 < \u221e w.p.1, and\n\u2200(s, a) 6= (st, at) : \u03b1t(s, a) = 0. 6) \u2200s, a, s\u2032 : Var{Rs\u2032\n\nsa} < \u221e.\n\nA \u2018proper\u2019 learning policy ensures that each state action pair is visited an in\ufb01nite number of times.\nFor instance, in a communicating MDP proper policies include a random policy.\n\nSketch of the proof. We sketch how to apply Lemma 2 to prove Theorem 1 without going into full\ntechnical detail. Because of the symmetry in the updates on the functions QA and QB it suf\ufb01ces\nto show convergence for either of these. We will apply Lemma 2 with Pt = {QA\n0 , s0, a0, \u03b10,\nt (st+1, a\u2217) \u2212\nr1, s1, . . ., st, at}, X = S \u00d7 A, \u2206t = QA\nQ\u2217\nt (st, at), where a\u2217 = arg maxa QA(st+1, a). It is straightforward to show the \ufb01rst two conditions\nof the lemma hold. The fourth condition of the lemma holds as a consequence of the boundedness\ncondition on the variance of the rewards in the theorem.\n\n0 , QB\nt \u2212 Q\u2217, \u03b6 = \u03b1 and Ft(st, at) = rt + \u03b3QB\n\nThis leaves to show that the third condition on the expected contraction of Ft holds. We can write\n\nFt(st, at) = F Q\n\nt = rt + \u03b3QA\n\nt (st, at) + \u03b3(cid:0)QB\nwhere F Q\nunder consideration. It is well-known that E{F Q\nt (st+1, a\u2217) and it suf\ufb01ces to show that \u2206BA\nct = \u03b3QB\nzero. Depending on whether QB or QA is updated, the update of \u2206BA\n\nt (st, at) is the value of Ft if normal Q-learning would be\nt |Pt} \u2264 \u03b3||\u2206t||, so to apply the lemma we identify\nt converges to\n\nt (st+1, a\u2217)(cid:1) ,\n\nt (st+1, a\u2217) \u2212 \u03b3QA\n\nt (st+1, a\u2217) \u2212 QA\n\nt (st+1, a\u2217) \u2212 Q\u2217\n\nt = QB\n\nt \u2212 QA\n\nat time t is either\n\nt\n\n\u2206BA\n\u2206BA\n\nt+1(st, at) = \u2206BA\nt+1(st, at) = \u2206BA\n\nt\n\nt\n\n(st, at) + \u03b1t(st, at)F B\n(st, at) \u2212 \u03b1t(st, at)F A\n\nt (st, at) , or\nt (st, at) ,\n\n5\n\n\fwhere F A\nQB\n\nt (st, at) = rt + \u03b3QB\nt = 1\n\nt (st, at). We de\ufb01ne \u03b6 BA\n\nt (st+1, a\u2217) \u2212 QA\n2 \u03b1t. Then\n\nt (st, at) and F B\n\nt (st, at) = rt + \u03b3QA\n\nt (st+1, b\u2217) \u2212\n\nE{\u2206BA\n\nt+1(st, at)|Pt} = \u2206BA\n\nt\n\n(st, at) + E{\u03b1t(st, at)F B\n\nt (st, at) \u2212 \u03b1t(st, at)F A\n\nt\n\n= (1 \u2212 \u03b6 BA\n(st, at)|Pt} = \u03b3E(cid:8)QA\nt (st+1, b\u2217)|Pt} \u2265 E{QB\n\nt\n\nwhere E{F BA\nthat the selection whether to update QA or QB is independent on the sample (e.g. random).\n\nt (st+1, a\u2217)|Pt(cid:9). For this step it is important\nt (st+1, a\u2217)|Pt}. By de\ufb01nition of a\u2217 as given in line 6 of\n\nt (st+1, b\u2217) \u2212 QB\n\n(st, at))\u2206BA\n\nt\n\n(st, at) + \u03b6 BA\n\nt\n\n(st, at)E{F BA\n\nt\n\nt (st+1, a\u2217) = maxa QA\n\nt (st+1, a) \u2265 QA\n\nt (st+1, b\u2217) and therefore\n\nAssume E{QA\nAlgorithm 1 we have QA\n\nt (st, at)|Pt}\n(st, at)|Pt} ,\n\nt (st+1, b\u2217) \u2212 QB\nt (st+1, a\u2217) \u2212 QB\n\nt (st+1, a\u2217)|Pt(cid:9)\nt (st+1, a\u2217)|Pt(cid:9) \u2264 \u03b3(cid:13)(cid:13)\u2206BA\n\nt (cid:13)(cid:13) .\n\nt (st+1, b\u2217)|Pt} and note that by de\ufb01nition of b\u2217 we have\n\nNow assume E{QB\nt (st+1, b\u2217) \u2265 QB\nQB\n\nt\n\n(cid:12)(cid:12)E{F BA\n\nt\n\n(cid:12)(cid:12)E{F BA\n\n(st, at)|Pt}(cid:12)(cid:12) = \u03b3E(cid:8)QA\n\u2264 \u03b3E(cid:8)QA\nt (st+1, a\u2217)|Pt} > E{QA\nt (st+1, a\u2217). Then\n\n(st, at)|Pt}(cid:12)(cid:12) = \u03b3E(cid:8)QB\n\u2264 \u03b3E(cid:8)QB\n|Pt}| \u2264 \u03b3k\u2206BA\n\nt (st+1, a\u2217) \u2212 QA\nt (st+1, b\u2217) \u2212 QA\n\nt (st+1, b\u2217)|Pt(cid:9)\nt (st+1, b\u2217)|Pt(cid:9) \u2264 \u03b3(cid:13)(cid:13)\u2206BA\n\nt (cid:13)(cid:13) .\n\nClearly, one of the two assumptions must hold at each time step and in both cases we obtain the\ndesired result that |E{F BA\nto\nzero, which in turn ensures that the original process also converges in the limit.\n\nt k. Applying the lemma yields convergence of \u2206BA\n\nt\n\nt\n\n4 Experiments\n\nThis section contains results on two problems, as an illustration of the bias of Q-learning and as a \ufb01rst\npractical comparison with Double Q-learning. The settings are simple to allow an easy interpretation\nof what is happening. Double Q-learning scales to larger problems and continuous spaces in the\nsame way as Q-learning, so our focus here is explicitly on the bias of the algorithms.\n\nThe settings are the gambling game of roulette and a small grid world. There is considerable ran-\ndomness in the rewards, and as a result we will see that indeed Q-learning performs poorly. The\ndiscount factor was 0.95 in all experiments. We conducted two experiments on each problem. The\nlearning rate was either linear: \u03b1t(s, a) = 1/nt(s, a), or polynomial \u03b1t(s, a) = 1/nt(s, a)0.8. For\nDouble Q-learning nt(s, a) = nA\nt (s, a) if QB is updated,\nwhere nA\nt store the number of updates for each action for the corresponding value function.\nThe polynomial learning rate was shown in previous work to be better in theory and in practice [14].\n\nt (s, a) if QA is updated and nt(s, a) = nB\n\nt and nB\n\n4.1 Roulette\n\nIn roulette, a player chooses between 170 betting actions, including betting on a number, on either\nof the colors black or red, and so on. The payoff for each of these bets is chosen such that almost all\nbets have an expected payout of 1\n38 $36 = $0.947 per dollar, resulting in an expected loss of -$0.053\nper play if we assume the player bets $1 every time.1 We assume all betting actions transition back\nto the same state and there is one action that stops playing, yielding $0. We ignore the available\nfunds of the player as a factor and assume he bets $1 each turn.\n\nFigure 1 shows the mean action values over all actions, as found by Q-learning and Double Q-\nlearning. Each trial consisted of a synchronous update of all 171 actions. After 100,000 trials,\nQ-learning with a linear learning rate values all betting actions at more than $20 and there is little\nprogress. With polynomial learning rates the performance improves, but Double Q-learning con-\nverges much more quickly. The average estimates of Q-learning are not poor because of a few\npoorly estimated outliers. After 100,000 trials Q-learning valued all non-terminating actions be-\ntween $22.63 and $22.67 for linear learning rates and between $9.58 to $9.64 for polynomial rates.\nIn this setting Double Q-learning does not suffer from signi\ufb01cant underestimations.\n\n1Only the so called \u2018top line\u2019 which pays $6 per dollar when 00, 0, 1, 2 or 3 is hit has a slightly lower\n\nexpected value of -$0.079 per dollar.\n\n6\n\n\fFigure 1: The average action values according to Q-learning and Double Q-learning when playing\nroulette. The \u2018walk-away\u2019 action is worth $0. Averaged over 10 experiments.\n\nFigure 2: Results in the grid world for Q-learning and Double Q-learning. The \ufb01rst row shows\naverage rewards per time step. The second row shows the maximal action value in the starting state\nS. Averaged over 10,000 experiments.\n\n4.2 Grid World\n\nConsider the small grid world MDP as show in Figure 2. Each state has 4 actions, corresponding to\nthe directions the agent can go. The starting state is in the lower left position and the goal state is\nin the upper right. Each time the agent selects an action that walks off the grid, the agent stays in\nthe same state. Each non-terminating step, the agent receives a random reward of \u221212 or +10 with\nequal probability. In the goal state every action yields +5 and ends an episode. The optimal policy\nends an episode after \ufb01ve actions, so the optimal average reward per step is +0.2. The exploration\n\nwas \u01eb-greedy with \u01eb(s) = 1/pn(s) where n(s) is the number of times state s has been visited,\n\nassuring in\ufb01nite exploration in the limit which is a theoretical requirement for the convergence of\nboth Q-learning and Double Q-learning. Such an \u01eb-greedy setting is bene\ufb01cial for Q-learning, since\nthis implies that actions with large overestimations are selected more often than realistically valued\nactions. This can reduce the overestimation.\n\nFigure 2 shows the average rewards in the \ufb01rst row and the maximum action value in the starting\nstate in the second row. Double Q-learning performs much better in terms of its average rewards,\nbut this does not imply that the estimations of the action values are accurate. The optimal value of\nk=0 \u03b3k \u2248 0.36, which is depicted in\nthe second row of Figure 2 with a horizontal dotted line. We see Double Q-learning does not get\nmuch closer to this value in 10, 000 learning steps than Q-learning. However, even if the error of the\naction values is comparable, the policies found by Double Q-learning are clearly much better.\n\nthe maximally valued action in the starting state is 5\u03b34 \u2212 P3\n\n5 Discussion\n\nWe note an important difference between the well known heuristic exploration technique of opti-\nmism in the face of uncertainty [21, 22] and the overestimation bias. Optimism about uncertain\nevents can be bene\ufb01cial, but Q-learning can overestimate actions that have been tried often and the\nestimations can be higher than any realistic optimistic estimate. For instance, in roulette our initial\naction value estimate of $0 can be considered optimistic, since no action has an actual expected value\nhigher than this. However, even after trying 100,000 actions Q-learning on average estimated each\ngambling action to be worth almost $10. In contrast, although Double Q-learning can underestimate\n\n7\n\n\fthe values of some actions, it is easy to set the initial action values high enough to ensure optimism\nfor actions that have experienced limited updates. Therefore, the use of the technique of optimism in\nthe face of uncertainty can be thought of as an orthogonal concept to the over- and underestimation\nthat is the topic of this paper.\n\nThe analysis in this paper is not only applicable to Q-learning. For instance, in a recent paper\non multi-armed bandit problems, methods were proposed to exploit structure in the form of the\npresence of clusters of correlated arms in order to speed up convergence and reduce total regret\n[23]. The value of such a cluster in itself is an estimation task and the proposed methods included\ntaking the mean value, which would result in an underestimation of the actual value, and taking the\nmaximum value, which is a case of the single estimator and results in an overestimation. It would\nbe interesting to see how the double estimator approach fares in such a setting.\n\nAlthough the settings in our experiments used stochastic rewards, our analysis is not limited to MDPs\nwith stochastic reward functions. When the rewards are deterministic but the state transitions are\nstochastic, the same pattern of overestimations due to this noise can occur and the same conclusions\ncontinue to hold.\n\n6 Conclusion\n\nWe have presented a new algorithm called Double Q-learning that uses a double estimator approach\nto determine the value of the next state. To our knowledge, this is the \ufb01rst off-policy value based\nreinforcement learning algorithm that does not have a positive bias in estimating the action values in\nstochastic environments. According to our analysis, Double Q-learning sometimes underestimates\nthe action values, but does not suffer from the overestimation bias that Q-learning does. In a roulette\ngame and a maze problem, Double Q-learning was shown to reach good performance levels much\nmore quickly.\n\nFuture work Interesting future work would include research to obtain more insight into the merits\nof the Double Q-learning algorithm. For instance, some preliminary experiments in the grid world\nshowed that Q-learning performs even worse with higher discount factors, but Double Q-learning\nis virtually unaffected. Additionally, the fact that we can construct positively biased and negatively\nbiased off-policy algorithms raises the question whether it is also possible to construct an unbi-\nased off-policy reinforcement-learning algorithm, without the high variance of unbiased on-policy\nMonte-Carlo methods [24]. Possibly, this can be done by estimating the size of the overestimation\nand deducting this from the estimate. Unfortunately, the size of the overestimation is dependent on\nthe number of actions and the unknown distributions of the rewards and transitions, making this a\nnon-trivial extension.\n\nMore analysis on the performance of Q-learning and related algorithms such as Fitted Q-iteration\n[12] and Delayed Q-learning [10] is desirable. For instance, Delayed Q-learning can suffer from\nsimilar overestimations, although it does have polynomial convergence guarantees. This is simi-\nlar to the polynomial learning rates: although performance is improved from an exponential to a\npolynomial rate [14], the algorithm still suffers from the inherent overestimation bias due to the sin-\ngle estimator approach. Furthermore, it would be interesting to see how Fitted Double Q-iteration,\nDelayed Double Q-learning and other extensions of Q-learning perform in practice when they are\napplied to Double Q-learning.\n\nAcknowledgments\n\nThe authors wish to thank Marco Wiering and Gerard Vreeswijk for helpful comments. This re-\nsearch was made possible thanks to grant 612.066.514 of the dutch organization for scienti\ufb01c re-\nsearch (Nederlandse Organisatie voor Wetenschappelijk Onderzoek, NWO).\n\nReferences\n\n[1] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King\u2019s College, Cambridge,\n\nEngland, 1989.\n\n[2] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279\u2013292, 1992.\n\n8\n\n\f[3] R. Bellman. Dynamic Programming. Princeton University Press, 1957.\n\n[4] T. Jaakkola, M. I. Jordan, and S. P. Singh. On the convergence of stochastic iterative dynamic\n\nprogramming algorithms. Neural Computation, 6:1185\u20131201, 1994.\n\n[5] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning,\n\n16:185\u2013202, 1994.\n\n[6] M. L. Littman and C. Szepesv\u00b4ari. A generalized reinforcement-learning model: Convergence\nIn L. Saitta, editor, Proceedings of the 13th International Conference on\n\nand applications.\nMachine Learning (ICML-96), pages 310\u2013318, Bari, Italy, 1996. Morgan Kaufmann.\n\n[7] R. H. Crites and A. G. Barto. Improving elevator performance using reinforcement learning. In\nD. S. Touretzky, M. C. Mozer, and M. E. Hasselmo, editors, Advances in Neural Information\nProcessing Systems 8, pages 1017\u20131023, Cambridge MA, 1996. MIT Press.\n\n[8] W. D. Smart and L. P. Kaelbling. Effective reinforcement learning for mobile robots.\n\nIn\nProceedings of the 2002 IEEE International Conference on Robotics and Automation (ICRA\n2002), pages 3404\u20133410, Washington, DC, USA, 2002.\n\n[9] M. A. Wiering and H. P. van Hasselt. Ensemble algorithms in reinforcement learning. IEEE\n\nTransactions on Systems, Man, and Cybernetics, Part B, 38(4):930\u2013936, 2008.\n\n[10] A. L. Strehl, L. Li, E. Wiewiora, J. Langford, and M. L. Littman. PAC model-free reinforce-\nment learning. In Proceedings of the 23rd international conference on Machine learning, pages\n881\u2013888. ACM, 2006.\n\n[11] M. J. Kearns and S. P. Singh. Finite-sample convergence rates for Q-learning and indirect\nalgorithms. In Neural Information Processing Systems 12, pages 996\u20131002. MIT Press, 1999.\n\n[12] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal\n\nof Machine Learning Research, 6(1):503\u2013556, 2005.\n\n[13] C. Szepesv\u00b4ari. The asymptotic convergence-rate of Q-learning. In NIPS \u201997: Proceedings of\nthe 1997 conference on Advances in neural information processing systems 10, pages 1064\u2013\n1070, Cambridge, MA, USA, 1998. MIT Press.\n\n[14] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning\n\nResearch, 5:1\u201325, 2003.\n\n[15] E. Van den Steen. Rational overoptimism (and other biases). American Economic Review,\n\n94(4):1141\u20131151, September 2004.\n\n[16] J. E. Smith and R. L. Winkler. The optimizer\u2019s curse: Skepticism and postdecision surprise in\n\ndecision analysis. Management Science, 52(3):311\u2013322, 2006.\n\n[17] E. Capen, R. Clapp, and T. Campbell. Bidding in high risk situations. Journal of Petroleum\n\nTechnology, 23:641\u2013653, 1971.\n\n[18] R. H. Thaler. Anomalies: The winner\u2019s curse. Journal of Economic Perspectives, 2(1):191\u2013\n\n202, Winter 1988.\n\n[19] J. L. W. V. Jensen. Sur les fonctions convexes et les in\u00b4egalit\u00b4es entre les valeurs moyennes.\n\nJournal Acta Mathematica, 30(1):175\u2013193, 1906.\n\n[20] S. P. Singh, T. Jaakkola, M. L. Littman, and C. Szepesv\u00b4ari. Convergence results for single-step\n\non-policy reinforcement-learning algorithms. Machine Learning, 38(3):287\u2013308, 2000.\n\n[21] L. P. Kaelbling, M. L. Littman, and A. W. Moore. Reinforcement learning: A survey. Journal\n\nof Arti\ufb01cial Intelligence Research, 4:237\u2013285, 1996.\n\n[22] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. The MIT press,\n\nCambridge MA, 1998.\n\n[23] S. Pandey, D. Chakrabarti, and D. Agarwal. Multi-armed bandit problems with dependent\narms. In Proceedings of the 24th international conference on Machine learning, pages 721\u2013\n728. ACM, 2007.\n\n[24] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications.\n\nBiometrika, pages 97\u2013109, 1970.\n\n9\n\n\f", "award": [], "sourceid": 208, "authors": [{"given_name": "Hado", "family_name": "Hasselt", "institution": null}]}