{"title": "Shallow Updates for Deep Reinforcement Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3135, "page_last": 3145, "abstract": "Deep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN) have achieved state-of-the-art results in a variety of challenging, high-dimensional domains. This success is mainly attributed to the power of deep neural networks to learn rich domain representations for approximating the value function or policy.  Batch reinforcement learning methods with linear representations, on the other hand, are more stable and require less hyper parameter tuning. Yet, substantial feature engineering is necessary to achieve good results. In this work we propose a hybrid approach -- the Least Squares Deep Q-Network (LS-DQN), which combines rich feature representations learned by a DRL algorithm with the stability of a linear least squares method. We do this by periodically re-training the last hidden layer of a DRL network with a batch least squares update. Key to our approach is a Bayesian regularization term for the least squares update, which prevents over-fitting to the more recent data. We tested LS-DQN on five Atari games and demonstrate significant improvement over vanilla DQN and Double-DQN. We also investigated the reasons for the superior performance of our method. Interestingly, we found that the performance improvement can be attributed to the large batch size used by the LS method when optimizing the last layer.", "full_text": "Shallow Updates for Deep Reinforcement Learning\n\nNir Levine\u2217\n\nTom Zahavy\u2217\n\nDept. of Electrical Engineering\n\nDept. of Electrical Engineering\n\nThe Technion - Israel Institute of Technology\n\nThe Technion - Israel Institute of Technology\n\nIsrael, Haifa 3200003\n\nlevin.nir1@gmail.com\n\nIsrael, Haifa 3200003\n\ntomzahavy@campus.technion.ac.il\n\nDaniel J. Mankowitz\n\nDept. of Electrical Engineering\n\nThe Technion - Israel Institute of Technology\n\nIsrael, Haifa 3200003\n\ndanielm@tx.technion.ac.il\n\nAviv Tamar\n\nDept. of Electrical Engineering and\nComputer Sciences, UC Berkeley\n\nBerkeley, CA 94720\n\navivt@berkeley.edu\n\nShie Mannor\n\nDept. of Electrical Engineering\n\nThe Technion - Israel Institute of Technology\n\nIsrael, Haifa 3200003\n\nshie@ee.technion.ac.il\n\n* Joint \ufb01rst authors. Ordered alphabetically by \ufb01rst name.\n\nAbstract\n\nDeep reinforcement learning (DRL) methods such as the Deep Q-Network (DQN)\nhave achieved state-of-the-art results in a variety of challenging, high-dimensional\ndomains. This success is mainly attributed to the power of deep neural networks to\nlearn rich domain representations for approximating the value function or policy.\nBatch reinforcement learning methods with linear representations, on the other\nhand, are more stable and require less hyper parameter tuning. Yet, substantial\nfeature engineering is necessary to achieve good results. In this work we propose a\nhybrid approach \u2013 the Least Squares Deep Q-Network (LS-DQN), which combines\nrich feature representations learned by a DRL algorithm with the stability of a\nlinear least squares method. We do this by periodically re-training the last hidden\nlayer of a DRL network with a batch least squares update. Key to our approach\nis a Bayesian regularization term for the least squares update, which prevents\nover-\ufb01tting to the more recent data. We tested LS-DQN on \ufb01ve Atari games and\ndemonstrate signi\ufb01cant improvement over vanilla DQN and Double-DQN. We also\ninvestigated the reasons for the superior performance of our method. Interestingly,\nwe found that the performance improvement can be attributed to the large batch\nsize used by the LS method when optimizing the last layer.\n\n1\n\nIntroduction\n\nReinforcement learning (RL) is a \ufb01eld of research that uses dynamic programing (DP; Bertsekas\n2008), among other approaches, to solve sequential decision making problems. The main challenge\nin applying DP to real world problems is an exponential growth of computational requirements as the\nproblem size increases, known as the curse of dimensionality (Bertsekas, 2008).\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRL tackles the curse of dimensionality by approximating terms in the DP calculation such as the\nvalue function or policy. Popular function approximators for this task include deep neural networks,\nhenceforth termed deep RL (DRL), and linear architectures, henceforth termed shallow RL (SRL).\nSRL methods have enjoyed wide popularity over the years (see, e.g., Tsitsiklis et al. 1997; Bertsekas\n2008 for extensive reviews). In particular, batch algorithms based on a least squares (LS) approach,\nsuch as Least Squares Temporal Difference (LSTD, Lagoudakis & Parr 2003) and Fitted-Q Iteration\n(FQI, Ernst et al. 2005) are known to be stable and data ef\ufb01cient. However, the success of these\nalgorithms crucially depends on the quality of the feature representation. Ideally, the representation\nencodes rich, expressive features that can accurately represent the value function. However, in\npractice, \ufb01nding such good features is dif\ufb01cult and often hampers the usage of linear function\napproximation methods.\nIn DRL, on the other hand, the features are learned together with the value function in a deep archi-\ntecture. Recent advancements in DRL using convolutional neural networks demonstrated learning\nof expressive features (Zahavy et al., 2016; Wang et al., 2016) and state-of-the-art performance in\nchallenging tasks such as video games (Mnih et al. 2015; Tessler et al. 2017; Mnih et al. 2016), and\nGo (Silver et al., 2016). To date, the most impressive DRL results (E.g., the works of Mnih et al.\n2015, Mnih et al. 2016) were obtained using online RL algorithms, based on a stochastic gradient\ndescent (SGD) procedure.\nOn the one hand, SRL is stable and data ef\ufb01cient. On the other hand, DRL learns powerful repre-\nsentations. This motivates us to ask: can we combine DRL with SRL to leverage the bene\ufb01ts of\nboth?\nIn this work, we develop a hybrid approach that combines batch SRL algorithms with online DRL.\nOur main insight is that the last layer in a deep architecture can be seen as a linear representation,\nwith the preceding layers encoding features. Therefore, the last layer can be learned using standard\nSRL algorithms. Following this insight, we propose a method that repeatedly re-trains the last hidden\nlayer of a DRL network with a batch SRL algorithm, using data collected throughout the DRL run.\nWe focus on value-based DRL algorithms (e.g., the popular DQN of Mnih et al. 2015) and on\nSRL based on LS methods1, and propose the Least Squares DQN algorithm (LS-DQN). Key to our\napproach is a novel regularization term for the least squares method that uses the DRL solution as a\nprior in a Bayesian least squares formulation. Our experiments demonstrate that this hybrid approach\nsigni\ufb01cantly improves performance on the Atari benchmark for several combinations of DRL and\nSRL methods.\nTo support our results, we performed an in-depth analysis to tease out the factors that make our hybrid\napproach outperform DRL. Interestingly, we found that the improved performance is mainly due to\nthe large batch size of SRL methods compared to the small batch size that is typical for DRL.\n\n2 Background\n\nIn this section we describe our RL framework and several shallow and deep RL algorithms that will\nbe used throughout the paper.\nRL Framework: We consider a standard RL formulation (Sutton & Barto, 1998) based on a Markov\nDecision Process (MDP). An MDP is a tuple (cid:104)S, A, R, P, \u03b3(cid:105), where S is a \ufb01nite set of states, A\nis a \ufb01nite set of actions, and \u03b3 \u2208 [0, 1] is the discount factor. A transition probability function\nP : S \u00d7 A \u2192 \u2206S maps states and actions to a probability distribution over next states. Finally,\nR : S \u00d7 A \u2192 [Rmin, Rmax] denotes the reward. The goal in RL is to learn a policy \u03c0 : S \u2192 \u2206A that\nt=0 \u03b3trt| \u03c0]. Value based RL\nt=0 \u03b3trt|st = s, at = a, \u03c0], which\nrepresents the expected discounted return of executing action a \u2208 A from state s \u2208 S and following\nthe policy \u03c0 thereafter. The optimal action value function Q\u2217(s, a) obeys a fundamental recursion\nknown as the Bellman equation Q\u2217(s, a) = E [ rt + \u03b3 maxa(cid:48) Q\u2217(st+1, a(cid:48))| st = s, at = a].\n\nsolves the MDP by maximizing the expected discounted return E [(cid:80)\u221e\nmethods make use of the action value function Q\u03c0(s, a) = E[(cid:80)\u221e\n\n1Our approach can be generalized to other DRL/SRL variants.\n\n2\n\n\f2.1 SRL algorithms\n\nLeast Squares Temporal Difference Q-Learning (LSTD-Q): LSTD (Barto & Crites, 1996) and\nLSTD-Q (Lagoudakis & Parr, 2003) are batch SRL algorithms. LSTD-Q learns a control policy \u03c0\nfrom a batch of samples by estimating a linear approximation \u02c6Q\u03c0 = \u03a6w\u03c0 of the action value function\nQ\u03c0 \u2208 R|S||A|, where w\u03c0 \u2208 Rk are a set of weights and \u03a6 \u2208 R|S||A|\u00d7k is a feature matrix. Each row\nof \u03a6 represents a feature vector for a state-action pair (cid:104)s, a(cid:105). The weights w\u03c0 are learned by enforcing\n\u02c6Q\u03c0 to satisfy a \ufb01xed point equation w.r.t. the projected Bellman operator, resulting in a system of\nlinear equations Aw\u03c0 = b, where A = \u03a6T (\u03a6 \u2212 \u03b3P\u03a0\u03c0\u03a6) and b = \u03a6TR. Here, R \u2208 R|S||A| is the\nreward vector, P \u2208 R|S||A|\u00d7|S| is the transition matrix and \u03a0\u03c0 \u2208 R|S|\u00d7|S||A| is a matrix describing\nthe policy. Given a set of NSRL samples D = {si, ai, ri, si+1}NSRL\n, we can approximate A and b\nwith the following empirical averages:\n\ni=1\n\n(cid:20)\n\nNSRL(cid:88)\n\ni=1\n\n\u02dcA =\n\n1\n\nNSRL\n\n(cid:18)\n\n\u03c6(si, ai)T\n\n\u03c6(si, ai)\u2212 \u03b3\u03c6(si+1, \u03c0(si+1))\n\n, \u02dcb =\n\n\u03c6(si, ai)T ri\n\n.\n\n(cid:19)(cid:21)\n\n(cid:20)\n\nNSRL(cid:88)\n\ni=1\n\n1\n\nNSRL\n\n(cid:21)\n\n(1)\nThe weights w\u03c0 can be calculated using a least squares minimization: \u02dcw\u03c0 = arg minw (cid:107) \u02dcAw \u2212 \u02dcb(cid:107)2\nor by calculating the pseudo-inverse: \u02dcw\u03c0 = \u02dcA\u2020\u02dcb. LSTD-Q is an off-policy algorithm: the same set of\nsamples D can be used to train any policy \u03c0 so long as \u03c0(si+1) is de\ufb01ned for every si+1 in the set.\nFitted Q Iteration (FQI): The FQI algorithm (Ernst et al., 2005)\nis a batch SRL\nthe Q-function using regression.\nalgorithm that computes\nthe set D de\ufb01ned above and the approximation\nAt\nthe algorithm,\niteration N of\niteration QN\u22121 are used to generate supervised learning targets:\nfrom the previous\n,\u2200i \u2208 NSRL. These targets are then used by a supervised\nyi = ri + \u03b3 maxa(cid:48) QN\u22121(si+1, a\n(cid:48)\nlearning (regression) method to compute the next function in the sequence QN , by minimizing\n(Q(si, ai) \u2212 (ri + \u03b3 maxa(cid:48) QN\u22121(si+1, a(cid:48))))2. For a lin-\nthe MSE loss QN = argminQ\near function approximation Qn(a, s) = \u03c6T (s, a)wn, LS can be used to give the FQI solution\nwn = arg minw (cid:107) \u02dcAw \u2212 \u02dcb(cid:107)2\n\niterative approximations of\n\n(cid:80)NSRL\n\n2, where \u02dcA, \u02dcb are given by:\n\ni=1\n\n),\n\n2\n\n(cid:20)\n\nNSRL(cid:88)\n\ni=1\n\n\u02dcA =\n\n1\n\nNSRL\n\n\u03c6(si, ai)T \u03c6(si, ai)\n\n,\n\n\u03c6(si, ai)T yi\n\n.\n\n(2)\n\n(cid:21)\n\n(cid:20)\n\nNSRL(cid:88)\n\ni=1\n\n\u02dcb =\n\n1\n\nNSRL\n\n(cid:21)\n\nThe FQI algorithm can also be used with non-linear function approximations such as trees (Ernst\net al., 2005) and neural networks (Riedmiller, 2005). The DQN algorithm (Mnih et al., 2015) can be\nviewed as online form of FQI.\n\n2.2 DRL algorithms\n\nDeep Q-Network (DQN): The DQN algorithm (Mnih et al., 2015) learns the Q function by mini-\nmizing the mean squared error of the Bellman equation, de\ufb01ned as Est,at,rt,st+1(cid:107)Q\u03b8(st, at) \u2212 yt(cid:107)2\n2,\n(cid:48)\nwhere yt = rt + \u03b3 maxa(cid:48) Q\u03b8target(st+1, a\n). The DQN maintains two separate networks, namely\nthe current network with weights \u03b8 and the target network with weights \u03b8target. Fixing the target\nnetwork makes the DQN algorithm equivalent to FQI (see the FQI MSE loss de\ufb01ned above), where\nthe regression algorithm is chosen to be SGD (RMSPROP, Hinton et al. 2012). The DQN is an\noff-policy learning algorithm. Therefore, the tuples (cid:104)st, at, rt, st+1(cid:105) that are used to optimize the\nnetwork weights are \ufb01rst collected from the agent\u2019s experience and are stored in an Experience Replay\n(ER) buffer (Lin, 1993) providing improved stability and performance.\nDouble DQN (DDQN): DDQN (Van Hasselt et al., 2016) is a modi\ufb01cation of the DQN al-\ngorithm that addresses overly optimistic estimates of the value function. This is achieved by\nperforming action selection with the current network \u03b8 and evaluating the action with the tar-\nget network, \u03b8target, yielding the DDQN target update yt = rt if st+1 is terminal, otherwise\nyt = rt + \u03b3Q\u03b8target(st+1, maxa Q\u03b8(st+1, a)).\n\n3\n\n\f3 The LS-DQN Algorithm\n\nk\n\nWe now present a hybrid approach for DRL with SRL updates2. Our algorithm, the LS-DQN\nAlgorithm, periodically switches between training a DRL network and re-training its last hidden layer\nusing an SRL method. 3\nWe assume that the DRL algorithm uses a deep network for representing the Q function4, where\nthe last layer is linear and fully connected. Such networks have been used extensively in deep RL\nrecently (e.g., Mnih et al. 2015; Van Hasselt et al. 2016; Mnih et al. 2016). In such a representation,\nthe last layer, which approximates the Q function, can be seen as a linear combination of features (the\noutput of the penultimate layer), and we propose to learn more accurate weights for it using SRL.\nExplicitly, the LS-DQN algorithm begins by training the weights of a DRL network, wk, using a\nvalue-based DRL algorithm for NDRL steps (Line 2). LS-DQN then updates the last hidden layer\nweights, wlast\n, by executing LS-UPDATE: retraining the weights using a SRL algorithm with NSRL\nsamples (Line 3).\nThe LS-UPDATE consists of the following steps. First, data trajectories D for the batch update are\ngathered using the current network weights, wk (Line 7). In practice, the current experience replay\ncan be used and no additional samples need to be collected. The algorithm next generates new\nfeatures \u03a6 (s, a) from the data trajectories using the current DRL network with weights wk. This\nstep guarantees that we do not use samples with inconsistent features, as the ER contains features\nfrom \u2019old\u2019 networks weights. Computationally, this step requires running a forward pass of the deep\nnetwork for every sample in D, and can be performed quickly using parallelization.\nOnce the new features are generated, LS-DQN uses an SRL algorithm to re-calculate the weights of\nthe last hidden layer wlast\nWhile the LS-DQN algorithm is conceptually straightforward, we found that naively running it\nwith off-the-shelf SRL algorithms such as FQI or LSTD resulted in instability and a degradation of\nthe DRL performance. The reason is that the \u2018slow\u2019 SGD computation in DRL essentially retains\ninformation from older training epochs, while the batch SRL method \u2018forgets\u2019 all data but the most\nrecent batch. In the following, we propose a novel regularization method for addressing this issue.\n\n(Line 9).\n\nk\n\n(cid:46) Train the DRL network for NDRL steps\n(cid:46) Update the last layer weights with the SRL solution\n\nAlgorithm 1 LS-DQN Algorithm\nRequire: w0\n1: for k = 1\u00b7\u00b7\u00b7 SRLiters do\nwk \u2190 trainDRLNetwork(wk\u22121)\n2:\nk \u2190 LS-UPDATE(wk)\nwlast\n3:\n4: end for\n5:\n6: function LS-UPDATE(w)\nD \u2190 gatherData(w)\n7:\n\u03a6(s, a) \u2190 generateFeatures(D, w)\n8:\nwlast \u2190 SRL-Algorithm(D, \u03a6(s, a))\n9:\nreturn wlast\n10:\n11: end function\n\nRegularization\n\nOur goal is to improve the performance of a value-based DRL agent using a batch SRL algorithm.\nBatch SRL algorithms, however, do not leverage the knowledge that the agent has gained before the\nmost recent batch5. We observed that this issue prevents the use of off-the-shelf implementations of\nSRL methods in our hybrid LS-DQN algorithm.\n\n2Code is available online at https://github.com/Shallow-Updates-for-Deep-RL\n3We refer the reader to Appendix B for a diagram of the algorithm.\n4The features in the last DQN layer are not action dependent. We generate action-dependent features \u03a6 (s, a)\n\nby zero-padding to a one-hot state-action feature vector. See Appendix E for more details.\n\n5While conceptually, the data batch can include all the data seen so far, due to computational limitations, this\n\nis not a practical solution in the domains we consider.\n\n4\n\n\fTo enjoy the bene\ufb01ts of both worlds, that is, a batch algorithm that can use the accumulated knowledge\ngained by the DRL network, we introduce a novel Bayesian regularization method for LSTD-Q and\nas a Bayesian prior for the\nFQI that uses the last hidden layer weights of the DRL network wlast\nSRL algorithm 6.\nSRL Bayesian Prior Formulation: We are interested in learning the weights of the last hidden\nlayer (wlast), using a least squares SRL algorithm. We pursue a Bayesian approach, where the prior\nweights distribution at iteration k of LS-DQN is given by wprior \u223c N (wlast\n, \u03bb\u22122), and we recall that\nare the last hidden layer weights of the DRL network at iteration SRLiter = k. The Bayesian\nwlast\nsolution for the regression problem in the FQI algorithm is given by (Box & Tiao, 2011)\n\nk\n\nk\n\nk\n\nwlast = ( \u02dcA + \u03bbI)\u22121(\u02dcb + \u03bbwlast\n\nk\n\n) ,\n\nwhere \u02dcA and \u02dcb are given in Equation 2. A similar regularization can be added to LSTD-Q based on a\nregularized \ufb01xed point equation (Kolter & Ng, 2009). Full details are in Appendix A.\n\n4 Experiments\n\nIn this section, we present experiments showcasing the improved performance attained by our LS-\nDQN algorithm compared to state-of-the-art DRL methods. Our experiments are divided into three\nsections. In Section 4.1, we start by investigating the behavior of SRL algorithms in high dimensional\nenvironments. We then show results for the LS-DQN on \ufb01ve Atari domains, in Section 4.2, and\ncompare the resulting performance to regular DQN and DDQN agents. Finally, in Section 4.3,\nwe present an ablative analysis of the LS-DQN algorithm, which clari\ufb01es the reasons behind our\nalgorithm\u2019s success.\n\n4.1 SRL Algorithms with High Dimensional Observations\n\nIn the \ufb01rst set of experiments, we explore how least squares SRL algorithms perform in domains\nwith high dimensional observations. This is an important step before applying a SRL method within\nthe LS-DQN algorithm. In particular, we focused on answering the following questions: (1) What\nregularization method to use? (2) How to generate data for the LS algorithm? (3) How many policy\nimprovement iterations to perform?\nTo answer these questions, we performed the following procedure: We trained DQN agents on two\ngames from the Arcade Learning Environment (ALE, Bellemare et al.); namely, Breakout and Qbert,\nusing the vanilla DQN implementation (Mnih et al., 2015). For each DQN run, we (1) periodically\n7 save the current DQN network weights and ER; (2) Use an SRL algorithm (LSTD-Q or FQI) to\nre-learn the weights of the last layer, and (3) evaluate the resulting DQN network by temporarily\nreplacing the DQN weights with the SRL solution weights. After the evaluation, we replace back\nthe original DQN weights and continue training.\nEach evaluation entails 20 roll-outs 8 with an \u0001-greedy policy (similar to Mnih et al., \u0001 = 0.05). This\nperiodic evaluation setup allowed us to effectively experiment with the SRL algorithms and obtain\nclear comparisons with DQN, without waiting for full DQN runs to complete.\n(1) Regularization: Experiments with standard SRL methods without any regularization yielded\npoor results. We found the main reason to be that the matrices used in the SRL solutions (Equations 1\nand 2) are ill-conditioned, resulting in instability. One possible explanation stems from the sparseness\nof the features. The DQN uses ReLU activations (Jarrett et al., 2009), which causes the network to\nlearn sparse feature representations. For example, once the DQN completed training on Breakout,\n96% of features were zero.\nOnce we added a regularization term, we found that the performance of the SRL algorithms improved.\n\nWe experimented with the (cid:96)2 and Bayesian Prior (BP) regularizers (\u03bb \u2208 (cid:2)0, 102(cid:3)). While the (cid:96)2\n\nregularizer showed competitive performance in Breakout, we found that the BP performed better\nacross domains (Figure 1, best regularizers chosen, shows the average score of each con\ufb01guration\nfollowing the explained evaluation procedure, for the different epochs). Moreover, the BP regularizer\n\n6The reader is referred to Ghavamzadeh et al. (2015) for an overview on using Bayesian methods in RL.\n7Every three million DQN steps, referred to as one epoch (out of a total of 50 million steps).\n8Each roll-out starts from a new (random) game and follows a policy until the agent loses all of its lives.\n\n5\n\n\fwas not sensitive to the scale of the regularization coef\ufb01cient. Regularizers in the range (10\u22121, 101)\nperformed well across all domains. A table of average scores for different coef\ufb01cients can be found\nin Appendix C.1. Note that we do not expect for much improvement as we replace back the original\nDQN weights after evaluation.\n\n(2) Data Gathering: We experimented with two mechanisms for generating data: (1) generating\nnew data from the current policy, and (2) using the ER. We found that the data generation mechanism\nhad a signi\ufb01cant impact on the performance of the algorithms. When the data is generated only from\nthe current DQN policy (without ER) the SRL solution resulted in poor performance compared to a\nsolution using the ER (as was observed by Mnih et al. 2015). We believe that the main reason the ER\nworks well is that the ER contains data sampled from multiple (past) policies, and therefore exhibits\nmore exploration of the state space.\n(3) Policy Improvement: LSTD-Q and FQI are off-policy algorithms and can be applied iteratively\non the same dataset (e.g. LSPI, Lagoudakis & Parr 2003). However, in practice, we found that\nperforming multiple iterations did not improve the results. A possible explanation is that by improving\nthe policy, the policy reaches new areas in the state space that are not represented well in the current\nER, and therefore are not approximated well by the SRL solution and the current DRL network.\n\nFigure 1: Periodic evaluation for DQN (green), LS-DQNLSTD-Q with Bayesian prior regularization\n(red, solid \u03bb = 10, dashed \u03bb = 1), and (cid:96)2 regularization (blue, solid \u03bb = 0.001, dashed \u03bb = 0.0001).\n\n4.2 Atari Experiments\n\nWe next ran the full LS-DQN algorithm (Alg. 1) on \ufb01ve Atari domains: Asterix, Space Invaders,\nBreakout, Q-Bert and Bowling. We ran the LS-DQN using both DQN and DDQN as the DRL\nalgorithm, and using both LSTD-Q and FQI as the SRL algorithms. We chose to run a LS-update\nevery NDRL = 500k steps, for a total of 50M steps (SRLiters = 100). We used the current ER\nbuffer as the \u2018generated\u2019 data in the LS-UPDATE function (line 7 in Alg. 1, NSRL = 1M), and a\nregularization coef\ufb01cient \u03bb = 1 for the Bayesian prior solution (both for FQI and LSTQ-Q). We\nemphasize the we did not use any additional samples beyond the samples already obtained by the\nDRL algorithm.\nFigure 2 presents the learning curves of the DQN network, LS-DQN with LSTD-Q, and LS-DQN\nwith FQI (referred to as DQN, LS-DQNLSTD-Q, and LS-DQNFQI, respectively) on three domains:\nAsterix, Space Invaders and Breakout. Note that we use the same evaluation process as described\nin Mnih et al. (2015). We were also interested in a test to measure differences between learning\ncurves, and not only their maximal score. Hence we chose to perform Wilcoxon signed-rank test on\nthe average scores between the three DQN variants. This non-parametric statistical test measures\nwhether related samples differ in their means (Wilcoxon, 1945). We found that the learning curves\nfor both LS-DQNLSTD-Q and LS-DQNFQI were statistically signi\ufb01cantly better than those of DQN,\nwith p-values smaller than 1e-15 for all three domains.\nTable 1 presents the maximum average scores along the learning curves of the \ufb01ve domains, when\nthe SRL algorithms were incorporated into both DQN agents and DDQN agents (the notation is\nsimilar, i.e., LS-DDQNFQI)9. Our algorithm, LS-DQN, attained better performance compared to the\n\n9 Scores for DQN and DDQN were taken from Van Hasselt et al. (2016).\n\n6\n\n\fFigure 2: Learning curves of DQN (green), LS-DQNLSTD-Q (red), and LS-DQNFQI (blue).\n\nvanilla DQN agents, as seen by the higher scores in Table 1 and Figure 2. We observe an interesting\nphenomenon for the game Asterix: In Figure 2, the DQN\u2019s score \u201ccrashes\u201d to zero (as was observed\nby Van Hasselt et al. 2016). LS-DQNLSTD-Q did not manage to resolve this issue, even though it\nachieved a signi\ufb01cantly higher score that that of the DQN. LS-DQNFQI, however, maintained steady\nperformance and did not \u201ccrash\u201d to zero. We found that, in general, incorporating FQI as an SRL\nalgorithm into the DRL agents resulted in improved performance.\n\nTable 1: Maximal average scores across \ufb01ve different Atari domains for each of the DQN variants.\n\nBreakout\n\n401.20\n420.00\n438.55\n375.00\n397.94\n\nSpace\nInvaders\n1975.50\n3207.44\n3360.81\n3154.60\n4400.83\n\nAsterix\n\nQbert\n\nBowling\n\n6011.67\n13704.23\n13636.81\n15150.00\n16270.45\n\n10595.83\n10767.47\n12981.42\n14875.00\n12727.94\n\n42.40\n61.21\n75.38\n70.50\n80.75\n\n``````````\n\nGame\n\nAlgorithm\nDQN9\nLS-DQNLSTD-Q\nLS-DQNFQI\nDDQN9\nLS-DDQNFQI\n\n4.3 Ablative Analysis\n\nIn the previous section, we saw that the LS-DQN algorithm has improved performance, compared to\nthe DQN agents, across a number of domains. The goal of this section is to understand the reasons\nbehind the LS-DQN\u2019s improved performance by conducting an ablative analysis of our algorithm.\nFor this analysis, we used a DQN agent that was trained on the game of Breakout, in the same\nmanner as described in Section 4.1. We focus on analyzing the LS-DQNFQI algorithm, that has the\nsame optimization objective as DQN (cf. Section 2), and postulate the following conjectures for its\nimproved performance:\n\n(i) The SRL algorithms use a Bayesian regularization term, which is not included in the DQN\n\nobjective.\n\n(ii) The SRL algorithms have less hyperparameters to tune and generate an explicit solution\n\ncompared to SGD-based DRL solutions.\n\n(iii) Large-batch methods perform better than small-batch methods when combining DRL with SRL.\n(iv) SRL algorithms focus on training the last layer and are easier to optimize.\n\nThe Experiments: We started by analyzing the learning method of the last layer (i.e., the \u2018shallow\u2019\npart of the learning process). We did this by optimizing the last layer, at each LS-UPDATE epoch,\nusing (1) FQI with a Bayesian prior and a LS solution, and (2) an ADAM (Kingma & Ba, 2014)\noptimizer with and without an additional Bayesian prior regularization term in the loss function. We\ncompared these approaches for different mini-batch sizes of 32, 512, and 4096 data points, and used\n\u03bb = 1 for all experiments.\nRelating to conjecture (ii), note that the FQI algorithm has only one hyper-parameter to tune and\nproduces an explicit solution using the whole dataset simultaneously. ADAM, on the other hand, has\nmore hyper-parameters to tune and works on different mini-batch sizes.\n\n7\n\n\fThe Experimental Setup: The experiments were done in a periodic fashion similar to Section 4.1,\ni.e., testing behavior in different epochs over a vanilla DQN run. For both ADAM and FQI, we \ufb01rst\ncollected 80k data samples from the ER at each epoch. For ADAM, we performed 20 iterations over\nthe data, where each iteration consisted of randomly permuting the data, dividing it into mini-batches\nand optimizing using ADAM over the mini-batches10. We then simulate the agent and report average\nscores across 20 trajectories.\nThe Results: Figure 3 depicts the difference between the average scores of (1) and (2) to that of the\nDQN baseline scores. We see that larger mini-batches result in improved performance. Moreover,\nthe LS solution (FQI) outperforms the ADAM solutions for mini-batch sizes of 32 and 512 on most\nepochs, and even slightly outperforms the best of them (mini-batch size of 4096 and a Bayesian\nprior). In addition, a solution with a prior performs better than a solution without a prior.\nSummary: Our ablative analysis experiments strongly support conjectures (iii) and (iv) from above,\nfor explaining LS-DQN\u2019s improved performance. That is, large-batch methods perform better than\nsmall-batch methods when combining DRL with SRL as explained above; and SRL algorithms that\nfocus on training only the last layer are easier to optimize, as we see that optimizing the last layer\nimproved the score across epochs.\n\nFigure 3: Differences of the average scores, for different learning methods, compared to vanilla DQN.\nPositive values represent improvement over vanilla DQN. For example, for mini-batch of 32 (left\n\ufb01gure), in epoch 3, FQI (blue) out-performed vanilla DQN by 21, while ADAM with prior (red), and\nADAM without prior (green) under-performed by -46, and -96, respectively. Note that: (1) as the\nmini-batch size increases, the improvement of ADAM over DQN becomes closer to the improvement\nof FQI over DQN, and (2) optimizing the last layer improves performance.\n\nWe \ufb01nish this Section with an interesting observation. While the LS solution improves the perfor-\nmance of the DRL agents, we found that the LS solution weights are very close to the baseline DQN\nsolution. See Appendix D, for the full results. Moreover, the distance was inversely proportional to\nthe performance of the solution. That is, the FQI solution that performed the best, was the closest (in\n(cid:96)2 norm) to the DQN solution, and vice versa. There were orders of magnitude differences between\nthe norms of solutions that performed well and those that did not. Similar results, i.e., that large-batch\nsolutions \ufb01nd solutions that are close to the baseline, have been reported in (Keskar et al., 2016). We\nfurther compare our results with the \ufb01ndings of Keskar et al. in the section to follow.\n\n5 Related work\n\nWe now review recent works that are related to this paper.\nRegularization: The general idea of applying regularization for feature selection, and to avoid over-\n\ufb01tting is a common theme in machine learning. However, applying it to RL algorithms is challenging\ndue to the fact that these algorithms are based on \ufb01nding a \ufb01xed-point rather than optimizing a loss\nfunction (Kolter & Ng, 2009).Value-based DRL approaches do not use regularization layers (e.g.\npooling, dropout and batch normalization), which are popular in other deep learning methods. The\nDQN, for example, has a relatively shallow architecture (three convolutional layers, followed by two\nfully connected layers) without any regularization layers. Recently, regularization was introduced\n\n10 The selected hyper-parameters used for these experiments can be found in Appendix D, along with results\n\nfor one iteration of ADAM.\n\n8\n\n\fin problems that combine value-based RL with other learning objectives. For example, Hester et al.\n(2017) combine RL with supervised learning from expert demonstration, and introduce regularization\nto avoid over-\ufb01tting the expert data; and Kirkpatrick et al. (2017) introduces regularization to avoid\ncatastrophic forgetting in transfer learning. SRL methods, on the other hand, perform well with\nregularization (Kolter & Ng, 2009) and have been shown to converge Farahmand et al. (2009).\nBatch size: Our results suggest that a large batch LS solution for the last layer of a value-based DRL\nnetwork can signi\ufb01cantly improve it\u2019s performance. This result is somewhat surprising, as it has been\nobserved by practitioners that using larger batches in deep learning degrades the quality of the model,\nas measured by its ability to generalize (Keskar et al., 2016).\nHowever, our method differs from the experiments performed by Keskar et al. 2016 and therefore\ndoes not contradict them, for the following reasons: (1) The LS-DQN Algorithm uses the large batch\nsolution only for the last layer. The lower layers of the network are not affected by the large batch\nsolution and therefore do not converge to a sharp minimum. (2) The experiments of (Keskar et al.,\n2016) were performed for classi\ufb01cation tasks, whereas our algorithm is minimizing an MSE loss. (3)\nKeskar et al. showed that large-batch solutions work well when piggy-backing (warm-started) on a\nsmall-batch solution. Similarly, our algorithm mixes small and large batch solutions as it switches\nbetween them periodically.\nMoreover, it was recently observed that \ufb02at minima in practical deep learning model classes can be\nturned into sharp minima via re-parameterization without changing the generalization gap, and hence\nit requires further investigation Dinh et al. (2017). In addition, Hoffer et al. showed that large-batch\ntraining can generalize as well as small-batch training by adapting the number of iterations Hoffer\net al. (2017). Thus, we strongly believe that our \ufb01ndings on combining large and small batches in\nDRL are in agreement with recent results of other deep learning research groups.\nDeep and Shallow RL: Using the last-hidden layer of a DNN as a feature extractor and learning the\nlast layer with a different algorithm has been addressed before in the literature, e.g., in the context of\ntransfer learning (Donahue et al., 2013). In RL, there have been competitive attempts to use SRL with\nunsupervised features to play Atari (Liang et al., 2016; Blundell et al., 2016), and to learn features\nautomatically followed by a linear control rule (Song et al., 2016), but to the best of our knowledge,\nthis is the \ufb01rst attempt that successfully combines DRL with SRL algorithms.\n\n6 Conclusion\n\nIn this work we presented LS-DQN, a hybrid approach that combines least-squares RL updates\nwithin online deep RL. LS-DQN obtains the best of both worlds: rich representations from deep RL\nnetworks as well as stability and data ef\ufb01ciency of least squares methods. Experiments with two deep\nRL methods and two least squares methods revealed that a hybrid approach consistently improves\nover vanilla deep RL in the Atari domain. Our ablative analysis indicates that the success of the\nLS-DQN algorithm is due to the large batch updates made possible by using least squares.\nThis work focused on value-based RL. However, our hybrid linear/deep approach can be extended\nto other RL methods, such as actor critic (Mnih et al., 2016). More broadly, decades of research\non linear RL methods have provided methods with strong guarantees, such as approximate linear\nprogramming (Desai et al., 2012) and modi\ufb01ed policy iteration (Scherrer et al., 2015). Our approach\nshows that with the correct modi\ufb01cations, such as our Bayesian regularization term, linear methods\ncan be combined with deep RL. This opens the door to future combinations of well-understood linear\nRL with deep representation learning.\n\nAcknowledgement This research was supported by the European Community\u2019s Seventh Frame-\nwork Program (FP7/2007-2013) under grant agreement 306638 (SUPREL). A. Tamar is supported in\npart by Siemens and the Viterbi Scholarship, Technion.\n\n9\n\n\fReferences\nBarto, AG and Crites, RH. Improving elevator performance using reinforcement learning. Advances in neural\n\ninformation processing systems, 8:1017\u20131023, 1996.\n\nBellemare, Marc G, Naddaf, Yavar, Veness, Joel, and Bowling, Michael. The arcade learning environment: An\n\nevaluation platform for general agents. Journal of Arti\ufb01cial Intelligence Research, 47:253\u2013279, 2013.\n\nBertsekas, Dimitri P. Approximate dynamic programming. 2008.\n\nBlundell, Charles, Uria, Benigno, Pritzel, Alexander, Li, Yazhe, Ruderman, Avraham, Leibo, Joel Z, Rae, Jack,\n\nWierstra, Daan, and Hassabis, Demis. Model-free episodic control. stat, 1050:14, 2016.\n\nBox, George EP and Tiao, George C. Bayesian inference in statistical analysis. John Wiley & Sons, 2011.\n\nDesai, Vijay V, Farias, Vivek F, and Moallemi, Ciamac C. Approximate dynamic programming via a smoothed\n\nlinear program. Operations Research, 60(3):655\u2013674, 2012.\n\nDinh, Laurent, Pascanu, Razvan, Bengio, Samy, and Bengio, Yoshua. Sharp minima can generalize for deep\n\nnets. arXiv preprint arXiv:1703.04933, 2017.\n\nDonahue, Jeff, Jia, Yangqing, Vinyals, Oriol, Hoffman, Judy, Zhang, Ning, Tzeng, Eric, and Darrell, Trevor.\nDecaf: A deep convolutional activation feature for generic visual recognition. In Proceedings of the 30th\ninternational conference on machine learning (ICML-13), pp. 647\u2013655, 2013.\n\nErnst, Damien, Geurts, Pierre, and Wehenkel, Louis. Tree-based batch mode reinforcement learning. Journal of\n\nMachine Learning Research, 6(Apr):503\u2013556, 2005.\n\nFarahmand, Amir M, Ghavamzadeh, Mohammad, Mannor, Shie, and Szepesv\u00e1ri, Csaba. Regularized policy\n\niteration. In Advances in Neural Information Processing Systems, pp. 441\u2013448, 2009.\n\nGhavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, Tamar, Aviv, et al. Bayesian reinforcement learning:\n\nA survey. Foundations and Trends R(cid:13) in Machine Learning, 8(5-6):359\u2013483, 2015.\n\nHester, Todd, Vecerik, Matej, Pietquin, Olivier, Lanctot, Marc, Schaul, Tom, Piot, Bilal, Sendonaris, Andrew,\nDulac-Arnold, Gabriel, Osband, Ian, Agapiou, John, et al. Learning from demonstrations for real world\nreinforcement learning. arXiv preprint arXiv:1704.03732, 2017.\n\nHinton, Geoffrey, Srivastava, NiRsh, and Swersky, Kevin. Neural networks for machine learning lecture 6a\n\noverview of mini\u2013batch gradient descent. 2012.\n\nHoffer, Elad, Hubara, Itay, and Soudry, Daniel. Train longer, generalize better: closing the generalization gap in\n\nlarge batch training of neural networks. arXiv preprint arXiv:1705.08741, 2017.\n\nJarrett, Kevin, Kavukcuoglu, Koray, LeCun, Yann, et al. What is the best multi-stage architecture for object\nrecognition? In Computer Vision, 2009 IEEE 12th International Conference on, pp. 2146\u20132153. IEEE, 2009.\n\nKeskar, Nitish Shirish, Mudigere, Dheevatsa, Nocedal, Jorge, Smelyanskiy, Mikhail, and Tang, Ping Tak Pe-\nter. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint\narXiv:1609.04836, 2016.\n\nKingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,\n\n2014.\n\nKirkpatrick, James, Pascanu, Razvan, Rabinowitz, Neil, Veness, Joel, Desjardins, Guillaume, Rusu, Andrei A,\nMilan, Kieran, Quan, John, Ramalho, Tiago, Grabska-Barwinska, Agnieszka, et al. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the National Academy of Sciences, pp. 201611835, 2017.\n\nKolter, J Zico and Ng, Andrew Y. Regularization and feature selection in least-squares temporal difference\n\nlearning. In Proceedings of the 26th annual international conference on machine learning. ACM, 2009.\n\nLagoudakis, Michail G and Parr, Ronald. Least-squares policy iteration. Journal of machine learning research,\n\n4(Dec):1107\u20131149, 2003.\n\nLiang, Yitao, Machado, Marlos C, Talvitie, Erik, and Bowling, Michael. State of the art control of atari games\nusing shallow reinforcement learning. In Proceedings of the 2016 International Conference on Autonomous\nAgents & Multiagent Systems, 2016.\n\nLin, Long-Ji. Reinforcement learning for robots using neural networks. 1993.\n\n10\n\n\fMnih, Volodymyr, Kavukcuoglu, Koray, Silver, David, Rusu, Andrei A, Veness, Joel, Bellemare, Marc G,\nGraves, Alex, Riedmiller, Martin, Fidjeland, Andreas K, Ostrovski, Georg, et al. Human-level control through\ndeep reinforcement learning. Nature, 518(7540):529\u2013533, 2015.\n\nMnih, Volodymyr, Badia, Adria Puigdomenech, Mirza, Mehdi, Graves, Alex, Lillicrap, Timothy P, Harley,\nTim, Silver, David, and Kavukcuoglu, Koray. Asynchronous methods for deep reinforcement learning. In\nInternational Conference on Machine Learning, pp. 1928\u20131937, 2016.\n\nRiedmiller, Martin. Neural \ufb01tted q iteration\u2013\ufb01rst experiences with a data ef\ufb01cient neural reinforcement learning\n\nmethod. In European Conference on Machine Learning, pp. 317\u2013328. Springer, 2005.\n\nScherrer, Bruno, Ghavamzadeh, Mohammad, Gabillon, Victor, Lesner, Boris, and Geist, Matthieu. Approximate\nmodi\ufb01ed policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16:\n1629\u20131676, 2015.\n\nSilver, David, Huang, Aja, Maddison, Chris J, Guez, Arthur, Sifre, Laurent, Van Den Driessche, George,\nSchrittwieser, Julian, Antonoglou, Ioannis, Panneershelvam, Veda, Lanctot, Marc, et al. Mastering the game\nof go with deep neural networks and tree search. Nature, 529(7587):484\u2013489, 2016.\n\nSong, Zhao, Parr, Ronald E, Liao, Xuejun, and Carin, Lawrence. Linear feature encoding for reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, pp. 4224\u20134232, 2016.\n\nSutton, Richard and Barto, Andrew. Reinforcement Learning: An Introduction. MIT Press, 1998.\n\nTessler, Chen, Givony, Shahar, Zahavy, Tom, Mankowitz, Daniel J, and Mannor, Shie. A deep hierarchical\napproach to lifelong learning in minecraft. Proceedings of the National Conference on Arti\ufb01cial Intelligence\n(AAAI), 2017.\n\nTsitsiklis, John N, Van Roy, Benjamin, et al. An analysis of temporal-difference learning with function\n\napproximation. IEEE transactions on automatic control 42.5, pp. 674\u2013690, 1997.\n\nVan Hasselt, Hado, Guez, Arthur, and Silver, David. Deep reinforcement learning with double q-learning.\n\nProceedings of the National Conference on Arti\ufb01cial Intelligence (AAAI), 2016.\n\nWang, Ziyu, Schaul, Tom, Hessel, Matteo, van Hasselt, Hado, Lanctot, Marc, and de Freitas, Nando. Dueling\nnetwork architectures for deep reinforcement learning. In Proceedings of The 33rd International Conference\non Machine Learning, pp. 1995\u20132003, 2016.\n\nWilcoxon, Frank. Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80\u201383, 1945.\n\nZahavy, Tom, Ben-Zrihem, Nir, and Mannor, Shie. Graying the black box: Understanding dqns. In Proceedings\n\nof The 33rd International Conference on Machine Learning, pp. 1899\u20131908, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1785, "authors": [{"given_name": "Nir", "family_name": "Levine", "institution": "Technion - Israel Institute of Technology"}, {"given_name": "Tom", "family_name": "Zahavy", "institution": "The Technion"}, {"given_name": "Daniel", "family_name": "Mankowitz", "institution": "Technion"}, {"given_name": "Aviv", "family_name": "Tamar", "institution": "UC Berkeley"}, {"given_name": "Shie", "family_name": "Mannor", "institution": "Technion"}]}