{"title": "Transfer of Value Functions via Variational Methods", "book": "Advances in Neural Information Processing Systems", "page_first": 6179, "page_last": 6189, "abstract": "We consider the problem of transferring value functions in reinforcement learning. We propose an approach that uses the given source tasks to learn a prior distribution over optimal value functions and provide an efficient variational approximation of the corresponding posterior in a new target task. We show our approach to be general, in the sense that it can be combined with complex parametric function approximators and distribution models, while providing two practical algorithms based on Gaussians and Gaussian mixtures. We theoretically analyze them by deriving a finite-sample analysis and provide a comprehensive empirical evaluation in four different domains.", "full_text": "Transfer of Value Functions via Variational Methods\n\nAndrea Tirinzoni\u2217\nPolitecnico di Milano\nandrea.tirinzoni@polimi.it\n\nRafael Rodriguez Sanchez\u2217\n\nPolitecnico di Milano\n\nMarcello Restelli\n\nPolitecnico di Milano\n\nrafaelalberto.rodriguez@polimi.it\n\nmarcello.restelli@polimi.it\n\nAbstract\n\nWe consider the problem of transferring value functions in reinforcement learning.\nWe propose an approach that uses the given source tasks to learn a prior distribution\nover optimal value functions and provide to an ef\ufb01cient variational approximation\nof the corresponding posterior in a new target task. We show our approach to be\ngeneral, in the sense that it can be combined with complex parametric function\napproximators and distribution models, while providing two practical algorithms\nbased on Gaussians and Gaussian mixtures. We theoretically analyze them by\nderiving a \ufb01nite-sample analysis and provide a comprehensive empirical evaluation\nin four different domains.\n\n1\n\nIntroduction\n\nRecent advances have allowed reinforcement learning (RL) [34] to achieve impressive results in a\nwide variety of complex tasks, ranging from Atari [26] through the game of Go [33] to the control of\nsophisticated robotic systems [17, 24, 23]. The main limitation is that these RL algorithms still require\nan enormous amount of experience samples before successfully learning such complicated tasks. One\nof the most promising solutions to alleviate this problem is transfer learning, which focuses on reusing\npast knowledge available to the agent in order to reduce the sample complexity for learning new tasks.\nIn the typical settings of transfer in RL [36], it is assumed that the agent has already solved a set of\nsource tasks generated from some unknown distribution. Then, given a target task (which is drawn\nfrom the same distribution, or a slightly different one), the agent can rely on the knowledge from\nthe source tasks to speed up the learning process. This reuse of knowledge is a signi\ufb01cant advantage\nover plain RL, in which the agent learns each new task from scratch regardless of any previous\nlearning experience. Several algorithms have been proposed in the literature to transfer different\nelements involved in the learning process: experience samples [22, 35, 37], policies/options [11, 19],\nrewards [18], features [6], parameters [10, 16, 12]. We refer the reader to [36, 20] for a thorough\nsurvey on transfer in RL.\nAssuming that the tasks follow a speci\ufb01c distribution, an intuitive choice to design a transfer algorithm\nis to try to characterize the uncertainty over the target task. Then, an ideal algorithm would leverage\nprior knowledge from the source tasks to interact with the target task to reduce uncertainty as quickly\nas possible. This simple intuition makes Bayesian methods appealing approaches for transfer in\nRL, and many previous works have been proposed in this direction. In [39], the authors assume\nthat the tasks share similarities in their dynamics and rewards and propose a hierarchical Bayesian\nmodel for the distribution of these two elements. Similarly, in [21], the authors assume that tasks are\nsimilar in their value functions and design a different hierarchical Bayesian model for the transfer\nof such information. More recently, [10], and its extension [16], consider tasks whose dynamics\nare governed by some hidden parameters and propose ef\ufb01cient Bayesian models to quickly learn\nsuch parameters in new tasks. However, most of these algorithms require speci\ufb01c, and sometimes\nrestrictive, assumptions (e.g., on the distributions involved or the function approximators adopted),\nwhich might limit their practical applicability. The importance of having transfer algorithms that\n\n\u2217Equal contribution\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\falleviate the need for strong assumptions and easily adapt to different contexts motivates us to take a\nmore general approach.\nSimilarly to [21], we assume that the tasks share similarities in their value functions and use the given\nsource tasks to learn a distribution over such functions. Then, we use this distribution as a prior for\nlearning the target task and we propose a variational approximation of the corresponding posterior\nwhich is computationally ef\ufb01cient. Leveraging recent ideas from randomized value functions [27, 4],\nwe design a Thompson Sampling-based algorithm that ef\ufb01ciently explores the target task by sampling\nrepeatedly from the posterior and acting greedily w.r.t. (with respect to) the sampled value function.\nWe show that our approach is very general, in the sense that it can work with any parametric function\napproximator and any prior/posterior distribution models (in this paper we focus on the Gaussian\nand Gaussian mixture models). In addition to the algorithmic contribution, we also give a theoretical\ncontribution by providing a \ufb01nite-sample analysis of our approach and an experimental contribution\nthat shows its empirical performance on four domains with an increasing level of dif\ufb01culty.\n\n2 Preliminaries\n\nWe consider a distribution D over tasks, where each task M\u03c4 is modeled as a discounted Markov\nDecision Process (MDP). We de\ufb01ne an MDP as a tuple M\u03c4 = (cid:104)S,A,P\u03c4 ,R\u03c4 , p0, \u03b3(cid:105), where S is the\nstate-space, A is a \ufb01nite set of actions, P\u03c4 (\u00b7|s, a) is the distribution of the next state s(cid:48) given that\naction a is taken in state s, R\u03c4 : S \u00d7 A \u2192 R is the reward function, p0 is the initial-state distribution,\nand \u03b3 \u2208 [0, 1) is the discount factor. We assume the reward function to be uniformly bounded by a\nconstant Rmax > 0. A deterministic policy \u03c0 : S \u2192 A is a mapping from states to actions. At the\nbeginning of each episode of interaction, the initial state s0 is drawn from p0. Then, the agent takes\na possibly in\ufb01nite horizon: max\u03c0 J(\u03c0) (cid:44) EM\u03c4 ,\u03c0[(cid:80)\u221e\nthe action a0 = \u03c0(s0), receives a reward R\u03c4 (s0, a0), transitions to the next state s1 \u223c P\u03c4 (\u00b7|s0, a0),\nand the process is repeated. The goal is to \ufb01nd the policy maximizing the long-term return over\nt=0 \u03b3tR\u03c4 (st, at)]. To this end, we de\ufb01ne the\noptimal value function of task M\u03c4 , Q\u2217\n\u03c4 (s, a), as the expected return obtained by taking action a in\nstate s and following an optimal policy thereafter. Then, an optimal policy \u03c0\u2217\n\u03c4 is a policy that is\n\u03c4 (s) = argmaxa Q\u2217\ngreedy with respect to the optimal value function, i.e., \u03c0\u2217\n\u03c4 (s, a) for all states s. It\ncan be shown (e.g., [28]) that Q\u2217\n\u03c4 is the unique \ufb01xed-point of the optimal Bellman operator T\u03c4 de\ufb01ned\nby T\u03c4 Q(s, a) = R\u03c4 (s, a) + \u03b3Es(cid:48)\u223cP\u03c4 [maxa(cid:48) Q(s(cid:48), a(cid:48))] for any value function Q. From now on, we\nadopt the term Q-function to denote any plausible value function, i.e., any function Q : S \u00d7 A \u2192 R\nuniformly bounded by Rmax\n1\u2212\u03b3 . In the following, to avoid cluttering the notation, we will drop the\nsubscript \u03c4 whenever there is no ambiguity.\n\nWe consider a parametric family of Q-functions, Q = (cid:8)Qw : S \u00d7 A \u2192 R | w \u2208 Rd(cid:9), and we\n\nassume each function in Q to be uniformly bounded by Rmax\n1\u2212\u03b3 . When learning the optimal value\nfunction, a quantity of interest is how close a given function Qw is to the \ufb01xed-point of the Bellman\noperator. A possible measure is its Bellman error (or Bellman residual), de\ufb01ned by Bw (cid:44) T Qw\u2212Qw.\nNotice that Qw is optimal if and only if Bw(s, a) = 0 for all s, a. If we assume the existence of\na distribution \u03bd over S \u00d7 A, a sound objective is to directly minimize the squared Bellman error\nof Qw under \u03bd, denoted by (cid:107)Bw(cid:107)2\n\u03bd. Unfortunately, it is well known that an unbiased estimator\nof this quantity requires two independent samples of the next state s(cid:48) for each s, a (e.g., [25]). In\npractice, the Bellman error is typically replaced by the TD error b(w), which approximates the\n(cid:80)N\nformer using a single transition sample (cid:104)s, a, s(cid:48), r(cid:105), b(w) = r + \u03b3 maxa(cid:48) Qw(s(cid:48), a(cid:48)) \u2212 Qw(s, a).\nFinally, given a dataset D = (cid:104)si, ai, ri, s(cid:48)\ni=1 of N samples, the squared TD error is computed as\ni(cid:105)N\ni=1(ri + \u03b3 maxa(cid:48) Qw(s(cid:48)\ni, a(cid:48)) \u2212 Qw(si, ai))2 = 1\n(cid:107)Bw(cid:107)2\ni=1 bi(w)2. Whenever the\ndistinction is clear from the context, with a slight abuse of terminology, we refer to the squared\nBellman error and to the squared TD error as Bellman error and TD error, respectively.\n\n(cid:80)N\n\nD = 1\nN\n\nN\n\n3 Variational Transfer Learning\n\nIn this section, we describe our variational approach to transfer in RL. In Section 3.1, we begin by\nintroducing our algorithm from a high-level perspective, so that any choice of prior and posterior\ndistributions is possible. Then, in Sections 3.2 and 3.3, we propose practical implementations based\non Gaussians and mixtures of Gaussians, respectively. We conclude with some considerations on\nhow to optimize the proposed objective in Section 3.4.\n\n2\n\n\f3.1 Algorithm\nLet us observe that the distribution D over tasks induces a distribution over optimal Q-functions. Fur-\nthermore, for any MDP, learning its optimal Q-function is suf\ufb01cient to solve the problem. Thus, we\ncan safely replace the distribution over tasks with the distribution over their optimal value functions.\nIn our parametric settings, we reduce the latter to a distribution p(w) over weights.\nAssume, for the moment, that we know the distribution p(w) and consider a dataset D =\n(cid:104)si, ai, ri, s(cid:48)\ni=1 of samples from some task M\u03c4 \u223c D that we want to solve. Then, we can\ni(cid:105)N\ncompute the posterior distribution over weights given such dataset by applying Bayes theorem as\np(w|D) \u221d p(D|w)p(w). Unfortunately, this cannot be directly used in practice since we do not have\na model of the likelihood p(D|w). In such case, it is very common to make strong assumptions on the\nMDPs or the Q-functions to get tractable posteriors. However, in our transfer settings, all distributions\ninvolved depend on the family of tasks under consideration and making such assumptions is likely\nto limit the applicability to speci\ufb01c problems. Thus, we take a different approach to derive a more\ngeneral, but still well-grounded, solution. Note that our \ufb01nal goal is to move the total probability\nmass over the weights while minimizing some empirical loss measure, which in our case is the TD\nerror (cid:107)Bw(cid:107)2\nD. Then, given a prior p(w), we know from PAC-Bayesian theory that the optimal Gibbs\nposterior q which minimizes an oracle upper bound on the expected loss takes the form (e.g., [9]):\n\ne\u2212\u039b(cid:107)Bw(cid:107)2\n\n(cid:82) e\u2212\u039b(cid:107)Bw(cid:48)(cid:107)2\n\nD p(w)\nD p(dw(cid:48))\n\nq(w) =\n\n,\n\n(1)\n\nfor some parameter \u039b > 0. Since \u039b is typically chosen to increase with the number of samples\nN, in the remaining, we set it to \u03bb\u22121N, for some constant \u03bb > 0. Note that, whenever the term\ne\u2212\u039b(cid:107)Bw(cid:107)2\nD can be interpreted as the actual likelihood of D, q becomes a classic Bayesian posterior.\nAlthough we now have an appealing distribution, the integral at the denominator of (1) is intractable\nto compute even for simple Q-function models. Thus, we propose a variational approximation q\u03be by\nconsidering a simpler family of distributions parameterized by \u03be \u2208 \u039e. Then, our problem reduces to\n\ufb01nding the variational parameters \u03be such that q\u03be minimizes the Kullback-Leibler (KL) divergence\nw.r.t. the Gibbs posterior q. From the theory of variational inference (e.g., [7]), this can be shown to\nbe equivalent to minimizing the well-known (negative) evidence lower bound (ELBO):\n\n(cid:104)\n\n(cid:105)\n\nmin\n\n\u03be\u2208\u039e L(\u03be) = Ew\u223cq\u03be\n\n(cid:107)Bw(cid:107)2\n\nD\n\n+\n\n\u03bb\nN\n\nKL (q\u03be(w) || p(w)) .\n\n(2)\n\nIntuitively, the approximate posterior balances between placing probability mass over those weights w\nthat have low expected TD error (\ufb01rst term), and staying close to the prior distribution (second term).\nAssuming that we can compute the gradients of (2) w.r.t. the variational parameters \u03be, our objective\ncan be optimized using any stochastic optimization algorithm, as shown in the next subsections.\nWe now highlight our general transfer procedure\nin Algorithm 1, while deferring a description of\nspeci\ufb01c choices for the involved distributions\nto the next two subsections. Although the dis-\ntribution p(w) is not known in practice, we as-\nsume that the agent has solved a \ufb01nite number\nof source tasks M\u03c41 ,M\u03c42, . . . ,M\u03c4M and that\nwe are given the set of their approximate so-\nlutions: Ws = {w1, w2, . . . , wM} such that\nQwj (cid:39) Q\u2217\n\u03c4j . Using these weights, we start by\nestimating the prior distribution (line 1), and we\ninitialize the variational parameters by minimiz-\ning the KL divergence w.r.t. such distribution\n(line 2).2 Then, at each time step of interaction,\nwe re-sample the weights from the current ap-\nproximate posterior and act greedily w.r.t. the\ncorresponding Q-function (lines 7,8). After collecting and storing the new experience (lines 9-10),\nwe estimate the objective function gradient using a mini-batch of samples from the current dataset\n(line 11), and update the variational parameters (line 12).\n\nAlgorithm 1 Variational Transfer\nRequire: Target task M\u03c4 , source weights Ws\n1: Estimate prior p(w) from Ws\n2: Initialize parameters: \u03be \u2190 argmin\u03be KL(q\u03be||p)\n3: Initialize dataset: D = \u2205\n4: repeat\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\nend while\n13:\n14: until forever\n\nSample weights: w \u223c q\u03be(w)\nTake action ah = argmaxa Qw(sh, a)\nsh+1 \u223c P\u03c4 (\u00b7|sh, ah), rh+1 = R\u03c4 (sh, ah)\nD \u2190 D \u222a (cid:104)sh, ah, rh+1, sh+1(cid:105)\nEstimate gradient \u2207\u03beL(\u03be) using D(cid:48) \u2286 D\nUpdate \u03be from \u2207\u03beL(\u03be) using any optimizer\n\nSample initial state: s0 \u223c p0\nwhile sh is not terminal do\n\n2If the prior and approximate posterior were in the same family of distributions we could simply set \u03be to the\n\nprior parameters. However, we are not making this assumption at this point.\n\n3\n\n\fThe key property of our approach is the weight resampling at line 7, which resembles the well-known\nThompson sampling approach adopted in multi-armed bandits [8] and closely relates to the recent\nvalue function randomization [27, 4]. At every time we guess what is the task we are trying to solve\nbased on our current belief and we act as if such guess were true. This mechanism allows an ef\ufb01cient\nadaptive exploration of the target task. Intuitively, during the \ufb01rst steps of interaction, the agent is\nvery uncertain about the current task, and such uncertainty induces stochasticity in the chosen actions,\nallowing a rather informed exploration to take place. Consider, for instance, that actions that are\nbad on average for all tasks are improbable to be sampled, while this cannot happen in uninformed\nexploration strategies, like \u0001-greedy, before learning takes place. As the learning process goes on, the\nalgorithm will quickly \ufb01gure out which task is solving, thus moving all the probability mass over\nthe weights minimizing the TD error. From that point, sampling from the posterior is approximately\nequivalent to deterministically taking such weights, and no more exploration will be performed.\nFinally, notice the generality of the proposed approach: as far as the objective L is differentiable\nin the variational parameters \u03be, and its gradients can be ef\ufb01ciently computed, any approximator for\nthe Q-function and any prior/posterior distributions can be adopted. For the latter, we describe two\npractical choices in the next two sections.\n\n3.2 Gaussian Variational Transfer\n\nWe now restrict to a speci\ufb01c choice of the prior and posterior families that makes our algorithm\nvery ef\ufb01cient and easy to implement. We assume that optimal Q-functions (or better, their weights)\nfollow a multivariate Gaussian distribution. That is, we model the prior as p(w) = N (\u00b5p, \u03a3p)\nand we learn its parameters from the set of source weights using maximum likelihood estimation\n(with small regularization to make sure the covariance is positive de\ufb01nite). Then, our variational\nfamily is the set of all well-de\ufb01ned Gaussian distributions, i.e., the variational parameters are\n\n\u039e =(cid:8)(\u00b5, \u03a3) | \u00b5 \u2208 Rd, \u03a3 \u2208 Rd\u00d7d, \u03a3 (cid:31) 0(cid:9). To prevent the covariance from becoming not positive\n\nde\ufb01nite, we consider its Cholesky decomposition \u03a3 = LLT and we learn the lower-triangular\nCholesky factor L instead.\nIn this case, deriving the gradient of the objective is very simple.\nBoth the KL between two multivariate Gaussians and its gradients have a simple closed-form\nexpression. The expected log-likelihood, on the other hand, can be easily differentiated by adopting\nthe reparameterization trick (e.g., [15, 29]). We report these results in Appendix B.1.\n\n3.3 Mixture of Gaussian Variational Transfer\n\nAlthough the Gaussian assumption of the previous section is very appealing as it allows for a simple\nand ef\ufb01cient way of computing the variational objective and its gradients, in practice it rarely allows\nus to describe the prior distribution accurately. In fact, even for families of tasks in which the reward\nand transition models are Gaussian, the Q-values might be far from being normally distributed.\nDepending on the family of tasks under consideration and, since we are learning a distribution over\nweights, on the chosen function approximator, the prior might have arbitrarily complex shapes. When\nthe information loss due to the Gaussian approximation becomes too severe, the algorithm is likely to\nfail at capturing any similarities between the tasks. We now propose a variant to successfully solve\nthis problem, while keeping the algorithm ef\ufb01cient and simple enough to be applied in practice.\nGiven the source tasks\u2019 weights Ws, we model our estimated prior as a mixture with equally weighted\nisotropic Gaussians centered at each weight: p(w) = 1|Ws|\npI). This model\nresembles a kernel density estimator [31] with bandwidth \u03c32\np and, due to its nonparametric nature,\n(cid:80)C\nit allows capturing arbitrarily complex distributions. Consistently with the prior, we model our\napproximate posterior as a mixture of Gaussians. Using C components, our posterior is q\u03be(w) =\ni=1 N (w|\u00b5i, \u03a3i), with variational parameters \u03be = (\u00b51, . . . , \u00b5C, \u03a31, . . . , \u03a3C). Once again,\n1\nC\nwe learn Cholesky factors instead of full covariances. Finally, since the KL divergence between two\nmixtures of Gaussians has no closed-form expression, we rely on an upper bound to such quantity, so\nthat the negative ELBO still upper bounds the KL between the approximate and the exact posterior.\nAmong the many upper bounds available, we adopt the one proposed in [14] (see Appendix B.2).\n\n(cid:80)\nws\u2208Ws N (w|ws, \u03c32\n\n3.4 Minimizing the TD Error\nFrom Sections 3.2 and 3.3, we know that differentiating the negative ELBO L w.r.t. \u03be requires\ndifferentiating (cid:107)Bw(cid:107)2\nD w.r.t. w. Unfortunately, the TD error is well-known to be non-differentiable\n\n4\n\n\f(cid:88)\n\na\n\n1\n|A|\n\ndue to the presence of the max operator. This issue is rarely a problem since typical value-based\nalgorithms are semi-gradient methods, i.e., they do not differentiate the targets (see, e.g., Chapter 11\nof [34]). However, our transfer settings are quite different from common RL. In fact, our algorithm\nis likely to start from Q-functions that are very close to an optimum and aims only to adapt the\nweights in some direction of lower error so as to quickly converge to the solution of the target task.\nUnfortunately, this property does not hold for most semi-gradient algorithms. Even worse, many\nonline RL algorithms combined with complex function approximators (e.g., DQNs) are well-known\nto be unstable, especially when approaching an optimum, and require many tricks and tuning to work\nwell [30, 38]. This property is clearly undesirable in our case, as we only aim at adapting already\ngood solutions. Thus, we consider using a residual gradient algorithm [5]. To differentiate the targets,\nwe replace the optimal Bellman operator with the mellow Bellman operator introduced in [3], which\nadopts a softened version of max called mellowmax:\n\nmm\n\nQw(s, a) =\n\nlog\n\na\n\n1\n\u03ba\n\ne\u03baQw(s,a)\n\n(3)\n\nwhere \u03ba is a hyperparameter and |A| is the number of actions. The mellow Bellman operator, which\n\nwe denote as (cid:101)T , has several appealing properties: (i) it converges to the maximum as \u03ba \u2192 \u221e, (ii) it\nhas a unique \ufb01xed-point, and (iii) it is differentiable. Denoting by (cid:101)Bw = (cid:101)T Qw \u2212 Qw the Bellman\nresidual w.r.t. the mellow Bellman operator (cid:101)T , we have that the corresponding TD error, ||(cid:101)Bw||2\n\nnow differentiable w.r.t. w.\nAlthough residual algorithms have guaranteed convergence, they are typically much slower than their\nsemi-gradient counterpart. [5] proposed to project the gradient in a direction that achieves higher\nlearning speed, while preserving convergence. This projection is obtained by including a parameter\nN(cid:88)\n\u03c8 \u2208 [0, 1] in the TD error gradient:\nwhere(cid:101)bi(w) = ri + \u03b3 mma(cid:48) Qw(s(cid:48)\n\n(cid:101)bi(w)\ni, a(cid:48)) \u2212 Qw(si, ai). Notice that \u03c8 trades-off between the semi-\ngradient (\u03c8 = 0) and the full residual gradient (\u03c8 = 1). A good criterion for choosing such\nparameter is to start with values close to zero (to have faster learning) and move to higher values\nwhen approaching the optimum (to guarantee convergence).\n\n) \u2212 \u2207wQw(si, ai)\n\n(cid:13)(cid:13)(cid:13)(cid:101)Bw\n\n\u03b3\u03c8\u2207w mm\n\na(cid:48) Qw(s(cid:48)\n\ni, a(cid:48)\n\nD, is\n\n(cid:13)(cid:13)(cid:13)2\n\nD\n\n(cid:17)\n\n,\n\n\u2207w\n\n=\n\n2\nN\n\n(cid:16)\n\ni=1\n\n4 Theoretical Analysis\n\nA \ufb01rst important question that we need to answer is whether replacing max with mellow-max in\nthe Bellman operator constitutes a strong approximation or not. It has been proven [3] that the\nmellow Bellman operator is a non-expansion under the L\u221e-norm and, thus, has a unique \ufb01xed-point.\nHowever, how such \ufb01xed-point differs from the one of the optimal Bellman operator remains an open\nquestion. Since mellow-max monotonically converges to max as \u03ba \u2192 \u221e, it would be desirable if\nthe \ufb01xed point of the corresponding operator also monotonically converged to the \ufb01xed point of the\noptimal one. We con\ufb01rm that this property actually holds in the following theorem.\nTheorem 1. Let Q\u2217 be the \ufb01xed-point of the optimal Bellman operator T . De\ufb01ne the action-gap\nstate s. Let (cid:101)Q be the \ufb01xed-point of the mellow Bellman operator (cid:101)T with parameter \u03ba > 0 and denote\nfunction g(s) as the difference between the value of the best action and the second best action at each\n\nby \u03b2\u03ba > 0 the inverse temperature of the induced Boltzmann distribution (as in [3]). Then:\n\n(cid:13)(cid:13)(cid:13)Q\u2217\n\n(cid:13)(cid:13)(cid:13)\u221e \u2264\n\u2212 (cid:101)Q\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\n\n2\u03b3Rmax\n(1 \u2212 \u03b3)2\n\n1\n\n1 + 1|A| e\u03b2\u03bag\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)\u221e\n\n.\n\nThe proof is provided in Appendix A.1. Notice that (cid:101)Q converges to Q\u2217 exponentially fast as \u03ba\n\n(equivalently, \u03b2\u03ba) increases and the action gaps are all larger than zero. Notice that this result is of\ninterest even outside our speci\ufb01c settings.\nThe second question that we need to answer is whether we can provide any guarantee on our\nalgorithm\u2019s performance when given limited data. To address this point, we consider the two variants\n\n(4)\n\n5\n\n\fof Algorithm 1 from Section 3.2 and 3.3 with linear approximators. Speci\ufb01cally, we consider the\nfamily of linearly parameterized value functions Qw(s, a) = wT \u03c6(s, a) with bounded weights\n(cid:107)w(cid:107)2 \u2264 wmax and uniformly bounded features (cid:107)\u03c6(s, a)(cid:107)2 \u2264 \u03c6max. We assume only a \ufb01nite dataset\nis available and provide a \ufb01nite-sample analysis bounding the expected (mellow) Bellman error under\nTheorem 2. Let(cid:98)\u03be be the variational parameters minimizing the objective of Eq. (2) on a dataset D of\nthe variational distribution minimizing the objective (2) for any \ufb01xed target task M\u03c4 .\nN i.i.d. samples distributed according to M\u03c4 and \u03bd. Let w\u2217 = arginf w ||(cid:101)Bw||2\n\n\u03bd and de\ufb01ne \u03c5(w\u2217) (cid:44)\n. Then, there exist constants c1, c2, c3 such\n\nN I) [v(w)], with v(w) (cid:44) E\u03bd\n\nEN (w\u2217, 1\nthat, with probability at least 1 \u2212 \u03b4 over the choice of the dataset D:\n\nV arP\u03c4\n\n(cid:104)\n\n(cid:104)(cid:101)b(w)\n(cid:105)(cid:105)\n(cid:115)\n\nlog 2\n\u03b4\nN\n\n+\n\nc2 + \u03bbd log N + \u03bb\u03d5 (Ws)\n\nN\n\n+\n\nc3\nN 2 ,\n\nwhen the Gaussian version of Algorithm 1 is used with prior\n\nEq(cid:98)\u03be\n\n(cid:13)(cid:13)(cid:13)2\n\n\u03bd\n\n(cid:13)(cid:13)(cid:13)2\n\n(cid:13)(cid:13)(cid:13)(cid:101)Bw\u2217\n\n(cid:20)(cid:13)(cid:13)(cid:13)(cid:101)Bw\n\n(cid:21)\n\u2264 2\nwhere \u03d5(Ws) = (cid:107)w\u2217\np(w) = N (\u00b5p, \u03a3p) estimated from Ws, while:\n(cid:80)\n\n\u2212 \u00b5p(cid:107)\u03a3\n\n+ \u03c5(w\u2217\n\n(cid:88)\n\n\u03bd\n\n\u22121\np\n\n) + c1\n\n\u03d5(Ws) =\n\n1\n\u03c32\np\n\ne\u2212\u03b2(cid:107)w\u2217\u2212w(cid:107)\n\ne\u2212\u03b2(cid:107)w\u2217\u2212w(cid:48)(cid:107) (cid:107)w\u2217\n\nw(cid:48)\u2208Ws\n\nw\u2208Ws\n\n\u2212 w(cid:107)\n\n(5)\n\nis the softmin distance between the optimal and source weights when the mixture version of Algorithm\n1 is used with C components and bandwidth \u03c32\n\np for the prior. Here \u03b2 = 1\n2\u03c32\np\n\n.\n\nWe refer the reader to Appendix A.2 for the proof and a speci\ufb01c de\ufb01nition of the constants. Four\nmain terms constitute our bound: the approximation error due to the limited hypothesis space (\ufb01rst\nterm), the variance (second and third terms), the distance to the prior (fourth term), and a constant\nterm decaying as O(N 2). As we might have expected, the only difference between the bounds for the\ntwo versions of Algorithm 1 is in the term \u03d5(Ws), i.e., the distance between the optimal weights w\u2217\nand the source weights Ws. Speci\ufb01cally, for the mixture version we have the (smoothened) minimum\ndistance to the source tasks\u2019 weights (Equation (5)), while for the Gaussian one we have the distance\nto the mean of such weights. This property shows a clear advantage of using the mixture version of\nAlgorithm 1 rather than the Gaussian one: in order to tighten the bound, it is enough to have at least\none source task that is close to the optimal solution of the target task. In fact, the Gaussian version\nrequires the source tasks to be, on average, similar to the target task in order to perform well, while\nthe mixture version only requires this property for one of them. In both cases, when the term \u03d5(Ws)\nis reduced, the dominating error is due to the variance of the estimates, and, thus, the algorithm\nis expected to achieve good performance rather quickly, as new data is collected. Furthermore, as\nN \u2192 \u221e the only error terms remaining are the irreducible approximation error due to the limited\nfunctional space and the variance term \u03c5(w\u2217). The latter is due to the fact that we minimize a biased\nestimate of the Bellman error and can be removed in cases where double sampling of the next state is\npossible (e.g., in simulation). We empirically verify these considerations in Section 6.\n\n5 Related Works\n\nOur approach is mostly related to [21]. Although we both assume the tasks to share similarities in\ntheir value functions, [21] consider only linear approximators and adopt a hierarchical Bayesian\nmodel of the corresponding weights\u2019 distribution, which is assumed Gaussian. On the other hand, our\nvariational approximation allows for more general distribution families and can be combined with non-\nlinear approximators. Furthermore, [21] propose a Dirichlet process model for the case where weights\ncluster into different classes, which relates to our mixture formulation and proves the importance of\ncapturing more complicated task distributions. Finally, [21] considers the problem of jointly learning\nall given tasks, while we focus on transferring information from a set of source tasks to the target task.\nIn [39], the authors propose a hierarchical Bayesian model for the distribution over MDPs. Unlike\nour approach and [21], they consider a distribution over transition probabilities and rewards, rather\nthan value functions. In the same spirit of our method, they consider a Thompson sampling-based\nprocedure which, at each iteration, samples a new task from the posterior and solves it. However,\n[39] consider only \ufb01nite MDPs, which poses a severe limitation on the algorithm\u2019s applicability.\nOn the contrary, our approach can handle high-dimensional tasks. In [10], the authors consider a\n\n6\n\n\ffamily of tasks whose dynamics are governed by some hidden parameters and use Gaussian processes\n(GPs) to model such dynamics across tasks. Recently, [16] extended this approach by replacing\nGPs with Bayesian neural networks to obtain a more scalable approach. Both approaches result in a\nmodel-based algorithm that quickly adapts to new tasks by estimating their hidden parameters, while\nwe propose a model-free method which does not require such assumptions.\nFinally, our approach relates to recent algorithms for meta-learning/fast-adaptation of weights in\nneural networks [12, 13, 2]. Such approaches typically assume to have full access to the task\ndistribution D (i.e., samples from D can be obtained on-demand) and build meta-models that quickly\nadapt to new tasks drawn from the same distribution. On the other hand, we assume only a \ufb01xed and\nlimited set of source tasks, together with their approximate solutions, is available. Then, our goal is\nto speed-up the learning process of a new target task from D by transferring only these data, without\nrequiring additional source tasks or experience samples from them.\n\n6 Experiments\n\nIn this section, we provide an experimental evaluation of our approach in four different domains\nwith increasing level of dif\ufb01culty. In all experiments, we compare our Gaussian variational transfer\nalgorithm (GVT) and the version using a c-component mixture of Gaussians (c-MGVT) to plain\nno-transfer RL (NT) with \u0001-greedy exploration and to a simple transfer baseline in which we randomly\npick one source Q-function and \ufb01ne-tune from its weights (FT). Finally, in Section 6.4 we empirically\ndemonstrate the differences between our approach and the previously discussed fast-adaptation\nalgorithms. We report the detailed parameters, together with additional results, in Appendix C.\n\n6.1 The Rooms Problem\n\nGoal\n\nStart\n\nFigure 1: Rooms problem.\n\nWe consider an agent navigating in the environment depicted in Fig-\nure 1. The agent starts in the bottom-left corner and must move from\none room to another to reach the goal position in the top-right corner.\nThe rooms are connected by small doors whose locations are unknown\nto the agent. The state-space is modeled as a 10 \u00d7 10 continuous grid,\nwhile the action-space is the set of 4 movement directions (up, right,\ndown, left). After each action, the agent moves by 1 in the chosen di-\nrection and the \ufb01nal position is corrupted by Gaussian noise N (0, 0.2).\nIn case the agent hits a wall, its position remains unchanged. The\nreward is 1 when reaching the goal (after which the process terminates) and 0 otherwise, while the\ndiscount factor is \u03b3 = 0.99. In this experiment, we consider linearly parameterized Q-functions with\n121 equally-spaced radial basis features.\nWe generate a set of 50 source tasks for the three-room environment of Figure 1 by sampling both\ndoor locations uniformly in the allowed space, and solve all of them by directly minimizing the\nTD error as presented in Section 3.4. Then, we use our algorithms to transfer from 10 source tasks\nsampled from the previously generated set. The average return over the last 50 learning episodes as a\nfunction of the number of iterations is shown in Figure 2a. Each curve is the result of 20 independent\nruns, each one resampling the target and source tasks, with 95% con\ufb01dence intervals. Further details\non the parameters adopted in this experiment are given in Appendix C.1. As expected, the no-transfer\n(NT) algorithm fails at learning the task in so few iterations due to the limited exploration provided by\nan \u0001-greedy policy. On the other hand, all our algorithms achieve a signi\ufb01cant speed-up and converge\nto the optimal performance in few iterations, with GVT being slightly slower. FT achieves good\nperformance as well, but it takes more time to adapt a random source Q-function. Interestingly, we\nnotice that there is no advantage in adopting more than 1 component for the posterior in MGVT.\nThis result is intuitive since, as soon as the algorithm \ufb01gures out which is the target task, all the\ncomponents move towards the same region.\nTo better understand the differences between GVT and MGVT, we now consider transferring from a\nslightly different distribution than the one from which target tasks are drawn. We generate 50 source\ntasks again but this time with the bottom door \ufb01xed at the center and the other one moving. Then, we\nrepeat the previous experiment, allowing both doors to move when sampling target tasks. The results\nare shown in Figure 2b. Interestingly, MGVT seems almost unaffected by this change, proving that it\nhas suf\ufb01cient representation power to generalize to slightly different task distributions. The same\n\n7\n\n\f(a) Rooms: two doors moving\n\n(b) Rooms: one door moving\n\n(c) Cartpole\n\n(d) Mountain Car\n\n(e) Maze Navigation (\ufb01rst maze)\n\n(f) Maze Navigation (second maze)\n\nNT\n\nFT\n\nGVT\n\n1-MGVT\n\n3-MGVT\n\nFigure 2: The online expected return achieved by the algorithm as a function of the number of\niterations. Each curve is the average of 20 independent runs. 95% con\ufb01dence intervals are shown.\n\ndoes not hold for GVT, which now is not able to solve many of the sampled target tasks, as can be\nnoticed from the higher variance. Furthermore, the good performance of FT proves that GVT is,\nindeed, subject to a loss of information due to averaging the source weights. This result proves again\nthat assuming Gaussian distributions can pose severe limitations in our transfer settings.\n\n6.2 Classic Control\n\nWe now consider two well-known classic control environments: Cartpole and Mountain Car [34].\nFor both, we generate 20 source tasks by uniformly sampling their physical parameters (cart mass,\npole mass, pole length for Cartpole and car speed for Mountain Car) and solve them by directly\nminimizing the TD error as in the previous experiment. We parameterize Q-functions using neural\nnetworks with one layer of 32 hidden units for Cartpole and 64 for Mountain Car. A better description\nof these two environments and their parameters is given in Appendix C.2. In this experiment, we use\na Double Deep Q-Network (DDQN) [38] to provide a stronger no-transfer baseline for comparison.\nThe results (same settings of Section 6.1) are shown in Figures 2c and 2d. For Cartpole (Figure 2c),\nall variational transfer algorithms are almost zero-shot. This result is expected since, although we vary\nthe system parameters in a wide range, the optimal Q-values of states near the balanced position are\nsimilar for all tasks. On the contrary, in Mountain Car (Figure 2d) the optimal Q-functions become\nvery different when changing the car speed. This phenomenon hinders the learning of GVT in the\ntarget task, while MGVT achieves a good jump-start and converges in fewer iterations. Similarly to\nthe Rooms domain, the naive weight adaptation of FT makes it slower than MGVT in both domains.\n\n6.3 Maze Navigation\n\nFinally, we consider a robotic agent navigating mazes. At the beginning of each episode, the agent\nis dropped to a random position in a 10m2 maze and must reach a goal area in the smallest time\npossible. The robot is equipped with sensors detecting its absolute position, its orientation, the\ndistance to any obstacle within 2m in 9 equally-spaced directions, and whether the goal is present in\nthe same range. The only actions available are move forward with speed 0.5m/s or rotate (in either\ndirection) with speed of \u03c0/8 rad/s. Each time step corresponds to 1s of simulation. The reward is 1\nfor reaching the goal and 0 otherwise, while the discount factor is \u03b3 = 0.99. For this experiment, we\ndesign a set of 20 different mazes and solve them using a DDQN with two layers of 32 neurons and\nReLU activations. Then, we \ufb01x a target maze and transfer from 5 source mazes uniformly sampled\nfrom such set (excluding the chosen target). To further assess the robustness of our method, we\nnow consider transferring from the Q-functions learned by DDQNs instead of those obtained by\nminimizing the TD error as in the previous domains. From our considerations of Sections 3.4 and 4,\n\n8\n\n0.20.40.60.811.21.4\u00b710400.20.40.60.8IterationsExpectedReturn0.20.40.60.811.21.4\u00b710400.20.40.60.8IterationsExpectedReturn2,0004,0006,0008,0002030405060IterationsExpectedReturn123\u00b7104\u221280\u221270\u221260IterationsExpectedReturn0.20.40.60.811.21.41.6\u00b710400.20.40.6IterationsExpectedReturn0.20.40.60.811.21.41.6\u00b710400.20.4IterationsExpectedReturn\f(a) The Rooms Problem\n\n(b) Maze Navigation\n\nFigure 3: MAML vs 3-MGVT in our navigation problems.\n\nthe \ufb01xed-points of the two algorithms are different, which creates a further challenge for our method.\nWe show the results for two different target mazes in Figure 2e, and Figure 2f, while referring the\nreader to Appendix C.3 for their illustration and additional results. Once again, MGVT achieves\na remarkable speed-up over (no-transfer) DDQN. This time, using 3 components achieves slightly\nbetter performance than using only 1, which is likely due to the fact that the task distribution is much\nmore complicated than in the previous domains. For the same reason, GVT shows negative transfer\nand performs even worse than DDQN. Similarly, FT performs much worse than in the previous\ndomains and negatively transfer in the more complicated target maze of Figure 2e.\n\n6.4 A Comparison to Fast-Adaptation Algorithms\n\nIn order to provide a better understanding of the differences between our settings and the ones typically\nconsidered in fast-adaptation algorithms, we now show a comparison to the recently proposed meta-\nlearner MAML [12]. We repeat the previous experiments, focusing on the navigation tasks, using\ntwo different versions of MAML. In the \ufb01rst one (MAML-full), we perform meta-training using the\nfull distribution over tasks for a number of iterations that allows the meta-policy to converge. In\nthe second one (MAML-batch), we execute the meta-train only on the same number of \ufb01xed source\ntasks as the one used for our algorithm, allowing the meta-policy to reach convergence again. In both\ncases, we perform the meta-test on random tasks sampled from the full distribution. The results are\nshown in Figure 3 in comparison to our best algorithm (3-MGVT), where each curve is obtained by\naveraging 5 meta-testing runs for each of 4 different meta-policies. Additional details are given in\nAppendix C.4. In both cases, the full version of MAML achieves a much better jumpstart and adapts\nmuch faster than our approach. However, this is no longer the case when limiting the number of\nsource tasks. In fact, this situation reduces to the case in which the task distribution at meta-training\nis a discrete uniform over the \ufb01xed source tasks, while at meta-testing the algorithm is required to\ngeneralize to a different distribution. This is a case that arises quite frequently in practice for which\nMAML was not speci\ufb01cally designed. Things get even worse when we explicitly add a shift to the\nmeta-training distribution as we did in Figure 2b for the rooms problem (MAML-shift in Figure 3a).\nAlthough we meta-trained on the full distribution, the \ufb01nal performance was even worse than the\none using the \ufb01xed source tasks. Finally, notice that we compare the algorithms w.r.t. the number\nof gradient steps, even if our approach collects only one new sample at each iteration while MAML\ncollects a full batch of trajectories.\n\n7 Conclusion\n\nWe presented a variational method for transferring value functions in RL. We showed our approach to\nbe general, in the sense that it can be combined with several distributions and function approximators,\nwhile providing two practical algorithms based on Gaussians and mixtures of Gaussians, respectively.\nWe analyzed both from a theoretical and empirical perspective, proving that the Gaussian version has\nsevere limitations, while the mixture one is much better for our transfer settings. We evaluated the\nproposed algorithms in different domains, showing that both achieve excellent performance in simple\ntasks, while only the mixture version is able to handle complex environments.\nSince our algorithm effectively models the uncertainty over tasks, a relevant future work is to design\nan algorithm that explicitly explores the target task to reduce such uncertainty. Furthermore, our\nvariational approach could be extended to model a distribution over optimal policies instead of value\nfunctions, which might allow better transferred behavior.\n\n9\n\n1,0002,0003,0004,00000.20.40.60.8GradientstepsExpectedReturnMAML-fullMAML-batchMAML-shift3-MGVT0.20.40.60.811.21.4\u00b710400.20.40.6GradientstepsExpectedReturnMAML-fullMAML-batch3-MGVT\fAcknowledgments\n\nWe gratefully acknowledge the support of NVIDIA Corporation with the donation of the Tesla K40cm,\nTitan XP and Tesla V100 used for this research.\n\nReferences\n[1] Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of\n\ngibbs posteriors. Journal of Machine Learning Research, 17(239):1\u201341, 2016.\n\n[2] Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended PAC-Bayes theory. In\n\nProceedings of the 35th International Conference on Machine Learning, 2018.\n\n[3] Kavosh Asadi and Michael L Littman. An alternative softmax operator for reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 243\u2013252, 2017.\n\n[4] Kamyar Azizzadenesheli, Emma Brunskill, and Animashree Anandkumar. Ef\ufb01cient exploration through\n\nbayesian deep q-networks. arXiv preprint arXiv:1802.04412, 2018.\n\n[5] Leemon Baird. Residual algorithms: Reinforcement learning with function approximation. In Machine\n\nLearning Proceedings 1995, pages 30\u201337. Elsevier, 1995.\n\n[6] Andr\u00e9 Barreto, Will Dabney, R\u00e9mi Munos, Jonathan J Hunt, Tom Schaul, Hado P van Hasselt, and David\nSilver. Successor features for transfer in reinforcement learning. In Advances in neural information\nprocessing systems, pages 4055\u20134065, 2017.\n\n[7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians.\n\nJournal of the American Statistical Association, 112(518):859\u2013877, 2017.\n\n[8] S\u00e9bastien Bubeck, Nicolo Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed\n\nbandit problems. Foundations and Trends R(cid:13) in Machine Learning, 5(1):1\u2013122, 2012.\n\n[9] Olivier Catoni. Pac-bayesian supervised classi\ufb01cation: the thermodynamics of statistical learning. arXiv\n\npreprint arXiv:0712.0248, 2007.\n\n[10] Finale Doshi-Velez and George Konidaris. Hidden parameter markov decision processes: A semiparametric\nregression approach for discovering latent task parametrizations. In IJCAI: proceedings of the conference,\nvolume 2016, page 1432, 2016.\n\n[11] Fernando Fern\u00e1ndez and Manuela Veloso. Probabilistic policy reuse in a reinforcement learning agent.\nIn Proceedings of the \ufb01fth international joint conference on Autonomous agents and multiagent systems,\npages 720\u2013727. ACM, 2006.\n\n[12] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep\n\nnetworks. arXiv preprint arXiv:1703.03400, 2017.\n\n[13] Erin Grant, Chelsea Finn, Sergey Levine, Trevor Darrell, and Thomas Grif\ufb01ths. Recasting gradient-based\n\nmeta-learning as hierarchical bayes. arXiv preprint arXiv:1801.08930, 2018.\n\n[14] John R Hershey and Peder A Olsen. Approximating the kullback leibler divergence between gaussian\nmixture models. In Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International\nConference on, 2007.\n\n[15] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The\n\nJournal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[16] Taylor W Killian, Samuel Daulton, George Konidaris, and Finale Doshi-Velez. Robust and ef\ufb01cient transfer\nlearning with hidden parameter markov decision processes. In Advances in Neural Information Processing\nSystems, pages 6250\u20136261, 2017.\n\n[17] Jens Kober and Jan R Peters. Policy search for motor primitives in robotics. In Advances in neural\n\ninformation processing systems, pages 849\u2013856, 2009.\n\n[18] George Konidaris and Andrew Barto. Autonomous shaping: Knowledge transfer in reinforcement learning.\n\nIn Proceedings of the 23rd international conference on Machine learning, pages 489\u2013496. ACM, 2006.\n\n[19] George Konidaris and Andrew G Barto. Building portable options: Skill transfer in reinforcement learning.\n\n2007.\n\n10\n\n\f[20] Alessandro Lazaric. Transfer in reinforcement learning: a framework and a survey. In Reinforcement\n\nLearning. 2012.\n\n[21] Alessandro Lazaric and Mohammad Ghavamzadeh. Bayesian multi-task reinforcement learning.\n\nICML-27th International Conference on Machine Learning, pages 599\u2013606. Omnipress, 2010.\n\nIn\n\n[22] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples in batch reinforcement\n\nlearning. In Proceedings of the 25th international conference on Machine learning, 2008.\n\n[23] Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor\n\npolicies. The Journal of Machine Learning Research, 17(1):1334\u20131373, 2016.\n\n[24] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David\nSilver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint\narXiv:1509.02971, 2015.\n\n[25] Odalric-Ambrym Maillard, R\u00e9mi Munos, Alessandro Lazaric, and Mohammad Ghavamzadeh. Finite-\nsample analysis of bellman residual minimization. In Proceedings of 2nd Asian Conference on Machine\nLearning, pages 299\u2013314, 2010.\n\n[26] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare,\nAlex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through\ndeep reinforcement learning. Nature, 2015.\n\n[27] Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value\n\nfunctions. arXiv preprint arXiv:1402.0635, 2014.\n\n[28] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley\n\n& Sons, Inc., New York, NY, USA, 1994.\n\n[29] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi-\n\nmate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n\n[30] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. arXiv\n\npreprint arXiv:1511.05952, 2015.\n\n[31] David W Scott. Multivariate density estimation: theory, practice, and visualization. John Wiley & Sons,\n\n2015.\n\n[32] Yevgeny Seldin, Fran\u00e7ois Laviolette, Nicolo Cesa-Bianchi, John Shawe-Taylor, and Peter Auer. Pac-\n\nbayesian inequalities for martingales. IEEE Transactions on Information Theory, 2012.\n\n[33] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian\nSchrittwieser, Ioannis Antonoglou, et al. Mastering the game of go with deep neural networks and tree\nsearch. nature, 529(7587):484\u2013489, 2016.\n\n[34] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[35] Matthew E Taylor, Nicholas K Jong, and Peter Stone. Transferring instances for model-based reinforcement\nlearning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases,\npages 488\u2013505. Springer, 2008.\n\n[36] Matthew E Taylor and Peter Stone. Transfer learning for reinforcement learning domains: A survey.\n\nJournal of Machine Learning Research, 10(Jul):1633\u20131685, 2009.\n\n[37] Andrea Tirinzoni, Andrea Sessa, Matteo Pirotta, and Marcello Restelli. Importance weighted transfer\nof samples in reinforcement learning. In Proceedings of the 35th International Conference on Machine\nLearning, volume 80 of Proceedings of Machine Learning Research, pages 4936\u20134945. PMLR, 2018.\n\n[38] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning.\n\n2016.\n\n[39] Aaron Wilson, Alan Fern, Soumya Ray, and Prasad Tadepalli. Multi-task reinforcement learning: a\nhierarchical bayesian approach. In Proceedings of the 24th international conference on Machine learning,\npages 1015\u20131022. ACM, 2007.\n\n11\n\n\f", "award": [], "sourceid": 3040, "authors": [{"given_name": "Andrea", "family_name": "Tirinzoni", "institution": "Politecnico di Milano"}, {"given_name": "Rafael", "family_name": "Rodriguez Sanchez", "institution": "Politecnico di Milano"}, {"given_name": "Marcello", "family_name": "Restelli", "institution": "Politecnico di Milano"}]}