{"title": "Variance Reduced Policy Evaluation with Smooth Function Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 5784, "page_last": 5795, "abstract": "Policy evaluation with smooth and nonlinear function approximation has shown great potential for reinforcement learning. Compared to linear function approxi- mation, it allows for using a richer class of approximation functions such as the neural networks. Traditional algorithms are based on two timescales stochastic approximation whose convergence rate is often slow. This paper focuses on an offline setting where a trajectory of $m$ state-action pairs are observed. We formulate the policy evaluation problem as a non-convex primal-dual, finite-sum optimization problem, whose primal sub-problem is non-convex and dual sub-problem is strongly concave. We suggest a single-timescale primal-dual gradient algorithm with variance reduction, and show that it converges to an $\\epsilon$-stationary point using $O(m/\\epsilon)$ calls (in expectation) to a gradient oracle.", "full_text": "Variance Reduced Policy Evaluation with Smooth\n\nFunction Approximation\n\nHoi-To Wai\n\nThe Chinese University of Hong Kong\n\nShatin, Hong Kong\n\nhtwai@se.cuhk.edu.hk\n\nMingyi Hong\n\nUniversity of Minnesota\nMinneapolis, MN, USA\n\nmhong@umn.edu\n\nZhuoran Yang\n\nPrinceton University\nPrinceton, NJ, USA\nzy6@princeton.edu\n\nZhaoran Wang\n\nNorthwestern University\n\nEvanston, IL, USA\n\nzhaoranwang@gmail.com\n\nKexin Tang\n\nUniversity of Minnesota\nMinneapolis, MN, USA\n\ntangk@umn.edu\n\nAbstract\n\nPolicy evaluation with smooth and nonlinear function approximation has shown\ngreat potential for reinforcement learning. Compared to linear function approxi-\nmation, it allows for using a richer class of approximation functions such as the\nneural networks. Traditional algorithms are based on two timescales stochastic\napproximation whose convergence rate is often slow. This paper focuses on an\nof\ufb02ine setting where a trajectory of m state-action pairs are observed. We formulate\nthe policy evaluation problem as a non-convex primal-dual, \ufb01nite-sum optimiza-\ntion problem, whose primal sub-problem is non-convex and dual sub-problem is\nstrongly concave. We suggest a single-timescale primal-dual gradient algorithm\nwith variance reduction, and show that it converges to an \ufffd-stationary point using\nO(m/\ufffd) calls (in expectation) to a gradient oracle.\n\n1\n\nIntroduction\n\nIn reinforcement learning (RL) [39], policy evaluation aims to estimate the value function that\ncorresponds to a given policy. It serves as a crucial step in policy optimization algorithms [19, 17, 34,\n35] for solving RL tasks. Perhaps the most popular family of methods is temporal-difference (TD)\n[9], which estimates the value function by minimizing loss functions that are based on the Bellman\nequation. These methods can readily incorporate function approximations and have received huge\nempirical success, e.g., when the value functions are parametrized by deep neural networks [26, 36].\nIn contrast to the wide application of policy evaluation with nonlinear function approximation, most\nanalytical results on policy evaluation focus on the linear setting [41, 40, 23, 14, 42, 45, 3, 37, 8].\nHowever, when it comes to nonlinear function approximation, TD methods can be divergent [2, 43].\nTo remedy, Bhatnagar et al. [4] proposed an online algorithm for minimizing a generalized mean-\nsquared projected Bellman error (MSPBE) with smooth and nonlinear value functions. Asymptotic\nconvergence of this algorithm is established based on two-timescale stochastic approximation [5, 18]\nwith diminishing step size. In a similar vein, Chung et al. [7] established the convergence of TD-\nlearning with neural networks utilizing different step sizes for the top layer and the lower layers.\nHowever, non-asymptotic convergence results for nonlinear policy evaluation remains an open\nproblem, illustrating a clear gap between theory and practice.\nIn this work, we make the \ufb01rst attempt to bridge this gap studying policy evaluation with smooth and\nnonlinear function approximation. We focus on the of\ufb02ine setting where we are provided with m\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fconsecutive transitions from the policy to be evaluated, which is an important RL regime [20] and is\nclosely related to the technique of experience replay [21]. Our contributions are two-fold:\n\n\u2022 We recast the MSPBE minimization problem as a primal-dual optimization via the Fenchel\u2019s\nduality. Here, the objective function is a \ufb01nite-sum, and is non-convex in the primal, strongly\nconcave in the dual \u2014 constituting a one-sided non-convex primal-dual optimization problem.\n\n\u2022 A variance reduced algorithm [cf. nPD-VR algorithm in Algorithm 1] is developed and applied\nto tackle the nonlinear policy evaluation problem. The algorithm performs primal-dual updates\nbased on a single transition and has low computational complexity per-iteration. Unlike the\nexisting algorithms, the proposed algorithm uses a \ufb01xed set of step sizes which is easier to tune\nand requires only a single for-loop to implement. We analyze the non-asymptotic performance\nof the algorithm and show that it converges to an \ufffd-stationary point of the MSPBE within\nO(m/\ufffd) calls to a gradient oracle, in expectation.\n\nNote that the optimization problem arisen is strictly more challenging than plain non-convex min-\nimization problems that are of recent interest, e.g., [30, 1, 15]. For instance, naive gradient-based\nupdates for this problem might exhibit bizarre behaviors such as cycling [10]. To the best of our\nknowledge, the result in this paper constitutes the \ufb01rst convergence rate analysis for variance reduced\npolicy evaluation with smooth and nonlinear function approximation.\n\nRelated Work Our work extends the research on policy evaluation with linear function approxi-\nmation [41, 40, 23, 38, 14, 42, 8, 45, 3, 44, 6, 37, 13]; see [9] for a comprehensive review. Among\nthese work, our work is closely related to [14, 44, 6], which study single- and multi-agent policy\nevaluation in the of\ufb02ine setting. Besides, they utilize the Fenchel\u2019s duality to obtain primal-dual\noptimization problems with a \ufb01nite-sum structure, for which they provide variance-reduced opti-\nmization algorithms. Thanks to the linear function approximation, their objectives are strongly\nconvex-concave, which enables linear rate of convergence. Furthermore, [4, 7] seem to be the only\nconvergent policy evaluation results with nonlinear function approximation. Both of their algorithms\nutilize two-timescale step sizes, which may yield slow convergence. Moreover, their convergence\nresults depends on two-timescale stochastic approximation [5, 18], which uses the trajectory of an\nODE to approximate that of a stochastic process. When specialized to an of\ufb02ine setting similar to\nours, [4] can be viewed as the primal-dual stochastic gradient algorithm for our problem.\nFrom the optimization point of view, the non-convex primal-dual optimization (a.k.a non-convex min-\nmax problem) that arise in the above non-linear policy evaluation setting is dif\ufb01cult to tackle. Although\nrecent works have focused on the non-convex minimization problems [30, 1, 15], only a few have\nfocused on the non-convex min-max problems. Recently, Daskalakis and Panageas [10], Daskalakis\net al. [11] study the convergence of vanilla gradient descent/ascent (GDA), and the authors focused on\nbilinear problems (thus without the non-convex component). An optimistic mirror descent algorithm\nis proposed in [25], and its convergence to a saddle point is established under certain strong coherence\nassumption. In [28], algorithms for robust machine learning problems have been proposed, where the\nproblem is linear in one side and non-convex in another side. In [29], a proximally guided stochastic\nmirror descent method (PG-SMD) is proposed, which updates the variables simultaneously, while\nadopting a double loop update rule in which the variables are updated in \u201cstages\". These algorithms\n\nyield the convergence rate in the order of O(1/\u221aK) and O(1/K 1/4), respectively. Recently, an\n\noracle based non-convex stochastic gradient descent for generative adversarial networks was proposed\nin [31, 32], where the algorithm solves the maximization subproblem up to some small error. It\nwas shown in [24] that a deterministic gradient descent/ascent min-max algorithm has O(1/K)\nconvergence rate. In [27], the O(1/K) convergence rate was proved for nonconvex-nonconcave\nmin-max problems under Polyak- \u0141ojasiewicz conditions.\n\nOrganization In \u00a72 we describe the setup for the policy evaluation problem with smooth (possibly\nnonlinear) function approximation.\nIn \u00a73 we describe the variance reduced method for policy\nevaluation and present the results from a preliminary numerical experiment. In \u00a74 we provide the\nmain convergence result in this paper for the proposed variance reduced method, in which a few key\nlemmas and a proof outline will be presented.\n\n2\n\n\f2 Markov Decision Process and Nonlinear Function Approximation\n\nConsider a Markov Decision Process (MDP) de\ufb01ned by\ufffdS,A,P,R, \u03b3\ufffd. We have denoted S as the\nstate space and A as the action space, notice that both S, A can be in\ufb01nite. Let s \u2208 S, a \u2208 A be a\nstate and an action, respectively. For each a \u2208 A, the operator P a is a Markov kernel describing the\nstate transition upon taking action a. For any measurable function f on S, we have\n\nP a(s, s\ufffd)f (s\ufffd)\u00b5(ds\ufffd), \u2200 s \u2208 S.\n\n(1)\n\n\ufffdP af\ufffd(s) =\ufffdS\n\nLastly, the reward function R(s, a) is the reward received after taking action a in state s and \u03b3 \u2208 (0, 1)\nis the discount factor.\nA policy \u03c0 is de\ufb01ned through the conditional probability \u03c0(a|s) of taking action a given the current\nstate s. Given a policy \u03c0, the expected instantaneous reward at state s is de\ufb01ned as:\n\n(2)\nIn the policy evaluation problem, we are interested in the value function V : S \u2192 R that is de\ufb01ned as\nthe discounted total reward over an in\ufb01nite horizon with the initial state \ufb01xed at s \u2208 S:\n\nR\u03c0(s) := Ea\u223c\u03c0(\u00b7|s)\ufffdR(s, a)\ufffd, \u2200 s \u2208 S.\n\nLet M(S) be the manifold of value function given the state space S, we de\ufb01ne the Bellman operator\nT \u03c0 : M(S) \u2192 M(S) as:\n\nV (s) := E\ufffd\ufffd\u221et=0 \u03b3tR(st, at)|s0 = s, at \u223c \u03c0(\u00b7|st), st+1 \u223c P at (st,\u00b7)\ufffd.\n\ufffdT \u03c0f\ufffd(s) := E\ufffdR(s, a) + \u03b3f (s\ufffd)|a \u223c \u03c0(\u00b7|s), s\ufffd \u223c P a(s,\u00b7)\ufffd, \u2200 s \u2208 S,\n\nwhere f is any measurable function de\ufb01ned on S. Denote V (s) (resp. R\u03c0(s)) as the average reward\nfor the policy when initialized at a state s \u2208 S. The Bellman equation [39] shows that the value\nfunction V : S \u2192 R satis\ufb01es\n\n(3)\n\n(4)\n\nwhere we have de\ufb01ned the operator P \u03c0(\u00b7,\u00b7) as the expected Markov kernel of the policy \u03c0:\n\nV (s) = R\u03c0(s) + \u03b3\ufffdP \u03c0V\ufffd(s) = T \u03c0V (s), \u2200 s \u2208 S,\nP \u03c0(s, s\ufffd) :=\ufffdA\nP a(s, s\ufffd)\u03c0(a|s\ufffd)\u00b5(da), \u2200 s, s\ufffd \u2208 S \u00d7 S.\n\n(5)\n\n(6)\n\nIn the above sense, the policy evaluation problem refers to solving for V : S \u2192 R which satis\ufb01es (5).\n2.1 Nonlinear Function Approximation\nSolving for the function V : S \u2192 R in (5) is a non-trivial task since the state space S is large (or even\nin\ufb01nite) and the expected Markov kernel P \u03c0(\u00b7,\u00b7) is unknown. To address the \ufb01rst issue, a common\napproach is to approximate V (s) by a parametric family of functions.\nThis paper considers approximating V : S \u2192 R from the family of parametric and smooth functions\ngiven by F = {V\u03b8 : \u03b8 \u2208 \u0398}, where \u03b8 is a d-dimensional parameter vector and \u0398 is a compact,\nconvex subset of Rd. Note that F forms a differentiable manifold. For each \u03b8, V\u03b8 is a map from S to\nR and the function is non-linear w.r.t. \u03b8. As we consider the family of smooth functions, the gradient\nand Hessian of V\u03b8(s) w.r.t. \u03b8 exists and they are denoted as\n\ng\u03b8(s) :=\ufffd\u2207\u03b8V\u03b8\ufffd(s) \u2208 Rd, H\u03b8(s) :=\ufffd\u22072\n\n(7)\nfor each s \u2208 S and \u03b8 \u2208 \u0398. We de\ufb01ne G\u03b8 := Es\u223cp\u03c0(\u00b7)[g\u03b8(s)g\ufffd\u03b8 (s)] \u2208 Rd\u00d7d, where p\u03c0(\u00b7) is the\nstationary distribution of the MDP under policy \u03c0. Throughout this paper, we assume that G\u03b8 is a\npositive de\ufb01nite matrix for all \u03b8 \u2208 \u0398.\nTo \ufb01nd the best parameter \u03b8\ufffd such that V\u03b8\ufffd : S \u2192 R is the closest approximation to a value function\nV that satis\ufb01es (5), Bhatnagar et al. [4] proposed to minimize the mean squared projected bellman\nerror (MSPBE) de\ufb01ned as follows:\n\n\u03b8V\u03b8\ufffd(s) \u2208 Rd\u00d7d,\n\nJ(\u03b8) := 1\n\n2\ufffd\ufffd\u03a0\u03b8\ufffdT \u03c0V\u03b8 \u2212 V\u03b8\ufffd\ufffd\ufffd2\n\np\u03c0(\u00b7),\n\n3\n\n(8)\n\n\fmin\n\u03b8\u2208\u0398\n\nmax\nw\u2208Rd\n\np\u03c0(\u00b7) =\ufffdS\n\nJ(\u03b8) =\n\nwhere the weighted norm \ufffdV \ufffd2\np\u03c0(s)|V (s)|2\u00b5(ds) is de\ufb01ned with the stationary distribu-\ntion p\u03c0(s) and \u03a0\u03b8 is a projection onto the space of nonlinear functions F w.r.t. the metric \ufffd \u00b7 \ufffdp\u03c0(\u00b7),\ni.e., for any f : S \u2192 R, we have \u03a0\u03b8f = arg minV\u03b8\u2208F \ufffdf \u2212 V\u03b8\ufffd2\np\u03c0(\u00b7). The following identities are\nshown in [4]:\n1\n2\n1\n\nEs\u223cp\u03c0(\u00b7)\ufffd(T \u03c0V\u03b8(s) \u2212 V\u03b8(s))g\u03b8(s)\ufffd\ufffd G\u22121\n2\ufffd\ufffd\ufffdEs\u223cp\u03c0(\u00b7)\ufffd(T \u03c0V\u03b8(s) \u2212 V\u03b8(s))g\u03b8(s)\ufffd\ufffd\ufffd\ufffd\nw\u2208Rd\ufffd \u2212\n\n\u03b8 Es\u223cp\u03c0(\u00b7)\ufffd(T \u03c0V\u03b8(s) \u2212 V\u03b8(s))g\u03b8(s)\ufffd\nEs\u223cp\u03c0(\u00b7)\ufffd(w\ufffdg\u03b8(s))2\ufffd +\ufffdw, Es\u223cp\u03c0(\u00b7)\ufffd(T \u03c0V\u03b8(s) \u2212 V\u03b8(s))g\u03b8(s)\ufffd\ufffd\ufffd\n\nwhere the last equality is due to the Fenchel\u2019s duality. With the above equivalence, the MSPBE\nminimization problem can be reformulated as a primal-dual optimization problem:\n\n= max\n\nG\u22121\n\n(9)\n\n1\n2\n\n=\n\n2\n\n\u03b8\n\nL(\u03b8, w), where\n\n(10)\n\n1\n2\n\nL(\u03b8, w) :=\ufffdw, Es\u223cp\u03c0(\u00b7)\ufffd(T \u03c0V\u03b8(s) \u2212 V\u03b8(s))g\u03b8(s)\ufffd\ufffd \u2212\n\n(11)\nFor convenience, we call \u03b8 as the primal variable and w as the dual variable. For any \ufb01xed \u03b8 \u2208 \u0398,\nthe function L(\u03b8, w) is strongly concave in w since G\u03b8 is positive de\ufb01nite. Moreover, the primal\nand dual gradients are given respectively by:\n\nEs\u223cp\u03c0(\u00b7)\ufffd(w\ufffdg\u03b8(s))2\ufffd.\n\n\u2207\u03b8L(\u03b8, w) = Es\u223cp\u03c0(\u00b7)\ufffd\ufffdT \u03c0V\u03b8(s) \u2212 V\u03b8(s) \u2212 g\ufffd\u03b8 (s)w\ufffdH\u03b8(s)w\ufffd\n+ Es\u223cp\u03c0(\u00b7)\ufffd(g\ufffd\u03b8 (s)w)\ufffd\u03b3Es\ufffd\u223cp\u03c0(\u00b7|s)[g\u03b8(s\ufffd)] \u2212 g\u03b8(s)\ufffd\ufffd,\n\u2207wL(\u03b8, w) = Es\u223cp\u03c0(\u00b7)\ufffd(T \u03c0V\u03b8(s) \u2212 V\u03b8(s))g\u03b8(s)\ufffd \u2212 Es\u223cp\u03c0(\u00b7)\ufffdg\u03b8(s)g\ufffd\u03b8 (s)w\ufffd.\n\u2207\u03b8\ufffdT \u03c0V\u03b8(s) \u2212 V\u03b8(s)\ufffd = \u03b3E[g\u03b8(s\ufffd)|s\ufffd \u223c p(\u00b7|s, a), a \u223c p(\u00b7|s)] \u2212 g\u03b8(s)\n\n(13)\nand we have denoted Es\ufffd\u223cp\u03c0(\u00b7|s)[g\u03b8(s\ufffd)] as the expectation considered in the above. The primal dual\nprojected gradient algorithm proceeds as\n\nThe above follows from the gradient of the temporal difference error:\n\n(12)\n\n\u03b8(k+1) = P\u0398\ufffd\u03b8(k) \u2212 \u03b1k+1\u2207\u03b8L(\u03b8(k), w(k))\ufffd\nw(k+1) = w(k) + \u03b2k+1\u2207wL(\u03b8(k), w(k)),\n\n(14)\n\nwhere P\u0398 denotes the Euclidean projection onto the set \u0398. Applying the primal dual gradient\nalgorithm (14) is dif\ufb01cult as evaluating the gradients \u2207\u03b8L(\u03b8(k), w(k)),\u2207wL(\u03b8(k), w(k)) requires\ncomputing the expectations in (12) (and may require computing the second order moment of the\nquantities). In addition, while the problem (10) is strongly concave in w, it is potentially non-convex\nin \u03b8 as the function V\u03b8(\u00b7) is non-linear with respect to \u03b8 \u2208 \u0398. It is unknown if the primal dual\ngradient algorithm will converge to a stationary (or saddle) point solution, and if it converges, the rate\nof convergence is unknown.\n\n3 Variance Reduced Policy Evaluation with Nonlinear Approximation\n\nWe tackle the policy evaluation problem with smooth function approximation via focusing on a\nsampled average version of problem (10). To \ufb01x idea, we observe a trajectory of state-action pairs\n{s1, a1, s2, a2, ..., sm, am, sm+1} generated from the policy \u03c0 that we wish to evaluate and consider\na sample average approximation of the stochastic objective function (11) as:\n\nm\ufffdm\n\nL(\u03b8, w) := 1\n\ni=1 Li(\u03b8, w), where\n\nLi(\u03b8, w) :=\ufffdw,\ufffdR(si, ai) + \u03b3V\u03b8(si+1) \u2212 V\u03b8(si)\ufffdg\u03b8(si)\ufffd \u2212\n\n1\n2\n\n(w\ufffdg\u03b8(si))2.\n\n(15)\n\nOur goal is to evaluate the stationary point (to be de\ufb01ned later) of the \ufb01nite-sum, non-convex, primal-\ndual problem:\n\nmin\n\u03b8\u2208\u0398\n\nmax\n\nw\u2208W L(\u03b8, w) = 1\n\ni=1Li(\u03b8, w).\n\n(16)\n\nm\ufffdm\n\n4\n\n\fAlgorithm 1 Nonconvex Primal-Dual Gradient with Variance Reduction (nPD-VR) Algorithm.\n1: Input: a trajectory of the state-action pairs {s1, a1, s2, a2, ..., sm, am, sm+1} generated from a\n2: Compute the initial averaged gradients as:\n\ngiven policy; step sizes \u03b1, \u03b2 > 0; initialization points \u03b80 \u2208 \u0398, w0 \u2208 Rd.\n\nG(0)\n\u03b8 = 1\n\nm\ufffdm\n3: for k = 0, 1, 2, ..., K \u2212 1 do\n4:\n5:\n\ni=1 \u2207\u03b8Li(\u03b8(0), w(0)), G(0)\n\nw = 1\n\ni=1 \u2207wLi(\u03b8(0), w(0))\n\nm\ufffdm\n\nSelect two indices ik, jk independently and uniformly from {1, ..., m}.\nPerform the primal-dual updates:\n\u03b8(k+1) = P\u0398\ufffd\u03b8(k) \u2212 \u03b2\ufffdG(k)\n\u03b8 +\ufffd\u2207\u03b8Lik (\u03b8(k), w(k)) \u2212 \u2207\u03b8Lik (\u03b8(k)\nw(k+1) = w(k) + \u03b1\ufffdG(k)\nw +\ufffd\u2207wLik (\u03b8(k), w(k)) \u2212 \u2207wLik (\u03b8(k)\n=\ufffd\u03b8(k)\n\nwhere the gradients can be given by (17).\nUpdate the variables as:\n\n=\ufffdw(k)\n\n, w(k+1)\n\n\u03b8(k+1)\ni\n\nik\n\nik\n\ni\n\n, w(k)\nik\n\n)\ufffd\ufffd\ufffd\n)\ufffd\ufffd,\n\n, w(k)\nik\n\n6:\n\nG(k+1)\n\n\u03b8\n\n= G(k)\n\n\u03b8 +\n\nG(k+1)\n\nw\n\n= G(k)\n\nw +\n\n\u03b8(k)\ni\n1\n\nw(k)\n\nif i = jk\nif i \ufffd= jk\nm\ufffd\u2207\u03b8Ljk (\u03b8(k), w(k)) \u2212 \u2207\u03b8Ljk (\u03b8(k)\nm\ufffd\u2207wLjk (\u03b8(k), w(k)) \u2212 \u2207wLjk (\u03b8(k)\n\nif i = jk\nif i \ufffd= jk\n)\ufffd,\n, w(k)\njk\n)\ufffd,\n\n, w(k)\njk\n\njk\n\njk\n\n1\n\ni\n\n7: end for\n8: Return: (\u03b8( \u02dcK), w( \u02dcK)), where \u02dcK is independently and uniformly picked from {1, ..., K} \u2014 an\n\napproximate stationary point to (16).\n\n(18)\n\n(19)\n\n(20)\n\n(21)\n\nObserve that if m is suf\ufb01ciently large and as G\u03b8 is positive de\ufb01nite, the primal-dual objective function\nis strongly concave in w but is possibly non-convex in \u03b8 due to non-linearity. The above problem is\nhence a one-sided non-convex problem which remains challenging to tackle.\nAn exact primal-dual gradient (PDG) algorithm following (14) but replacing the gradients of L(\u03b8, w)\nby that of L(\u03b8, w) may be applied to (15). In fact, through exploiting the one-sided non-convexity,\nLu et al. [24] showed that a similar algorithm to the PDG algorithm indeed converges sublinearly to\na stationary point of (16). However, for large m \ufffd 1, implementing the PDG algorithm involves a\nhigh per-iteration complexity since evaluating the full gradient requires \u03a9(m) FLOPS. Our idea is\nto derive a fast stochastic algorithm for function approximation through borrowing techniques from\nvariance reduction methods [16, 12, 33, 30].\nTo \ufb01x notations, let i \u2208 {1, ..., m} and we de\ufb01ne the primal-dual gradient of the ith samples:\n\ufffd \u2207\u03b8Li(\u03b8, w)\n\u2207wLi(\u03b8, w) \ufffd =\ufffd \ufffd\u03b4i(\u03b8) \u2212 g\ufffd\u03b8 (si)w\ufffdH\u03b8(si)w + (g\ufffd\u03b8 (si)w)\ufffd\u03b3g\u03b8(si+1) \u2212 g\u03b8(si)\ufffd\nwhere \u03b4i(\u03b8) := R(si, ai) + \u03b3V\u03b8(si+1) \u2212 V\u03b8(si) is the ith sampled temporal difference.\nWe propose the Nonconvex Primal-Dual Gradient with Variance Reduction (nPD-VR) algorithm\nfor (16) in Algorithm 1. The algorithm is a natural extension of the non-convex SAGA algorithm\nintroduced by [30] to the primal-dual, \ufb01nite-sum setting of interest. In speci\ufb01c, line 5 performs the\nprimal dual gradient update through an unbiased estimate of the gradient \u2014 by denoting\n\n\u03b4i(\u03b8)g\u03b8(si) \u2212\ufffdg\u03b8(si)\ufffdw\ufffdg\u03b8(si)\n\n\ufffd\n\n(17)\n\n\ufffdG(k)\n\ufffdG(k)\n\n:= G(k)\n\u03b8\nw := G(k)\n\n\u03b8 +\ufffd\u2207\u03b8Lik (\u03b8(k), w(k)) \u2212 \u2207\u03b8Lik (\u03b8(k)\nw +\ufffd\u2207wLik (\u03b8(k), w(k)) \u2212 \u2207wLik (\u03b8(k)\nas ik is uniformly picked from {1, ..., m}, therefore (when conditioned on the past random variable\ngenerated up to iteration k) the expected values of the quantities \ufffdG(k)\nw are the primal-dual\n\n)\ufffd,\n)\ufffd,\n\u03b8 ,\ufffdG(k)\n\n, w(k)\nik\n, w(k)\nik\n\n(22)\n\nik\n\nik\n\n5\n\n\fgradients \u2207\u03b8L(\u03b8(k), w(k)), \u2207wL(\u03b8(k), w(k)), respectively. Meanwhile, the updates in line 6 keep\nrefreshing the stored variables in the memory. We remark that these updates are based on the index\njk which is independent from the ik used in line 5. As we shall see in the analysis, this subtle detail\nin the algorithm allows for proving that the variance in gradient is reduced [cf. Lemma 3].\nAs the nPD-VR algorithm employs an incremental update rule similar to the SAGA method, this\nalgorithm is suitable for the big-data setting when m \ufffd 1. Particularly, the cost for the updates in\nline 5 and line 6 are independent of m. Moreover, the proposed algorithm utilizes a \ufb01xed step size\nrule which allows for adaptation to more dynamical data.\nWe remark that existing approach [4, 7] have studied a two-timescale stochastic approximation\nalgorithm for tackling the stochastic problem (10); and in a similar vein, a recent related work [22]\nproposed a double loop algorithm that requires solving the dual problem (nearly) optimally. In\ncontrast, the nPD-VR algorithm runs on a single-timescale. The nPD-VR algorithm is more \ufb02exible\nand numerically stable, as we shall show in the convergence analysis.\n\n3.1 Preliminary Numerical Experiments\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\ufffd\n\n\ufffd\ufffd\n\n\ufffd\ufffd\n\n\ufffd\ufffd\n\n\ufffd\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\n\nWe present preliminary experiments\nof learning the value function from\nthe MountainCar dataset with m =\n5000 via the nPD-VR algorithm. We\nran Sarsa [39] to obtain a good pol-\nicy, then we generate a trajectory of\nthe state-action pairs. To learn the\nvalue function, we parameterize V\u03b8(\u00b7)\nas a 2-layer neural network with n\nhidden neurons and consider a for-\ngetting factor \u03b3 = 0.95. We set the\nconstraints in (16) with \u0398 = [0, 1]n,\nand in addition we consider w to be\nbounded in [0, 100]n for better numer-\nical stability, which can be enforced\nby incorporating a projection step after (19). For the nPD-VR algorithm, we set the step sizes as\n\u03b1 = 10\u22124, \u03b2 = 10\u22128. Note we have approximated the Hessian in gradient computation (17) with\ndiagonal approximation. For benchmark, we also experiment with a single-timescale SGD on (16)\nwith a diminishing step size. Trajectory of the objective L(\u03b8(k), w(k)) is shown in Fig. 1. As seen, the\nobjective of nPD-VR converges to (close to) zero in 4-5 passes on the data, while a single timescale\nSGD on (16) takes a long time (or fail) to converge.\n\nFigure 1: Trajectory of the nPD-VR on the MountainCar\ndataset such that the value function is approximated as a 2-\nlayer neural network with n neurons. (Left) n = 50 neurons\n(Right) n = 100 neurons.\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\ufffd\n\n\ufffd\ufffd\n\n\ufffd\n\n4 Convergence Analysis\n\nBefore stating the main results, let us list a few assumptions on the nPD-VR algorithm and the\nprimal-dual problem (16).\nAssumption 1. For any \u03b8 \u2208 \u0398, the sum function L(\u03b8, w) is \u00b5-strongly concave in w.\nIn the case of policy evaluation problem, Assumption 1 can be implied by taking a suf\ufb01cient number\nof samples m and exploiting the fact that G\u03b8 in (7) is positive de\ufb01nite.\nAssumption 2. The iterates {\u03b8(k), w(k)}k\u22650 generated by the nPD-VR algorithm stay within a\ncompact set \u0398 \u00d7 W, for some W \u2286 Rd which is compact and convex.\nDue to the Euclidean projection in the primal-update of \u03b8, the condition \u03b8(k) \u2208 \u0398 holds straightfor-\nwardly. Meanwhile it maybe dif\ufb01cult to verify w(k) \u2208 W as the update is unconstrained in general.\nAn intuition is that as L(\u03b8, w) is strongly concave in w, for each \u00af\u03b8 \u2208 \u0398, the maximizer to L( \u00af\u03b8, w)\nis unique, i.e., denoted as w\ufffd( \u00af\u03b8). Also due to the strong concavity, at each iteration k and with\na suf\ufb01ciently small step size, the dual update of wk pulls the dual variable towards w\ufffd(\u03b8(k)) and\ntherefore wk also stays within a compact set. Nevertheless, in our numerical experiments in Sec. 3.1,\nwe \ufb01nd that incorporating an additional projection step to the dual update improves the numerical\nperformance. Lastly, we assume that:\n\n6\n\n\fAssumption 3. For each i \u2208 {1, ..., m}, the gradient \u2207\u03b8Li(\u03b8, w) (resp. \u2207wLi(\u03b8, w)) is L\u03b8\n(resp. Lw) Lipschitz. We have:\n\n\ufffd\u2207\u03b8Li(\u03b8, w) \u2212 \u2207\u03b8Li(\u03b8\ufffd, w\ufffd)\ufffd \u2264 L\u03b8\ufffd\ufffd\u03b8 \u2212 \u03b8\ufffd\ufffd + \ufffdw \u2212 w\ufffd\ufffd\ufffd,\n\ufffd\u2207wLi(\u03b8, w) \u2212 \u2207wLi(\u03b8\ufffd, w\ufffd)\ufffd \u2264 Lw\ufffd\ufffd\u03b8 \u2212 \u03b8\ufffd\ufffd + \ufffdw \u2212 w\ufffd\ufffd\ufffd,\n\nfor any \u03b8, \u03b8\ufffd \u2208 \u0398 and any w, w\ufffd \u2208 W, where W is de\ufb01ned in Assumption 2.\nAssumption 3 is mild and it can be veri\ufb01ed by using the compactness of W and checking (17).\nIn particular, the assumption holds when the parametric family of functions has bounded, smooth\ngradient and Hessian.\n\n(23)\n\nSummary of Main Results The primal-dual optimization (16) is a one-sided constrained problem,\ni.e., only \u03b8 is constrained to \u0398 while w is unconstrained. We quantify its convergence via the\n\nfollowing stationarity measure. De\ufb01ne \u03b8 = P\u0398\ufffd\u03b8\u2212 \u03b2\u2207\u03b8L(\u03b8, w)\ufffd for any \u03b8, w \u2208 \u0398\u00d7Rd. Observe\nthat if \ufffd\u03b8 \u2212 \u03b8\ufffd = 0 and \u2207wL(\u03b8, w) = 0, then (\u03b8, w) is a (\ufb01rst order) stationary point. Inspired by\nsuch observation, the following stationarity measure emerges as a natural metric:\n\nG(\u03b8(k), w(k)) :=\n\n(k)\n\n1\n\u03b22\ufffd\u03b8\n\n\u2212 \u03b8(k)\ufffd2 + \ufffd\u2207wL(\u03b8(k), w(k))\ufffd2,\n\n(24)\n\nwhere \u03b8\n\n(k) is de\ufb01ned through (\u03b8(k), w(k)) as\n\n\u03b8\n\n(k)\n\n:= P\u0398\ufffd\u03b8(k) \u2212 \u03b2\u2207\u03b8L(\u03b8(k), w(k))\ufffd.\n(25)\nObserve that if G(\u03b8(k), w(k)) = 0, then the primal-dual solution (\u03b8(k), w(k)) is a stationary point.\nFurthermore, the metric is roughly invariant with the step size since \ufffd\u03b8\n\u2212 \u03b8(k)\ufffd2 = O(\u03b22). The\nfollowing theorem shows the convergence of the nPD-VR algorithm:\nTheorem 1. Assume Assumption 1\u20133 hold true. There exist step size parameters \u2013 of the order\n\u03b2 = \u0398(1/m), \u03b1 = \u0398(1/m) \u2013 such that it holds for any K \u2208 N that\n\n(k)\n\nE\ufffdG(\u03b8( \u02dcK), w( \u02dcK))\ufffd \u2264\n\nF (K) + 4\n\n\u00b5\ufffd3 + 2m\ufffd2L2\n\nw\u03b1 + L2\nK min{\u03b1, \u03b2\n4}\n\n\u03b8\u03b2\ufffd\ufffd\ufffd\u2207wL(\u03b8(0), w(0))\ufffd2\n\n,\n\n(26)\n\nwhere F (K) := E[L(\u03b8(0), w(0)) \u2212 L(\u03b8(K), w(K))] and we recall that \u02dcK is a uniform random\nvariable drawn from {1, ..., K}.\nThe above shows that the stationarity measure decays to zero at a sublinear rate. In particular, with the\nstep size order \u03b1 = \u0398(1/m), \u03b2 = \u0398(1/m), the number of iterations required to reach an \ufffd-stationary\npoint [with G(\u03b8, w) = O(\ufffd)] is O(m/\ufffd), provided that the strong concavity constant \u00b5, Lipschitz\nconstants of the functions L\u03b8, Lw are independent of m.\n\nComparison to Prior Work Note that non-asymptotic convergence of primal-dual gradient type\nalgorithms to stationary points with (one sided) non-convex problems has only been recently re-\nsearched. Of close relationship is [24] which study a block coordinate descent version of single loop\nprimal-dual gradient method \u2013 the primal and dual updates are performed in sequence and complete\ngradients are evaluated at each iteration \u2013 Lu et al. [24] showed that their algorithm converges to an\n\ufffd-stationary point using O(1/\ufffd) iterations, under a similar set of assumptions as ours. Since each\niteration of [24] requires a complete gradient evaluation, the number of calls to a gradient oracle is\nthus O(m/\ufffd). In [22, 29], several proximally guided stochastic mirror descent methods (PG-SMD)\nare proposed for primal-dual problems following a closely related set of assumptions. However,\nthe PG-SMD methods in [22, 29] rely on a double-loop update in which the primal variables are\nupdated in a faster pace than the dual variables. Nevertheless, [22, 29] show that these methods\nconverges to an \ufffd-stationary point using O(m/\ufffd) gradient oracle calls. To the best of our knowledge,\nour algorithm is the \ufb01rst stochastic algorithm that can deal with the \ufb01nite-sum primal-dual problem\nsuch as (16), using a single-loop, and variance reduced techniques. Furthermore, the convergence\nrate of the proposed nPD-VR algorithm is on-par with the state-of-the-art methods.\n\n7\n\n\f4.1 Proof Outline\n\nOur analysis follows from combining and improving recent techniques for analyzing non-convex\noptimization algorithms in [30, 24]. To facilitate our analysis, we denote the errors in gradient by\ne(k)\nw \u2212\u2207wL(\u03b8(k), w(k)), respectively. Detailed proofs\n\u03b8\nof results in this section can be found in the supplementary materials.\n\n\u03b8 \u2212\u2207\u03b8L(\u03b8(k), w(k)) and e(k)\n\nw :=\ufffdG(k)\n\n:=\ufffdG(k)\n\nKey Lemmas We begin by establishing a few lemmas for the convergence analysis. In speci\ufb01c, the\n\ufb01rst step is to control the change in objective function value with the primal update:\nLemma 1. Under Assumption 3. For any k \u2208 N, we have\nL(\u03b8(k+1), w(k)) \u2212 L(\u03b8(k), w(k)) \u2264\ufffdL\u03b8 \u2212\n\n(27)\n\n(k)\n\n1\n\n2\u03b2\ufffd\ufffd\u03b8\n2\u03b2\ufffd\ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2 +\n\n1\n\n\u2212 \u03b8(k)\ufffd2\n\u03b2\n2 \ufffde(k)\n\u03b8 \ufffd2\n\n+\ufffd L\u03b8\n\n2 \u2212\n\nThe proof follows by the standard descent property of smooth functions combined with the variance\ncontrolling technique introduced by [30].\nSecondly, the progress made by the dual update obeys the following bounds:\nLemma 2. Under Assumption 1-3. For any k \u2208 N, the change in objective value is bounded as:\n\n(28)\n\n(29)\n\nL(\u03b8(k+1), w(k+1)) \u2212 L(\u03b8(k+1), w(k)) \u2264 \u03b1L2\n\n\ufffd2\u03b1 +\n\n\u00b5\u03b12\n\n\u03b13L2\nw\n\n2 \u2212\n\nand the dual gradient is controlled by:\n\n\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2 \u2264\ufffd1 + \u03b12L2\n\nw\ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2\n2 \ufffd\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2 +\ufffd\u03b1 \u2212\n\u00b5\u03b12\ny \u2212 2\u00b5\u03b1\ufffd\ufffd\u2207wL(\u03b8(k\u22121), w(k\u22121))\ufffd2\n\n2 \ufffd\ufffde(k)\nw \ufffd2,\n\n+ \u00b5\u03b1\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2 +\n\nL2\nw\n\n\u00b5\u03b1\ufffd\ufffd\u03b8(k) \u2212 \u03b8(k\u22121)\ufffd2 + \u03b12\ufffde(k\u22121)\n\nw\n\n\ufffd2\ufffd\n\nThe bound (28) is a standard relation for dual gradient update, while (29) is a consequence of the\nstrong-concavity of L(\u03b8(k), w(k)) \u2013 it shows that \ufffd\u2207wL(\u03b8(k), w(k))\ufffd contracts after a dual update.\nTo control the gradient error terms in expectation \ufffde(k)\nw \ufffd2, we consider [also see Lemma 4 in\nthe supplementary materials]\nm\ufffdi=1\ufffd\ufffd\u03b8(k) \u2212 \u03b8(k)\n\ni \ufffd2\ufffd\ni \ufffd2 + \ufffdw(k) \u2212 w(k)\n\nand notice that \u0394(0) = 0. Using Assumption 3 and when the step size is suf\ufb01ciently small, we can\n\n\u03b8 \ufffd2, \ufffde(k)\n\n\u0394(k) :=\n\n1\nm\n\n(30)\n\nestablish a bound on\ufffdK\n\nk=0 E[\u0394(k)] via the below lemma:\n\nLemma 3. Under Assumption 3 and the condition on the step sizes that:\n1\nm\n\nw(\u03b12 + \u03b1(1 \u2212\n\n\u03b4(\u03b1, \u03b2) :=\n\n)) > 0.\n\nFor any K \u2265 1, we have\nE\ufffd\u0394(k)\ufffd \u2264\n\nK\ufffdk=0\n\n1\n\n\u03b4(\u03b1, \u03b2)\n\n1\nm \u2212 max{\u03b1, \u03b2} \u2212 4L2\nK\ufffdk=0\n\n\u03b2\ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2 + 4\u03b1\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2\ufffd.\nE\ufffd 2\n\nThe proof of the lemma makes use of the property of the nPD-VR algorithm and uses a new technique\nin proving the contraction of variance in SAGA-type algorithms. Furthermore, note that if the step\nsizes satis\ufb01es\n\nthen one has ( 1\n\nm \u2212 max{\u03b1, \u03b2} \u2212 4L2\n\nK\ufffdk=0\n\nE\ufffd\u0394(k)\ufffd \u2264\n\nK\ufffdk=0\n\n1\n2m \u2265 max{\u03b1, \u03b2} + 8\u03b1L2\nw,\nw(\u03b12 + \u03b1(1 \u2212 1\nm )))\u22121 \u2264 m\n\u03b2 \ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2 + 2m\u03b1\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2\ufffd.\nE\ufffd m\n\n2 . We simplify (32) into\n\n(31)\n\n(32)\n\n(a0)\n\n(33)\n\n8\n\n\fProof of Theorem 1 Equipped with the lemmas above on the progress made by primal-dual updates\nand the SAGA gradient estimation, our proof follows by analyzing (27), (28), (29). We remark that\nthe proof technique used is new, which departs from the common Lyapunov/potential function\napproach pursued in recent papers [30, 24] on non-convex analysis.\nTo illustrate the idea, through carefully controlling the step size, we show that by summing up the\ninequalities (27), (28) from k = 0 to k = K \u2212 1, we get\n\u03b2 )\ufffdK\u22121\n\n\u03a9\ufffd min{\u03b1, \u03b2}\ufffd\ufffdK\u22121\nk=0 E[G(\u03b8(k), w(k))] \u2264 O(\u03b1)\ufffdK\u22121\nUsing (29), the sum \ufffdK\u22121\nk=0 E[\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2] can be further upper bounded as the form\nconstant \u00d7\ufffdK\u22121\nk=0 E[\ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2] + constant. Substituting the newly obtained bound, one\ncan \ufb01nd a step size \u03b2 > 0 such that the constant in front of the term E[\ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2] is negative.\nIt follows that we can upper bound the right hand side of (34) with a constant independent of K.\nSubsequently, we observe that as \u02dcK is an independent r.v. uniformly distributed on {0, ..., K \u2212 1},\none has E\ufffdG(\u03b8( \u02dcK), w( \u02dcK))\ufffd = K\u22121\ufffdK\u22121\nk=0 E[G(\u03b8(k), w(k))] and applying (34) yields Theorem 1.\n\nk=0 E[\ufffd\u2207wL(\u03b8(k), w(k))\ufffd2]\nk=0 E[\ufffd\u03b8(k+1) \u2212 \u03b8(k)\ufffd2] + constant.\n\n5 Conclusions and Extensions\n\n+ O(m \u2212 1\n\n(34)\n\nIn this paper, we have studied the policy evaluation problem in the case of smooth (possibly non-linear)\nfunction approximation. We consider an of\ufb02ine setting via sample average approximation of the\nBellman equation. Albeit the sample size m can be large, we propose a simple and ef\ufb01cient, variance\nreduced primal dual update strategy to handle the one-sided non-convex optimization problem arisen.\nWe analyze the non-asymptotic convergence rate of the algorithm towards a stationary point and\ndemonstrate that it performs on par with state-of-the-art optimization methods, while the latter\nrequires higher implementation complexity.\nSeveral extensions are worth studying \u2014 similar to the SAGA algorithm considered here, the SVRG\nalgorithm [16] may bene\ufb01t the nonconvex primal-dual optimization; as suggested by [30], using\nmini-batch can accelerate the convergence rate from O(m/K) to O(m\nAcknowledgement\n\n3 /K).\n\n2\n\nH.-T. Wai is supported by the CUHK Direct Grant #4055113. M. Hong is supported in part by NSF\nunder Grant CCF-1651825, CMMI-172775, CIF-1910385 and by AFOSR under grant 19RT0424.\n\n9\n\n\fReferences\n[1] Z. Allen-Zhu and E. Hazan. Variance reduction for faster non-convex optimization. In Interna-\n\ntional conference on machine learning, pages 699\u2013707, 2016.\n\n[2] L. Baird. Residual algorithms: Reinforcement learning with function approximation.\n\nInternational Conference on Machine Learning, pages 30\u201337, 1995.\n\nIn\n\n[3] J. Bhandari, D. Russo, and R. Singal. A \ufb01nite time analysis of temporal difference learning with\n\nlinear function approximation. arXiv preprint arXiv:1806.02450, 2018.\n\n[4] S. Bhatnagar, D. Precup, D. Silver, R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. Convergent\ntemporal-difference learning with arbitrary smooth function approximation. In Advances in\nNeural Information Processing Systems, pages 1204\u20131212, 2009.\n\n[5] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint, volume 48. Springer,\n\n2009.\n\n[6] L. Cassano, K. Yuan, and A. H. Sayed. Multi-agent fully decentralized off-policy learning with\n\nlinear convergence rates. arXiv preprint arXiv:1810.07792, 2018.\n\n[7] W. Chung, S. Nath, A. Joseph, and M. White. Two-timescale networks for nonlinear value\n\nfunction approximation. In ICLR, 2019.\n\n[8] G. Dalal, B. Szorenyi, G. Thoppe, and S. Mannor.\n\nFinite sample analysis of two-\ntimescale stochastic approximation with applications to reinforcement learning. arXiv preprint\narXiv:1703.05376, 2017.\n\n[9] C. Dann, G. Neumann, and J. Peters. Policy evaluation with temporal differences: A survey and\n\ncomparison. Journal of Machine Learning Research, 15(1):809\u2013883, 2014.\n\n[10] C. Daskalakis and I. Panageas. The limit points of (optimistic) gradient descent in min-max\noptimization. In Advances in Neural Information Processing Systems, pages 9256\u20139266, 2018.\n\n[11] C. Daskalakis, A. Ilyas, V. Syrgkanis, and H. Zeng. Training GANs with optimism. arXiv\n\npreprint arXiv:1711.00141, 2017.\n\n[12] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with\nsupport for non-strongly convex composite objectives. In Advances in neural information\nprocessing systems, pages 1646\u20131654, 2014.\n\n[13] T. T. Doan, S. T. Maguluri, and J. Romberg. Convergence rates of distributed td (0)\nwith linear function approximation for multi-agent reinforcement learning. arXiv preprint\narXiv:1902.07393, 2019.\n\n[14] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy\n\nevaluation. In International Conference on Machine Learning, pages 1049\u20131058, 2017.\n\n[15] S. Ghadimi and G. Lan. Stochastic \ufb01rst-and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization, 23(4):2341\u20132368, 2013.\n\n[16] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance\n\nreduction. In Advances in neural information processing systems, pages 315\u2013323, 2013.\n\n[17] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information\n\nProcessing Systems, pages 1008\u20131014, 2000.\n\n[18] H. Kushner and G. G. Yin. Stochastic Approximation and Recursive Algorithms and Applica-\n\ntions. Springer Science & Business Media, 2003.\n\n[19] M. G. Lagoudakis and R. Parr. Least-squares policy iteration. Journal of machine learning\n\nresearch, 4(Dec):1107\u20131149, 2003.\n\n[20] S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning,\n\npages 45\u201373. Springer, 2012.\n\n10\n\n\f[21] L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching.\n\nMachine learning, 8(3-4):293\u2013321, 1992.\n\n[22] Q. Lin, M. Liu, H. Ra\ufb01que, and T. Yang. Solving weakly-convex-weakly-concave saddle-point\nproblems as weakly-monotone variational inequality. arXiv preprint arXiv:1810.10207, 2018.\n\n[23] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of\nproximal gradient TD algorithms. In Conference on Uncertainty in Arti\ufb01cial Intelligence, pages\n504\u2013513, 2015.\n\n[24] S. Lu, I. Tsaknakis, Y. Chen, and M. Hong. Hybrid block successive approximation for one-sided\nnon-convex min-max problems: Algorithms and applications. arXiv preprint arXiv:1902.08294,\n2019.\n\n[25] P. Mertikopoulos, H. Zenati, B. Lecouat, C. Foo, V. Chandrasekhar, and G. Piliouras. Mirror\ndescent in saddle-point problems: Going the extra (gradient) mile. CoRR, abs/1807.02629,\n2018. URL http://arxiv.org/abs/1807.02629.\n\n[26] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu.\nIn International Conference on\n\nAsynchronous methods for deep reinforcement learning.\nMachine Learning, pages 1928\u20131937, 2016.\n\n[27] M. Nouiehed, M. Sanjabi, J. D. Lee, and M. Razaviyayn. Solving a class of non-convex\n\nmin-max games using iterative \ufb01rst order methods. arXiv preprint arXiv:1902.08297, 2019.\n\n[28] Q. Qian, S. Zhu, J. Tang, R. Jin, B. Sun, and H. Li. Robust optimization over multiple domains.\n\nCoRR, abs/1805.07588, 2018. URL http://arxiv.org/abs/1805.07588.\n\n[29] H. Ra\ufb01que, M. Liu, Q. Lin, and T. Yang. Non-convex min-max optimization: Provable\n\nalgorithms and applications in machine learning. arXiv preprint arXiv:1810.02060, 2018.\n\n[30] S. J. Reddi, S. Sra, B. Poczos, and A. J. Smola. Proximal stochastic methods for nonsmooth\nnonconvex \ufb01nite-sum optimization. In Advances in Neural Information Processing Systems,\npages 1145\u20131153, 2016.\n\n[31] M. Sanjabi, B. Jimmy, M. Razaviyayn, and J. D. Lee. On the convergence and robustness\nof training GANs with regularized optimal transport. In Proceedings of Advances in Neural\nInformation Processing Systems, pages 7088\u20137098, 2018.\n\n[32] M. Sanjabi, M. Razaviyayn, and J. D. Lee. Solving non-convex non-concave min-max games\n\nunder polyak-lojasiewicz condition. arXiv preprint arXiv:1812.02878, 2018.\n\n[33] M. Schmidt, N. Le Roux, and F. Bach. Minimizing \ufb01nite sums with the stochastic average\n\ngradient. Mathematical Programming, 162(1-2):83\u2013112, 2017.\n\n[34] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization.\n\nIn International Conference on Machine Learning, pages 1889\u20131897, 2015.\n\n[35] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization\n\nalgorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[36] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker,\nM. Lai, A. Bolton, et al. Mastering the game of go without human knowledge. Nature, 550\n(7676):354, 2017.\n\n[37] R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation and TD\n\nlearning. arXiv preprint arXiv:1902.00923, 2019.\n\n[38] M. S. Stankovi\u00b4c and S. S. Stankovi\u00b4c. Multi-agent temporal-difference learning with linear\nfunction approximation: Weak convergence under time-varying network topologies. In 2016\nAmerican Control Conference (ACC), pages 167\u2013172. IEEE, 2016.\n\n[39] R. S. Sutton and A. G. Barto. Reinforcement Learning: An Introduction. MIT press, 2018.\n\n11\n\n\f[40] R. S. Sutton, H. R. Maei, D. Precup, S. Bhatnagar, D. Silver, C. Szepesv\u00e1ri, and E. Wiewiora.\nFast gradient-descent methods for temporal-difference learning with linear function approxima-\ntion. In International Conference on Machine Learning, pages 993\u20131000, 2009.\n\n[41] R. S. Sutton, H. R. Maei, and C. Szepesv\u00e1ri. A convergent o(n) temporal-difference algorithm\nfor off-policy learning with linear function approximation. In Advances in Neural Information\nProcessing Systems, pages 1609\u20131616, 2009.\n\n[42] A. Touati, P.-L. Bacon, D. Precup, and P. Vincent. Convergent tree-backup and retrace with\n\nfunction approximation. arXiv preprint arXiv:1705.09322, 2017.\n\n[43] J. N. Tsitsiklis and B. Van Roy. Analysis of temporal-diffference learning with function\napproximation. In Advances in Neural Information Processing Systems, pages 1075\u20131081,\n1997.\n\n[44] H.-T. Wai, Z. Yang, P. Z. Wang, and M. Hong. Multi-agent reinforcement learning via double\naveraging primal-dual optimization. In Advances in Neural Information Processing Systems,\npages 9649\u20139660, 2018.\n\n[45] Y. Wang, W. Chen, Y. Liu, Z.-M. Ma, and T.-Y. Liu. Finite sample analysis of the GTD policy\nevaluation algorithms in Markov setting. In Advances in Neural Information Processing Systems,\npages 5504\u20135513, 2017.\n\n12\n\n\f", "award": [], "sourceid": 3096, "authors": [{"given_name": "Hoi-To", "family_name": "Wai", "institution": "The Chinese University of Hong Kong"}, {"given_name": "Mingyi", "family_name": "Hong", "institution": "University of Minnesota"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Zhaoran", "family_name": "Wang", "institution": "Northwestern University"}, {"given_name": "Kexin", "family_name": "Tang", "institution": "Shanghai Jiao Tong University"}]}