{"title": "Policy Optimization Provably Converges to Nash Equilibria in Zero-Sum Linear Quadratic Games", "book": "Advances in Neural Information Processing Systems", "page_first": 11602, "page_last": 11614, "abstract": "We study the global convergence of policy optimization for finding the Nash equilibria (NE) in zero-sum linear quadratic (LQ) games. To this end, we first investigate the landscape of LQ games, viewing it as a nonconvex-nonconcave saddle-point problem in the policy space. Specifically, we show that despite its nonconvexity and nonconcavity, zero-sum LQ games have the property that the stationary point of the objective function with respect to the linear feedback control policies constitutes the NE of the game. Building upon this, we develop three projected nested-gradient methods that are guaranteed to converge to the NE of the game. Moreover, we show that all these algorithms enjoy both globally sublinear and locally linear convergence rates. Simulation results are also provided to illustrate the satisfactory convergence properties of the algorithms. To the best of our knowledge, this work appears to be the first one to investigate the optimization landscape of LQ games, and provably show the convergence of policy optimization methods to the NE. Our work serves as an initial step toward understanding the theoretical aspects of policy-based reinforcement learning algorithms for zero-sum Markov games in general.", "full_text": "Policy Optimization Provably Converges to Nash\nEquilibria in Zero-Sum Linear Quadratic Games\n\nKaiqing Zhang\nECE and CSL\n\nUniversity of Illinois at Urbana-Champaign\n\nkzhang66@illinois.edu\n\nZhuoran Yang\n\nORFE\n\nPrinceton University\nzy6@princeton.edu\n\nTamer Ba\u00b8sar\nECE and CSL\n\nUniversity of Illinois at Urbana-Champaign\n\nbasar1@illinois.edu\n\nAbstract\n\nWe study the global convergence of policy optimization for \ufb01nding the Nash equi-\nlibria (NE) in zero-sum linear quadratic (LQ) games. To this end, we \ufb01rst inves-\ntigate the landscape of LQ games, viewing it as a nonconvex-nonconcave saddle-\npoint problem in the policy space. Speci\ufb01cally, we show that despite its noncon-\nvexity and nonconcavity, zero-sum LQ games have the property that the stationary\npoint of the objective function with respect to the linear feedback control policies\nconstitutes the NE of the game. Building upon this, we develop three projected\nnested-gradient methods that are guaranteed to converge to the NE of the game.\nMoreover, we show that all these algorithms enjoy both globally sublinear and lo-\ncally linear convergence rates. Simulation results are also provided to illustrate the\nsatisfactory convergence properties of the algorithms. To the best of our knowl-\nedge, this work appears to be the \ufb01rst one to investigate the optimization landscape\nof LQ games, and provably show the convergence of policy optimization methods\nto the NE. Our work serves as an initial step toward understanding the theoretical\naspects of policy-based reinforcement learning algorithms for zero-sum Markov\ngames in general.\n\n1\n\nIntroduction\n\nReinforcement learning [1] has achieved sensational progress recently in several prominent decision-\nmaking problems, e.g., playing the game of Go [2, 3], and playing real-time strategy games [4,\n5]. Interestingly, all of these problems can be formulated as zero-sum Markov games involving\ntwo opposing players or teams. Moreover, their algorithmic frameworks are all based upon policy\noptimization (PO) methods such as actor-critic [6] and proximal policy optimization (PPO) [7],\nwhere the policies are parametrized and iteratively updated. Such popularity of PO methods are\nmainly attributed to the facts that: (i) they are easy to implement and can handle high-dimensional\nand continuous action spaces; (ii) they can readily incorporate advanced optimization results to\nfacilitate the algorithm design [7, 8, 9]. Moreover, empirically, some observations have shown that\nPO methods usually converge faster than value-based ones [9, 10].\nIn contrast to the tremendous empirical success, theoretical understanding of policy optimization\nmethods for zero-sum Markov games lags behind. Although the convergence of policy optimization\nalgorithms to locally optimal policies has been established in the classical reinforcement learning\nsetting with a single agent/player [11, 6, 12, 7, 13, 14], extending those theoretical guarantees to\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fNash equilibrium (NE) policies, a common solution concept in game theory also known as saddle-\npoint equilibrium (SPE) in the zero-sum setting [15], suffers from the following two caveats.\nFirst, since the players simultaneously determine their actions in the games, the decision-making\nproblem faced by each player becomes non-stationary. As a result, single-agent algorithms fail to\nwork due to lack of Markov property [16]. Second, with parametrized policies, the policy optimiza-\ntion for \ufb01nding NE in a function space is reduced to solving for NE in the policy parameter space,\nwhere the underlying game is in general nonconvex-nonconcave. Since nonconvex optimization\nproblems are NP-hard [17] in the worst case, so is \ufb01nding an NE in nonconvex-nonconcave saddle-\npoint problems [18]. In fact, it has been showcased recently that vanilla gradient-based algorithms\nmight have cyclic behaviors and fail to converge to any NE [19, 20, 21] in zero-sum games.\nAs an initial attempt in merging the gap between theory and practice, we study the performance of\nPO methods on a simple but quintessential example of zero-sum Markov games, namely, zero-sum\nlinear quadratic (LQ) games.\nIn LQ games, the system evolves following linear dynamics con-\ntrolled by both players, while the cost function is quadratically dependent on the states and joint\ncontrol actions. Zero-sum LQ games \ufb01nd broad applications in H\u221e-control for robust control syn-\nthesis [15, 22], and risk-sensitive control [23, 24].\nIn fact, such an LQ setting can be used for\nstudying general continuous control problems with adversarial disturbances/opponents, by lineariz-\ning the system of interest around the operational point [15]. Therefore, developing theory for the\nLQ setting may provide some insights into the local property of the general control settings. Our\nstudy is pertinent to the recent efforts on policy optimization for linear quadratic regulator (LQR)\nproblems [25, 26, 27], a single-player counterpart of LQ games. As to be shown later, LQ games\nare more challenging to solve using PO methods, since they are not only nonconvex in the pol-\nicy space for one player (as LQR), but also nonconcave for the other. Compared to PO for LQR,\nsuch nonconvexity-nonconcavity has caused technical dif\ufb01culties in showing the stabilizing prop-\nerties along the iterations, an essential requirement for the iterative PO algorithms to be feasible.\nAdditionally, in contrast to the recent non-asymptotic analyses on gradient methods for nonconvex-\nnonconcave saddle-point problems [28], the objective function lacks smoothness in LQ games, as\nthe main challenge identi\ufb01ed in [25] for LQR.\nTo address these technical challenges, we \ufb01rst investigate the optimization landscape of LQ games,\nshowing that the stationary point of the objective function constitutes the NE of the game, despite its\nnonconvexity and nonconcavity. We then propose three projected nested-gradient methods, which\nseparate the updates into two loops with both gradient-based iterations. Such a nested-loop update\nmitigates the inherent non-stationarity of learning in games. The projection ensures the stabilizing\nproperty of the control along the iterations. The algorithms are guaranteed to converge to the NE,\nwith globally sublinear and locally linear rates. Our results set theoretical foundations for developing\nmodel-free policy-based reinforcement learning algorithms for zero-sum LQ games.\nRelated Work. There is a huge body of literature on value-based methods for zero-sum Markov\ngames; see, e.g, [29, 30, 31, 32, 33, 34] and the references therein. Specially, for the linear quadratic\nsetting, [35] proposed a Q-learning approximate dynamic programming approach. In contrast, the\nstudy of PO methods for zero-sum Markov games is limited, which are either empirical without any\ntheoretical guarantees [36], or developed only for the tabular setting [37, 38, 39, 40]. Within the\nLQ setting, our work is related to the recent one on the global convergence of policy gradient (PG)\nmethods for LQR [25, 26]. However, our setting is more challenging since it concerns a saddle-point\nproblem with not only nonconvexity on the minimizer, but also nonconcavity on the maximizer.\nOur work also falls into the realm of solving nonconvex-(non)concave saddle-point problems\n[41, 42, 43, 44, 45, 46], which has recently drawn great attention due to the popularity of train-\ning generative adversarial networks (GANs) [47, 48, 42, 49]. However, most of the existing results\nare either for the nonconvex but concave minimax setting [50, 42, 49], or only have asymptotic con-\nvergence results [41, 47, 48, 43, 44]. Two recent pieces of results on non-asymptotic analyses for\nsolving this problem have been established under strong assumptions that the objective function is\neither weakly-convex and weakly-concave [51], or smooth [28]. However, LQ games satisfy neither\nof these assumptions. In addition, even asymptotically, basic gradient-based approaches may not\nconverge to (local) Nash equilibria [45, 46], not even to stationary points, due to the oscillatory be-\nhaviors [20]. In contrast to [45, 46], our results show the global convergence to actual NE (instead\nof any surrogate as local minimax in [46]) of the game.\n\n2\n\n\fContribution. Our contribution is two-fold: i) we investigate the optimization landscape of zero-\nsum LQ games in the parametrized feedback control policy space, showing its desired property that\nstationary points constitute the Nash equilibria; ii) we develop projected nested-gradient methods\nthat are proved to converge to the NE with globally sublinear and locally linear rates. We also pro-\nvide several interesting simulation \ufb01ndings on solving this problem with PO methods, even beyond\nthe settings we consider that enjoy theoretical guarantees. To the best of our knowledge, for the \ufb01rst\ntime, policy-based methods with function approximation are shown to converge to the global Nash\nequilibria in a class of zero-sum Markov games, and also with convergence rate guarantees.\n\n2 Background\n\nConsider a zero-sum LQ game model, where the dynamics are characterized by a linear system\n\nxt+1 = Axt + But + Cvt,\n\n(2.1)\nwhere the system state is xt \u2208 Rd, the control inputs of players 1 and 2 are ut \u2208 Rm1 and vt \u2208 Rm2,\nrespectively. The matrices satisfy A \u2208 Rd\u00d7d, B \u2208 Rd\u00d7m1, and C \u2208 Rd\u00d7m2. The objective of player\n1 (player 2) is to minimize (maximize) the in\ufb01nite-horizon value function,\nt Qxt + u(cid:62)\n\nt Ruut \u2212 v(cid:62)\n\n(cid:20) \u221e(cid:88)\n\n(cid:20) \u221e(cid:88)\n\n= Ex0\u223cD\n\nct(xt, ut, vt)\n\nEx0\u223cD\n\nt Rvvt)\n\n(x(cid:62)\n\n(cid:21)\n\n(cid:21)\n\n,\n\nt=0\n\nt=0\n\n(2.2)\nwhere x0 \u223c D is the initial state drawn from a distribution D, the matrices Q \u2208 Rd\u00d7d, Ru \u2208\nRm1\u00d7m1, and Rv \u2208 Rm2\u00d7m2 are all positive de\ufb01nite. If the solution to (2.2) exists and the inf and\nsup can be interchanged, we refer to the solution value in (2.2) as the value of the game.\nTo investigate the property of the solution to (2.2), we \ufb01rst introduce the generalized algebraic\nRiccati equation (GARE) as follows\n\nP \u2217 = A(cid:62)P \u2217A + Q \u2212(cid:2)A(cid:62)P \u2217B A(cid:62)P \u2217C(cid:3)(cid:20)Ru + B(cid:62)P \u2217B\n\n(cid:21)\u22121(cid:20)B(cid:62)P \u2217A\n\n(cid:21)\n\n,\n\nB(cid:62)P \u2217C\n\n\u2212Rv + C(cid:62)P \u2217C\n\nC(cid:62)P \u2217B\n\nC(cid:62)P \u2217A\n\n(2.3)\nwhere P \u2217 denotes the minimal non-negative de\ufb01nite solution to (2.3). Under standard assumptions\nto be speci\ufb01ed shortly, the value exists and can be characterized by P \u2217 [15], i.e., for any x0 \u2208 Rd,\n\ninf\n\n{ut}t\u22650\n\nsup\n{vt}t\u22650\n\nx(cid:62)\n0 P \u2217x0 = inf\n{ut}t\u22650\n\nsup\n{vt}t\u22650\n\nct(xt, ut, vt) = sup\n{vt}t\u22650\n\ninf\n\n{ut}t\u22650\n\nct(xt, ut, vt),\n\n(2.4)\n\nand there exists a pair of linear feedback stabilizing polices to make the equality in (2.4) hold, i.e.,\n(2.5)\nwith K\u2217 \u2208 Rm1\u00d7d and L\u2217 \u2208 Rm2\u00d7d being the control gain matrices for the minimizer and the\nmaximizer, respectively. The values of K\u2217 and L\u2217 can be given by\n\nt = \u2212K\u2217xt,\nu\u2217\n\nt = \u2212L\u2217xt,\nv\u2217\n\n\u221e(cid:88)\n\nt=0\n\n\u221e(cid:88)\n\nt=0\n\nK\u2217 =[Ru + B(cid:62)P \u2217B \u2212 B(cid:62)P \u2217C(\u2212Rv + C(cid:62)P \u2217C)\u22121C(cid:62)P \u2217B]\u22121\n\n\u00d7 [B(cid:62)P \u2217A \u2212 B(cid:62)P \u2217C(\u2212Rv + C(cid:62)P \u2217C)\u22121C(cid:62)P \u2217A],\n\nL\u2217 =[\u2212Rv + C(cid:62)P \u2217C \u2212 C(cid:62)P \u2217B(Ru + B(cid:62)P \u2217B)\u22121B(cid:62)P \u2217C]\u22121\n\n(2.6)\n\n(2.7)\nSince the controller pair (K\u2217, L\u2217) achieves the value (2.4) for any x0, the value of the game is thus\n\n(cid:1). Now we introduce the assumption that guarantees the arguments above to hold.\n\n\u00d7 [C(cid:62)P \u2217A \u2212 C(cid:62)P \u2217B(Ru + B(cid:62)P \u2217B)\u22121B(cid:62)P \u2217A].\n\nEx0\u223cD(cid:0)x(cid:62)\n\n0 P \u2217x0\n\nAssumption 2.1. The following conditions hold: i) there exists a minimal positive de\ufb01nite solution\nP \u2217 to the GARE (2.3) that satis\ufb01es Rv \u2212 C(cid:62)P \u2217C > 0; ii) L\u2217 satis\ufb01es Q \u2212 (L\u2217)(cid:62)RvL\u2217 > 0.\nThe condition i) in Assumption 2.1 is a standard suf\ufb01cient condition that ensures the existence of the\nvalue of the game [15, 35, 52]. The additional condition ii) leads to the saddle-point property of the\ncontrol pair (K\u2217, L\u2217), i.e., the controller sequence ({u\u2217\nt}t\u22650,{v\u2217\nt }t\u22650) generated from (2.5) con-\nstitutes the NE of the game (2.2), which is also unique. We formally state the arguments regarding\n(2.3)-(2.7) in the following lemma, whose proof is deferred to \u00a7C.1.\n\n3\n\n\f\u221e(cid:88)\n\nLemma 2.2. Under Assumption 2.1 i), for any x0 \u2208 Rd, the value of the minimax game\n\ninf\n\n{ut}t\u22650\n\nsup\n{vt}t\u22650\n\nct(xt, ut, vt)\n\n(2.8)\nexists, i.e., (2.4) holds, and (K\u2217, L\u2217) is stabilizing. Furthermore, under Assumption 2.1 ii), the\nt }t\u22650) generated from (2.5) constitutes the saddle-point of (2.8),\ncontroller sequence ({u\u2217\ni.e., the NE of the game, and it is unique.\nLemma 2.2 implies that the solution to (2.2) can be found by searching for (K\u2217, L\u2217) in the matrix\nspace Rm1\u00d7d \u00d7 Rm2\u00d7d, given by (2.6)-(2.7) for some P \u2217 > 0 satisfying (2.3). Next, we aim to\ndevelop policy optimization methods that are guaranteed to converge to such a (K\u2217, L\u2217).\n\nt}t\u22650,{v\u2217\n\nt=0\n\n3 Policy Gradient and Landscape\n\n(cid:26) \u221e(cid:88)\n\nC(K, L) := Ex0\u223cD\n\nBy Lemma 2.2, we focus on \ufb01nding the state feedback policies of players parameterized by ut =\n\u2212Kxt and vt = \u2212Lxt, such that \u03c1(A\u2212 BK \u2212 CL) < 1. Accordingly, we denote the corresponding\nexpected cost in (2.2) as\n\nt Qxt + (Kxt)(cid:62)Ru(Kxt) \u2212 (Lxt)(cid:62)Rv(Lxt)(cid:3)(cid:27)\n(cid:2)x(cid:62)\nThen for any stablilizing control pair (K, L), it follows that C(K, L) = Ex0\u223cD(cid:0)x(cid:62)\nwe de\ufb01ne \u03a3K,L as the state correlation matrix, i.e., \u03a3K,L := Ex0\u223cD(cid:80)\u221e\n\nPK,L = Q + K(cid:62)RuK \u2212 L(cid:62)RvL + (A \u2212 BK \u2212 CL)(cid:62)PK,L(A \u2212 BK \u2212 CL).\n\nAlso, de\ufb01ne PK,L as the unique solution to the Lyapunov equation\n\nthe NE (K\u2217, L\u2217) using policy optimization methods that solve the following minimax problem\n\n(cid:1). Also,\n\n(3.2)\n\nt . Our goal is to \ufb01nd\n\nt=0 xtx(cid:62)\n\n.\n\n(3.1)\n\n0 PK,Lx0\n\nt=0\n\nC(K, L),\n\n(3.3)\n\nmax\nsuch that for any K \u2208 Rm1\u00d7d and L \u2208 Rm2\u00d7d,\n\nmin\nK\n\nL\n\nC(K\u2217, L) \u2264 C(K\u2217, L\u2217) \u2264 C(K, L\u2217).\n\nAs has been recognized in [25] that the LQR problem is nonconvex with respect to (w.r.t.) the control\ngain K, we note that in general, for some given L (or K), the minimization (or maximization)\nproblem is not convex (or concave). This has in fact caused the main challenge for the design of\nequilibrium-seeking algorithms for zero-sum LQ games. We formally state this in the following\nlemma, which is proved in \u00a7C.2.\nLemma 3.1 (Nonconvexity-Nonconcavity of C(K, L)). De\ufb01ne a subset \u2126 \u2282 Rm2\u00d7d as\n\n\u2126 :=(cid:8)L \u2208 Rm2\u00d7d | Q \u2212 L(cid:62)RvL > 0(cid:9).\n\n(3.4)\nThen there exists L \u2208 \u2126 such that minK C(K, L) is a nonconvex minimization problem; there exists\nK such that maxL\u2208\u2126 C(K, L) is a nonconcave maximization problem.\nTo facilitate the algorithm design, we establish the explicit expression of the policy gradient w.r.t.\nthe parameters K and L in the following lemma, with a proof provided in \u00a7C.3.\nLemma 3.2 (Policy Gradient Expression). The policy gradients of C(K, L) have the form\n\n\u2207KC(K, L) = 2[(Ru + B(cid:62)PK,LB)K \u2212 B(cid:62)PK,L(A \u2212 CL)]\u03a3K,L\n\u2207LC(K, L) = 2[(\u2212Rv + C(cid:62)PK,LC)L \u2212 C(cid:62)PK,L(A \u2212 BK)]\u03a3K,L.\n\n(3.5)\n(3.6)\n\nTo study the landscape of this nonconvex-nonconcave problem, we \ufb01rst examine the property of the\nstationary points of C(K, L), which are the points that gradient-based methods converge to.\nLemma 3.3 (Stationary Point Property). For a stabilizing control pair (K, L), i.e., \u03c1(A \u2212 BK \u2212\nCL) < 1, suppose \u03a3K,L is full-rank and (\u2212Rv + C(cid:62)PK,LC) is invertible.\nIf \u2207KC(K, L) =\n\u2207LC(K, L) = 0, and the induced matrix PK,L de\ufb01ned in (3.2) is positive de\ufb01nite, then (K, L)\nconstitutes the control gain pair at the Nash equilibrium.\nLemma 3.3, proved in \u00a7C.4, shows that the stationary point of C(K, L) suf\ufb01ces to characterize\nthe NE of the game under certain conditions. In fact, for \u03a3K,L to be full-rank, it suf\ufb01ces to let\nEx0\u223cDx0x(cid:62)\n0 be full-rank, i.e., to use a random initial state x0. This can be easily satis\ufb01ed in practice.\n\n4\n\n\f4 Policy Optimization Algorithms\n\nIn this section, we propose three PO methods, based on policy gradients, to \ufb01nd the global NE of\nthe LQ game. In particular, we develop nested-gradient (NG) methods, which \ufb01rst solve the inner\noptimization by PG methods, and then use the stationary-point solution to perform gradient-update\nfor the outer optimization. One way to solve for the NE is to directly address the minimax problem\n(2.2). Success of this procedure, as pointed out in [25] for LQR, requires the stability guarantee\nof the system along the outer policy-gradient updates. However, unlike LQR, it is not clear so far\nif there exists a stepsize and/or condition on K that ensures such stability of the system along the\nouter-loop policy-gradient update. Instead, if we solve the maximin problem, which has the same\nvalue as (2.2) (see Lemma 2.2), then a simple projection step on the iterate L, as to be shown later,\ncan guarantee the stability of the updates. Therefore, we aim to solve maxL minK C(K, L).\nFor some given L, the inner minimization problem becomes an LQR problem with equivalent cost\n\nmatrix (cid:101)QL = Q \u2212 L(cid:62)RvL, and state transition matrix (cid:101)AL = A \u2212 CL. Motivated by [25], we\n\npropose to \ufb01nd the stationary point of the inner problem, since the stationary point suf\ufb01ces to be the\nglobal optimum under certain conditions (see Corollary 4 in [25]). Let the stationary-point solution\nbe K(L). By setting \u2207KC(K, L) = 0 and by Lemma 3.2, we have\n\nK(L) = (Ru + B(cid:62)PK(L),LB)\u22121B(cid:62)PK(L),L(A \u2212 CL).\n\n(4.1)\n\nWe then substitute (4.1) into (3.2) to obtain the Riccati equation for the inner problem:\n\nL PK(L),L(cid:101)AL \u2212 (cid:101)A(cid:62)\n\nL PK(L),LC(Ru + B(cid:62)PK(L),LB)\u22121C(cid:62)PK(L),L(cid:101)AL. (4.2)\n\nPK(L),L = (cid:101)QL + (cid:101)A(cid:62)\n\nNote that K(L) can be obtained using gradient-based algorithms as in [25]. For example, one can\nuse the basic policy gradient update in the inner-loop, i.e.,\n\nK(cid:48) = K \u2212 \u03b1\u2207KC(K, L) = K \u2212 2\u03b1[(Ru + B(cid:62)PK,LB)K \u2212 B(cid:62)PK,L(cid:101)AL]\u03a3K,L,\n\n(4.3)\nwhere \u03b1 > 0 denotes the stepsize, and PK,L denotes the solution to (3.2) for given (K, L). Alterna-\ntively, one can also use the following algorithms that use the approximate second-order information\nto accelerate the update, i.e., the natural policy gradient update:\n\nK(cid:48) = K \u2212 \u03b1\u2207KC(K, L)\u03a3\u22121\n\nK,L = K \u2212 2\u03b1[(Ru + B(cid:62)PK,LB)K \u2212 B(cid:62)PK,L(cid:101)AL],\n= K \u2212 2\u03b1(Ru + B(cid:62)PK,LB)\u22121[(Ru + B(cid:62)PK,LB)K \u2212 B(cid:62)PK,L(cid:101)AL].\n\nK(cid:48) = K \u2212 \u03b1(Ru + B(cid:62)PK,LB)\u22121\u2207KC(K, L)\u03a3\u22121\n\nK,L\n\nand the Gauss-Newton update:\n\n(4.4)\n\n(4.5)\n\nSuppose K(L) in (4.1) can be obtained, regardless of the algorithms used. Then, we substitute\nK(L) back to \u2207LC(K(L), L) to obtain the nested-gradient w.r.t. L, which has the following form\n(4.6)\n\n\u2207L(cid:101)C(L) := \u2207LC(K(L), L)\n(cid:110)(cid:2) \u2212 Rv + C(cid:62)PK(L),LC \u2212 C(cid:62)PK(L),LB(Ru + B(cid:62)PK(L),LB)\u22121B(cid:62)PK(L),LC(cid:3)L (4.7)\nNote that the stationary-point condition of the outer-loop that \u2207L(cid:101)C(L) = 0 is identical to that of\n\n(cid:2)A \u2212 B(Ru + B(cid:62)PK(L),LB)\u22121B(cid:62)PK(L),LA(cid:3)(cid:111)\n\n\u2212 C(cid:62)PK(L),L\n\n\u03a3K(L),L.\n\n(4.8)\n\n= 2\n\n\u2207LC(K(L), L) = 0, since\n\n\u2207L(cid:101)C(L) = \u2207LC(K(L), L) + \u2207LK(L) \u00b7 \u2207KC(K(L), L) = \u2207LC(K(L), L),\n\nwhere \u2207KC(K(L), L) = 0 by de\ufb01nition of K(L). Thus, the convergent point (K(L), L) that makes\n\n\u2207L(cid:101)C(L) = 0 satisfy both conditions \u2207KC(K(L), L) = 0 and \u2207LC(K(L), L) = 0, which implies\n\nfrom Lemma 3.3 that the convergent control pair (K(L), L) constitutes the Nash equilibrium.\nThus, we propose the projected nested-gradient update in the outer-loop to \ufb01nd the pair (K(L), L):\n\nProjected Nested-Gradient:\n\n(4.9)\n\n\u2126 [L + \u03b7\u2207L(cid:101)C(L)],\n\nL(cid:48) = PGD\n\n5\n\n\fwhere \u2126 is some convex set in Rm2\u00d7d, and PGD\n\n\u2126 [(cid:101)L] = argmin\n\nPGD\n\n\u2126 [\u00b7] is the projection operator onto \u2126 de\ufb01ned as\nTr\n\n(cid:104)(cid:0)L \u2212(cid:101)L(cid:1)(cid:0)L \u2212(cid:101)L(cid:1)(cid:62)(cid:105)\n\n,\n\n(4.10)\n\ni.e., the minimizer of the distance between (cid:101)L and L in Frobenius norm. It is assumed that the set\nthere always exists a constant \u03b6 with 0 < \u03b6 < \u03c3min((cid:101)QL\u2217 ), with one example of \u2126 that serves the\n\n\u2126 is large enough such that the Nash equilibrium (K\u2217, L\u2217) is contained in it. By Assumption 2.1,\n\nL\u2208\u2126\n\npurpose is\n\n\u2126 :=(cid:8)L \u2208 Rm2\u00d7d | Q \u2212 L(cid:62)RvL \u2265 \u03b6 \u00b7 I(cid:9),\n\n(4.11)\nwhich contains L\u2217 at the NE. Thus, the projection does not exclude the convergence to the NE. The\nfollowing lemma, whose proof is in \u00a7C.5, shows that this set \u2126 is indeed convex and compact.\nLemma 4.1. The subset \u2126 \u2282 Rm2\u00d7d de\ufb01ned in (4.11) is a convex and compact set.\nRemark 4.2 (Constraint Set \u2126). The projection is mainly for the purpose of theoretical analysis,\nand is not necessarily used in the implementation of the algorithm in practice. In fact, the simulation\nresults in \u00a76 show that the algorithms can converge without this projection in many cases. Such a\nprojection is also implementable, since the set to project on is convex, and the constraint is directly\nimposed on the policy parameter iterate L (not on some derivative quantities, e.g., PK(L),L).\n\nSimilarly, we develop the following projected natural nested-gradient update:\nProjected Natural Nested-Gradient:\nwhere the projection operator PN G\n\n\u2126 [\u00b7] for natural nested-gradient is de\ufb01ned as\n.\n\n(cid:104)(cid:0) \u02c7L \u2212(cid:101)L(cid:1)\u03a3K(L),L\n\n(cid:2)L + \u03b7\u2207L(cid:101)C(L)\u03a3\u22121\n(cid:0) \u02c7L \u2212(cid:101)L(cid:1)(cid:62)(cid:105)\n\n\u2126 [(cid:101)L] = argmin\n\nL(cid:48) = PN G\n\nPN G\n\nTr\n\n\u2126\n\nK(L),L\n\n(cid:3),\n\n\u02c7L\u2208\u2126\n\n(4.12)\n\n(4.13)\n\nA weight matrix \u03a3K(L),L is added for the convenience of subsequent theoretical analysis. We note\nthat the weight matrix \u03a3K(L),L depends on the current iterate L in (4.12).\nMoreover, we can develop the projected nested-gradient algorithm with preconditioning matrices.\nFor example, if we assume that Rv \u2212 C(cid:62)PK(L),LC is positive de\ufb01nite, and de\ufb01ne\n\nWL = Rv \u2212 C(cid:62)(cid:2)PK(L),L \u2212 PK(L),LB(Ru + B(cid:62)PK(L),LB)\u22121B(cid:62)PK(L),L\n\n(4.14)\nwe have the following update that is referred to as a projected Gauss-Newton nested-gradient update\nProjected Gauss-Newton Nested-Gradient:\n\n(cid:3)C,\n\nL(cid:48) = PGN\n\u2126 [\u00b7] is de\ufb01ned as\n\n\u2126\n\n(cid:2)L + \u03b7W \u22121\n(cid:104)\n\nL \u2207L(cid:101)C(L)\u03a3\u22121\n(cid:0) \u02c7L \u2212(cid:101)L(cid:1)\u03a3K(L),L\n\n(cid:3),\n(cid:0) \u02c7L \u2212(cid:101)L(cid:1)(cid:62)\n\nW 1/2\n\nK(L),L\n\nL\n\nTr\n\n(cid:105)\n\n.\n\nW 1/2\n\nL\n\n(4.15)\n\n(4.16)\n\nwhere the projection operator PGN\n\n\u2126 [(cid:101)L] = argmin\n\nPGN\n\n\u02c7L\u2208\u2126\n\nThe weight matrices \u03a3K(L),L and WL both depend on the current iterate L in (4.15).\nBased on the updates above, it is straightforward to develop model-free versions of NG algorithms\nusing sampled data. In particular, we propose to \ufb01rst use zeroth-order optimization algorithms to \ufb01nd\nthe stationary point of the inner LQR problem after a \ufb01nite number of iterations. Since the Gauss-\nNewton update cannot be estimated via sampling, only the PG and natural PG updates are converted\nto model-free versions. The approximate stationary point is then substituted into the outer-loop to\nperform the projected (natural) NG updates. Details of our model-free projected NG updates are\nprovided in \u00a7A. Note that building upon our theory next, high-probability convergence guarantees\nfor these model-free counterparts can be established as in the LQR setting in [25].\n\n5 Convergence Results\n\nWe start by establishing the convergence for the inner optimization problem as follows, which shows\nthe globally linear convergence rates of the inner-loop policy gradient updates (4.3)-(4.5).\n\n6\n\n\fProposition 5.1 (Global Convergence Rate of Inner-Loop Update). Suppose Ex0\u223cDx0x(cid:62)\n0 > 0 and\nAssumption 2.1 holds. For any L \u2208 \u2126, where \u2126 is de\ufb01ned in (3.4), it follows that: i) the inner-loop\nLQR problem always admits a solution, with a positive de\ufb01nite PK(L),L and a stabilizing control pair\n(K(L), L); ii) there exists a constant stepsize \u03b1 > 0 for each of the updates (4.3)-(4.5) such that the\ngenerated control pair sequences {(K\u03c4 , L)}\u03c4\u22650 are always stabilizing; iii) the updates (4.3)-(4.5)\nenables the convergence of the cost value sequence {C(K\u03c4 , L)}\u03c4\u22650 to the optimum C(K(L), L)\nwith linear rate.\n\nProof of Proposition 5.1, deferred to \u00a7B.2, primarily follows that for Theorem 7 in [25]. However,\nwe provide additional stability arguments for the control pair (K\u03c4 , L) along the iteration of \u03c4.\nWe then establish the global convergence of the projected NG updates (4.9), (4.12), and (4.15).\nBefore we state the results, we de\ufb01ne the gradient mapping for all three projection operators\n\u2126 , PN G\n\u2126 , and PGD\nPGN\nPGN\n\u2126\n\n\u2126 at any L \u2208 \u2126 as follows\nL \u2207L(cid:101)C(L)\u03a3\u22121\n\n(cid:2)L + \u03b7\u2207L(cid:101)C(L)\u03a3\u22121\n\n(cid:2)L + \u03b7W \u22121\n\n(cid:3) \u2212 L\n\nPN G\n\u2126\n\nK(L),L\n\nK(L),L\n\n\u02c6G\u2217\nL :=\n\n(cid:3) \u2212 L\n(cid:2)L + \u03b7\u2207L(cid:101)C(L)](cid:3) \u2212 L\n\n(cid:101)G\u2217\n\nL :=\n\n2\u03b7\n\n2\u03b7\n\nPGD\n\u2126\n\n\u02c7G\u2217\nL :=\n\n2\u03b7\n\n.\n\n(5.1)\n\nL\u03c4\n\n\u03c4 =0\n\nL\u03c4\n\n\u03c4 =0\n\nL\u03c4\n\n\u03c4 =0\n\n(cid:13)(cid:13)2(cid:9)\n\n(cid:13)(cid:13)2(cid:9)\n\n(cid:13)(cid:13) \u02c6G\u2217\n\n(cid:13)(cid:13) \u02c7G\u2217\n\n(cid:13)(cid:13)(cid:101)G\u2217\n\nNote that gradient mappings have been commonly adopted in the analysis of projected gradient\ndescent methods in constrained optimization [53].\nTheorem 5.2 (Global Convergence Rate of Outer-Loop Update). Suppose Ex0\u223cDx0x(cid:62)\n0 > 0, As-\nsumption 2.1 holds, and the initial maximizer control L0 \u2208 \u2126, where \u2126 is de\ufb01ned in (4.11). Then it\nfollows that: i) at iteration t of the projected NG updates (4.9), (4.12), and (4.15), the inner-loop up-\ndates (4.3)-(4.5) converge to K(Lt) with linear rate; ii) the control pair sequences {(K(Lt), Lt)}t\u22650\ngenerated from (4.9), (4.12), and (4.15) are always stabilizing (regardless of the stepsize choice \u03b7);\niii) with proper choices of the stepsize \u03b7, the updates (4.9), (4.12), and (4.15) all converge to the\nNash equilibrium (K\u2217, L\u2217) of the zero-sum LQ game (3.3) with O(1/t) rate, in the sense that the\nt\u22651 all\n\nt\u22651, and (cid:8)t\u22121(cid:80)t\u22121\n\nsequences (cid:8)t\u22121(cid:80)t\u22121\n\n(cid:13)(cid:13)2(cid:9)\nt\u22651, (cid:8)t\u22121(cid:80)t\u22121\n\nconverge to zero with O(1/t) rate.\nSince \u2126 \u2282 \u2126, the \ufb01rst two arguments follow directly by applying Proposition 5.1. The last argument\nshows that the iterate (K(Lt), Lt) generated from the projected NG updates converges to the Nash\nequilibrium with a sublinear rate. Detailed proof of Theorem 5.2 is provided in \u00a7B.3.\nDue to the nonconvexity-nonconcavity of the problem (see Lemma 3.1), our result is pertinent to\nthe recent work on \ufb01nding a \ufb01rst-order stationary point for nonconvex-nonconcave minimax games\nunder the Polyak-\u0141ojasiewicz (P\u0141)-condition for one of the players [28].\nInterestingly, the LQ\ngames considered here also satisfy the one-sided P\u0141-condition in [28], since for a given L \u2208 \u2126, the\ninner problem is an LQR, which enables the use of Lemma 11 in [25] to show this. However, as\nrecognized by [25] for LQR problems, the main challenge of the LQ games here in contrast to the\nminimax game setting in [28] is coping with the lack of smoothness in the objective function.\nThis O(1/t) rate matches the sublinear convergence rate to \ufb01rst-order stationary points, instead\nof (local) Nash equilibrium, in [28]. In contrast, by the landscape of zero-sum LQ games shown in\nLemma 3.3, our convergence is to the global NE of the game, if the projection is not effective. In fact,\nin this case, the convergence rate can be improved to be linear, as to be introduced next in Theorem\n5.3. In addition, our rate also matches the (worst-case) global convergence rate of gradient descent\nand second-order algorithms for nonconvex optimization, either under the smoothness assumption\nof the objective [54, 55], or for a certain class of non-smooth objectives [56].\nCompared to [25], the nested-gradient algorithms cannot be shown to have globally linear conver-\ngence rates so far, owing to the additional nonconcavity on L added to the standard LQR problems.\nNonetheless, the P\u0141 property of the LQ games still enables linear convergence rate near the Nash\nequilibrium. We formally establish the local convergence results in the following theorem, the proof\nof which is provided in \u00a7B.4.\nTheorem 5.3 (Local Convergence Rate of Outer-Loop Update). Under the conditions of Theorem\n5.2, the projected NG updates (4.9), (4.12), and (4.15) all have locally linear convergence rates\naround the Nash equilibrium (K\u2217, L\u2217) of the LQ game (3.3), in the sense that the cost value se-\n\n7\n\n\fquence {C(K(Lt), Lt)}t\u22650 converges to C(K\u2217, L\u2217), and the nested gradient norm square sequence\n\n{(cid:107)\u2207L(cid:101)C(Lt)(cid:107)2}t\u22650 converges to zero, both with linear rates.\n\nTheorem 5.3 shows that when the proposed NG updates (4.9), (4.12), and (4.15) get closer to the NE\n(K\u2217, L\u2217), the local convergence rates can be improved from sublinear (see Theorem 5.2) to linear.\nThis resembles the convergence property of (Quasi)-Newton methods for nonconvex optimization,\nwith globally sublinear and locally linear convergence rates. To the best of our knowledge, this ap-\npears to be the \ufb01rst such result on equilibrium-seeking for nonconvex-nonconcave minimax games,\neven with the smoothness assumption as in [28].\nRemark 5.4. We note that for the class of zero-sum LQ games that Assumption 2.1 ii) fails to hold,\nthere may not exists a set \u2126 of the form (4.11) that contains the NE (K\u2217, L\u2217). Even then, our global\nconvergence results in Proposition 5.1 and Theorem 5.2 still hold. This is because the convergence\nis established in the sense of gradient mappings. In this case, the statement should be changed from\nglobal convergence to the NE, to global convergence to the projected point of NE onto \u2126. . However,\nthis may invalidate the statements on local convergence in Theorem 5.3, as the proof relies on the\nineffectiveness of the projection operator around the NE.\n\n6 Simulation Results\n\nIn this section, we provide some numerical results to show the superior convergence property of\nseveral PO methods. We consider two settings referred to as Case 1 and Case 2, which are created\nbased on the simulations in [35], with\n\n(cid:34) 0.956488\n\nA =\n\n0.0741349\n\n0.0816012\n0.94121 \u22120.000708383\n\n\u22120.0005\n\n0\n\n0\n\n0.132655\n\n, B = [\u22120.00550808 \u22120.096\n\n(cid:62)\n\n0.867345]\n\n,\n\n(cid:35)\n\n.\n\nCase 1: P \u2217 =\n\n, Case 2: P \u2217 =\n\n(cid:34) 6.0173\n\n(cid:34)23.7658\n\n16.8959 0.0937\n16.8959 18.4645 0.1014\n0.0937\n1.0107\n\n0.1014\n\n5.6702\n\u22120.0071 \u22120.0067\n\n(cid:62)\nand Ru = Rv = I, \u03a30 = 0.03 \u00b7 I. We choose Q = I and C = [0.00951892, 0.0038373, 0.001]\n(cid:62) for Case 2. By direct\nfor Case 1; while Q = 0.01 \u00b7 I and C = [0.00951892, 0.0038373, 0.2]\ncalculation, we have that\n5.6702 \u22120.0071\n5.4213 \u22120.0067\n0.0102\nThus, one can easily check that Rv \u2212 C(cid:62)P \u2217C > 0 is satis\ufb01ed for both Case 1 and Case 2, i.e.,\nAssumption 2.1 i) holds. However, for Case 1, \u03bbmin(Q \u2212 (L\u2217)(cid:62)RvL\u2217) = 0.8739 > 0 satis\ufb01es\nAssumption 2.1 ii); for Case 2, \u03bbmin(Q \u2212 (L\u2217)(cid:62)RvL\u2217) = \u22120.0011 < 0 fails to satisfy it.\nIn both settings, we evaluate the convergence performance of not only our nested-gradient methods,\nbut also two types of their variants, alternating-gradient (AG) and gradient-descent-ascent (GDA)\nmethods. AG methods are based on the nested-gradient methods, but at each outer-loop iteration,\nthe inner-loop gradient-based updates only perform a \ufb01nite number of iterations, instead of con-\nverging to the exact solution K(Lt) as nested-gradient methods, which follows the idea in [28].\nThe GDA methods perform policy gradient descent for the minimizer and ascent for the maximizer\nsimultaneously. Detailed updates of these two types of methods are deferred to \u00a7D.\nFigure 1 shows that for Case 1, our nested-gradient methods indeed enjoy the global convergence to\nthe NE. The cost C(K(L), L) monotonically increases to that at the NE, and the convergence rate of\nnatural NG sits between that of the other two NG methods. Also, we note that the convergence rates\n\nof gradient mapping square in (b) are linear, which are due to (c) that \u03bb((cid:101)QL) is always positive along\n\nthe iteration, i.e., the projection is not effective. This way, our convergence results follow from the\nlocal convergence rates in Theorem 5.3, although the initialization is random (global). We have also\nshown in Figure 2 that even without Assumption 2.1 ii), i.e., in Case 2, all the PO methods mentioned\nsuccessfully converge to the NE, although the cost sequences do not converge monotonically. This\nmotivates us to provide theory for other policy optimization methods, also for more general settings\nof LQ games. We note that no projection was imposed when implementing these algorithms in all\nour experiments, which justi\ufb01es that the projection here is just for the purpose of theoretical analysis.\nIn fact, we have not found an instance of zero-sum LQ games that makes the projections effective\nas the algorithms proceed. This motivates the theoretical study of projection-free algorithms in our\nfuture work. More simulation results can be found in \u00a7D.\n\n(cid:35)\n\n(cid:35)\n\n8\n\n\f(a) C(K(L), L)\n\n(b) Grad. Mapp. Norm Square\n\n(c) \u03bbmin((cid:101)QL)\n\nFigure 1: Performance of the three projected NG methods for Case 1 where Assumption 2.1 ii)\nis satis\ufb01ed. (a) shows the monotone convergence of the expected cost C(K(L), L) to the NE cost\nC(K\u2217, L\u2217); (b) shows the convergence of the gradient mapping norm square; (c) shows the change\n\nof the smallest eigenvalue of (cid:101)QL = Q \u2212 L(cid:62)RvL.\n\n(a) Nested-Gradient\n\n(c) Gradient-Descent-Ascent\nFigure 2: Convergence of the cost for Case 2 where Assumption 2.1 ii) is not satis\ufb01ed. (a), (b), and\n(c) show convergence of the NG, AG, and GDA methods, respectively.\n\n(b) Alternating-Gradient\n\n7 Concluding Remarks\n\nThis paper has developed policy optimization methods, speci\ufb01cally, projected nested-gradient meth-\nods, to solve for the Nash equilibria of zero-sum LQ games. Despite the nonconvexity-nonconcavity\nof the problem, the gradient-based algorithms have been shown to converge to the NE with globally\nsublinear and locally linear rates. This work appears to be the \ufb01rst one showing that policy opti-\nmization methods can converge to the NE of a class of zero-sum Markov games, with \ufb01nite-iteration\nanalyses. Interesting simulation results have demonstrated the superior convergence property of our\nalgorithms, even without the projection operator, and that of the gradient-descent-ascent algorithms\nwith simultaneous updates of both players, even when Assumption 2.1 ii) is relaxed. Based on both\nthe theory and simulation, future directions include convergence analysis for the setting under a re-\nlaxed version of Assumption 2.1, and that for the projection-free versions of the algorithms, which\nwe believe can be done by the techniques in our recent work [22]. Besides, developing policy opti-\nmization methods for general-sum LQ games is another interesting yet challenging future direction.\n\nAcknowledgements\nK. Zhang and T. Ba\u00b8sar were supported in part by the US Army Research Laboratory (ARL) Coop-\nerative Agreement W911NF-17-2-0196, and in part by the Of\ufb01ce of Naval Research (ONR) MURI\nGrant N00014-16-1-2710. Z. Yang was supported by Tencent PhD Fellowship.\n\nReferences\n\n[1] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT press,\n\n2018.\n\n9\n\n0204060801.141.161.181.21.221.241.261.28Nash EquilibriumGauss-NewtonNaturalGradient02040608010-1510-1010-5100Gauss-NewtonNaturalGradient0204060800.880.90.920.940.960.98Gauss-NewtonNaturalGradient0204060800.330.3320.3340.3360.3380.340.342Nash EquilibriumGauss-NewtonNaturalGradient0204060800.340.360.380.40.420.440.46Nash EquilibriumGauss-NewtonNaturalGradient01020304050600.350.360.370.380.39Nash EquilibriumGauss-NewtonNaturalGradient\f[2] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van\nDen Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanc-\ntot, et al. Mastering the game of Go with deep neural networks and tree search. Nature,\n529(7587):484\u2013489, 2016.\n\n[3] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur\nGuez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game\nof Go without human knowledge. Nature, 550(7676):354\u2013359, 2017.\n\n[4] OpenAI. Openai \ufb01ve. https://blog.openai.com/openai-five/, 2018.\n\n[5] Oriol Vinyals, Igor Babuschkin, Junyoung Chung, Michael Mathieu, Max Jaderberg, Woj-\nciech M. Czarnecki, Andrew Dudzik, Aja Huang, Petko Georgiev, Richard Powell, Timo\nEwalds, Dan Horgan, Manuel Kroiss, Ivo Danihelka, John Agapiou, Junhyuk Oh, Valentin\nDalibard, David Choi, Laurent Sifre, Yury Sulsky, Sasha Vezhnevets, James Molloy, Trevor\nCai, David Budden, Tom Paine, Caglar Gulcehre, Ziyu Wang, Tobias Pfaff, Toby Pohlen,\nYuhuai Wu, Dani Yogatama, Julia Cohen, Katrina McKinney, Oliver Smith, Tom Schaul, Tim-\nothy Lillicrap, Chris Apps, Koray Kavukcuoglu, Demis Hassabis, and David Silver. AlphaS-\ntar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/\nalphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.\n\n[6] Vijay R Konda and John N Tsitsiklis. Actor-critic algorithms. In Advances in Neural Informa-\n\ntion Processing Systems, pages 1008\u20131014, 2000.\n\n[7] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal\n\npolicy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.\n\n[8] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust\nregion policy optimization. In International Conference on Machine Learning, pages 1889\u2013\n1897, 2015.\n\n[9] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap,\nTim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforce-\nment learning. In International conference on machine learning, pages 1928\u20131937, 2016.\n\n[10] Brendan O\u2019Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining\n\npolicy gradient and Q-learning. arXiv preprint arXiv:1611.01626, 2016.\n\n[11] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradi-\nent methods for reinforcement learning with function approximation. In Advances in Neural\nInformation Processing Systems, pages 1057\u20131063, 2000.\n\n[12] Sham M Kakade. A natural policy gradient. In Advances in Neural Information Processing\n\nSystems, pages 1531\u20131538, 2002.\n\n[13] Matteo Papini, Damiano Binaghi, Giuseppe Canonaco, Matteo Pirotta, and Marcello Restelli.\n\nStochastic variance-reduced policy gradient. arXiv preprint arXiv:1806.05618, 2018.\n\n[14] Kaiqing Zhang, Alec Koppel, Hao Zhu, and Tamer Ba\u00b8sar. Global convergence of policy gra-\n\ndient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383, 2019.\n\n[15] Tamer Ba\u00b8sar and Pierre Bernhard. H\u221e Optimal Control and Related Minimax Design Prob-\n\nlems: A Dynamic Game Approach. Springer Science & Business Media, 2008.\n\n[16] Pablo Hernandez-Leal, Michael Kaisers, Tim Baarslag, and Enrique Munoz de Cote. A sur-\nvey of learning in multiagent environments: Dealing with non-stationarity. arXiv preprint\narXiv:1707.09183, 2017.\n\n[17] Katta G Murty and Santosh N Kabadi. Some NP-complete problems in quadratic and nonlinear\n\nprogramming. Mathematical Programming, 39(2):117\u2013129, 1987.\n\n[18] Robert S Chen, Brendan Lucier, Yaron Singer, and Vasilis Syrgkanis. Robust optimization for\nnon-convex objectives. In Advances in Neural Information Processing Systems, pages 4705\u2013\n4714, 2017.\n\n10\n\n\f[19] David Balduzzi, Sebastien Racaniere, James Martens, Jakob Foerster, Karl Tuyls, and Thore\nIn International Conference on\n\nGraepel. The mechanics of n-player differentiable games.\nMachine Learning, pages 363\u2013372, 2018.\n\n[20] Eric Mazumdar and Lillian J Ratliff. On the convergence of competitive, multi-agent gradient-\n\nbased learning. arXiv preprint arXiv:1804.05464, 2018.\n\n[21] Leonard Adolphs, Hadi Daneshmand, Aurelien Lucchi, and Thomas Hofmann. Local saddle\n\npoint optimization: A curvature exploitation approach. 2019.\n\n[22] Kaiqing Zhang, Bin Hu, and Tamer Ba\u00b8sar. Policy optimization for H2 linear control with\nH\u221e robustness guarantee: Implicit regularization and global convergence. arXiv preprint\narXiv:1910.09496, 2019.\n\n[23] David Jacobson. Optimal stochastic linear systems with exponential performance criteria and\nIEEE Transactions on Automatic control,\n\ntheir relation to deterministic differential games.\n18(2):124\u2013131, 1973.\n\n[24] Peter Whittle. Risk-sensitive linear/quadratic/Gaussian control. Advances in Applied Proba-\n\nbility, 13(4):764\u2013777, 1981.\n\n[25] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. Global convergence of policy\ngradient methods for the linear quadratic regulator. In International Conference on Machine\nLearning, pages 1467\u20131476, 2018.\n\n[26] Dhruv Malik, Ashwin Pananjady, Kush Bhatia, Koulik Khamaru, Peter L Bartlett, and Martin J\nWainwright. Derivative-free methods for policy optimization: Guarantees for linear quadratic\nsystems. arXiv preprint arXiv:1812.08305, 2018.\n\n[27] Stephen Tu and Benjamin Recht. The gap between model-based and model-free methods on\nthe linear quadratic regulator: An asymptotic viewpoint. arXiv preprint arXiv:1812.03565,\n2018.\n\n[28] Maher Nouiehed, Maziar Sanjabi, Jason D Lee, and Meisam Razaviyayn.\n\nclass of non-convex min-max games using iterative \ufb01rst order methods.\narXiv:1902.08297, 2019.\n\nSolving a\narXiv preprint\n\n[29] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In\n\nInternational Conference on Machine Learning, pages 157\u2013163, 1994.\n\n[30] Michail G Lagoudakis and Ronald Parr. Value function approximation in zero-sum Markov\n\ngames. In Conference on Uncertainty in Arti\ufb01cial Intelligence, pages 283\u2013292, 2002.\n\n[31] Vincent Conitzer and Tuomas Sandholm. Awesome: A general multiagent learning algorithm\nthat converges in self-play and learns a best response against stationary opponents. Machine\nLearning, 67(1-2):23\u201343, 2007.\n\n[32] Julien P\u00e9rolat, Bilal Piot, Bruno Scherrer, and Olivier Pietquin. On the use of non-stationary\nstrategies for solving two-player zero-sum markov games. In Conference on Arti\ufb01cial Intelli-\ngence and Statistics, 2016.\n\n[33] Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Ba\u00b8sar. Finite-sample analyses\nfor fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783,\n2018.\n\n[34] Shaofeng Zou, Tengyu Xu, and Yingbin Liang. Finite-sample analysis for SARSA and Q-\n\nlearning with linear function approximation. arXiv preprint arXiv:1902.02234, 2019.\n\n[35] Asma Al-Tamimi, Frank L Lewis, and Murad Abu-Khalaf. Model-free Q-learning designs\nfor linear discrete-time zero-sum games with application to H-in\ufb01nity control. Automatica,\n43(3):473\u2013481, 2007.\n\n[36] Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial\nreinforcement learning. In International Conference on Machine Learning, pages 2817\u20132826,\n2017.\n\n11\n\n\f[37] Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games.\nIn International Joint Conference on Arti\ufb01cial Intelligence, volume 17, pages 1021\u20131026,\n2001.\n\n[38] Bikramjit Banerjee and Jing Peng. Adaptive policy gradient in multiagent learning. In Confer-\n\nence on Autonomous Agents and Multiagent Systems, pages 686\u2013692. ACM, 2003.\n\n[39] Julien P\u00e9rolat, Bilal Piot, and Olivier Pietquin. Actor-critic \ufb01ctitious play in simultaneous\nmove multistage games. In International Conference on Arti\ufb01cial Intelligence and Statistics,\npages 919\u2013928, 2018.\n\n[40] Sriram Srinivasan, Marc Lanctot, Vinicius Zambaldi, Julien P\u00e9rolat, Karl Tuyls, R\u00e9mi Munos,\nand Michael Bowling. Actor-critic policy optimization in partially observable multiagent en-\nvironments. In Advances in Neural Information Processing Systems, pages 3422\u20133435, 2018.\n\n[41] Ashish Cherukuri, Bahman Gharesifard, and Jorge Cortes. Saddle-point dynamics: Condi-\ntions for asymptotic stability of saddle points. SIAM Journal on Control and Optimization,\n55(1):486\u2013511, 2017.\n\n[42] Hassan Ra\ufb01que, Mingrui Liu, Qihang Lin, and Tianbao Yang. Non-convex min-max\narXiv preprint\n\noptimization: Provable algorithms and applications in machine learning.\narXiv:1810.02060, 2018.\n\n[43] Constantinos Daskalakis and Ioannis Panageas. The limit points of (optimistic) gradient de-\nscent in min-max optimization. In Advances in Neural Information Processing Systems, pages\n9236\u20139246, 2018.\n\n[44] Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chan-\ndrasekhar, and Georgios Piliouras. Optimistic mirror descent in saddle-point problems: Going\nthe extra (gradient) mile. In International Conference on Learning Representations, 2019.\n\n[45] Eric V Mazumdar, Michael I Jordan, and S Shankar Sastry. On \ufb01nding local Nash equilibria\n(and only local Nash equilibria) in zero-sum games. arXiv preprint arXiv:1901.00838, 2019.\n\n[46] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Minmax optimization: Stable limit points\n\nof gradient descent ascent are locally optimal. arXiv preprint arXiv:1902.00618, 2019.\n\n[47] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochre-\niter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In\nAdvances in Neural Information Processing Systems, pages 6626\u20136637, 2017.\n\n[48] Vaishnavh Nagarajan and J Zico Kolter. Gradient descent GAN optimization is locally stable.\n\nIn Advances in Neural Information Processing Systems, pages 5585\u20135595, 2017.\n\n[49] Songtao Lu, Rahul Singh, Xiangyi Chen, Yongxin Chen, and Mingyi Hong. Understand the\n\ndynamics of GANs via primal-dual optimization. 2018.\n\n[50] Paulina Grnarova, K\ufb01r Y Levy, Aurelien Lucchi, Thomas Hofmann, and Andreas Krause. An\nonline learning approach to generative adversarial networks. arXiv preprint arXiv:1706.03269,\n2017.\n\n[51] Qihang Lin, Mingrui Liu, Hassan Ra\ufb01que, and Tianbao Yang. Solving weakly-convex-weakly-\nconcave saddle-point problems as weakly-monotone variational inequality. arXiv preprint\narXiv:1810.10207, 2018.\n\n[52] Anton A Stoorvogel and Arie JTM Weeren. The discrete-time Riccati equation related to the\n\nH\u221e control problem. IEEE Transactions on Automatic Control, 39(3):686\u2013691, 1994.\n\n[53] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87.\n\nSpringer Science & Business Media, 2013.\n\n[54] Coralia Cartis, Nicholas IM Gould, and Ph L Toint. On the complexity of steepest descent,\nNewton\u2019s and regularized Newton\u2019s methods for nonconvex unconstrained optimization prob-\nlems. SIAM Journal on Optimization, 20(6):2833\u20132852, 2010.\n\n12\n\n\f[55] Coralia Cartis, Nick IM Gould, and Philippe L Toint. Worst-case evaluation complexity\nand optimality of second-order methods for nonconvex smooth optimization. arXiv preprint\narXiv:1709.07180, 2017.\n\n[56] Koulik Khamaru and Martin J Wainwright. Convergence guarantees for a class of non-convex\n\nand non-smooth optimization problems. arXiv preprint arXiv:1804.09629, 2018.\n\n[57] Huibert Kwakernaak and Raphael Sivan. Linear Optimal Control Systems, volume 1. Wiley-\n\nInterscience New York, 1972.\n\n[58] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control, volume 1. Athena Scien-\n\nti\ufb01c Belmont, MA, 2005.\n\n[59] Steven G Krantz and Harold R Parks. The Implicit Function Theorem: History, Theory, and\n\nApplications. Springer Science & Business Media, 2012.\n\n[60] Eugene E Tyrtyshnikov. A Brief Introduction to Numerical Analysis. Springer Science &\n\nBusiness Media, 2012.\n\n[61] David Jacobson. On values and strategies for in\ufb01nite-time linear quadratic games.\n\nTransactions on Automatic Control, 22(3):490\u2013491, 1977.\n\nIEEE\n\n[62] Jan R Magnus and Heinz Neudecker. Matrix differential calculus with applications to simple,\nHadamard, and Kronecker products. Journal of Mathematical Psychology, 29(4):474\u2013492,\n1985.\n\n[63] Michail M Konstantinov, P Hr Petkov, and Nicolai D Christov. Perturbation analysis of the\n\ndiscrete Riccati equation. Kybernetika, 29(1):18\u201329, 1993.\n\n[64] Ji-Guang Sun. Perturbation theory for algebraic Riccati equations. SIAM Journal on Matrix\n\nAnalysis and Applications, 19(1):39\u201365, 1998.\n\n[65] Alexander Graham. Kronecker Products and Matrix Calculus with Applications. Courier\n\nDover Publications, 2018.\n\n13\n\n\f", "award": [], "sourceid": 6209, "authors": [{"given_name": "Kaiqing", "family_name": "Zhang", "institution": "University of Illinois at Urbana-Champaign (UIUC)"}, {"given_name": "Zhuoran", "family_name": "Yang", "institution": "Princeton University"}, {"given_name": "Tamer", "family_name": "Basar", "institution": null}]}