{"title": "Online Reinforcement Learning in Stochastic Games", "book": "Advances in Neural Information Processing Systems", "page_first": 4987, "page_last": 4997, "abstract": "We study online reinforcement learning in average-reward stochastic games (SGs). An SG models a two-player zero-sum game in a Markov environment, where state transitions and one-step payoffs are determined simultaneously by a learner and an adversary. We propose the \\textsc{UCSG} algorithm that achieves a sublinear regret compared to the game value when competing with an arbitrary opponent. This result improves previous ones under the same setting. The regret bound has a dependency on the \\textit{diameter}, which is an intrinsic value related to the mixing property of SGs. Slightly extended, \\textsc{UCSG} finds an $\\varepsilon$-maximin stationary policy with a sample complexity of $\\tilde{\\mathcal{O}}\\left(\\text{poly}(1/\\varepsilon)\\right)$, where $\\varepsilon$ is the error parameter. To the best of our knowledge, this extended result is the first in the average-reward setting. In the analysis, we develop Markov chain's perturbation bounds for mean first passage times and techniques to deal with non-stationary opponents, which may be of interest in their own right.", "full_text": "Online Reinforcement Learning in Stochastic Games\n\nChen-Yu Wei\n\nInstitute of Information Science\n\nAcademia Sinica, Taiwan\n\nYi-Te Hong\n\nInstitute of Information Science\n\nAcademia Sinica, Taiwan\n\nbahh723@iis.sinica.edu.tw\n\nted0504@iis.sinica.edu.tw\n\nChi-Jen Lu\n\nInstitute of Information Science\n\nAcademia Sinica, Taiwan\n\ncjlu@iis.sinica.edu.tw\n\nAbstract\n\nWe study online reinforcement learning in average-reward stochastic games (SGs).\nAn SG models a two-player zero-sum game in a Markov environment, where state\ntransitions and one-step payoffs are determined simultaneously by a learner and\nan adversary. We propose the UCSG algorithm that achieves a sublinear regret\ncompared to the game value when competing with an arbitrary opponent. This\nresult improves previous ones under the same setting. The regret bound has a\ndependency on the diameter, which is an intrinsic value related to the mixing\nproperty of SGs. If we let the opponent play an optimistic best response to the\nlearner, UCSG \ufb01nds an \u03b5-maximin stationary policy with a sample complexity of\n\u02dcO (poly(1/\u03b5)), where \u03b5 is the gap to the best policy.\n\n1\n\nIntroduction\n\nMany real-world scenarios (e.g., markets, computer networks, board games) can be cast as multi-agent\nsystems. The framework of Multi-Agent Reinforcement Learning (MARL) targets at learning to act in\nsuch systems. While in traditional reinforcement learning (RL) problems, Markov decision processes\n(MDPs) are widely used to model a single agent\u2019s interaction with the environment, stochastic games\n(SGs, [32]), as an extension of MDPs, are able to describe multiple agents\u2019 simultaneous interaction\nwith the environment. In this view, SGs are most well-suited to model MARL problems [24].\nIn this paper, two-player zero-sum SGs are considered. These games proceed like MDPs, with the\nexception that in each state, both players select their own actions simultaneously 1, which jointly\ndetermine the transition probabilities and their rewards . The zero-sum property restricts that the\ntwo players\u2019 payoffs sum to zero. Thus, while one player (Player 1) wants to maximize his/her total\nreward, the other (Player 2) would like to minimize that amount. Similar to the case of MDPs, the\nreward can be discounted or undiscounted, and the game can be episodic or non-episodic.\nIn the literature, SGs are typically learned under two different settings, and we will call them online\nand of\ufb02ine settings, respectively. In the of\ufb02ine setting, the learner controls both players in a centralized\nmanner, and the goal is to \ufb01nd the equilibrium of the game [33, 21, 30]. This is also known as \ufb01nding\nthe worst-case optimality for each player (a.k.a. maximin or minimax policy). In this case, we\ncare about the sample complexity, i.e., how many samples are required to estimate the worst-case\noptimality such that the error is below some threshold. In the online setting, the learner controls only\none of the players, and plays against an arbitrary opponent [24, 4, 5, 8, 31]. In this case, we care\n\n1Turn-based SGs, like Go, are special cases: in each state, one player\u2019s action set contains only a null action.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fabout the learner\u2019s regret, i.e., the difference between some benchmark measure and the learner\u2019s total\nreward earned in the learning process. This benchmark can be de\ufb01ned as the total reward when both\nplayers play optimal policies [5], or when Player 1 plays the best stationary response to Player 2 [4].\nSome of the above online-setting algorithms can \ufb01nd the equilibrium simply through self-playing.\nMost previous results on of\ufb02ine sample complexity consider discounted SGs. Their bounds depend\nheavily on the chosen discount factor [33, 21, 30, 31]. However, as noted in [5, 19], the discounted\nsetting might not be suitable for SGs that require long-term planning, because only \ufb01nite steps are\nrelevant in the reward function it de\ufb01nes. This paper, to the best of our knowledge, is the \ufb01rst to give\nan of\ufb02ine sample complexity bound of order \u02dcO (poly(1/\u03b5)) in the average-reward (undiscounted\nand non-episodic) setting, where \u03b5 is the error parameter. A major difference between our algorithm\nand previous ones is that the two players play asymmetric roles in our algorithm: by focusing on\n\ufb01nding only one player\u2019s worst-case optimal policy at a time, the sampling can be rather ef\ufb01cient.\nThis resembles but strictly extends [13]\u2019s methods in \ufb01nding the maximin action in a two-stage game.\nIn the online setting, we are only aware of [5]\u2019s R-MAX algorithm that deals with average-reward SGs\nand provides a regret bound. Considering a similar scenario and adopting the same regret de\ufb01nition,\nwe signi\ufb01cantly improve their bounds (see Appendix A for details). Another difference between our\nalgorithm and theirs is that ours is able to output a currently best stationary policy at any stage in the\nlearning process, while theirs only produces a T\u03b5-step \ufb01xed-horizon policy for some input parameter\nT\u03b5. The former could be more natural since the worst-case optimal policy is itself a stationary policy.\nThe techniques used in this paper are most related to RL for MDPs based on the optimism principle\n[2, 19, 9] (see Appendix A). The optimism principle built on concentration inequalities automatically\nstrikes a balance between exploitation and exploration, eliminating the need to manually adjust the\nlearning rate or the exploration ratio. However, when importing analysis from MDPs to SGs, we\nface the challenge caused by the opponent\u2019s uncontrollability and non-stationarity. This prevents the\nlearner from freely exploring the state space and makes previous analysis that relies on stationary\ndistribution\u2019s perturbation analysis [2] useless. In this paper, we develop a novel way to replace the\nopponent\u2019s non-stationary policy with a stationary one in the analysis (introduced in Section 5.1),\nwhich facilitates the use of techniques based on perturbation analysis. We hope that this technique\ncan bene\ufb01t future analysis concerning non-stationary agents in MARL.\nOne related topic is the robust MDP problem [29, 17, 23]. It is an MDP where some state-action\npairs have adversarial rewards and transitions. It is often assumed in robust MDP that the adversarial\nchoices by the environment are not directly observable by the Player, but in our SG setting, we\nassume that the actions of Player 2 can be observed. However, there are still dif\ufb01culties in SG that\nare not addressed by previous works on robust MDP.\nHere we compare our work to [23], a recent work on learning robust MDP. In their setting, there are\nadversarial and stochastic state-action pairs, and their proposed OLRM2 algorithm tries to distinguish\nthem. Under the scenario where the environment is fully adversarial, which is the counterpart to our\nsetting, the worst-case transitions and rewards are all revealed to the learner, and what the learner\nneeds to do is to perform a maximin planning. In our case, however, the worst-case transitions and\nrewards are still to be learned, and the opponent\u2019s arbitrary actions may hinder the learner to learn\nthis information. We would say that the contribution of [23] is orthogonal to ours.\nOther lines of research that are related to SGs are on MDPs with adversarially changing reward\nfunctions [11, 27, 28, 10] and with adversarially changing transition probabilities [35, 1]. The\nassumptions in these works have several differences with ours, and therefore their results are not\ncomparable to our results. However, they indeed provide other viewpoints about learning in stochastic\ngames.\n\n2 Preliminaries\n\nGame Models and Policies. A SG is a 4-tuple M = (S,A, r, p). S denotes the state space and\nA = A1 \u00d7 A2 the players\u2019 joint action space. We denote S = |S| and A = |A|. The game starts\nfrom an initial state s1. Suppose at time t the players are at state st. After the players play the joint\nt ) \u2208 [0, 1] from Player 2, and both\nactions (a1\nplayers visit state st+1 following the transition probability p(\u00b7|st, a1\nt ). For simplicity, we consider\n\nt ), Player 1 receives the reward rt = r(st, a1\n\nt , a2\n\nt , a2\n\nt , a2\n\n2\n\n\ft , a2\n\ndeterministic rewards as in [3]. The extension to stochastic case is straightforward. We shorten our\nt ), and use abbreviations such as r(st, at) and p(\u00b7|st, at).\nnotation by a := (a1, a2) or at := (a1\nWithout loss of generality, players are assumed to determine their actions based on the history. A\npolicy \u03c0 at time t maps the history up to time t, Ht = (s1, a1, r1, ..., st) \u2208 Ht, to a probability\ndistribution over actions. Such policies are called history-dependent policies, whose class is denoted\nby \u03a0HR. On the other hand, a stationary policy, whose class is denoted by \u03a0SR, selects actions as a\nfunction of the current state. For either class, joint policies (\u03c01, \u03c02) are often written as \u03c0.\nAverage Return and the Game Value. Let the players play joint policy \u03c0. De\ufb01ne the T -step total\nt=1 r(st, at), where s1 = s, and the average reward as \u03c1(M, \u03c0, s) :=\n\nreward as RT (M, \u03c0, s) :=(cid:80)T\n\nE [RT (M, \u03c0, s)], whenever the limit exists. In fact, the game value exists2 [26]:\n\nlimT\u2192\u221e 1\nT\n\n\u03c1\u2217(M, s) := sup\n\n\u03c01\n\ninf\n\u03c02\n\nlim\nT\u2192\u221e\n\n1\nT\n\nE(cid:2)RT (M, \u03c01, \u03c02, s)(cid:3) .\n\nIf \u03c1(M, \u03c0, s) or \u03c1\u2217(M, s) does not depend on the initial state s, we simply write \u03c1(M, \u03c0) or \u03c1\u2217(M ).\nThe Bias Vector. For a stationary policy \u03c0, the bias vector h(M, \u03c0,\u00b7) is de\ufb01ned, for each coordinate\ns, as\n\nh(M, \u03c0, s) := E\n\nr(st, at) \u2212 \u03c1(M, \u03c0, s)\n\n.\n\n(1)\n\nThe bias vector satis\ufb01es the Bellman equation: \u2200s \u2208 S,\n\nt=1\n\n\u03c1(M, \u03c0, s) + h(M, \u03c0, s) = r(s, \u03c0) +\n\n(cid:34) \u221e(cid:88)\n\n(cid:35)\n(cid:12)(cid:12)(cid:12)s1 = s, at \u223c \u03c0(\u00b7|st)\n\n(cid:88)\n\ns(cid:48)\n\np(s(cid:48)|s, \u03c0)h(M, \u03c0, s(cid:48)),\n\nwhere r(s, \u03c0) := Ea\u223c\u03c0(\u00b7|s)[r(s, a)] and p(s(cid:48)|s, \u03c0) :=Ea\u223c\u03c0(\u00b7|s)[p(s(cid:48)|s, a)].\nThe vector h(M, \u03c0,\u00b7) describes the relative advantage among states under model M and (joint) policy\n\u03c0. The advantage (or disadvantage) of state s compared to state s(cid:48) under policy \u03c0 is de\ufb01ned as the\ndifference between the accumulated rewards with initial states s and s(cid:48), which, from (1), converges\nto the difference h(M, \u03c0, s) \u2212 h(M, \u03c0, s(cid:48)) asymptotically. For the ease of notation, the span of a\nvector v is de\ufb01ned as sp(v) := maxi vi \u2212 mini vi. Therefore if a model, together with any policy,\ninduces large sp (h), then this model will be dif\ufb01cult to learn because visiting a bad state costs a lot\nin the learning process. As shown in [3] for the MDP case, the regret has an inevitable dependency\non sp(h(M, \u03c0\u2217,\u00b7)), where \u03c0\u2217 is the optimal policy.\nOn the other hand, sp(h(M, \u03c0,\u00b7)) is closely related to the mean \ufb01rst passage time under the Markov\nchain induced by M and \u03c0. Actually we have sp(h(M, \u03c0,\u00b7)) \u2264 T \u03c0(M ) := maxs,s(cid:48) T \u03c0\ns\u2192s(cid:48)(M ),\ns\u2192s(cid:48)(M ) denotes the expected time to reach state s(cid:48) starting from s when the model is M and\nwhere T \u03c0\nthe player(s) follow the (joint) policy \u03c0. This fact is intuitive, and the proof can be seen at Remark\nM.1.\nNotations.\nIn order to save space, we often write equations in vector or matrix form. We use\nvectors inequalities: if u, v \u2208 Rn, then u \u2264 v \u21d4 ui \u2264 vi \u2200i = 1, ..., n. For a general matrix game\nwith matrix G of size n \u00d7 m, we denote the value of the game as val G := max\np(cid:62)Gq =\np\u2208\u2206n\np(cid:62)Gq, where \u2206k is the probability simplex of dimension k. In SGs, given the estimated\nmin\nq\u2208\u2206m\nvalue function u(s(cid:48)) \u2200s(cid:48), we often need to solve the following matrix game equation:\n\nmin\nq\u2208\u2206m\n\nmax\np\u2208\u2206n\n\nv(s) = max\n\na1\u223c\u03c01(\u00b7|s)\n\nmin\n\na2\u223c\u03c02(\u00b7|s)\n\n{r(s, a1, a2) +\n\np(s(cid:48)|s, a1, a2)u(s(cid:48))},\n\n(cid:88)\n\ns(cid:48)\n\nand this is abbreviated with the vector form v = val{r + P u}. We also use solve1 G and solve2 G to\ndenote the optimal solutions of p and q. In addition, the indicator function is denoted by 1{\u00b7} or 1{\u00b7}.\n\n2Unlike in one-player MDPs, the sup and inf in the de\ufb01nition of \u03c1\u2217(M, s) are not necessarily attainable.\n\nMoreover, players may not have stationary optimal policies.\n\n3\n\n\f3 Problem Settings and Results Overview\n\nWe assume that the game proceeds for T steps. In order to have meaningful regret bounds (i.e., sub-\nlinear to T ), we must make some assumptions to the SG model itself. Our two different assumptions\nare\nAssumption 1. max\n\ns,s(cid:48) max\n\u03c01\u2208\u03a0SR\n\nmax\n\u03c02\u2208\u03a0SR\n\nAssumption 2. max\n\ns,s(cid:48) max\n\u03c02\u2208\u03a0SR\n\nmin\n\u03c01\u2208\u03a0SR\n\ns\u2192s(cid:48) (M ) \u2264 D.\nT \u03c01,\u03c02\ns\u2192s(cid:48) (M ) \u2264 D.\nT \u03c01,\u03c02\n\nWhy we make these assumptions is as follows. Consider an SG model where the opponent (Player 2)\nhas some way to lock the learner (Player 1) to some bad state. The best strategy for the learner might\nbe to totally avoid, if possible, entering that state. However, in the early stage of the learning process,\nthe learner won\u2019t know this, and he/she will have a certain probability to visit that state and get locked.\nThis will cause linear regret to the learner. Therefore, we assume the following: whatever policy the\nopponent executes, the learner always has some way to reach any state within some bounded time.\nThis is essentially our Assumption 2.\nAssumption 1 is the stronger one that actually implies that under any policies executed by the players\n(not necessarily stationary, see Remark M.2), every state is visited within an average of D steps. We\n\ufb01nd that under this assumption, the asymptotic regret can be improved. This assumption also has a\nsense similar to those required for Q-learning-type algorithms\u2019 convergence: they require that every\nstate be visited in\ufb01nitely often. See [18] for example.\nThese assumptions de\ufb01ne some notion of diameters that are speci\ufb01c to the SG model. It is known\nthat under Assumption 1 or Assumption 2, both players have optimal stationary policies, and the\ngame value is independent of the initial state. Thus we can simply write \u03c1\u2217(M, s) as \u03c1\u2217(M ). For a\nproof of these facts, please refer to Theorem E.1 in the appendix.\n\n3.1 Two Settings and Results Overview\n\nWe focus on training Player 1 and discuss two settings. In the online setting, Player 1 competes with\nan arbitrary Player 2. The regret is de\ufb01ned as\n\nT(cid:88)\n\nt=1\n\nReg(on)\n\nT =\n\n\u03c1\u2217(M ) \u2212 r(st, at).\n\nIn the of\ufb02ine setting, we control both Player 1 and Player 2\u2019s actions, and \ufb01nd Player 1\u2019s maximin\npolicy. The sample complexity is de\ufb01ned as\n\nT(cid:88)\n\nt=1\n\nL\u03b5 =\n\n1{\u03c1\u2217(M ) \u2212 min\n\n\u03c02\n\n\u03c1(M, \u03c01\n\nt , \u03c02) > \u03b5},\n\nwhere \u03c01\nt is a stationary policy being executed by Player 1 at time t. This de\ufb01nition is similar to those\nin [20, 19] for one-player MDPs. By the de\ufb01nition of L\u03b5, if we have an upper bound for L\u03b5 and run\nthe algorithm for T > L\u03b5 steps, there is some t such that \u03c01\nt is \u03b5-optimal. We will explain how to\npick this t in Section 7 and Appendix L.\nIt turns out that we can use almost the same algorithm to handle these two settings. Since learning in\nthe online setting is more challenging, from now on we will mainly focus on the online setting, and\nleave the discussion about the of\ufb02ine setting at the end of the paper. Our results can be summarized\nby the following two theorems.\nTheorem 3.1. Under Assumption 1, UCSG achieves Reg(on)\nTheorem 3.2. Under Assumption 2, UCSG achieves Reg(on)\n\nT = \u02dcO(D3S5A + DS\n\u221a\nT = \u02dcO( 3\n\nDS2AT 2) w.h.p.\n\nAT ) w.h.p. 3\n\n\u221a\n\n4 Upper Con\ufb01dence Stochastic Game Algorithm (UCSG)\n\n3We write, \u201cwith high probability, g = \u02dcO(f )\u201d or \u201cw.h.p., g = \u02dcO(f )\u201d to indicate \u201cwith probability\n\n\u2265 1 \u2212 \u03b4, g = f1O(f ) + f2\u201d, where f1, f2 are some polynomials of log D, log S, log A, log T, log(1/\u03b4).\n\n4\n\n\fAlgorithm 1 UCSG\n\nInput: S, A = A1 \u00d7 A2, T .\nInitialization: t = 1.\nfor phase k = 1, 2, ... do\n\ntk = t.\n1. Initialize phase k: vk(s, a) = 0, nk(s, a) = max\n\n(cid:111)\n1(s\u03c4 ,a\u03c4 )=(s,a)\nnk(s,a) , \u2200s, a, s(cid:48).\n2. Update the con\ufb01dence set: Mk = { \u02dcM : \u2200s, a, \u02dcp(\u00b7|s, a) \u2208 Pk(s, a)}, where\n\n1(s\u03c4 ,a\u03c4 ,s\u03c4 +1)=(s,a,s(cid:48)), \u02c6pk(s(cid:48)|s, a) = nk(s,a,s(cid:48))\n\nnk(s, a, s(cid:48)) =(cid:80)tk\u22121\n3. Optimistic planning: (cid:0)M 1\n\n(cid:1) = MAXIMIN-EVI (Mk, \u03b3k) , where \u03b3k := 1/\n\nPk(s, a) := CONF1(\u02c6pk(\u00b7|s, a), nk(s, a)) \u2229 CONF2(\u02c6pk(\u00b7|s, a), nk(s, a)).\n\n(cid:110)\n1,(cid:80)tk\u22121\n\n\u03c4 =1\n\n\u03c4 =1\n\n,\n\nk , \u03c01\nk\n\n\u221a\n\ntk.\n\nrepeat\n\n4. Execute policies:\nt \u223c \u03c01\n(cid:26)\n(cid:26)\n\nCONF1(\u02c6p, n) :=\n\nCONF2(\u02c6p, n) :=\n\nend for\nDe\ufb01nitions of con\ufb01dence regions:\n\nDraw a1\nSet vk(st, at) = vk(st, at) + 1 and t = t + 1.\n\nk(\u00b7|st); observe the reward rt and the next state st+1.\n\nuntil \u2203(s, a) such that vk(s, a) = nk(s, a)\n\n\u02dcp \u2208 [0, 1]S : (cid:107)\u02dcp \u2212 \u02c6p(cid:107)1 \u2264(cid:113) 2S ln(1/\u03b41)\n\u02dcp \u2208 [0, 1]S : \u2200i,(cid:12)(cid:12)(cid:112)\u02dcpi(1 \u2212 \u02dcpi) \u2212(cid:112)\u02c6pi(1 \u2212 \u02c6pi)(cid:12)(cid:12) \u2264(cid:113) 2 ln(6/\u03b41)\n(cid:16)(cid:113) ln(6/\u03b41)\n\n(cid:27)\n(cid:113) 2 \u02c6pi(1\u2212 \u02c6pi)\n\n2S2A log2 T .\n\n, \u03b41 =\n\nn\u22121\n\nn\n\n\u03b4\n\n|\u02dcpi \u2212 \u02c6pi| \u2264 min\n\n,\n\n2n\n\n,\n\nn\n\nln 6\n\u03b41\n\n+ 7\n\n3(n\u22121) ln 6\n\u03b41\n\n(cid:17)(cid:27)\n\n.\n\nThe Upper Con\ufb01dence Stochastic Game algorithm (UCSG) (Algorithm 1) extends UCRL2 [19],\nusing the optimism principle to balance exploitation and exploration. It proceeds in phases (indexed\nby k), and only changes the learner\u2019s policy \u03c01\nk at the beginning of each phase. The length of each\nphase is not \ufb01xed a priori, but depends on the statistics of past observations.\nIn the beginning of each phase k, the algorithm estimates the transition probabilities using empirical\nfrequencies \u02c6pk(\u00b7|s, a) observed in previous phases (Step 1). With these empirical frequencies, it can\nthen create a con\ufb01dence region Pk(s, a) for each transition probability. The transition probabilities\nlying in the con\ufb01dence regions constitute a set of plausible stochastic game models Mk, where the\ntrue model M belongs to with high probability (Step 2). Then, Player 1 optimistically picks one\nk under this model (Step 3). Finally,\nmodel M 1\nPlayer 1 executes the policy \u03c01\nk for a while until some (s, a)-pair\u2019s number of occurrences is doubled\nduring this phase (Step 4). The count vk(s, a) records the number of steps the (s, a)-pair is observed\nin phase k; it is reset to zero in the beginning of every phase.\nk \u2208 \u03a0SR such that \u2200s,\nIn Step 3, to pick an optimistic model and a policy is to pick M 1\n(2)\n\nk from Mk, and \ufb01nds the optimal (stationary) policy \u03c01\n\nk \u2208 Mk and \u03c01\n\n\u03c1\u2217( \u02dcM , s) \u2212 \u03b3k.\n\n\u03c1(M 1\n\nk , \u03c01\n\nk, \u03c02, s) \u2265 max\n\u02dcM\u2208Mk\n\nmin\n\u03c02\n\nwhere \u03b3k denotes the error parameter for MAXIMIN-EVI. The LHS of (2) is well-de\ufb01ned because\nk and \u03c01\nPlayer 2 has stationary optimal policy under the MDP induced by M 1\nk. Roughly speaking,\n\u03c1( \u02dcM , \u03c01, \u03c02, s) by an error\n(2) says that min\nmin\n\u03c02\n\u03c02\nk) are picked optimistically in Mk \u00d7 \u03a0SR considering the most\n\nk, \u03c02, s) should approximate max\n\n\u02dcM\u2208Mk,\u03c01\n\n\u03c1(M 1\n\nk , \u03c01\n\nk , \u03c01\n\nno more than \u03b3k. That is, (M 1\nadversarial opponent.\n\n4.1 Extended SG and Maximin-EVI\n\nk and \u03c01\n\nThe calculation of M 1\nk involves the technique of Extended Value Iteration (EVI), which also\nappears in [19] as a one-player version.\nConsider the following SG, named M +. Let the state space S and Player 2\u2019s action space A2 remain\nthe same as in M. Let A1+, p+(\u00b7|\u00b7,\u00b7,\u00b7), r+(\u00b7,\u00b7,\u00b7) be Player 1\u2019s action set, the transition kernel, and\n\n5\n\n\fthe reward function of M +, such that for any a1 \u2208 A1 and a2 \u2208 A2 and an admissible transition\nprobability \u02dcp(\u00b7|s, a1, a2) \u2208 Pk(s, a1, a2), there is an action a1+ \u2208 A1+ such that p+(\u00b7|s, a1+, a2) =\n\u02dcp(\u00b7|s, a1, a2) and r+(s, a1+, a2) = r(s, a1, a2). In other words, Player 1 selecting an action in A1+\nis equivalent to selecting an action in A1 and simultaneously selecting an admissible transition\nprobability in the con\ufb01dence region Pk(\u00b7,\u00b7).\nSuppose that M \u2208 Mk, then the extended SG M + satis\ufb01es Assumption 2 because the true model\nM is embedded in M +. By Theorem E.1 in Appendix E, it has a constant game value \u03c1\u2217(M +)\nindependent of the initial state, and satis\ufb01es Bellman equation of the form val{r + P f} = \u03c1 \u00b7 e + f,\nfor some bounded function f (\u00b7), where e stands for the all-one constant vector. With the above\nconditions, we can use value iteration with Schweitzer transform (a.k.a. aperiodic transform)[34]\nto solve the optimal policy in the extended EG M +. We call it MAXIMIN-EVI. For the details\nof MAXIMIN-EVI, please refer to Appendix F. We only summarize the result with the following\nLemma.\nLemma 4.1. Suppose the true model M \u2208 Mk, then the estimated model M 1\nk output by MAXIMIN-EVI in Step 3 satisfy\n\u03c01\n\nk and stationary policy\n\n\u2200s, min\n\n\u03c02\n\n\u03c1(M 1\n\nk , \u03c01\n\nk, \u03c02, s) \u2265 max\n\n\u03c01\n\n\u03c1(M, \u03c01, \u03c02, s) \u2212 \u03b3k.\n\nmin\n\u03c02\n\nBefore diving into the analysis under the two assumptions, we \ufb01rst establish the following fact.\nLemma 4.2. With high probability, the true model M \u2208 Mk for all phases k.\nIt is proved in Appendix D. With Lemma 4.2, we can fairly assume M \u2208 Mk in most of our analysis.\n\n5 Analysis under Assumption 1\n\nt to be a mapping from Ht to a distribution over A2. We will simply write \u03c02\n\nIn this section, we import analysis techniques from one-player MDPs [2, 19, 22, 9]. We also develop\nsome techniques that deal with non-stationary opponents.\nWe model Player 2\u2019s behavior in the most general way, i.e., assuming it using a history-dependent\nrandomized policy. Let Ht = (s1, a1, r1, ..., st\u22121, at\u22121, rt\u22121, st) \u2208 Ht be the history up to st, then\nt (\u00b7) and\nwe assume \u03c02\nt (\u00b7). With abuse\nhide its dependency on Ht inside the subscript t. A similar de\ufb01nition applies to \u03c01\nof notations, we denote by k(t) the phase where step t lies in, and thus our algorithm uses policy\nt (\u00b7) = \u03c01\nk are used interchangeably. Let Tk := tk+1 \u2212 tk be the\n\u03c01\ntk+1\u22121(cid:88)\nlength of phase k. We decompose the regret in phase k in the following way:\n\nk(t)(\u00b7|st). The notations \u03c01\n\nt and \u03c01\n\n\u039bk := Tk\u03c1\u2217(M ) \u2212\n\nr(st, at) =\n\n\u039b(n)\nk ,\n\n(3)\n\nt=tk\n\nin which we de\ufb01ne\n\n\u039b(1)\n\nk = Tk\n\n\u039b(2)\n\nk = Tk\n\n\u039b(3)\n\nk = Tk\n\n(cid:18)\n(cid:18)\n(cid:0)\u03c1(M 1\n\nmin\n\u03c02\n\nn=1\n\n4(cid:88)\n(cid:19)\nk)(cid:1) ,\n\n,\n\n(cid:19)\n\nk , \u03c01\n\nk, \u00af\u03c02\n\nk, stk )\n\n,\n\n\u03c1\u2217(M ) \u2212 min\n\n\u03c02\n\nk , \u03c01\n\n\u03c1(M 1\n\nk, \u03c02, stk )\nk, \u03c02, stk ) \u2212 \u03c1(M 1\nk, stk ) \u2212 \u03c1(M, \u03c01\ntk+1\u22121(cid:88)\n\nk, \u00af\u03c02\n\nr(st, at),\n\n\u03c1(M 1\n\nk , \u03c01\n\nk , \u03c01\n\nk, \u00af\u03c02\nk) \u2212\n\n\u039b(4)\nk = Tk\u03c1(M, \u03c01\n\nk, \u00af\u03c02\n\nt=tk\n\nwhere \u00af\u03c02\nk is some stationary policy of Player 2 which will be de\ufb01ned later. Since the actions of\nPlayer 2 are arbitrary, \u00af\u03c02\nk is imaginary and only exists in analysis. Note that under Assumption 1, any\nstationary policy pair over M induces an irreducible Markov chain, so we do not need to specify the\nis clearly non-positive, and \u039b(1)\ninitial states for \u03c1(M, \u03c01\nk ,\nby optimism, can be bounded using Lemma 4.1. Now remains to bound \u039b(3)\nk\n\nk) in (3). Among the four terms, \u039b(2)\n\nand \u039b(4)\nk .\n\nk, \u00af\u03c02\n\nk\n\n6\n\n\f5.1 Bounding(cid:80)\n\nk and(cid:80)\n\nk \u039b(3)\n\nk \u039b(4)\n\nk\n\nk. \u039b(3)\n\nThe Introduction of \u00af\u03c02\nk, which is a stationary policy\nthat replaces Player 2\u2019s non-stationary policy in the analysis. This replacement costs some constant\nregret but facilitates the use of perturbation analysis in regret bounding. The selection of \u00af\u03c02\nk is based\non the principle that the behavior (e.g., total number of visits to some (s, a)) of the Markov chain\ninduced by M, \u03c01\n\nk should be close to the empirical statistics. Intuitively, \u00af\u03c02\n\ninvolve the arti\ufb01cial policy \u00af\u03c02\n\nk can be de\ufb01ned as\n\nand \u039b(4)\nk\n\nk, \u00af\u03c02\n\nk\n\n(cid:80)tk+1\u22121\n(cid:80)tk+1\u22121\n\nt=tk\n\nt=tk\n\nt (a2)\n\n1st=s\u03c02\n1st=s\n\nk(a2|s) :=\n\u00af\u03c02\n\n.\n\n(4)\n\nNote two things, however. First, since we need the actual trajectory in de\ufb01ning this policy, it can only\nbe de\ufb01ned after phase k has ended. Second, \u00af\u03c02\nk can be unde\ufb01ned because the denominator of (4) can\nbe zero. However, this will not happen in too many steps. Actually, we have\n\nk Tk1{\u00af\u03c02\n\nk not well-de\ufb01ned}\u2264 \u02dcO(DS2A) with high probability.\n\nLemma 5.1. (cid:80)\n\nBefore describing how we bound the regret with the help of \u00af\u03c02\nk and the perturbation analysis, we\nestablish the following lemma:\nLemma 5.2. We say the transition probability at time step t is \u03b5-accurate if |p1\nk(s(cid:48)|st, \u03c0t) \u2212\np(s(cid:48)|st, \u03c0t)| \u2264 \u03b5 \u2200s(cid:48) where p1\nk . We let Bt(\u03b5) = 1 if the\ntransition probability at time t is \u03b5-accurate; otherwise Bt(\u03b5) = 0. Then for any state s, with high\n\nk denotes the transition kernel of M 1\n\nprobability,(cid:80)T\n\nt=1\n\n1st=s1Bt(\u03b5)=0 \u2264 \u02dcO(cid:0)A/\u03b52(cid:1) .\n\nk\n\nk,\u00af\u03c02\n\nk,\u00af\u03c02\n\nk, \u00af\u03c02\n\nk, \u03c02\n\nk , \u03c01\n\nk , \u03c01\n\nk, \u00af\u03c02\n\nk models \u03c02\n\nk , \u03c01\nk , \u03c01\n\nk, \u00af\u03c02\nk, \u00af\u03c02\n\nk. Then clearly, \u039b(4)\n\nis proportional to sp(h(M 1\n\nk (M ). As Remark M.1 implies that sp(h(M 1\nk (M ) \u2264 D, we have sp(h(M 1\n\nk quite well,\nNow we are able to sketch the logic behind our proofs. Let\u2019s assume that \u00af\u03c02\nk is close to the empirical\ni.e., the expected frequency of every state-action pair induced by M, \u03c01\nis close to zero in expectation. The term \u039b(3)\nfrequency induced by M, \u03c01\nk\nnow becomes the difference of average reward between two Markov reward processes with slightly\ndifferent transition probabilities. This term has a counterpart in [19] as a single-player version. Using\nk,\u00b7)).\nsimilar analysis, we can prove that the dominant term of \u039b(3)\nk,\u00b7)) \u2264 D (see their Remark 8),\nk\nIn the single-player case, [19] can directly claim that sp(h(M 1\nbut unfortunately, this is not the case in the two-player version. 4\nTo continue, we resort to the perturbation analysis for the mean \ufb01rst passage times (developed\nk will not be far from M for too many steps. Then\nin Appendix C). Lemma 5.2 shows that M 1\nTheorem C.9 in Appendix C tells that if M 1\nk ) can be bounded by\n2T \u03c01\nk ) and Assumption 1\nk ) \u2264 2T \u03c01\nk (M ) \u2264\nguarantees that T \u03c01\n2D.\nThe above approach leads to Lemma 5.3, which is a key in our analysis. We \ufb01rst de\ufb01ne some\nnotations. Under Assumption 1, any pair of stationary policies induces an irreducible Markov chain,\nwhich has a unique stationary distribution. If the policy pair \u03c0 = (\u03c01, \u03c02) is executed, we denote its\n1st=s.\nt=tk\nk is well-de\ufb01ned,\n\nstationary distribution by \u00b5(M, \u03c01, \u03c02,\u00b7) = \u00b5(M, \u03c0,\u00b7). Besides, denote vk(s) :=(cid:80)tk+1\u22121\nLemma 5.3. (cid:80)\nLemma 5.4. (cid:80)\n\nWe say a phase k is benign if the following hold true: the true model M lies in Mk, \u00af\u03c02\nsp(h(M 1\n\nk,\u00b7)) \u2264 2D, and \u00b5(M, \u03c01\nk Tk1{phase k is not benign} \u2264 \u02dcO(D3S5A) with high probability.\n\nk are close enough to M, T \u03c01\nk,\u00b7)) \u2264 T \u03c01\nk,\u00b7)) \u2264 T \u03c01\n\nk is well-de\ufb01ned }\u2264 \u02dcO(D\n4The argument in [19] is simple: suppose that h(M 1\nk , \u03c01\n\nST + DSA) with high probability.\nk, s) \u2212 h(M 1\nk, s(cid:48)) > D, by the communicating\nassumption, there is a path from s(cid:48) to s with expected time no more than D. Thus a policy that \ufb01rst goes from s(cid:48)\nk at s(cid:48). This leads to a contradiction. In two-player\nto s within D steps and then executes \u03c01\nk will outperform \u03c01\nSGs, with a similar argument, we can also show that sp(h(M 1\nk is the best response\nk , \u03c01\nto \u03c01\nk) can be quite different from\n\u03c02\u2217\nk , and thus sp(h(M 1\n\nk, \u03c02\u2217\nk,\u00b7)) \u2264 D does not necessarily hold true.\n\nk . However, since Player 2 is uncontrollable, his/her policy \u03c02\n\nFinally, for benign phases, we can have the following two lemmas.\n\n\u2200s. We can show the following:\n\nk ,\u00b7)) \u2264 D, where \u03c02\u2217\n\nk, s) \u2264 2vk(s)\n\nk,\u00af\u03c02\nk,\u00af\u03c02\nk,\u00af\u03c02\n\nk (M 1\nk (M 1\nk (M 1\n\nk under M 1\n\nk , \u03c01\n\nk, \u00af\u03c02\n\nk , \u03c01\n\nk, \u00af\u03c02\n\n1{\u00af\u03c02\n\nk \u039b(4)\n\nk\n\nk (or \u00af\u03c02\n\nk, \u00af\u03c02\n\nk , \u03c01\n\nTk\n\n\u221a\n\nk,\u00af\u03c02\n\n7\n\n\f1{phase k is benign} \u2264 \u02dcO(DS\n\n\u221a\n\nAT + DS2A) with high probability,\n\nLemma 5.5. (cid:80)\nthe help of Lemma 4.1:(cid:80)\n\nk \u039b(3)\n\nk\n\nk \u2264(cid:80)\n\nProof of Theorem 3.1. The regret proof starts from the decomposition of (3). \u039b(1)\nk\n\u221a\n\n\u221a\ntk = O(\nk by \u02dcO(D3S5A + DS\n\nLemma 5.1, 5.3, 5.4, and 5.5, we can bound \u039b(3)\nk\n\n\u221a\nk Tk/\n\nis bounded with\nk \u2264 0 by de\ufb01nition. Then with\n\nand \u039b(4)\n\nk \u039b(1)\n\nk \u039b(2)\n\nAT ).\n\nT ).(cid:80)\n\n6 Analysis under Assumption 2\n\nk, \u00af\u03c02\n\nk , \u03c01\n\nIn Section 5, the main ingredient of regret analysis lies in bounding the span of the bias vector,\nk,\u00b7)). However, the same approach does not work because under the weaker Assump-\nsp(h(M 1\ntion 2, we do not have a bound on the mean \ufb01rst passage time under arbitrary policy pairs. Hence we\nadopt the approach of approximating the average reward SG problem by a sequence of \ufb01nite-horizon\nSGs: on a high level, \ufb01rst, with the help of Assumption 2, we approximate the T multiple of the\noriginal average-reward SG game value (i.e. the total reward in hindsight) with the sum of those of\nH-step episodic SGs; second, we resort to [9]\u2019s results to bound the H-step SGs\u2019 sample complexity\nand translates it to regret.\nApproximation by repeated episodic SGs. For the approximation, the quantity H does not appear\nin UCSG but only in the analysis. The horizon T is divided into episodes each with length H. Index\nepisodes with i = 1, ..., T /H, and denote episode i\u2019s \ufb01rst time step by \u03c4i. We say i \u2208 ph(k) if all H\nsteps of episode i lie in phase k. De\ufb01ne the H-step expected reward under joint policy \u03c0 with initial\n. Now we decompose the regret in phase k as\n\nstate s as VH (M, \u03c0, s) := E(cid:104)(cid:80)H\n\nwhere\n\n(cid:105)\nt=1 rt|at \u223c \u03c0, s1 = s\nr(st, at) \u2264 6(cid:88)\ntk+1\u22121(cid:88)\n(cid:17)\n\nt=tk\n\nn=1\n\n\u03c1\u2217 \u2212 min\u03c02 \u03c1(M 1\n\n(cid:16)\n\n\u2206k := Tk\u03c1\u2217 \u2212\n\n(cid:16)\n(cid:16)\n(cid:16)\n(cid:16)\n\ni\u2208ph(k) H\n\ni\u2208ph(k)\n\nH min\u03c02 \u03c1(M 1\n\ni\u2208ph(k)\n\nmin\u03c02 VH (M 1\n\ni\u2208ph(k)\n\nVH (M 1\n\nk , \u03c01\n\nk, \u03c02\n\ni\u2208ph(k)\n\nVH (M, \u03c01\n\nk, \u03c02\n\n,\n\nk , \u03c01\n\nk, \u03c02, s\u03c4i)\n\nk , \u03c01\nk, \u03c02, s\u03c4i) \u2212 min\u03c02 VH (M 1\n(cid:17)\nk, \u03c02, s\u03c4i) \u2212 VH (M 1\nk, \u03c02\nk, \u03c02\n\nk , \u03c01\ni , s\u03c4i) \u2212 VH (M, \u03c01\n\ni , s\u03c4i)\n\nk , \u03c01\n\n,\n\ni , s\u03c4i) \u2212(cid:80)\u03c4i+1\u22121\n\nt=\u03c4i\n\nr(st, at)\n\n(cid:17)\n\n\u2206(1)\n\n\u2206(2)\n\n\u2206(3)\n\nk =(cid:80)\nk =(cid:80)\nk =(cid:80)\nk =(cid:80)\nk =(cid:80)\n\n\u2206(4)\n\n\u2206(5)\n\n(cid:17)\n\n,\n\n(cid:17)\n\nk , \u03c01\n\nk, \u03c02, s\u03c4i)\n\ni , s\u03c4i)\n\n,\n\n, \u2206(6)\n\nk = 2H.\n\n\u2206(n)\nk ,\n\n(5)\n\ni denotes Player 2\u2019s policy in episode i, which may be non-stationary. \u2206(6)\n\nHere, \u03c02\npossible two incomplete episodes in phase k. \u2206(1)\n\u221a\nk\nMAXIMIN-EVI algorithm: \u2206(1)\nin\ufb01nite-horizon SG by a repeated episodic H-step SG (with possibly different initial states). \u2206(3)\nk\nclearly non-positive. It remains to bound \u2206(2)\n\ncomes from the\nis related to the tolerance level we set for the\nis an error caused by approximating an\nis\n\nk \u2264 Tk\u03b3k = Tk/\n\ntk. \u2206(2)\nk\n\nLemma 6.1. By Azuma-Hoeffding\u2019s inequality,(cid:80)\nLemma 6.2. Under Assumption 2,(cid:80)\n\nand \u2206(5)\nk .\n\u221a\nk \u2264 \u02dcO(\nk \u2206(5)\n\nk \u2264 T D/H +(cid:80)\n\nk , \u2206(4)\n\nHT ) with high probability.\n\nk \u2206(2)\n\nk Tk\u03b3k.\n\nk\n\nk\n\nFrom sample complexity to regret bound. As the main contributor of regret, \u2206(4)\ncorresponds\nk\nto the inaccuracy in the transition probability estimation. Here we largely reuse [9]\u2019s results where\nthey consider one-player episodic MDP with a \ufb01xed initial state distribution. Their main lemma\nstates that the number of episodes in phases such that |VH (M 1\nk , \u03c0k, s0) \u2212 VH (M, \u03c0k, s0)| > \u03b5\nk , \u03c0k, s0) \u2212 VH (M, \u03c0k, s0)| > \u03b5} = \u02dcO(H 2S2A/\u03b52). Note that their proof allows\n\nwill not exceed \u02dcO(cid:0)H 2S2A/\u03b52(cid:1), where s0 is their initial state in each episode. In other words,\n(cid:80)\n\n1{|VH (M 1\n\nk\n\nTk\nH\n\n\u03c0k to be an arbitrarily selected non-stationary policy for phase k.\n\n8\n\n\fWe can directly utilize their analysis and we summarize it as Theorem K.1 in the appendix. While\ntheir algorithm has an input \u03b5, this input can be removed without affecting bounds. This means that\nthe PAC bounds holds for arbitrarily selected \u03b5. With the help of Theorem K.1, we have\n\nHAT + HS2A) with high probability.\n\n\u221a\n\nk \u2264 \u02dcO(S\n\nLemma 6.3. (cid:80)\nmax{D, 3(cid:112)D2T /(S2A)}.\n\nk \u2206(4)\n\nProof of Theorem 3.2. With the decomposition (5) and the help of Lemma 6.1, 6.2, and 6.3,\nthe regret is bounded by \u02dcO( T D\nDS2AT 2) by selecting H =\n\n\u221a\nHAT + S2AH) = \u02dcO( 3\n\n\u221a\n\nH + S\n\n7 Sample Complexity of Of\ufb02ine Training\n\nIn Section 3.1, we de\ufb01ned L\u03b5 to be the sample complexity of Player 1\u2019s maximin policy. In our\nof\ufb02ine version of UCSG, in each phase k we let both players each select their own optimistic policy.\nAfter Player 1 has optimistically selected \u03c01\nk based\non the known \u03c01\niteration on the extended MDP under \ufb01xed \u03c01\nstopping threshold also as \u03b3k, we have\n\nk. Speci\ufb01cally, the model-policy pair(cid:0)M 2\n\n(cid:1) is obtained by another extended value\n\nk, where Player 2\u2019s action set is extended. By setting the\n\nk, Player 2 then optimistically selects his policy \u03c02\n\nk , \u03c02\nk\n\n\u03c1(M 2\n\nk , \u03c01\n\nk, \u03c02\n\nk, s) \u2264 min\n\u02dcM\u2208Mk\n\n\u03c1( \u02dcM , \u03c01\n\nk, \u03c02, s) + \u03b3k\n\nmin\n\u03c02\n\n(6)\n\nwhen value iteration halts. With this selection rule, we are able to obtain the following theorems.\nTheorem 7.1. Under Assumption 1, UCSG achieves L\u03b5 = \u02dcO(D3S5A + D2S2A/\u03b52) w.h.p.\nTheorem 7.2. Let Assumption 2 hold, and further assume that max\nThen UCSG achieves L\u03b5 = \u02dcO(DS2A/\u03b53) w.h.p.\nThe algorithm can output a single stationary policy for Player 1 with the following guarantee: if we\nrun the of\ufb02ine version of UCSG for T > L\u03b5 steps, the algorithm can output a single stationary policy\nthat is \u03b5-optimal. We show how to output this policy in the proofs of Theorem 7.1 and 7.2.\n\ns\u2192s(cid:48) (M ) \u2264 D.\nT \u03c01,\u03c02\n\ns,s(cid:48) max\n\u03c01\u2208\u03a0SR\n\nmin\n\u03c02\u2208\u03a0SR\n\n\u221a\n\n\u221a\n\n8 Open Problems\nIn this work, we obtain the regret of \u02dcO(D3S5A + DS\nDS2AT ) under different\nmixing assumptions. A natural open problem is how to improve these bounds on both asymptotic and\nconstant terms. A lower bound of them can be inherited from the one-player MDP setting, which is\n\u2126(\ns\u2192s(cid:48) \u2264\nAnother open problem is that if we further weaken the assumptions to maxs,s(cid:48) min\u03c01 min\u03c02 T \u03c01,\u03c02\nD, can we still learn the SG? We have argued that if we only have this assumption, in general we\ncannot get sublinear regret in the online setting. However, it is still possible to obtain polynomial-time\nof\ufb02ine sample complexity if the two players cooperate to explore the state-action space.\n\n\u221a\nAT ) and \u02dcO( 3\n\nDSAT ) [19].\n\nAcknowledgments\n\nWe would like to thank all anonymous reviewers who have devoted their time for reviewing this work\nand giving us valuable feedbacks. We would like to give special thanks to the reviewer who reviewed\nthis work\u2019s previous version in ICML; your detailed check of our proofs greatly improved the quality\nof this paper.\n\n9\n\n\fReferences\n[1] Yasin Abbasi, Peter L Bartlett, Varun Kanade, Yevgeny Seldin, and Csaba Szepesv\u00e1ri. On-\nline learning in markov decision processes with adversarially chosen transition probability\ndistributions. In Advances in Neural Information Processing Systems, 2013.\n\n[2] Peter Auer and Ronald Ortner. Logarithmic online regret bounds for undiscounted reinforcement\n\nlearning. In Advances in Neural Information Processing Systems, 2007.\n\n[3] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement\nlearning in weakly communicating mdps. In Proceedings of Conference on Uncertainty in\nArti\ufb01cial Intelligence. AUAI Press, 2009.\n\n[4] Michael Bowling and Manuela Veloso. Rational and convergent learning in stochastic games.\n\nIn International Joint Conference on Arti\ufb01cial Intelligence, 2001.\n\n[5] Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for\n\nnear-optimal reinforcement learning. Journal of Machine Learning Research, 2002.\n\n[6] S\u00e9bastien Bubeck and Aleksandrs Slivkins. The best of both worlds: Stochastic and adversarial\n\nbandits. In Conference on Learning Theory, 2012.\n\n[7] Grace E Cho and Carl D Meyer. Markov chain sensitivity measured by mean \ufb01rst passage times.\n\nLinear Algebra and Its Applications, 2000.\n\n[8] Vincent Conitzer and Tuomas Sandholm. Awesome: A general multiagent learning algorithm\nthat converges in self-play and learns a best response against stationary opponents. Machine\nLearning, 2007.\n\n[9] Christoph Dann and Emma Brunskill. Sample complexity of episodic \ufb01xed-horizon reinforce-\n\nment learning. In Advances in Neural Information Processing Systems, 2015.\n\n[10] Travis Dick, Andras Gyorgy, and Csaba Szepesvari. Online learning in markov decision\nprocesses with changing cost sequences. In Proceedings of International Conference of Machine\nLearning, 2014.\n\n[11] Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online markov decision processes.\n\nMathematics of Operations Research, 2009.\n\n[12] Awi Federgruen. On n-person stochastic games by denumerable state space. Advances in\n\nApplied Probability, 1978.\n\n[13] Aur\u00e9lien Garivier, Emilie Kaufmann, and Wouter M Koolen. Maximin action identi\ufb01cation: A\nnew bandit framework for games. In Conference on Learning Theory, pages 1028\u20131050, 2016.\n\n[14] Arie Hordijk. Dynamic programming and markov potential theory. MC Tracts, 1974.\n\n[15] Jeffrey J Hunter. Generalized inverses and their application to applied probability problems.\n\nLinear Algebra and Its Applications, 1982.\n\n[16] Jeffrey J Hunter. Stationary distributions and mean \ufb01rst passage times of perturbed markov\n\nchains. Linear Algebra and Its Applications, 2005.\n\n[17] Garud N. Iyengar. Robust dynamic programming. Math. Oper. Res., 30(2):257\u2013280, 2005.\n\n[18] Tommi Jaakkola, Michael I Jordan, and Satinder P Singh. On the convergence of stochastic\n\niterative dynamic programming algorithms. Neural computation, 1994.\n\n[19] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement\n\nlearning. Journal of Machine Learning Research, 2010.\n\n[20] Sham Machandranath Kakade et al. On the sample complexity of reinforcement learning. PhD\n\nthesis, University of London London, England, 2003.\n\n10\n\n\f[21] Michail G Lagoudakis and Ronald Parr. Value function approximation in zero-sum markov\ngames. In Proceedings of Conference on Uncertainty in Arti\ufb01cial Intelligence. Morgan Kauf-\nmann Publishers Inc., 2002.\n\n[22] Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In International Conference\n\non Algorithmic Learning Theory. Springer, 2012.\n\n[23] Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision\n\nprocesses. Math. Oper. Res., 41(4):1325\u20131353, 2016.\n\n[24] Michael L Littman. Markov games as a framework for multi-agent reinforcement learning. In\n\nProceedings of International Conference of Machine Learning, 1994.\n\n[25] A Maurer and M Pontil. Empirical bernstein bounds and sample variance penalization. In\n\nConference on Learning Theory, 2009.\n\n[26] J-F Mertens and Abraham Neyman. Stochastic games. International Journal of Game Theory,\n\n1981.\n\n[27] Gergely Neu, Andras Antos, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesv\u00e1ri. Online markov decision\nprocesses under bandit feedback. In Advances in Neural Information Processing Systems, 2010.\n\n[28] Gergely Neu, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesv\u00e1ri. The adversarial stochastic shortest path\n\nproblem with unknown transition probabilities. In AISTATS, 2012.\n\n[29] Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain\n\ntransition matrices. Math. Oper. Res., 53(5):780\u2013798, 2005.\n\n[30] Julien Perolat, Bruno Scherrer, Bilal Piot, and Olivier Pietquin. Approximate dynamic program-\nming for two-player zero-sum markov games. In Proceedings of International Conference of\nMachine Learning, 2015.\n\n[31] HL Prasad, Prashanth LA, and Shalabh Bhatnagar. Two-timescale algorithms for learning nash\nequilibria in general-sum stochastic games. In Proceedings of the 2015 International Conference\non Autonomous Agents and Multiagent Systems. International Foundation for Autonomous\nAgents and Multiagent Systems, 2015.\n\n[32] Lloyd S Shapley. Stochastic games. Proceedings of the National Academy of Sciences, 1953.\n\n[33] Csaba Szepesv\u00e1ri and Michael L Littman. Generalized markov decision processes: Dynamic-\nprogramming and reinforcement-learning algorithms. In Proceedings of International Confer-\nence of Machine Learning, 1996.\n\n[34] J Van der Wal. Successive approximations for average reward markov games. International\n\nJournal of Game Theory, 1980.\n\n[35] Jia Yuan Yu and Shie Mannor. Arbitrarily modulated markov decision processes. In Proceedings\n\nof Conference on Decision and Control. IEEE, 2009.\n\n11\n\n\f", "award": [], "sourceid": 2585, "authors": [{"given_name": "Chen-Yu", "family_name": "Wei", "institution": "Academia Sinica"}, {"given_name": "Yi-Te", "family_name": "Hong", "institution": "National Taiwan University"}, {"given_name": "Chi-Jen", "family_name": "Lu", "institution": "Academia Sinica"}]}