{"title": "Reinforcement Learning to Play an Optimal Nash Equilibrium in Team Markov Games", "book": "Advances in Neural Information Processing Systems", "page_first": 1603, "page_last": 1610, "abstract": null, "full_text": "Reinforcement Learning to Play an Optimal\nNash Equilibrium in Team Markov Games\n\nXiaofeng Wang\nECE Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\n\nxiaofeng@andrew.cmu.edu\n\nTuomas Sandholm\n\nCS Department\n\nCarnegie Mellon University\n\nPittsburgh, PA 15213\nsandholm@cs.cmu.edu\n\nAbstract\n\nMultiagent learning is a key problem in AI. In the presence of multi-\nple Nash equilibria, even agents with non-con\ufb02icting interests may not\nbe able to learn an optimal coordination policy. The problem is exac-\ncerbated if the agents do not know the game and independently receive\nnoisy payoffs. So, multiagent reinforfcement learning involves two inter-\nrelated problems: identifying the game and learning to play. In this paper,\nwe present optimal adaptive learning, the \ufb01rst algorithm that converges\nto an optimal Nash equilibrium with probability 1 in any team Markov\ngame. We provide a convergence proof, and show that the algorithm\u2019s\nparameters are easy to set to meet the convergence conditions.\n\n1 Introduction\nMultiagent learning is a key problem in AI. For a decade, computer scientists have worked\non extending reinforcement learning (RL) to multiagent settings [11, 15, 5, 17]. Markov\ngames (aka. stochastic games) [16] have emerged as the prevalent model of multiagent RL.\nAn approach called Nash-Q [9, 6, 8] has been proposed for learning the game structure\nand the agents\u2019 strategies (to a \ufb01xed point called Nash equilibrium where no agent can\nimprove its expected payoff by deviating to a different strategy). Nash-Q converges if a\nunique Nash equilibrium exists, but generally there are multiple Nash equilibria. Even\nteam Markov games (where the agents have common interests) can have multiple Nash\nequilibria, only some of which are optimal (that is, maximize sum of the agents\u2019 discounted\npayoffs). Therefore, learning in this setting is highly nontrivial.\nA straightforward solution to this problem is to enforce convention (social law). Boutilier\nproposed a tie-breaking scheme where agents choose individual actions in lexicographic\norder[1]. However, there are many settings where the designer is unable or unwilling\nto impose a convention. In these cases, agents need to learn to coordinate. Claus and\nBoutilier introduced \ufb01ctitious play, an equilibrium selection technique in game theory, to\nRL. Their algorithm, joint action learner (JAL) [2], guarantees the convergence to a Nash\nequilibrium in a team stage game. However, this equilibrium may not be optimal. The\nsame problem prevails in other equilibrium-selection approaches in game theory such as\nadaptive play [18] and the evolutionary model proposed in [7].\nIn RL, the agents usually do not know the environmental model (game) up front and receive\nnoisy payoffs. In this case, even the lexicographic approaches may not work because agents\nreceive noisy payoffs independently and thus may never perceive a tie. Another signi\ufb01cant\n\n\fis\n\n\u0019\u001c\u001b6\u001f\n\nproblem in previous research is how a nonstationary exploration policy (required by RL) af-\nfects the convergence of equilibrium selection approaches\u2014which have been studied under\nthe assumption that agents either always take the best-response actions or make mistakes at\na constant rate. In RL, learning to play an optimal Nash equilibrium in team Markov games\nhas been posed as one of the important open problems [9]. While there have been heuristic\napproaches to this problm, no existing algorithm has been proposed that is guarenteed to\nconverge to an optimal Nash equilibrium in this setting.\nIn this paper, we present optimal adaptive learning (OAL), the \ufb01rst algorithm that converge\nto an optimal Nash equilibrium with probability 1 in any team Markov game (Section 3).\nWe prove its convergence, and show that OAL\u2019s parameters are easy to meet the conver-\ngence conditions (Section 4).\n2 The setting\n2.1 MDPs and reinforcement learning (RL)\nIn a Markov decision problem, there is one agent in the environment. A fully observable\n\n\u000bL\u0019\u001c\u001b\u001d\u0003\t\u001e0\u0003\t\u001b\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f [10].\n\n[12]. The Q-function for this policy,\n\nis the pay-\nis a discount factor. There exists a deterministic optimal\n\nis a \ufb01nite state space; \u0005\nMarkov decision problem (MDP) is a tuple \u0002\u0001\u0004\u0003\u0006\u0005\u0007\u0003\t\b\n\u0003\f\u000b\u000e\r where \u0001\nis a payoff function (\b\u001a\u0019\u001c\u001b\u001d\u0003\t\u001e \u001f\nis the space of actions the agent can take; \b\u0010\u000f\u0011\u0001\u0013\u0012\u0014\u0005\u0016\u0015\u0018\u0017\n()\u0003+*-,\nin state \u001b ); and \u000b!\u000f\"\u0001#\u0012$\u0005%\u0012&\u0001%\u0015\nis the expected payoff for taking action \u001e\nis the probability of ending in state \u001b41 , given that ac-\na transition function (\u000b.\u0019\u0002\u001b/\u0003\u0006\u001e0\u0003\t\u001b213\u001f\nis taken in state \u001b ). An agent\u2019s deterministic policy (aka. strategy) is a mapping\ntion \u001e\nthe action that policy 5 prescribes in state \u001b .\nfrom states to actions. We denote by 5\n:ED\nthat maximizes 798\nThe objective is to \ufb01nd a policy 5\n()\u0003E*2\u001f\noff at time F , and >HG\nK\nI , is de\ufb01ned by the set of equations\npolicy 5JI\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001fNM\u0013\b\u001a\u0019\u0002\u001b/\u0003\u0006\u001e \u001fPO\n\u001f . At any state \u001b , the optimal\nKLI\n7\u0013QARCS/T\nKLI\npolicy chooses W\u001d]\u0006^\u0004UXWZY_[\nKLI\nReinforcement learning can be viewed as a sampling method for estimating K\u001aI when the\nand/or transition function \u000b\npayoff function \b\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f can be approximated\nare unknown. K\u001aI\n\u0019\u001c\u001b/\u0003\u0006\u001e \u001f calculated from the agent\u2019s experience up to time F . The model-\nby a function K\nbased approach uses samples to generate models of \b\n\u0019\u001c\u001b/\u0003\u0006\u001e \u001fPO\n\u0019\u0002\u001b/\u0003\t\u001eV\u001f`M\u0013\b\n\u0003\t\u001e\n\u0019\u0002\u001b\nRcS\u001d\\\n: , a learning policy assigns probabilities to actions at each state. If the learning\nBased on K\npolicy has the \u201cGreedy in the Limit with In\ufb01nite Exploration\u201d (GLIE) property, then K\nwill converge to K.I (with either a model-based or model-free approach) and the agent will\n\nconverge in behavior to an optimal policy [14]. Using GLIE, every state-action pair is\nvisited in\ufb01nitely often, and in the limit the action selection is greedy with respect to the\nQ-function w.p.1. One common GLIE policy is Boltzmann exploration [14].\n2.2 Multiagent RL in team Markov games when the game is unknown\nA natural extension of an MDP to multiagent environments is a Markov game (aka.\nstochastic game) [16]. In this paper we focus on team Markov games, that are Markov\ngames where each agent receives the same expected payoff (in the presence of noise, dif-\nferent agent may still receive different payoffs at a particular moment.). In other words,\nthere are no con\ufb02icts between the agents, but learning the game structure and learning to\ncoordinate are nevertheless highly nontrivial.\n\n, and then iteratively computes\n\n\u0019\u0002\u001b/\u0003\t\u001ea\u0003b\u001b\n\n\u001fVUXWZY\n\n:<;\u0011=?>\n\u0019\u001c\u001b\n\n\u0003\t\u001e\n\n\u001fVUXWZY)[\n\nS\u001d\\\n\n:A@\n\n\u0019CB\n\n\u001f , where B\n\nDe\ufb01nition 1 A team Markov game (aka identical-interest stochastic game) g\nChi\u0003\t\u0001\u0004\u0003\u0006\u0005\u0007\u0003\t\b\n\u0003\f\u000b\u000e\r , where h\nis a set of n agents; S is a \ufb01nite state space; \u0005jM\u0010\u0012`\u0005lk\njoint action space of n agents; \br\u000fs\u00019\u0012t\u0005u\u0015v\u0017\n(_\u0003E*E, is a transition function.\nand \u000bw\u000fV\u0001x\u0012y\u0005#\u0012z\u0001{\u0015|'\nThe objective of the } agents is to \ufb01nd a deterministic joint policy (aka. joint strategy aka.\nk ) so as to maximize the\n\u000f/\u0001{\u0015\u0080\u0005\nM#~\n;Jfbnpnpn\nand5\nstrategy pro\ufb01le) 5\n\nis a tuple\nis a\nis the common expected payoff function;\n\nq0\u007f (where5\n\n\u000f\u0081\u0001\u0082\u0015\u0083\u0005\n\n;mfonpnpn\n\nS/T\n\nand\u000b\n:ed\u0011f\n\n\u001f .\n\n'\n5\n:\n'\n>\n1\nR\n1\n1\n:\nK\n:\n:\n>\n7\nQ\nR\n\u000b\n:\n1\n[\nK\n1\n1\n:\nq\n5\nk\nk\n\f\u0019\u0002\u001b/\u0003\t\u001eV\u001f\n\nis the K\n\n;mfonpnpn\n\u0019\u001c\u001b/\u0003-~\n\nif each individual policy is a best response\n\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f , is the expected sum\nin state \u001b and follow policy\n\nand any individual policy 5\n\n-function for (each) optimal policy\nin advance.\n\nSometimes, they know neither the payoff structure nor the transition probabilities.\n\nexpected sum of their discounted payoffs. The Q-function, K\u0001\nof discounted payoffs given that the agents play joint action \u001e\nthereafter. The optimal K\n-function K.I\n5mI . So, KLI captures the game structure. The agents generally do not know K\u001aI\nis a Nash equilibriumf\nA joint policy ~\n\u0019\u0002\u001b/\u0003-~\n, \u001b\nk , KLI\nto the others. That is, for all \u0002\n\u0019\u001c\u001b2\u001f\n\u0019\u001c\u001b2\u001f\f\u001f\u0006\u0005\n\u0019\u0002\u001b2\u001f\u0006\u001f , where 5\n\u007f\u0007\u0003\nKLI\nagent \u0002 . (Likewise, throughout the paper, we use\ntheir joint action, by \u0005\ninequality above is strict. An optimal Nash equilibrium 5\nA joint action \u001e\nfor all \u001e\nWe call such a game a state game for \u001b . An optimal joint action in \u001b\n\n\u007f\u0004\u0003\nis the joint policy of all agents except\nto denote all agents but \u0002 , e.g., by\ntheir joint action set.) A Nash equilibrium is strict if the\nis a Nash equilibrium that gives\nthe agents the maximal expected sum of discounted payoffs. In team games, each optimal\nNash equilibrium is an optimal joint policy (and there are no other optimal joint policies).\n. If we treat\n\nis an optimal Nash\nequilibrium of that state game. Thus, the task of optimal coordination in a team Markov\ngame boils down to having all the agents play an optimal Nash equilibrium in state games.\nHowever, a coordination problem arises if there are multiple Nash equilibria. The 3-player\n\nif K.I\nin state \u001b , we obtain a team game in matrix form.\n\nis optimal in state \u001b\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f as the payoff of joint action \u001e\n\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f\t\u0005\n\n\u0019\u0002\u001b/\u0003\u0006\u001e\n\nk\t\u0019\u0002\u001b2\u001f\n\nKLI\n\nf\f\u000b\n\nf\f\u000b\u000e\n\nf\f\u000b\u000e\u000f\n\n\u0010\u000b\n\n\u0011\u000b\u000e\n\n\u000f\u0011\u000b\n\n\u000f\u0010\u000b\u000e\n\n-20\n-20\n5\n\n10\n-20\n-20\nTable 1: A three-player coordination game\n\n-20\n10\n-20\n\n5\n-20\n-20\n\n-20\n-20\n5\n\n-20\n5\n-20\n\n-20\n5\n-20\n\n\u0010\u000b\u000e\u000f\n\n5\n-20\n-20\n\n\u000f\u0011\u000b\u000e\u000f\n\n-20\n-20\n10\n\ngame in Table 1 has three optimal Nash equilibria and six sub-optimal Nash equilibria. In\nthis game, no existing equilibrium selection algorithm (e.g.,\ufb01ctitious play [3]) is guarenteed\nto learn to play an optimal Nash equilibrium. Furthermore, if the payoffs are only expec-\ntations over each agent\u2019s noisy payoffs and unknown to the agents before playing, even\nidenti\ufb01cation of these sub-optimal Nash equilibria during learning is nontrivial.\n3 Optimal adaptive learning (OAL) algorithm\nWe \ufb01rst consider the case where agents know the game before playing. This enables the\nlearning agents to construct a virtual game (VG) for each state \u001b of the team Markov\n\u0019\u001c\u001b\u001d\u0003\t\u001e \u001f\nbe the payoff that the agents receive from the VG in state \u001b\nfor a joint action \u001e . We let\n\u0019\u0002\u001b/\u0003\u0006\u001e\n( otherwise. For ex-\n\u0019\u0002\u001b/\u0003\t\u001eV\u001fyM\n\u0012\u0014\u0013LI\n\u0012\u0014\u0013LI\n(~6\u001e0*\u001d\u0003\u0010\u00152*\u001d\u0003\u0011\u00164*\n\u007f ), and payoff 0 to every other joint action. The\n\nample, the VG for the game in Table 1 gives payoff 1 for each optimal Nash equilibrium\n\ngame to eliminate all the strict suboptimal Nash equilibria in that state. Let\n\nif \u001e9M\n\u007f ,~6\u001e\u0018\u0017 \u0003\u0010\u0015\u0019\u0017 \u0003\u0011\u0016\u001a\u0017\n\n\u001f and\n\n\u0019\u001c\u001b\u001d\u0003\t\u001e \u001f\n\n\u0012\u0014\u0013\n\nVG in this example is weakly acyclic.\nDe\ufb01nition 2 (Weakly acyclic game [18]) Let\nbest-response graph of\n\nbe an n-player game in matrix form. The\nas a vertex and connects two\n\nS\u001d\\\n\nWZ]\t^`UXW4Y)[\nKLI\n\u007f , and ~6\u001e\u001c\u001b)\u0003\u0010\u0015\u001d\u001b)\u0003\u0011\u0016\u0019\u001b\ntakes each joint action \u001e\nk is a best response to \u001e\n\n1 with a directed edge \u001e\n\nif and only if 1) \u001e\u001f\u001e\nM9\u001e\n\nvertices \u001e and \u001e\nexactly one agent \u0002 such that \u001e\nis weakly acyclic if in its best-response graph, from any initial vertex \u001e\ndirected path to some vertex \u001e\nposed a learning algorithm called adaptive play (AP), which works as follows. Let \u001e\nbe a joint action played at time F over an n-player game in matrix form. Fix integers\n Throughout the paper, every Nash equilibrium that we discuss is also a subgame perfect Nash equilibrium. (This re\ufb01nement\n\nTo tackle the equilibrium selection problem for weakly acyclic games, Young [18] pro-\n\n1 ; 2) there exists\nk . We say the game\n\nfrom which there is no outgoing edge.\n\n, there exists a\n\nk and \u001e\n\nof Nash equilibrium was \ufb01rst introduced in [13] for different games).\n\n5\n5\nk\nq\n\u007f\nG\nh\nG\n\u0001\n1\n5\n5\nd\nk\n5\n1\nk\n5\nd\nk\nd\nk\n\b\n\u0002\n\u001e\nd\nk\nd\nk\nI\n1\n\u001f\n1\nG\n\u0005\nK\nI\n\nf\n\nf\n\nf\n\n[\nf\n[\n\n[\n\u000f\nI\n*\nR\n1\nM\n\u0013\n\u0013\nG\n\u0005\n\u0015\n\u001e\n1\nM\n\u001e\n1\nd\n1\nd\nk\nd\n\u0013\nI\n:\nG\n\u0005\n\f:ed?f\n\nsamples\n\n, each agent \u0002\n\nrandomly chooses its\n\nsuch that *\u0003\u0002\n. When F\nand \u0001\n* , each agent looks back at the \u0001 most recent plays\nactions. Starting from F\n:ed\u0006\u0005\b\u0007mf\n:ed\u0006\u0005\n\u0003\n\t\n\t\u000b\t-\u0003\u0006\u001e\n\u001f and randomly (without replacement) selects \n\u0003\u0006\u001e\nM\u0083\u0019\u0002\u001e\nfrom \u0004\nk (a\nk\u001c\u001f be the number of times that a reduced joint action \u001e\n\u0019C\u001e\n:\r\u0007mf\n: . Let \f\njoint action without agent \u0002 \u2019s individual action) appears in the  samples at F\nk\u0006\u0019\u0002\u001e \u001f\nbe agent \u0002 \u2019s payoff given that joint action \u001e has been played. Agent \u0002 calculates its expected\n[\u0012\u0011\u0014\u0013 \u001f\npayoff w.r.t its individual action \u001e k as @\u0010\u000f\n\u001eVk\u0007M\nWZ]\t^`UXW4Y\n\nand then randomly chooses an action from a set of best responses:\n\nO{* . Let \u000e\nke\u001f\u0016\u0015\u0018\u0017\u001a\u0019\u001c\u001b\u001e\u001d\n~2\u001e\u0081k\n\nYoung showed that AP in a weakly acyclic game converges to a strict Nash equilibrium\nw.p.1. Thus, AP on the VG for the game in Table 1 leads to an equilibrium with payoff 1\nwhich is actually an optimal Nash equilibrium for the original game. Unfortunately, this\ndoes not extend to all VGs because not all VGs are weakly acyclic: in a VG without any\nstrict Nash equilibrium, AP may not converge to the strategy pro\ufb01le with payoff 1.\nIn order to address more general settings, we now modify the notion of weakly acyclic\ngame and adaptive play to accommodate weak optimal Nash equilibria.\n\n\u0019C\u001eVke\u001fyM\n\n[\u0012\u0011\u0014\u0013\n\n~6\u001eVk\n\n@#\u000f\n\n\u0019\u0002\u001e\n\n\u007f .\n\nS\u001d\\\n\n\u0011\u0014\u0013\n\nk\f\u0019\n\nS\u001d\\\n\n,\n\nor a strict Nash equilibrium.\n\n(and no other joint policies). Game\n\ning some of the Nash equilibria of a game\n\nbe a set contain-\nis a\n, there exists a directed path to either a Nash equilibrium\n\nDe\ufb01nition 3 (Weakly acyclic game w.r.t a biased set (WAGB)): Let $\nWAGB if, from any initial vertex \u001e\ninside $\nWe can convert any VG to a WAGB by setting the biased set $\n\nto include all joint poli-\ncies that give payoff 1 (and no other joint policies). To solve such a game, we introduce\na new learning algorithm for equilibrium selection. It enables each agent to deterministi-\ncally select a best-response action once any Nash equilibrium in the biased set is attained\n(even if there exist several best responses when the Nash equilibrium is not strict). This is\ndifferent from AP where players randomize their action selection when there are multiple\nbest-response actions. We call our approach biased adaptive play (BAP). BAP works as\nbe the biased set composed of some Nash equilibria of a game in matrix\n\nfollows. Let $\n: be the set of \nform. Let \u0001\nthe most recent \u0001\nk&%\nk'%\n: , \u001e\n: and \u001e\n\u001e\u001c(\nUXW4Ys~6\u000b\n\njoint actions. If (1) there exists a joint action \u001e\n\u001e and \u001e\n, then agent \u0002 chooses its best-response action \u001e\n\nsamples drawn at time F , without replacement, from among\n1 , and (2) there exists at least one joint action \u001e\nR and\nk such that \u001e\n\nsuch that for all\nsuch that\n\n:*)\nis contained in the most recent play of\na Nash equilibrium inside $\n. On the other hand, if the two conditions above are not met,\nthen agent \u0002 chooses its best-response action in the same way as AP. As we will show, BAP\n(even with GLIE exploration) on a WAGB converges w.p.1 to either a Nash equilibrium in\n\n\u007f . That is, \u001eVk\n\n\u001e\u001c(\n\nor a strict Nash equilibrium.\n\nSo far we tackled learning of coordination in team Markov games where the game structure\nis known. Our real interests are in learning when the game is unknown. In multiagent\n\nis asymptotically approximated with K\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f . Our question is how to construct\n\n\u0019\u001c\u001b\u001d\u0003\t\u001e \u001f be\nreinforcement learning, K\nI\n: so as to assure\nthe virtual game w.r.t K\n\u0013LI w.p.1. Our method of achieving this makes use of the notion of + -optimality.\n\u0012\u0014\u0013\nDe\ufb01nition 4 Let + be a positive constant. A joint action a is + -optimal at state s and time\n\u0019\u001c\u001b\u001d\u0003\t\u001e\nUXW4Y)[\nfor all \u001e\nt if K\n. We denote the set of + -optimal joint\nactions at state s and time t as \u0005&,\n\u0019\u0002\u001b2\u001f .\n\u0019\u0002\u001b2\u001f at state \u001b and time F . All the\nThe idea is to use a decreasing + -bound +\njoint actions belonging to the set are treated as optimal Nash equilibria in the virtual game\n: converges\n\u0012\u0014\u0013\n\n: converges to zero at a rate slower than K\n\n: which give agents payoff 1. If +\n\nto estimate \u0005&,\n\n: . Let\n\n\u0019\u001c\u001b/\u0003\u0006\u001e \u001f`O\n\n\u0012\u0014\u0013\n\n\n\n\u0002\n\u0001\n\u0002\n\u0001\nM\n\u0001\nO\n\u0004\n:\nd\nd\nk\nG\n\u0005\nd\n7\n\u000e\n\u007f\n\u0003\n\u001e\nd\n!\n\"\n\b\n:\nk\nM\nD\n[\nR\n\u0013\n\u0013\n1\nk\n\u001f\n\u0013\n\u0013\n\u000f\n1\nG\n$\n\u001e\nG\n\u0001\n\u000f\nd\nd\n\u001e\n\u001e\nG\n\u0001\n\u000f\nG\n$\nk\nG\n\u001e\n:\nF\n1\nM\nD\nG\n\u0001\n\u000f\nG\n$\n$\n:\n:\n\u0012\n\u0013\n:\n\u0015\n\u0012\n:\n+\n\u0005\nR\nK\n:\n1\n\u001f\n1\nG\n\u0005\n\u0017\n:\n\u0017\n\f\u0019\u0001\n\n(_\u0003E*E,\n\n\u0019\u0001\n\n\u0012\u0014\u0013\n\n: , where \n\nto KLI , then\nwhich decreases slowly and monotonically to zero with \nis the smallest num-\nber of times that any state-action pair has been sampled so far. Now, we are ready to present\nthe entire optimal adaptive learning (OAL) algorithm. As we will present thereafter, we\n\n: proportional to a function \"\n\n\u0013LI w.p.1. We make +\n\nalgorithm that is used to learn the game structure.\n\n\u001f carefully using an understanding of the convergence rate of a model-based RL\n\n. For all Q\n;\t\b\n. \f\n\\\u000b\n\n(a) Update the virtual game \u000f\u0011\u0010\nQ\u0003\u0002\n(b) According to GLIE exploration, with an exploitation probability\u001a do\n\ncraft \"\nOptimal adaptive learning algorithm (for agentk )\nand [\n;{f , (\n1. Initialization :J;\ndo q\n;\u0014=\nS\n\\\nS\u001aT\nQ\u0003\u0002\n;\\\u000b\n\n2. Learning of coordination policy If:\u000e\r\nif[\n\u0017 at stateQ : \u000f\u0012\u0010\n;X= otherwise.\nQ\u0003\u0002\n;\u0015\u0014\nf\u0019\u0018\nSet \f\n\u000f\u0012\u0010\ni. Randomly select (without replacement) !\nrecords from \u0005\nplayed at stateQ .\n\u0013 over the virtual game \u000f\u0013\u0010\n\u001f at current state Q\nii. Calculate expected payoff of individual action [\n\u0011\u0014\u0013)(\nQ\u0003\u0002\n[\u0012\u0011\u0014\u0013\r\u001f . Construct the best-\n\u0017%$\n\u0005'&\n\u0011\u0014\u0013 \u001f\"!\n\u0011\u0014\u0013\u0012#\n\u0018\u0012+\nQ\u0003\u0002\n7\u001e\u001d\n\u001b\u0012\u001c\n\u000f\u0012\u0010\n\u0013\u0006\u0016\n\u001f\"!\n;/.\u0003021435.\u00196\nresponse set at stateQ and time: : ,-\u0007\n;\u0015\u0014\n\u001b\u0013\u001c\n\u001f .\nIf conditions 1) and 2) of BAP are met, choose a best-response action with respect to the biased set \f\nOtherwise, randomly select a best-response action from ,-\u0007\n\n\u0002\u001cQ\n\u001f and \u000f\u0013\u0010\n\n\u0005\u0006\u0004 and \u0007\nQ\u0003\u0002\n\nrecent observations of others\u2019 joint actions\n\n, randomly select an action, otherwise do\n\nOtherwise, randomly select an action to explore.\n\nQ\u0003\u0002\n;yf\n\nas follows:\n\n[\u0017\u0016\n\nQ\u0003\u0002\n\nQ\u0003\u0002\n\nQ\u0003\u0002\n\niii.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\n.\n\ni.\n\niii.\n\niii.\n\n(b)\n\nQ\u0003\u0002\n\nQ\u0003\u0002\n\nQ\u0003\u0002\n\ni.\nii.\n\n. Do\n\n\u001f .\n\n\u001f .\n\n\u001f4<\n\n(c)\n(d)\n\n\u0002eQ ?\n\n\u0005'&\nQ=\u0002\n\n\u001fED\nQ\u0003\u0002\n\n\u0002\u001cQ\n>\u000b\u0017)$\n\u001f\"!\n\n\u0019\u0002\u001b/\u0003\u0006\u001e \u001f\n\n\u001fI!\n\u0019\u001c\u001b\u001d\u0003\t\u001e\n\n\u0005'&\nR do(\n\u001fI!\n\n\u0002\u001cQ%?\nQ=\u0002\n\u001f .\n\u001f using (b).\n\u0017\u000bL\n35.\u00196\n\n\u0017 under the joint action[\n\u001f\r\u001f .\nf?d\n\u001f4<\n\u0005%&\n\u0002\u001cQ\n\u001f )\n\n3. Off-policy learning of game structure798\nR and payoff ;\n(a) Observe state transitionQ\u0011:\n\u0007.f\nQ=\u0002\n\u001f \u001f .\nQ\u0003\u0002\nQ\u0003\u0002\n\u0002AQ\nQ\u0003\u0002\nf\u0011d\nandQ ?A@\n\u0007CB\n35.\u00196\nQ\u0003\u0002\n\u001f\"D\n35G\n\u001f (see Section 4.2 for the construction of ,\n\u001f .\n\u001f for all \u001d\nQ\u0003\u0002\n[K\u0016\nQ=\u0002\nQ=\u0002\n\n\u001f4<\nQ\u0003\u0002\n\u001f4<\nQ\u0003\u0002\n\u0002\u001cQ\nQ=\u0002\niv. For allQ ?iS\n\u001f4<\n\u0017\u001a\u0019\u001c\u001b\nQ=\u0002\n:\u0016\u0007.f . F\nIf,\n\u0017\u000bJ\n\u001f4<\nii. Update 7\n\\\u000b\n\nis the number of times a joint action \u001e has been played in state \u001b by time F .\nHere, }\nis a positive constant (any value works). \f\nk appears in agent \u0002 \u2019s \naction \u001e\nactions taken in state \u001b .\n4 Proof of convergence of OAL\nIn this section, we prove that OAL converges to an optimal Nash equilibrium. Throughout,\nwe make the common RL assumptions: payoffs are bounded, and the number of states\nand actions is \ufb01nite. The proof is organized as follows. In Section 4.1 we show that OAL\nagents learn optimal coordination if the game is known. Speci\ufb01cally, we show that BAP\nagainst a WAGB with known game structure converges to a Nash equilibrium under GLIE\nexploration. Then in Section 4.2 we show that OAL agents will learn the game structure.\nSpeci\ufb01cally, any virtual game can be converted to a WAGB which will be learned surely.\nFinally, these two tracks merge in Section 4.3 which shows that OAL agents will learn the\ngame structure and optimal coordination. Due to limited space, we omit most proofs. They\ncan be found at: www.cs.cmu.edu/\u02dc sandholm/oal.ps.\n4.1 Learning to coordinate in a known game\nIn this section, we \ufb01rst model our biased adaptive play (BAP) algorithm with best-response\naction selection as a stationary Markov chain. In the second half of this section we then\nmodel BAP with GLIE exploration as a nonstationary Markov chain.\n\nsamples (at time F ) from the most recent \u0001\n\nis the number of times that a joint\njoint\n\n:\n\u0015\n\u0012\n:\n\u001f\nG\n'\n:\n:\n=\n\u0017\n\u001d\n[\n\u001f\n\u0017\n\u001d\n[\nR\n\u001f\n;\n\u001b\n\u0004\n\u0017\n\u001d\n[\n\u001f\n,\n\u0017\n\u0017\n\u001d\nQ\n\u001f\n;\n\\\n;\n\\\n\u0005\n\u0017\n\u001d\n[\n\u001f\nS\n\u0017\n\u001d\nQ\n\u0017\n\u001d\n[\n\u001f\n\u0017\n\u001d\n[\n\u001f\n;\n\u0017\n\u001d\n[\n\u001d\n[\n\u0013\n\u001f\n;\n\u001d\n*\n\u0017\n\u001d\n\u0014\n[\n\u0013\n\u0017\n\u0013\n\u001d\nQ\n\u001f\n[\n[\n\u0013\n\u001d\nR\n\u0013\n\u0013\n\u001d\n[\nR\n\u0013\n\u001f\n\u0018\n\u0017\n\u0013\n\u001d\nQ\nQ\nq\n\u0017\n\u001d\n[\nq\n\u0017\n\u001d\n[\n\u001f\n\u0007\n\u0017\n\u001d\n[\n\u0007\n\u0017\n\u001d\n[\n\u001f\n\u0007\n\u001b\n>\n\u0017\n$\n\u001d\n(\n\u001d\n;\n\u0017\nd\n\u0007\n\u0017\n\u001d\n[\n(\n\u0017\n\u001d\n[\nR\n(\n\u0017\n\u001d\n[\nR\n\u001f\n\u0007\n\u001b\n>\n\u0017\n$\n\u001d\n(\n\u001d\n(\n\u0017\n\u001d\n[\nR\nT\n;\nQ\n\u0017\n\u001d\n[\n\u001d\n\u001b\n\u001d\n(\n\u001f\n(\n\u0017\n\u001d\n[\n7\n\u001d\n[\n\u0007\n\u0017\n\u001d\n[\n\u001f\n7\n\u0005\nR\n(\n\u0017\n\u001d\n[\nR\n\u001f\n\u001d\nR\n7\n\u0017\n\u001d\nQ\nR\n\u0002\n[\nR\n:\n<\n\u0017\n<\nH\n\u0005\n&\n\u001d\nq\n\u0017\n\u001d\n[\n\b\n,\n\u001d\nF\n\u0017\n\u001d\nF\n\u0017\n,\n\u0017\n<\n\b\n,\n\u001d\nF\n\u0017\n\u0017\n\u001d\n[\n[\n\u0017\n\u001d\nQ\n\u0014\n7\n\u0017\n\u001d\n[\n\u001f\n\u0007\n,\n\u001d\nR\n7\n\u0017\n\u001d\n[\nR\n\u001f\n\u0018\n:\nM\n:\nd\nk\n\u001f\nd\nk\nG\n\u0005\nd\n\f.\n\n,\n\nF\u0001\n\nB\u0003\u0002\n\n\u0003\u000b\t\n\t\n\t+\u0003\u0006\u001e\n\t3\u0003\u0006\u001e \u001f with \u001e being either a member of the biased set $\n\n4.1.1 BAP as a stationary Markov chain\nplays. We take the initial \u0004\nConsider BAP with randomly selected initial \u0001\n\u001f as the initial state of the Markov chain. The de\ufb01nition of the other states is\n\u0003\u0006\u001e\n\u0019C\u001e\ninductive: A successor of state \u0004\nis any state \u0004\n1 obtained by deleting the left-most element\nof \u0004 and appending a new right-most element. The only exception is that all the states \u0004\n\u0019C\u001e0\u0003\t\u001ea\u0003\u000b\t\n. Any state directing to \u0004\nare grouped into a unique terminal state \u0004\ndirectly connected to \u0004\nLet \u000f be the state transition matrix of the above Markov chain. Let \u0004\n1 be a successor of \u0004\n\u0003\u000b\t\n\t\u000b\t-\u0003\u0006\u001e\nand let \u001e\nM\n\u007f (} players) be the new element that was appended to the right\nof \u0004\nto \u0004\n( be the transition probability from \u0004\nto get \u0004\n1 . Now,\nR\b\u0007\nR\b\u0007\nif and only if for each agent \u0002 , there exists a sample of size \nin \u0004\n\u0004\t\u0005\n\u0004\u0006\u0005\nto which \u001e\nis \u0002 \u2019s best\nresponse according to the action-selection rule of BAP. Because agent \u0002 chooses such a\nsample with a probability independent of time F , the Markov chain is stationary. Finally,\ndue to our clustering of multiple states into a terminal state \u0004\n, for any state \u0004 connected\nto \u0004\n\nor a strict Nash equilibrium\nis treated as\n\n~6\u001e\n\u0003\u0006\u001e\n1 . Let\n\n, we have\n\n\u0019C\u001e0\u0003\u0006\u001e0\u0003\u000b\t\n\nIn the above model, once the system reaches the terminal state, each agent\u2019s best response\nis to repeat its most recent action. This is straightforward if in the actual terminal state\n\n\u0003\t\u001e \u001f (which is one of the states that were clustered to form the terminal state), \u001e\n\nis only a weak Nash equilibrium (in this case, \u001e\n\n), BAP\nbiases each agent to choose its most recent action because conditions (1) and (2) of BAP are\nis an absorbing state of the \ufb01nite Markov chain.\r\nessentially is composed of multiple\n, they will be stuck in a particular state\n\nis a strict Nash equilibrium. If \u001e\nsatis\ufb01ed. Therefore, the terminal state \u0004\nOn the other hand, the above analysis shows that \u0004\nabsorbing states. Therefore, if agents come into \u0004\nin \u0004\nTheorem 1 Let G be a weakly acyclic game w.r.t. a biased set D. Let L(a) be the length\nof the shortest directed path in the best-response graph of G from a joint action a to either\nan absorbing vertex or a vertex in D, and let\nw.p.1, biased adaptive play in G converges to either a strict Nash equilibrium or a Nash\nequilibrium in D.\n\nforever instead of cycling around multiple states in \u0004\n\n\u0019C\u001e \u001f . If \u0001\n\n\u0017\u001d\u001f , then,\n\nUXW4Y\n\n\u0005\u0003\n\nR .\n\n.\n\nTheorem 1 says that the stationary Markov chain for BAP in a WAGB (given \u0001\n\u0017\u001d\u001f ) has a unique stationary distribution in which only the terminal state appears.\n4.1.2 BAP with GLIE exploration as a nonstationary Markov chain\nWithout knowing game structure, the learners need to use exploration to estimate their pay-\noffs. In this section we show that such exploration does not hurt the convergence of BAP.\nWe show this by \ufb01rst modeling BAP with GLIE exploration as a non-stationary Markov\nchain.\n\nWith GLIE exploration, at every time step F , each joint action occurs with positive proba-\n\nbility. This means that the system transitions from the state it is in to any of the successor\nstates with positive probability. On the other hand, the agents\u2019 action-selection becomes\nincreasingly greedy over time. In the limit, with probability one, the transition probabilities\nconverge to those of BAP with no exploration. Therefore, we can model the learning pro-\nis\n\ncess with a sequence of transition matrices ~\n\nthe transition matrix of the stationary Markov chain describing BAP without exploration.\n\u000f Akin to how we modeled BAP as a stationary Markov chain above, Young modeled adaptive play (AP) as a stationary\n\n:<;Jf such that\n\nMarkov chain [18]. There are two differences. First, unlike AP\u2019s, BAP\u2019s action selection is biased. Second, in Young\u2019s model, it\nis possible to have several absorbing states while in our model, at most one absorbing state exists (for any team game, our model\nhas exactly one absorbing state). This is because we cluster all the absorbing states into one. This allows us to prove our main\nconvergence theorem.\n\n, where \u000f\n\n:\t\u007f\n\n\f\u000e\n\n\u0002\n\u001b\n\u0004\n\u0005\nM\nf\n\n\u0005\nM\n\t\n(\nG\n\u0004\n(\n(\nf\n\nq\n\u0002\n\u0005\n\u0002\n\u0005\n(\nk\n(\n(\n\u0004\n\u0005\n\u0002\nM\n7\n\u0005\nR\nS\n\u0005\n\n\u0004\n\u0005\n\u0002\n\u0005\n\u0004\nM\n\t\n\t\nG\n$\n(\n(\n(\n(\n(\n\u000b\n\u0010\nM\n[\n\u000b\n\u0005\n\n\u0019\n\u000b\n\u0010\nO\n\u0005\n\n\u0019\n\u000b\n\u0010\nO\n\u000f\n8\nU\n:\n:\n8\n\u000f\n:\nM\n\u000f\n\f:\t\u007f\n\nOur objective here is to show that on a WAGB, BAP with GLIE exploration will converge\nto the (\u201cclustered\u201d) terminal state. For that, we use the following lemma (which is a com-\nbination of Theorems V4.4 and V4.5 from [4]).\n\nLemma 2 Let \u000f be the \ufb01nite transition matrix of a stationary Markov chain with a unique\n:<;mf be a sequence of \ufb01nite transition matrices. Let\n\u001d\u0003\u0002\n\nstationary distribution\nbe a probability vector and denote\n\f\u000e\r\nUsing this lemma and Theorem 1, we can prove the following theorem.\nTheorem 3 (BAP with GLIE) On a WAGB G, w.p.1, BAP with GLIE exploration (and\n\n. Let ~\n( .\n\n\u0001\u0006\u0005\n\n\u0005\u0007\u0005\u0007\u0005\n\nfor all\n\n\u0004 .\n\n, then\n\n\u0004\u0003:\n\n\u001d\b\u0002\n\nIf\n\n\f\u000e\n\n\u0017/\u001f ) converges to either a strict Nash equilibrium or a Nash equilibrium in D.\n\n4.2 Learning the virtual game\nSo far, we have shown that if the game structure is known in a WAGB, then BAP will\nconverge to the terminal state. To prove optimal convergence of the OAL algorithm, we\nneed to further demonstrate that 1) every virtual game is a WAGB, and 2) in OAL, the\n\u201ctemporary\u201d virtual game\nThe \ufb01rst of these two issues is handled by the following lemma:\nLemma 4 The virtual game VG of any n-player team state game is a weakly acyclic game\nw.r.t a biased set that contains all the optimal Nash equilibria, and no other joint actions.\n(By the de\ufb01nition of a virtual game, there are no strict Nash equilibria other than optimal\nones.) The length of the shortest best-response path\nLemma 4 implies that BAP in a known virtual game with GLIE exploration will converge\nto an optimal Nash equilibrium. This is because (by Theorem 3) BAP in a WAGB will\nor a strict Nash equilibrium, and\n\n: will converge to the \u201ccorrect\u201d virtual game\n\n\u0013\nI w.p.1.\n\nconverge to either a Nash equilibrium in a biased set $\n\n(by Lemma 4) any virtual game is a WAGB with all such Nash equilibria being optimal.\nThe following two lemmas are the last link of our proof chain. They show that OAL will\ncause agents to obtain the correct virtual game almost surely.\n\n\u000f\u0013\u0010\n\n.\n\n,\n\nM\f\u000b\n\n\b\u0082KLI\n\nQ\tS/T\n\nS\u001d\\\n\nLemma 5 In any team Markov game, (part 3 of) OAL assures that as F\nUXW4Y\n\nfor some constant M\n\n\u0019\u001c\u001b\u001d\u0003\t\u001e \u001f\n\n\u0019\u0002\u001b/\u0003\t\u001eV\u001f\n\n\u001f'\u0002\n\n(\n\nUsing Lemma 5, the following lemma is easy to prove.\nLemma 6 Consider any team Markov game. Let\nin the OAL algorithm in a given state. If 1) \"\n\u0017\u0013\u0012\n\n\u0012\u0014\u0013LI\nLemma 6 states that if the criterion for including a joint action among the + -optimal joint\n\n\u001f decreases monotonically to zero\nM9( , then\n\n: be the event that for all F\n\u0019\u0001\nB\u0081~\n\nactions in OAL is not made strict too quickly (quicker than the iterated logarithm), then\nagents will identify all optimal joint actions with probability one. In this case, they set up\nthe correct virtual game. It is easy to make OAL satisfy this condition. E.g., any function\n\n\u001f`M9( ), and 2)\n\n8\u0011\u0010\n\n\f\u000e\n\n( w.p.1.\nF ,\nM#* .\n\n\u0019\u0001\n\n\u001fNM\n\n\u0015\u0014\n\n\u0003\u0006(\u0018\u0017\u001a\u0019\u001b\u0017\n\n , will do.\n\n4.3 Main convergence theorem\nNow we are ready to prove that OAL converges to an optimal Nash equilibrium in any\nteam Markov game, even when the game structure is unknown. The idea is to show that\nthe OAL agents learn the game structure (VGs) and the optimal coordination policy (over\nthese VGs). OAL tackles these two learning problems simultaneously\u2014speci\ufb01cally, it\ninterleaves BAP (with GLIE exploration) with learning of game structure. However, the\nconvergence proof does not make use of this fact. Instead, the proof proceeds by showing\nthat the VGs are learned \ufb01rst, and coordination second (the learning algorithm does not\neven itself know when the switch occurs, but it does occur w.p.1).\n\n\n\u000f\n8\n\u0001\n\u0001\n\u0002\n\u0004\n\u001f\nM\n\u000f\n\u0002\n\u000f\n\u0002\n\u0007\nU\n:\n:\n8\n\u000f\n:\nM\n\u000f\nU\n8\n\u0001\n\u0002\n\u0004\n\u001f\nM\n\n\t\n\u0007\n\u0001\n\u0005\n\n\u0019\n\u000b\n\u0010\nO\n\u0012\n\u0013\n\u0012\n\u000b\n\u0002\n}\n\u0015\n\n\u0002\n[\n\u0019\nD\nK\n:\nD\n\n\u000e\n1\n\n\u000e\n1\nF\n\u0017\nF\n\u0017\n\u0007\n\u000f\n1\n\u0007\n\u0012\n\u0013\n:\nR\nM\n:\n\f\n\nU\n:\n:\n8\n\"\n\u0019\n\n:\nU\n:\n:\n\n\u000e\n1\n\n\u000e\n1\nF\nF\n\u0017\n,\n\u001d\nF\n\u0017\n\u001f\n\f\n\nU\n:\n:\n8\n\u000f\n\u000f\n:\n\u007f\n\"\n:\nd\n\u001b\n\u0016\n:\nf\n\f:b\u007f\n\nif F\n\n\u0019\u0001\n\n\u0017\u001d\u001f , and (2) \"\n\nTheorem 7 (Optimal convergence) In any team Markov game among } agents, if (1)\n\u001f satis\ufb01es Lemma 6, then the OAL algorithm converges to an\n\ngames. The optimal equilibria of these state games form the optimal policy 5\u0004I\n\noptimal Nash equilibrium w.p.1.\nProof. According to [1], a team Markov game can be decomposed into a sequence of state\nfor the\ngame. By the de\ufb01nition of GLIE exploration, each state in the \ufb01nite state space will be\nvisited in\ufb01nitely often w.p.1. Thus, it is suf\ufb01cient to only prove that the OAL algorithm\nwill converge to the optimal policy over individual state games w.p.1.\nLet\n\nconstant. If Condition (2) of the theorem is satis\ufb01ed, by Lemma 6 there exists a time \u000bL\u0019\nsuch that \u000f\n\n\u0013LI at that state for all F\n\u001f .\n\u000b.\u0019\n\n: be the event that\n\nf be any positive\n\nF . Let +\n\n\u0012\u0014\u0013\n\nB\u0081~\n\n:\t\u007f\n\n:\t\u007f\n\nB\u0081~\n\nexists a time\nPut together, there exists a time\n\ngiven state for all F\nB\u0081~\n:\t\u007f\n\n\u001f , then \u000f\nB\u0081~\n\u0002\u0001\n\u001f such that if \u0003\n\r . Because +\n\n: occurs, then OAL converges to an optimal Nash equilibrium w.p.1. \u000f Let +\n\n: occurs and Condition (1) of the theorem is satis\ufb01ed, by Theorem 3, OAL will converge\n\nIf\nto either a strict Nash equilibrium or a Nash equilibrium in the biased set w.p.1. Further-\nmore, by Lemma 4, we know that the biased set contains all of the optimal Nash equilibria\n(and nothing else), and there are no strict Nash equilibria outside the biased set. Therefore,\nif\n\r be any\npositive constant, and let\nbe the event that the agents play an optimal joint action at a\n\u0002\u0001\n\u0005\u0004\u0003\n. With this notation, we can reword the previous sentence: there\n\u001f such that if \u0003\n\u0019\f*\n\u001f-\u0019\n\n are only used\nin the proof (they are not parameters of the OAL algorithm), we can choose them to be\narbitrarily small. Therefore, OAL converges to an optimal Nash equilibrium w.p.1.\n5 Conclusions and future research\nWith multiple Nash equilibria, multiagent RL becomes dif\ufb01cult even when agents do not\nhave con\ufb02icting interests. In this paper, we present OAL, the \ufb01rst algorithm that converges\nto an optimal Nash equilibrium with probability 1 in any team Markov game. In the future\nwork, we consider extending the algorithm to some general-sum Markov games.\nAcknowledgments\nWang is supported by NSF grant IIS-0118767, the DARPA OASIS program, and the PASIS\nproject at CMU. Sandholm is supported by NSF CAREER Award IRI-9703122, and NSF\ngrants IIS-9800994, ITR IIS-0081246, and ITR IIS-0121678.\nReferences\n[1] C.Boutilier. Planning, learning and coordination in multi-agent decision processes. In TARK, 1996.\n[2] C.Claus and C.Boutilier. The dynamics of reinforcement learning in cooperative multi-agent systems. In AAAI, 1998.\n[3] D.Fudenberg and D.K.Levine. The theory of learning in games. MIT Press, 1998.\n[4] D.L.Isaacson and R.W.Madsen. Markov chain: theory and applications. John Wiley and Sons, Inc, 1976.\n\n .\n\u001f , then \u000f\nf and +\n\nB\u0081~\n\n\u0007\u0001\n\n\u0007\u0001\n\n. Learning to coordinate actions in multi-agent systems. In IJCAI, 1993.\n\n[6] J.Hu and W.P.Wellman. Multiagent reinforcement learning: theoretical framework and an algorithm. In ICML, 1998.\n[7] M.Kandori, G.J.Mailath, and R.Rob. Learning, mutation, and long run equilibria in games. Econometrica, 61(1):29\u201356,\n\n[5] G.Wei\t\n\n1993.\n\n[8] M.Littman. Friend-or-Foe Q-learning in general sum game. In ICML, 2001.\n[9] M.L.Littman. Value-function reinforcement learning in markov games. J. of Cognitive System Research, 2:55\u201366, 2000.\n[10] M.L.Purterman. Markov decision processes-discrete stochastic dynamic programming. John Wiley, 1994.\n[11] M.Tan. Multi-agent reinforcement learning: independent vs. cooperative agents. In ICML, 1993.\n[12] R.A.Howard. Dynamic programming and Markov processes. MIT Press, 1960.\n[13] R. Selten. Spieltheoretische behandlung eines oligopolmodells mit nachfragetr\u00a8agheit. Zeitschrift f\u00a8ur die gesamte Staatswis-\n\nsenschaft, 12:301\u2013324, 1965.\n\n[14] S. Singh, T.Jaakkola, M.L.Littman, and C.Szepesvari. Convergence results for single-step on-policy reinforcement learning\n\nalgorithms. Machine Learning, 2000.\n\n[15] S.Sen, M.Sekaran, and J. Hale. Learning to coordinate without sharing information. In AAAI, 1994.\n[16] F. Thusijsman. Optimality and equilibrium in stochastic games. Centrum voor Wiskunde en Informatica, 1992.\n[17] T.Sandholm and R.Crites. Learning in the iterated prisoner\u2019s dilemma. Biosystems, 37:147\u2013166, 1995.\n[18] H. Young. The evolution of conventions. Econometrica, 61(1):57\u201384, 1993.\n\n Theorem 3 requires \u0005\nLemma 4, we do have\u0005\n\n\u001d\f\u000b\u000e\r\u0010\u000f\n\n\u001d\f\u000b\u000e\r\u0010\u000f\n\n\u001f .\n\n\u001f . If Condition (1) of our main theorem is satis\ufb01ed (\u0005\n\n\u001f ), then by\n\nq'\u0007\n\n\u0001\n\u0005\n\n\u0019\n}\nO\n:\n\u000f\n:\nR\nM\n\u0012\n1\n\u0007\n+\nf\n\u001f\n\u000f\n\u0007\n*\n\b\n+\nf\n\u0007\n+\nf\n\u000f\n\u000f\n1\n\u0005\n\u0019\n+\n\n\u0003\nF\n\u0007\n\u0005\n\u0019\n+\n\n\u0003\nF\nD\n\u000f\n\u0007\n*\n\b\n+\n\u0006\n\u0019\n+\nf\n\u0003\n+\n\n\u0007\n\u0006\n\u0019\n+\nf\n\u0003\n+\n\n\u007f\n\u0007\n\u000f\nD\n\u000f\n\u000f\n\u000f\n\u0007\n\b\n+\nf\n*\n\b\n+\n\n\u001f\n\u0007\n*\n\b\n+\nf\n\b\n+\n\b\nL\n!\n\u0007\n\nL\n!\n\u001d\n\nL\n!\n\u0007\n\n\f", "award": [], "sourceid": 2171, "authors": [{"given_name": "Xiaofeng", "family_name": "Wang", "institution": null}, {"given_name": "Tuomas", "family_name": "Sandholm", "institution": null}]}