{"title": "Bandit Learning in Concave N-Person Games", "book": "Advances in Neural Information Processing Systems", "page_first": 5661, "page_last": 5671, "abstract": "This paper examines the long-run behavior of learning with bandit feedback in non-cooperative concave games. The bandit framework accounts for extremely low-information environments where the agents may not even know they are playing a game; as such, the agents\u2019 most sensible choice in this setting would be to employ a no-regret learning algorithm. In general, this does not mean that the players' behavior stabilizes in the long run: no-regret learning may lead to cycles, even with perfect gradient information. However, if a standard monotonicity condition is satisfied, our analysis shows that no-regret learning based on mirror descent with bandit feedback converges to Nash equilibrium with probability 1. We also derive an upper bound for the convergence rate of the process that nearly matches the best attainable rate for single-agent bandit stochastic optimization.", "full_text": "Bandit Learning in Concave N-Person Games\n\nMario Bravo\n\nUniversidad de Santiago de Chile\n\nDepartamento de Matem\u00e1tica y Ciencia de la Computaci\u00f3n\n\nmario.bravo.g@usach.cl\n\nDavid Leslie\n\nLancaster University & PROWLER.io\n\nd.leslie@lancaster.ac.uk\n\nPanayotis Mertikopoulos\n\nUniv. Grenoble Alpes, CNRS, Inria, Grenoble INP\n\nLIG 38000 Grenoble, France.\n\npanayotis.mertikopoulos@imag.fr\n\nAbstract\n\nThis paper examines the long-run behavior of learning with bandit feedback in\nnon-cooperative concave games. The bandit framework accounts for extremely low-\ninformation environments where the agents may not even know they are playing a\ngame; as such, the agents\u2019 most sensible choice in this setting would be to employ\na no-regret learning algorithm. In general, this does not mean that the players\u2019\nbehavior stabilizes in the long run: no-regret learning may lead to cycles, even\nwith perfect gradient information. However, if a standard monotonicity condition\nis satis\ufb01ed, our analysis shows that no-regret learning based on mirror descent with\nbandit feedback converges to Nash equilibrium with probability 1. We also derive\nan upper bound for the convergence rate of the process that nearly matches the best\nattainable rate for single-agent bandit stochastic optimization.\n\n1\n\nIntroduction\n\nThe bane of decision-making in an unknown environment is regret: noone wants to realize in\nhindsight that the decision policy they employed was strictly inferior to a plain policy prescribing\nthe same action throughout. For obvious reasons, this issue becomes considerably more intricate\nwhen the decision-maker is subject to situational uncertainty and the \u201cfog of war\u201d: when the only\ninformation at the optimizer\u2019s disposal is the reward obtained from a given action (the so-called\n\u201cbandit\u201d framework), is it even possible to design a no-regret policy? Especially in the context of\nonline convex optimization (repeated decision problems with continuous action sets and convex\ncosts), this problem becomes even more challenging because the decision-maker typically needs to\ninfer gradient information from the observation of a single scalar. Nonetheless, despite this extra\ndegree of dif\ufb01culty, this question has been shown to admit a positive answer: regret minimization is\npossible, even with bandit feedback (Flaxman et al., 2005; Kleinberg, 2004).\nIn this paper, we consider a multi-agent extension of this framework where, at each stage n = 1, 2, . . . ,\nof a repeated decision process, the reward of an agent is determined by the actions of all agents via a\n\ufb01xed mechanism: a non-cooperative N-person game. In general, the agents \u2013 or players \u2013 might\nbe completely oblivious to this mechanism, perhaps even ignoring its existence: for instance, when\nchoosing how much to bid for a good in an online auction, an agent is typically unaware of who\nthe other bidders are, what are their speci\ufb01c valuations, etc. Hence, lacking any knowledge about\nthe game, it is only natural to assume that agents will at least seek to achieve a minimal worst-case\nguarantee and minimize their regret. As a result, a fundamental question that arises is a) whether the\nagents\u2019 sequence of actions stabilizes to a rationally admissible state under no-regret learning; and\nb) if it does, whether convergence is affected by the information available to the agents.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fRelated work.\nIn \ufb01nite games, no-regret learning guarantees that the players\u2019 time-averaged,\nempirical frequency of play converges to the game\u2019s set of coarse correlated equilibria (CCE), and\nthe rate of this convergence is O(1/n) for (\u03bb, \u00b5)-smooth games (Foster et al., 2016; Syrgkanis et al.,\n2015). In general however, this set might contain highly subpar, rationally inadmissible strategies: for\ninstance, Viossat and Zapechelnyuk (2013) provide examples of CCE that assign positive selection\nprobability only to strictly dominated strategies. In the class of potential games, Cohen et al. (2017)\nrecently showed that the actual sequence of play (i.e., the sequence of actions that determine the\nagents\u2019 rewards at each stage) converges under no-regret learning, even with bandit feedback. Outside\nthis class however, the players\u2019 chosen actions may cycle in perpetuity, even in simple, two-player\nzero-sum games with full information (Mertikopoulos et al., 2018a,b); in fact, depending on the\nparameters of the players\u2019 learning process, agents could even exhibit a fully unpredictable, aperiodic\nand chaotic behavior (Palaiopanos et al., 2017). As such, without further assumptions in place,\nno-regret learning in a multi-agent setting does not necessarily imply convergence to a unilaterally\nstable, equilibrium state.\nIn the broader context of games with continuous action sets (the focal point of this paper), the long-run\nbehavior of no-regret learning is signi\ufb01cantly more challenging to analyze. In the case of mixed-\nstrategy learning, Perkins and Leslie (2014) and Perkins et al. (2017) showed that mixed-stratgy\nlearning based on stochastic \ufb01ctitious play converges to an \u03b5-perturbed Nash equilibrium in potential\ngames (but may lead to as much as O(\u03b5n) regret in the process). More relevant for our purposes is\nthe analysis of Nesterov (2009) who showed that the time-averaged sequence of play induced by a\nno-regret dual averaging (DA) process with noisy gradient feedback converges to Nash equilibrium\nin monotone games (a class which, in turn, contains all concave potential games).\nThe closest antecedent to our approach is the recent work of Mertikopoulos and Zhou (2018) who\nshowed that the actual sequence of play generated by dual averaging converges to Nash equilibrium\nin the class of variationally stable games (which includes all monotone games). To do so, the authors\n\ufb01rst showed that a naturally associated continuous-time dynamical system converges, and then used\nthe so-called asymptotic pseudotrajectory (APT) framework of Bena\u00efm (1999) to translate this result\nto discrete time. Similar APT techniques were also used in a very recent preprint by Bervoets\net al. (2018) to establish the convergence of a payoff-based learning algorithm in two classes of\none-dimensional concave games: games with strategic complements, and ordinal potential games\nwith isolated equilibria. The algorithm of Bervoets et al. (2018) can be seen as a special case of\nmirror descent coupled with a two-point gradient estimation process, suggesting several interesting\nlinks with our paper.\n\nOur contributions.\nIn this paper, we drop all feedback assumptions and we focus on the bandit\nframework where the only information at the players\u2019 disposal is the payoffs they receive at each\nstage. As we discussed above, this lack of information complicates matters considerably because\nplayers must now estimate their payoff gradients from their observed rewards. What makes matters\neven worse is that an agent may introduce a signi\ufb01cant bias in the (concurrent) estimation process of\nanother, so traditional, multiple-point estimation techniques for derivative-free optimization cannot\nbe applied (at least, not without signi\ufb01cant communication overhead between players).\nTo do away with player coordination requirements, we focus on learning processes which could\nbe sensibly deployed in a single-agent setting and we show that, in monotone games, the sequence\nof play induced by a wide class of no-regret learning policies converges to Nash equilibrium with\nprobability 1. Furthermore, by specializing to the class of strongly monotone games, we show that\nthe rate of convergence is O(n\u22121/3), i.e., it is nearly optimal with respect to the attainable O(n\u22121/2)\nrate for bandit, single-agent stochastic optimization with strongly convex and smooth objectives\n(Agarwal et al., 2010; Shamir, 2013).\nWe are not aware of a similar Nash equilibrium convergence result for concave games with general\nconvex action spaces and bandit feedback: the analysis of Mertikopoulos and Zhou (2018) requires\n\ufb01rst-order feedback, while the analysis of Bervoets et al. (2018) only applies to one-dimensional\ngames. We \ufb01nd this outcome particularly appealing for practical applications of game theory (e.g.,\nin network routing) because it shows that in a wide class of (possibly very complicated) nonlinear\ngames, the Nash equilibrium prediction does not require full rationality, common knowledge of\nrationality, \ufb02awless execution, or even the knowledge that a game is being played: a commonly-used,\nindividual no-regret algorithm suf\ufb01ces.\n\n2\n\n\f2 Problem setup and preliminaries\n\nConcave games. Throughout this paper, we will focus on games with a \ufb01nite number of players\ni \u2208 N = {1, . . . , N} and continuous action sets. During play, every player i \u2208 N selects an action\nxi from a compact convex subset Xi of a di-dimensional normed space Vi; subsequently, based on\neach player\u2019s individual objective and the action pro\ufb01le x = (xi; x\u2212i) \u2261 (x1, . . . , xN ) of all players\u2019\ni Xi\nfor the game\u2019s action space, we assume that each player\u2019s reward is determined by an associated\npayoff (or utility) function ui : X \u2192 R. Since players are not assumed to \u201cknow the game\u201d (or even\nthat they are involved in one) these payoff functions might be a priori unknown, especially with\nrespect to the dependence on the actions of other players. Our only structural assumption for ui will\n\nactions, every player receives a reward, and the process repeats. In more detail, writing X \u2261(cid:81)\nbe that ui(xi; x\u2212i) is concave in xi for all x\u2212i \u2208 X\u2212i \u2261(cid:81)\n\nj(cid:54)=i Xj, i \u2208 N .\n\nWith all this in hand, a concave game will be a tuple G \u2261 G(N ,X , u) with players, action spaces\nand payoffs de\ufb01ned as above. Below, we brie\ufb02y discuss some examples thereof:\nExample 2.1 (Cournot competition). In the standard Cournot oligopoly model, there is a \ufb01nite set of\n\ufb01rms indexed by i = 1, . . . , N, each supplying the market with a quantity xi \u2208 [0, Ci] of some good\n(or service), up to the \ufb01rm\u2019s production capacity Ci. By the law of supply and demand, the good is\ni=1 xi supplied to the market,\ntypically following a linear model of the form P (xtot) = a \u2212 bxtot for positive constants a, b > 0.\nThe utility of \ufb01rm i is then given by\n\npriced as a decreasing function P (xtot) of the total amount xtot =(cid:80)N\n\nui(xi; x\u2212i) = xiP (xtot) \u2212 cixi,\n\n(2.1)\ni.e., it comprises the total revenue from producing xi units of the good in question minus the associated\nproduction cost (in the above, ci > 0 represents the marginal production cost of \ufb01rm i).\nExample 2.2 (Resource allocation auctions). Consider a service provider with a number of splittable\nresources s \u2208 S = {1, . . . , S} (bandwidth, server time, GPU cores, etc.). These resources can\nbe leased to a set of N bidders (players) who can place monetary bids xis \u2265 0 for the utilization\ns\u2208S xis \u2264 bi. Once all bids\nare in, resources are allocated proportionally to each player\u2019s bid, i.e., the i-th player gets \u03c1is =\nj\u2208N xjs) units of the s-th resource (where qs denotes the available units of said\nresource and cs \u2265 0 is the \u201centry barrier\u201d for bidding on it). A simple model for the utility of player i\nis then given by\n\nof each resource s \u2208 S up to each player\u2019s total budget bi, i.e.,(cid:80)\n(qsxis)(cid:14)(cs +(cid:80)\n\n(cid:88)\n\ns\u2208S\n\nui(xi; x\u2212i) =\n\n[gi\u03c1is \u2212 xis],\n\n(2.2)\n\nwith gi denoting the marginal gain of player i from acquiring a unit slice of resources.\n\nFor more examples of monotone games, see Scutari et al. (2010), D\u2019Oro et al. (2015), Mertikopoulos\nand Belmega (2016), and references therein.\n\nNash equilibrium and monotone games. The most widely used solution concept for non-\ncooperative games is that of a Nash equilibrium (NE), de\ufb01ned here as any action pro\ufb01le x\u2217 \u2208 X that\nis resilient to unilateral deviations, viz.\n\nui(x\u2217\n\ni ; x\u2217\n\n\u2212i) \u2265 ui(xi; x\u2217\n\u2212i)\n\n(NE)\nBy the classical existence theorem of Debreu (1952), every concave game admits a Nash equilibrium.\nMoreover, thanks to the individual concavity of the game\u2019s payoff functions, Nash equilibria can also\nbe characterized via the \ufb01rst-order optimality condition\n\nfor all xi \u2208 Xi, i \u2208 N .\n\n(cid:104)vi(x\u2217), xi \u2212 x\u2217\n\ni (cid:105) \u2264 0\n\nfor all xi \u2208 Xi,\n\n(2.3)\n\nwhere vi(x) denotes the individual payoff gradient of the i-th player, i.e.,\n\n(2.4)\nwith \u2207i denoting differentiation with respect to xi.1 In terms of regularity, it will be convenient to\nassume that each vi is Lipschitz continuous; to streamline our presentation, this will be our standing\nassumption in what follows.\n\nvi(x) = \u2207i ui(xi; x\u2212i),\n\n1We adopt here the standard convention of treating vi(x) as an element of the dual space Yi \u2261 V\u2217\n\nwith (cid:104)yi, xi(cid:105) denoting the duality pairing between yi \u2208 Yi and xi \u2208 Xi \u2286 Vi.\n\ni of Vi,\n\n3\n\n\fStarting with the seminal work of Rosen (1965), much of the literature on continuous games and their\napplications has focused on games that satisfy a condition known as diagonal strict concavity (DSC).\nIn its simplest form, this condition posits that there exist positive constants \u03bbi > 0 such that\n\n\u03bbi(cid:104)vi(x(cid:48)) \u2212 vi(x), x(cid:48)\n\ni \u2212 xi(cid:105) < 0 for all x, x(cid:48) \u2208 X , x (cid:54)= x(cid:48).\n\n(DSC)\n\n(cid:88)\n\ni\u2208N\n\nOwing to the formal similarity between (DSC) and the various operator monotonicity conditions\nin optimization (see e.g., Bauschke and Combettes, 2017), games that satisfy (DSC) are commonly\nreferred to as (strictly) monotone. As was shown by Rosen (1965, Theorem 2), monotone games\nadmit a unique Nash equilibrium x\u2217 \u2208 X , which, in view of (DSC) and (NE), is also the unique\nsolution of the (weighted) variational inequality\n\u03bbi(cid:104)vi(x), xi \u2212 x\u2217\n\nfor all x (cid:54)= x\u2217.\n\n(cid:88)\n\n(VI)\n\ni (cid:105) < 0\n\ni\u2208N\n\nThis property of Nash equilibria of monotone games will play a crucial role in our analysis and we\nwill use it freely in the rest of our paper.\nIn terms of applications, monotonicity gives rise to a very rich class of games. As we show in the\npaper\u2019s supplement, Examples 2.1 and 2.2 both satisfy diagonal strict concavity (with a nontrivial\nchoice of weights for the latter), as do atomic splittable congestion games in networks with parallel\nlinks (Orda et al., 1993; Sorin and Wan, 2016), multi-user covariance matrix optimization problems\nin multiple-input and multiple-output (MIMO) systems (Mertikopoulos et al., 2017), and many other\nproblems where online decision-making is the norm. Namely, the class of monotone games contains\nall strictly convex-concave zero-sum games and all games that admit a (strictly) concave potential,\ni.e., a function f : X \u2192 R such that vi(x) = \u2207i f (x) for all x \u2208 X , i \u2208 N . In view of all this (and\nunless explicitly stated otherwise), we will focus throughout on monotone games; for completeness,\nwe also include in the supplement a straightforward second-order test for monotonicity.\n\n3 Regularized no-regret learning\n\nWe now turn to the learning methods that players could employ to increase their individual rewards in\nan online manner. Building on Zinkevich\u2019s (2003) online gradient descent policy, the most widely used\nalgorithmic schemes for no-regret learning in the context of online convex optimization invariably\nrevolve around the idea of regularization. To name but the most well-known paradigms, \u201cfollowing\nthe regularized leader\u201d (FTRL) explicitly relies on best-responding to a regularized aggregate of the\nreward functions revealed up to a given stage, while online mirror descent (OMD) and its variants use\na linear surrogate thereof. All these no-regret policies fall under the general umbrella of \u201cregularized\nlearning\u201d and their origins can be traced back to the seminal mirror descent (MD) algorithm of\nNemirovski and Yudin (1983).2\nThe basic idea of mirror descent is to generate a new feasible point x+ by taking a so-called \u201cmirror\nstep\u201d from a starting point x along the direction of an \u201capproximate gradient\u201d vector y (which we\ni Vi).3 To do so, let hi : Xi \u2192 R be a\n\ntreat here as an element of the dual space Y \u2261(cid:81)\n\ni Yi of V \u2261(cid:81)\n\ncontinuous and Ki-strongly convex distance-generating (or regularizer) function, i.e.,\ni \u2212 xi(cid:107)2,\n\nhi(txi + (1 \u2212 t)x(cid:48)\n(3.1)\ni \u2208 Xi and all t \u2208 [0, 1]. In terms of smoothness (and in a slight abuse of notation)\nfor all xi, x(cid:48)\nwe also assume that the subdifferential of hi admits a continuous selection, i.e., a continuous\nfunction \u2207hi : dom \u2202hi \u2192 Yi such that \u2207hi(xi) \u2208 \u2202hi(xi) for all xi \u2208 dom \u2202hi.4 Then, letting\n\ni) \u2264 thi(xi) + (1 \u2212 t)hi(x(cid:48)\n\n2 Kit(1 \u2212 t)(cid:107)x(cid:48)\n\ni) \u2212 1\n\nproduct norm (cid:107)x(cid:107)2 =(cid:80)\n\n2In a utility maximization setting, mirror descent should be called mirror ascent because players seek to\nmaximize their rewards (as opposed to minimizing their losses). Nonetheless, we keep the term \u201cdescent\u201d\nthroughout because, despite the role reversal, it is the standard name associated with the method.\n3For concreteness (and in a slight abuse of notation), we assume in what follows that V is equipped with the\n4Recall here that the subdifferential of hi at xi \u2208 Xi is de\ufb01ned as \u2202hi(xi) \u2261 {yi \u2208 Yi : hi(x(cid:48)\ni) \u2265 hi(xi) +\ni \u2208 Vi}, with the standard convention that hi(xi) = +\u221e if xi \u2208 Vi \\ Xi. By standard\ni \u2212 xi(cid:105) for all x(cid:48)\ni \u2286 dom \u2202hi \u2286 Xi.\n\n(cid:104)yi, x(cid:48)\nresults, the domain of subdifferentiability \u2202hi \u2261 {xi \u2208 Xi : \u2202hi (cid:54)= \u2205} of hi satis\ufb01es X \u25e6\n\ni(cid:107)xi(cid:107)2 and Y with the dual norm (cid:107)y(cid:107)\u2217 = max{(cid:104)y, x(cid:105) : (cid:107)x(cid:107) \u2264 1}.\n\n4\n\n\fh(x) = (cid:80)\n\ni hi(xi) for x \u2208 X (so h is strongly convex with modulus K = mini Ki), we get a\n\npseudo-distance on X via the relation\n\nD(p, x) = h(p) \u2212 h(x) \u2212 (cid:104)\u2207h(x), p \u2212 x(cid:105),\n\n(3.2)\n\nfor all p \u2208 X , x \u2208 dom \u2202h.\nThis pseudo-distance is known as the Bregman divergence and we have D(p, x) \u2265 0 with equality\nif and only if x = p; on the other hand, D may fail to be symmetric and/or satisfy the triangle\ninequality so, in general, it is not a bona \ufb01de distance function on X . Nevertheless, we also have\nD(p, x) \u2265 1\n2 K(cid:107)x \u2212 p(cid:107)2 (see the paper\u2019s supplement), so the convergence of a sequence Xn to p\ncan be checked by showing that D(p, Xn) \u2192 0. For technical reasons, it will be convenient to\nalso assume the converse, i.e., that D(p, Xn) \u2192 0 when Xn \u2192 p. This condition is known in the\nliterature as \u201cBregman reciprocity\u201d (Chen and Teboulle, 1993), and it will be our blanket assumption\nin what follows (note that it is trivially satis\ufb01ed by Examples 3.1 and 3.2 below).\nNow, as with true Euclidean distances, D(p, x) induces a prox-mapping given by\n\n{(cid:104)y, x \u2212 x(cid:48)(cid:105) + D(x(cid:48), x)}\n\nPx(y) = arg min\n\nx(cid:48)\u2208X\n\n(3.3)\nfor all x \u2208 dom \u2202h and all y \u2208 Y. Just like its Euclidean counterpart below, the prox-mapping (3.3)\nstarts with a point x \u2208 dom \u2202h and steps along the dual (gradient-like) vector y \u2208 Y to produce a\nnew feasible point x+ = Px(y). Standard examples of this process are:\nExample 3.1 (Euclidean projections). Let h(x) = 1\nthe induced prox-mapping is\n(3.4)\nwith \u03a0(x) = arg minx(cid:48)\u2208X(cid:107)x(cid:48) \u2212 x(cid:107)2 denoting the standard Euclidean projection onto X . Hence, the\nupdate rule x+ = Px(y) boils down to a \u201cvanilla\u201d, Euclidean projection step along y.\n(cid:80)d\nDKL(p, x) =(cid:80)d\n\nExample 3.2 (Entropic regularization and multiplicative weights). Suppressing the player index\nfor simplicity, let X be a d-dimensional simplex and consider the entropic regularizer h(x) =\nj=1 xj log xj. The induced pseudo-distance is the so-called Kullback\u2013Leibler (KL) divergence\n\n2(cid:107)x(cid:107)2\nPx(y) = \u03a0(x + y),\n\nj=1 pj log(pj/xj), which gives rise to the prox-mapping\n\n2 denote the Euclidean squared norm. Then,\n\nPx(y) =\n\n(cid:80)d\n\n(xj exp(yj))d\n\nj=1\nj=1 xj exp(yj)\n\n(3.5)\n\nfor all x \u2208 X \u25e6, y \u2208 Y. The update rule x+ = Px(y) is widely known as the multiplicative weights\n(MW) algorithm and plays a central role for learning in multi-armed bandit problems and \ufb01nite games\n(Arora et al., 2012; Auer et al., 1995; Freund and Schapire, 1999).\n\nWith all this in hand, the multi-agent mirror descent (MD) algorithm is given by the recursion\n\n(MD)\nwhere \u03b3n is a variable step-size sequence and \u02c6vn = (\u02c6vi,n)i\u2208N is a generic feedback sequence of\nestimated gradients. In the next section, we detail how this sequence is generated with \ufb01rst- or\nzeroth-order (bandit) feedback.\n\nXn+1 = PXn (\u03b3n\u02c6vn),\n\n4 First-order vs. bandit feedback\n\n4.1 First-order feedback.\n\nA common assumption in the literature is that players are able to obtain gradient information by\nquerying a \ufb01rst-order oracle (Nesterov, 2004). i.e., a \u201cblack-box\u201d feedback mechanism that outputs\nan estimate \u02c6vi of the individual payoff gradient vi(x) of the i-th player at the current action pro\ufb01le\nx = (xi; x\u2212i) \u2208 X . This estimate could be either perfect, giving \u02c6vi = vi(x) for all i \u2208 N , or\nimperfect, returning noisy information of the form \u02c6vi = vi(x) + Ui where Ui denotes the oracle\u2019s\nerror (random, systematic, or otherwise).\n\n5\n\n\fHaving access to a perfect oracle is usually a tall order, either because payoff gradients are dif\ufb01cult to\ncompute directly (especially without global knowledge), because they involve an expectation over a\npossibly unknown probability law, or for any other number of reasons. It is therefore more common\nto assume that each player has access to a stochastic oracle which, when called against a sequence\nof actions Xn \u2208 X , produces a sequence of gradient estimates \u02c6vn = (vi,n)i\u2208N that satis\ufb01es the\nfollowing statistical assumptions:\n\nE[\u02c6vn |Fn] = v(Xn).\n\na) Unbiasedness:\nb) Finite mean square: E[(cid:107)\u02c6vn(cid:107)2\u2217 |Fn] \u2264 V 2 for some \ufb01nite V \u2265 0.\n\n(4.1)\nIn terms of measurability, the expectation in (4.1) is conditioned on the history Fn of Xn up to stage\nn; in particular, since \u02c6vn is generated randomly from Xn, it is not Fn-measurable (and hence not\nadapted). To make this more transparent, we will write \u02c6vn = v(Xn) + Un+1 where Un is an adapted\nmartingale difference sequence with E[(cid:107)Un+1(cid:107)2\u2217 |Fn] \u2264 \u03c32 for some \ufb01nite \u03c3 \u2265 0.\n\n4.2 Bandit feedback.\n\nNow, if players don\u2019t have access to a \ufb01rst-order oracle \u2013 the so-called bandit or payoff-based\nframework \u2013 they will need to derive an individual gradient estimate from the only information\nat their disposal: the actual payoffs they receive at each stage. When a function can be queried\nat multiple points (as few as two in practice), there are ef\ufb01cient ways to estimate its gradient via\ndirectional sampling techniques as in Agarwal et al. (2010). In a game-theoretic setting however,\nmultiple-point estimation techniques do not apply because, in general, a player\u2019s payoff function\ndepends on the actions of all players. Thus, when a player attempts to get a second query of their\npayoff function, this function may have already changed due to the query of another player \u2013 i.e.,\n\u2212i (cid:54)= x\u2212i.\ninstead of sampling ui(\u00b7; x\u2212i), the i-th player would be sampling ui(\u00b7; x(cid:48)\nFollowing Spall (1997) and Flaxman et al. (2005), we posit instead that players rely on a simultaneous\nperturbation stochastic approximation (SPSA) approach that allows them to estimate their individual\npayoff gradients vi based off a single function evaluation. In detail, the key steps of this one-shot\nestimation process for each player i \u2208 N are:\n\n\u2212i) for some x(cid:48)\n\n0. Fix a query radius \u03b4 > 0.5\n1. Pick a pivot point xi \u2208 Xi where player i seeks to estimate their payoff gradient.\n2. Draw a vector zi from the unit sphere Si \u2261 Sdi of Vi \u2261 Rdi and play \u02c6xi = xi + \u03b4zi.6\n3. Receive \u02c6ui = ui(\u02c6xi; \u02c6x\u2212i) and set\n\n\u02c6vi =\n\n\u02c6ui zi.\n\n(4.2)\n\ndi\n\u03b4\n\nBy adapting a standard argument based on Stokes\u2019 theorem (detailed in the supplement), it can be\nshown that \u02c6vi is an unbiased estimator of the individual gradient of the \u03b4-smoothed payoff function\nui(xi + wi; x\u2212i + z\u2212i) dz1 \u00b7\u00b7\u00b7 dwi \u00b7\u00b7\u00b7 dzN (4.3)\n\n(cid:90)\n\nu\u03b4\ni (x) =\n\nvol(\u03b4Bi)(cid:81)\n\n1\nj(cid:54)=i vol(\u03b4Sj)\n\n\u03b4Bi\n\n(cid:90)\n(cid:81)\nj(cid:54)=i \u03b4Sj\n\nwith Bi \u2261 Bdi denoting the unit ball of Vi. The Lipschitz continuity of vi guarantees that (cid:107)\u2207i ui \u2212\n\u2207i u\u03b4\ni(cid:107)\u221e = O(\u03b4), so this estimate becomes more and more accurate as \u03b4 \u2192 0+. On the other\nhand, the second moment of \u02c6vi grows as O(1/\u03b42), implying in turn that the variability of \u02c6vi grows\nunbounded as \u03b4 \u2192 0+. This manifestation of the bias-variance dilemma plays a crucial role in\ndesigning no-regret policies with bandit feedback (Flaxman et al., 2005; Kleinberg, 2004), so \u03b4 must\nbe chosen with care.\nBefore dealing with this choice though, it is important to highlight two feasibility issues that arise\nwith the single-shot SPSA estimate (4.2). The \ufb01rst has to do with the fact that the perturbation\ndirection zi is chosen from the unit sphere Si so it may fail to be tangent to Xi, even when xi is\ninterior. To iron out this wrinkle, it suf\ufb01ces to sample zi from the intersection of Si with the af\ufb01ne\n5For simplicity, we take \u03b4 equal for all players; the extension to player-speci\ufb01c \u03b4 is straightforward, so we\n6We tacitly assume here that the query directions zi \u2208 Sdi are drawn independently across players.\n\nomit it.\n\n6\n\n\fhull of Xi in Vi; on that account (and without loss of generality), we will simply assume in what\nfollows that each Xi is a convex body of Vi, i.e., it has nonempty topological interior.\nThe second feasibility issue concerns the size of the perturbation step: even if zi is a feasible direction\nof motion, the query point \u02c6xi = xi + \u03b4zi may be unfeasible if xi is too close to the boundary of Xi.\nFor this reason, we will introduce a \u201csafety net\u201d in the spirit of Agarwal et al. (2010), and we will\nconstrain the set of possible pivot points xi to lie within a suitably shrunk zone of X .\nIn detail, let Bri(pi) be an ri-ball centered at pi \u2208 Xi so that Bri(pi) \u2286 Xi. Then, instead of\nperturbing xi by zi, we consider the feasibility adjustment\n\n(4.4)\nand each player plays \u02c6xi = xi + \u03b4wi instead of xi + \u03b4zi. In other words, this adjustment moves each\ni \u03b4(xi \u2212 pi), i.e., O(\u03b4)-closer to the interior base point pi, and then perturbs x\u03b4\npivot to x\u03b4\ni\nby \u03b4zi. Feasibility of the query point is then ensured by noting that\n\ni = xi \u2212 r\u22121\n\ni\n\nwi = zi \u2212 r\u22121\n\n(xi \u2212 pi),\n\n\u02c6xi = x\u03b4\n\ni + \u03b4zi = (1 \u2212 r\u22121\n\ni \u03b4)xi + r\u22121\n\ni \u03b4(pi + rizi),\n\n(4.5)\n\nso \u02c6xi \u2208 Xi if \u03b4/ri < 1 (since pi + rizi \u2208 Bri(pi) \u2286 Xi).\nThe difference between this estimator and the oracle framework we discussed above is twofold. First,\neach player\u2019s realized action is \u02c6xi = xi + \u03b4wi, not xi, so there is a disparity between the point at\nwhich payoffs are queried and the action pro\ufb01le where the oracle is called. Second, the resulting\nestimator \u02c6v is not unbiased, so the statistical assumptions (4.1) for a stochastic oracle do not hold. In\nparticular, given the feasibility adjustment (4.4), the estimate (4.2) with \u02c6x given by (4.5) satis\ufb01es\n\n(4.6)\nso there are two sources of systematic error: an O(\u03b4) perturbation in the function, and an O(\u03b4)\nperturbation of each player\u2019s pivot point from xi to x\u03b4\ni . Hence, to capture both sources of bias and\nseparate them from the random noise, we will write\n\ni ; x\u03b4\u2212i),\n\ni (x\u03b4\n\nE[\u02c6vi] = \u2207i u\u03b4\n\n\u02c6vi = vi(x) + Ui + bi\n\n(4.7)\nwhere Ui = \u02c6vi \u2212E[\u02c6vi] and bi = \u2207i u\u03b4\ni (x\u03b4)\u2212\u2207i ui(x). We are thus led to the following manifestation\nof the bias-variance dilemma: the bias term b in (4.7) is O(\u03b4), but the second moment of the noise\nterm U is O(1/\u03b42); as such, an increase in accuracy (small bias) would result in a commensurate\nloss of precision (large noise variance). Balancing these two factors will be a key component of our\nanalysis in the next section.\n\n5 Convergence analysis and results\n\nCombining the learning framework of Section 3 with the single-shot gradient estimation machinery\nof Section 4, we obtain the following variant of (MD) with payoff-based, bandit feedback:\n\n\u02c6Xn = Xn + \u03b4nWn,\nXn+1 = PXn(\u03b3n\u02c6vn).\n\n(MD-b)\n\ni\n\n(Xi,n \u2212 pi)\n\nWi,n = Zi,n \u2212 r\u22121\n\n\u02c6vi,n = (di/\u03b4n)ui( \u02c6Xn) Zi,n\n\nIn the above, the perturbations Wn and the estimates \u02c6vn are given respectively by (4.4) and (4.2), i.e.,\n(5.1)\nand Zi,n is drawn independently and uniformly across players at each stage n (see also Algorithm 1\nfor a pseudocode implementation and Fig. 1 for a schematic representation).\nIn the rest of this paper, our goal will be to determine the equilibrium convergence properties of this\nscheme in concave N-person games. Our \ufb01rst asymptotic result below shows that, under (MD-b), the\nplayers\u2019 learning process converges to Nash equilibrium in monotone games:\nTheorem 5.1. Suppose that the players of a monotone game G \u2261 G(N ,X , u) follow (MD-b) with\nstep-size \u03b3n and query radius \u03b4n such that\n\nlim\nn\u2192\u221e \u03b3n = lim\n\nn\u2192\u221e \u03b4n = 0,\n\n\u03b3n = \u221e,\n\n\u03b3n\u03b4n < \u221e,\n\nand\n\nn=1\n\nn=1\n\nn=1\n\n< \u221e.\n\n\u03b32\nn\n\u03b42\nn\n\n(5.2)\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\n\u221e(cid:88)\n\nThen, the sequence of realized actions \u02c6Xn converges to Nash equilibrium with probability 1.\n\n7\n\n\fAlgorithm 1: Multi-agent mirror descent with bandit feedback (player indices suppressed)\nRequire: step-size \u03b3n > 0, query radius \u03b4n > 0, safety ball Br(p) \u2286 X\n1: choose X \u2208 dom \u2202h\n2: repeat at each stage n = 1, 2, . . .\n3:\n4:\n5:\n6:\n7:\n8:\n9: until end\n\ndraw Z uniformly from Sd\nset W \u2190 Z \u2212 r\u22121(X \u2212 p)\nplay \u02c6X \u2190 X + \u03b4nW\nreceive \u02c6u \u2190 u( \u02c6X)\nset \u02c6v \u2190 (d/\u03b4n)\u02c6u \u00b7 Z\nupdate X \u2190 PX (\u03b3n\u02c6v)\n\n# perturbation direction\n# query direction\n# choose action\n# get payoff\n# estimate gradient\n# update pivot\n\n# initialization\n\nX1\n\n\u02c6X1\n.\n\n\u03b4z1\n\nX\n\n.X2\n\u03b4z2\n.\n\u02c6X2\n\n\u03b3(d/\u03b4) \u02c6u2z2\n\n.\nX3\n\n\u03b3(d/\u03b4) \u02c6u1z1\n\n.\n\n\u03a0\n\nFigure 1: Schematic representation of Algorithm 1 with ordinary, Euclidean projections. To reduce visual\nclutter, we did not include the feasibility adjustment r\u22121(x \u2212 p) in the action selection step Xn (cid:55)\u2192 \u02c6Xn.\n\nk=1 \u03b32\n\nk/\u03b42\n\nsimilar purpose. First, the conditions limn\u2192\u221e \u03b3n = 0 and(cid:80)\u221e\n\na weight of \u03b3n so the aggregate bias after n stages is of the order of O((cid:80)n\nO((cid:80)n\n\nEven though the setting is different, the conditions (5.2) for the tuning of the algorithm\u2019s parameters\nare akin to those encountered in Kiefer\u2013Wolfowitz stochastic approximation schemes and serve a\nn=1 \u03b3n = \u221e respectively mitigate\nthe method\u2019s inherent randomness and ensure a horizon of suf\ufb01cient length. The requirement\nlimn\u2192\u221e \u03b4n = 0 is also straightforward to explain: as players accrue more information, they need to\ndecrease the sampling bias in order to have any hope of converging. However, as we discussed in\nSection 4, decreasing \u03b4 also increases the variance of the players\u2019 gradient estimates, which might\ngrow to in\ufb01nity as \u03b4 \u2192 0. The crucial observation here is that new gradients enter the algorithm with\nk=1 \u03b3k\u03b4k) and its variance is\nk). If these error terms can be controlled, there is an underlying drift that emerges over\ntime and which steers the process to equilibrium. We make this precise in the supplement by using a\nsuitably adjusted variant of the Bregman divergence as a quasi-F\u00e9j\u00e9r energy function for (MD-b) and\nrelying on a series of (sub)martingale convergence arguments to establish the convergence of \u02c6Xn\n(\ufb01rst as a subsequence, then with probability 1).\nOf course, since Theorem 5.1 is asymptotic in nature, it is not clear how to choose \u03b3n and \u03b4n so as to\noptimize the method\u2019s convergence rate. Heuristically, if we take schedules of the form \u03b3n = \u03b3/np\nand \u03b4n = \u03b4/nq with \u03b3, \u03b4 > 0 and 0 < p, q \u2264 1, the only conditions imposed by (5.2) are p + q > 1\nand p \u2212 q > 1/2. However, as we discussed above, the aggregate bias in the algorithm after n stages\nk) = O(1/n2p\u22122q\u22121): if the\nconditions (5.2) are satis\ufb01ed, both error terms vanish, but they might do so at very different rates.\nBy equating these exponents in order to bridge this gap, we obtain q = p/3; moreover, since the\nsingle-shot SPSA estimator (4.2) introduces a \u0398(\u03b4n) random perturbation, q should be taken as large\nas possible to ensure that this perturbation vanishes at the fastest possible rate. As a result, the most\nsuitable choice for p and q seems to be p = 1, q = 1/3, leading to an error bound of O(1/n1/3).\n\nk=1 \u03b3n\u03b4n) = O(1/np+q\u22121) and its variance is O((cid:80)n\n\nis O((cid:80)n\n\nk=1 \u03b32\n\nk/\u03b42\n\n8\n\n\fWe show below that this bound is indeed attainable for games that are strongly monotone, i.e., they\nsatisfy the following stronger variant of diagonal strict concavity:\ni \u2212 xi(cid:105) \u2264 \u2212 \u03b2\n2\n\n\u03bbi(cid:104)vi(x(cid:48)) \u2212 vi(x), x(cid:48)\n\n(cid:107)x \u2212 x(cid:48)(cid:107)2\n\n(cid:88)\n\ni\u2208N\n\n(\u03b2-DSC)\n\nfor some \u03bbi, \u03b2 > 0 and for all x, x(cid:48) \u2208 X . Focusing for expository reasons on the most widely used,\nEuclidean incarnation of the method (Example 3.1), we have:\nTheorem 5.2. Let x\u2217 be the (necessarily unique) Nash equilibrium of a \u03b2-strongly monotone game.\nIf the players follow (MD-b) with Euclidean projections and parameters \u03b3n = \u03b3/n and \u03b4n = \u03b4/n1/3\nwith \u03b3 > 1/(3\u03b2) and \u03b4 > 0, we have\n\nE[(cid:107) \u02c6Xn \u2212 x\u2217(cid:107)2] = O(n\u22121/3).\n\n(5.3)\n\nTheorem 5.2 is our main \ufb01nite-time analysis result, so some remarks are in order. First, the step-size\nschedule \u03b3n \u221d 1/n is not required to obtain an O(n\u22121/3) convergence rate: as we show in the\npaper\u2019s supplement, more general schedules of the form \u03b3n \u221d 1/np and \u03b4n \u221d 1/nq with p > 3/4\nand q = p/3 > 1/4, still guarantee an O(n\u22121/3) rate of convergence for (MD-b). To put things\nin perspective, we also show in the supplement that if (MD) is run with \ufb01rst-order oracle feedback\nsatisfying the statistical assumptions (4.1), the rate of convergence becomes O(1/n). Viewed in this\nlight, the price for not having access to gradient information is no higher than O(n\u22122/3) in terms of\nthe players\u2019 equilibration rate.\nFinally, it is also worth comparing the bound (5.3) to the attainable rates for stochastic convex\noptimization (the single-player case). For problems with objectives that are both strongly convex and\nsmooth, Agarwal et al. (2010) attained an O(n\u22121/2) convergence rate with bandit feedback, which\nShamir (2013) showed is unimprovable. Thus, in the single-player case, the bound (5.3) is off by\nn1/6 and coincides with the bound of Agarwal et al. (2010) for strongly convex functions that are not\nnecessarily smooth. One reason for this gap is that the \u0398(n\u22121/2) bound of Shamir (2013) concerns\nk=1 Xk, while our analysis concerns the sequence of\nrealized actions \u02c6Xn. This difference is semantically signi\ufb01cant: In optimization, the query sequence\nis just a means to an end, and only the algorithm\u2019s output matters (i.e., \u00afXn). In a game-theoretic\nsetting however, it is the players\u2019 realized actions that determine their rewards at each stage, so the\n\ufb01gure of merit is the actual sequence of play \u02c6Xn. This sequence is more dif\ufb01cult to control, so this\ndisparity is, perhaps, not too surprising; nevertheless, we believe that this gap can be closed by using\na more sophisticated single-shot estimate, e.g., as in Ghadimi and Lan (2013). We defer this analysis\nto the future.\n\nthe smoothed-out time average \u00afXn = n\u22121(cid:80)n\n\n6 Concluding remarks\n\nThe most sensible choice for agents who are oblivious to the presence of each other (or who are\nsimply conservative), is to deploy a no-regret learning algorithm. With this in mind, we studied\nthe long-run behavior of individual regularized no-regret learning policies and we showed that, in\nmonotone games, play converges to equilibrium with probability 1, and the rate of convergence almost\nmatches the optimal rates of single-agent, stochastic convex optimization. Nevertheless, several\nquestions remain open: whether there is an intrinsic information-theoretic obstacle to bridging this\ngap; whether our convergence rate estimates hold with high probability (and not just in expectation);\nand whether our analysis extends to a fully decentralized setting where the players\u2019 updates need not\nbe synchronous. We intend to address these questions in future work.\n\nAcknowledgments\n\nM. Bravo gratefully acknowledges the support provided by FONDECYT grant 11151003. P. Mer-\ntikopoulos was partially supported by the Huawei HIRP \ufb02agship grant ULTRON, and the French\nNational Research Agency (ANR) grant ORACLESS (ANR\u201316\u2013CE33\u20130004\u201301). Part of this work\nwas carried out with \ufb01nancial support by the ECOS project C15E03.\n\n9\n\n\fReferences\nAgarwal, Alekh, O. Dekel, L. Xiao. 2010. Optimal algorithms for online convex optimization with multi-point\n\nbandit feedback. COLT \u201910: Proceedings of the 23rd Annual Conference on Learning Theory.\n\nArora, Sanjeev, Elad Hazan, Satyen Kale. 2012. The multiplicative weights update method: A meta-algorithm\n\nand applications. Theory of Computing 8(1) 121\u2013164.\n\nAuer, Peter, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, Robert E. Schapire. 1995. Gambling in a rigged casino: The\nadversarial multi-armed bandit problem. Proceedings of the 36th Annual Symposium on Foundations of\nComputer Science.\n\nBauschke, Heinz H., Patrick L. Combettes. 2017. Convex Analysis and Monotone Operator Theory in Hilbert\n\nSpaces. 2nd ed. Springer, New York, NY, USA.\n\nBena\u00efm, Michel. 1999. Dynamics of stochastic approximation algorithms. Jacques Az\u00e9ma, Michel \u00c9mery,\nMichel Ledoux, Marc Yor, eds., S\u00e9minaire de Probabilit\u00e9s XXXIII, Lecture Notes in Mathematics, vol.\n1709. Springer Berlin Heidelberg, 1\u201368.\n\nBervoets, Sebastian, Mario Bravo, Mathieu Faure. 2018. Learning with minimal information in continuous\n\ngames. https://arxiv.org/abs/1806.11506.\n\nChen, Gong, Marc Teboulle. 1993. Convergence analysis of a proximal-like minimization algorithm using\n\nBregman functions. SIAM Journal on Optimization 3(3) 538\u2013543.\n\nCohen, Johanne, Am\u00e9lie H\u00e9liou, Panayotis Mertikopoulos. 2017. Learning with bandit feedback in potential\ngames. NIPS \u201917: Proceedings of the 31st International Conference on Neural Information Processing\nSystems.\n\nDebreu, Gerard. 1952. A social equilibrium existence theorem. Proceedings of the National Academy of Sciences\n\nof the USA 38(10) 886\u2013893.\n\nD\u2019Oro, Salvatore, Panayotis Mertikopoulos, Aris L. Moustakas, Sergio Palazzo. 2015. Interference-based\npricing for opportunistic multi-carrier cognitive radio systems. IEEE Trans. Wireless Commun. 14(12)\n6536\u20136549.\n\nFlaxman, Abraham D., Adam Tauman Kalai, H. Brendan McMahan. 2005. Online convex optimization in\nthe bandit setting: gradient descent without a gradient. SODA \u201905: Proceedings of the 16th annual\nACM-SIAM Symposium on Discrete Algorithms. 385\u2013394.\n\nFoster, Dylan J., Thodoris Lykouris, Kathrik Sridharan, \u00c9va Tardos. 2016. Learning in games: Robustness of\nfast convergence. NIPS \u201916: Proceedings of the 30th International Conference on Neural Information\nProcessing Systems. 4727\u20134735.\n\nFreund, Yoav, Robert E. Schapire. 1999. Adaptive game playing using multiplicative weights. Games and\n\nEconomic Behavior 29 79\u2013103.\n\nGhadimi, Saeed, Guanghui Lan. 2013. Stochastic \ufb01rst- and zeroth-order methods for nonconvex stochastic\n\nprogramming. SIAM Journal on Optimization 23(4) 2341\u20132368.\n\nKleinberg, Robert D. 2004. Nearly tight bounds for the continuum-armed bandit problem. NIPS\u2019 04: Proceedings\n\nof the 18th Annual Conference on Neural Information Processing Systems.\n\nMertikopoulos, Panayotis, E. Veronica Belmega. 2016. Learning to be green: Robust energy ef\ufb01ciency\n\nmaximization in dynamic MIMO-OFDM systems. IEEE J. Sel. Areas Commun. 34(4) 743 \u2013 757.\n\nMertikopoulos, Panayotis, E. Veronica Belmega, Romain Negrel, Luca Sanguinetti. 2017. Distributed stochastic\n\noptimization via matrix exponential learning. IEEE Trans. Signal Process. 65(9) 2277\u20132290.\n\nMertikopoulos, Panayotis, Bruno Lecouat, Houssam Zenati, Chuan-Sheng Foo, Vijay Chandrasekhar, Georgios\nPiliouras. 2018a. Optimistic mirror descent in saddle-point problems: Going the extra (gradient) mile.\nhttps://arxiv.org/abs/1807.02629.\n\nMertikopoulos, Panayotis, Christos H. Papadimitriou, Georgios Piliouras. 2018b. Cycles in adversarial reg-\nularized learning. SODA \u201918: Proceedings of the 29th annual ACM-SIAM Symposium on Discrete\nAlgorithms.\n\nMertikopoulos, Panayotis, Zhengyuan Zhou. 2018. Learning in games with continuous action sets and unknown\n\npayoff functions. Mathematical Programming .\n\nNemirovski, Arkadi Semen, David Berkovich Yudin. 1983. Problem Complexity and Method Ef\ufb01ciency in\n\nOptimization. Wiley, New York, NY.\n\nNesterov, Yurii. 2004. Introductory Lectures on Convex Optimization: A Basic Course. No. 87 in Applied\n\nOptimization, Kluwer Academic Publishers.\n\nNesterov, Yurii. 2009. Primal-dual subgradient methods for convex problems. Mathematical Programming\n\n120(1) 221\u2013259.\n\nOrda, Ariel, Raphael Rom, Nahum Shimkin. 1993. Competitive routing in multi-user communication networks.\n\nIEEE/ACM Trans. Netw. 1(5) 614\u2013627.\n\n10\n\n\fPalaiopanos, Gerasimos, Ioannis Panageas, Georgios Piliouras. 2017. Multiplicative weights update with\nconstant step-size in congestion games: Convergence, limit cycles and chaos. NIPS \u201917: Proceedings of\nthe 31st International Conference on Neural Information Processing Systems.\n\nPerkins, Steven, David S. Leslie. 2014. Stochastic \ufb01ctitious play with continuous action sets. Journal of\n\nEconomic Theory 152 179\u2013213.\n\nPerkins, Steven, Panayotis Mertikopoulos, David S. Leslie. 2017. Mixed-strategy learning with continuous\n\naction sets. IEEE Trans. Autom. Control 62(1) 379\u2013384.\n\nRosen, J. B. 1965. Existence and uniqueness of equilibrium points for concave N-person games. Econometrica\n\n33(3) 520\u2013534.\n\nScutari, Gesualdo, Francisco Facchinei, Daniel P\u00e9rez Palomar, Jong-Shi Pang. 2010. Convex optimization, game\ntheory, and variational inequality theory in multiuser communication systems. IEEE Signal Process. Mag.\n27(3) 35\u201349.\n\nShamir, Ohad. 2013. On the complexity of bandit and derivative-free stochastic convex optimization. COLT \u201913:\n\nProceedings of the 26th Annual Conference on Learning Theory.\n\nSorin, Sylvain, Cheng Wan. 2016. Finite composite games: Equilibria and dynamics. Journal of Dynamics and\n\nGames 3(1) 101\u2013120.\n\nSpall, James C. 1997. A one-measurement form of simultaneous perturbation stochastic approximation.\n\nAutomatica 33(1) 109\u2013112.\n\nSyrgkanis, Vasilis, Alekh Agarwal, Haipeng Luo, Robert E. Schapire. 2015. Fast convergence of regularized\nlearning in games. NIPS \u201915: Proceedings of the 29th International Conference on Neural Information\nProcessing Systems. 2989\u20132997.\n\nViossat, Yannick, Andriy Zapechelnyuk. 2013. No-regret dynamics and \ufb01ctitious play. Journal of Economic\n\nTheory 148(2) 825\u2013842.\n\nZinkevich, Martin. 2003. Online convex programming and generalized in\ufb01nitesimal gradient ascent. ICML \u201903:\n\nProceedings of the 20th International Conference on Machine Learning. 928\u2013936.\n\n11\n\n\f", "award": [], "sourceid": 2736, "authors": [{"given_name": "Mario", "family_name": "Bravo", "institution": "University of Santiago, Chile"}, {"given_name": "David", "family_name": "Leslie", "institution": "Lancaster University and PROWLER.io"}, {"given_name": "Panayotis", "family_name": "Mertikopoulos", "institution": "CNRS (French National Center for Scientific Research)"}]}