NeurIPS 2020

Review 1

Summary and Contributions: This paper introduces the notion of contextual games, a class of games where the payoff of each player depends on both the action profile, and the current context. A typical example of such games is the routing games, where the delay faced by an agent not only depends on the paths chosen by all the agents, but also on the context e.g. capacity of the edges, weather etc. From a single agent’s point of view, this is just an adversarial contextual bandit problem and the existing algorithms can be used to get a no-regret learning algorithm. However, the paper imposes kernel-based regularity assumptions to get improved bounds on the regret. In particular, the main contributions are the following: 1. When the number of contexts is a small finite set Z the authors prove a bound of \sqrt{T|Z| \log K} (*) 2. When the contexts are chosen from a c-dimensional space, the authors prove a bound of \sqrt{c TK\log K} 3. Then the authors introduce the concept of contextual-coarse correlated equilibrium and show that if the players follow the proposed contextual bandit algorithms, then each player gets vanishing regret. Moreover an extension of the standard price of anarchy argument shows that the optimal contextual welfare can be achieved unto the PoA ratio. (*) * Please check my comments later regarding these points.

Strengths: 1. The main contribution of the paper is in introducing the idea of contextual games. These games are more general than standard repeated games, and are better models of routing games. On the other hand, they are not hard to solve like stochastic games. 2. Another contribution is to use kernel-based regularity assumptions to derive improved bounds for contextual bandits. However, as I point out in the Relation to prior work section, this is not entirely new. 3. I view this paper mainly making a new theoretical contribution to the literature of multi-armed bandits and learning on games and it is definitely of interest to the NeurIPS community. 4. I think the theorems and the proofs are correct. However, I am not convinced by the claim that the dependence on K can be improved to be logarithmic in K, as the paper provides no bound on the term \gamma_T.

Weaknesses: 1. One of the terms appearing in regret (theorem 1 and 2) is the term \sqrt{\gamma_T T}. But how does the term \gamma_T change as we increase number of actions K, and number of agents N? For an upper bound, the authors provide two examples — d log T and log^{d+1} T where d is the dimension of the space consisting of action profile and contexts as tuples. Since the action tuples grow exponentially with N, could it be that d becomes large very soon and dominates the other term in the regret. It seems to me that depending on the dimension of the domain, the claim that regret depends only logarithmically on number of actions K, might not hold. 2. The authors rightly point out that from a single agent’s point of view, the problem can be thought of as an adversarial contextual bandits. However, instead of using existing algorithms they try to design new algorithms to get improved bounds. But what is the limit of such improvements if we just make some smoothness assumptions. The paper did not prove or discuss any lower bounds. 3. The price of anarchy result (proposition 6) effectively replaces the sequence of lambda-s and mu-s by their maximum and minimum respectively. But I don’t think this is the right generalization of the classic result for the contextual setting. Why should the price of anarchy should be bad if there is just one round t that gives poor choices of lambda(z_t) and mu(z_t)?

Correctness: I read the paper carefully and I think the theorems and propositions are correct as they are stated. I am not entirely convinced that the regret scales logarithmically in K, as the authors say in line 63. What if the term \gamma_t grows faster than log K and dominates the first term in the regret? I thought the experiment was correct and definitely showed that the proposed method dominates the other approaches.

Clarity: I thought the paper was well-written and I did not have any major problem following the arguments presented in the paper. However, I have a few suggestions. > I would have liked to see some more details about the GP-MW algorithm since it is crucial for designing algorithm 1. > It might be a good idea to move the background on price of anarchy to the related work section. > The term pseudo-regret was used without a definition. > The main text introduced the term \gamma_T and said that it is sample-complexity parameter that can be bounded analytically. Perhaps, in the introduction, the authors should highlight why such a term is necessary, and provide more explanation of its definition.

Relation to Prior Work: The authors claim that prior literature on contextual bandits do not exploit the fact that similar contexts and outcomes should produce similar rewards. However, there have been some attempts to model such similarity information. For example, > Contextual Bandits with Similarity Information, Slivkins, 2011. >Showing Relevant Ads via Lipschitz Context Multi-Armed Bandits, Lu et. al. 2010. Both the papers impose that reward when considered as a function of context and action is Lipschitz continuous. Lu et. al. also derives a regret bound of the form T^{(c+1)/(c+2)} which appears as one of the terms in theorem 1. It will be helpful if the authors could clarify how the kernel-based regularity assumption compares with these approaches. The authors should have also discussed the results on data-dependent type bounds for contextual bandits. The authors claimed that their regret bound scales logarithmically with K (number of actions), compared to sort{K} bound for the previous algorithms for contextual bandits. Recently, the following paper gave some improvement in this regard: > Taking a Hint: How to Leverage Loss Predictors in Contextual Bandits, Wei et. al. 2020. Otherwise, I felt that the paper covered the prior literature carefully.

Reproducibility: Yes

Additional Feedback: Update after Rebuttal: The authors addressed my criticisms about the dependence on \gamma_T and the price of anarchy. However, in the worst-case, \gamma_T can increase linearly with N and this implies that the proposed bound is meaningful in a setting with a small number of agents. Additionally, the authors acknowledged my comment about the price of anarchy but did not provide a way to improve the bound. Therefore, I am keeping my score as it is.

Review 2

Summary and Contributions: The paper introduces contextual games, which are repeated games such that at each turn there is a context all players can see, that effects their reward functions. It is assumed that the reward functions are smooth with respect to some kernel, such that the rewards from similar contexts are similar. Players have unknown reward functions, but they can observe the strategy profile at the end of each turn. The authors propose a multiplicative-weight type algorithm that achieves sublinear regret when played against other players and an oblivious adversary that chooses the context sequence. Based on this result, it is shown that the dynamics of the game converge to a contextual-coarse-correlated equilibrium (introduced and defined here). For smooth games, a bound on the gap from the social optimum is provided. Enhanced result for the case of i.i.d. contexts are also shown. The proposed algorithm is then demonstrated on a traffic routing game.

Strengths: This is a nice and creative paper that provides a framework that can be useful for the ML community. The theoretical results are sound and interesting, and the experiments serve as an interesting use case.

Weaknesses: From a technical point of view, the result is an incremental enhancement of [32], and follows by connecting known results. As such, the significance of the paper relies on the originality and usefulness of the novel framework of contextual games. This is by itself of course fine, since impact and usefulness are possibly the most important aspects anyway. The main weakness of this paper is that the usefulness and motivation of the results are a bit vague. The reason is that it's not clear why would selfish players follow the proposed algorithm. From a game-theoretic point of view, the approach here misses the dynamic aspect of repeated games. A more meaningful equilibrium notion for a repeated game takes into account the known history of the game (i.e., subprefect equilibrium). From an online learning point of view, the standard story that "bypasses" this conceptual difficulty is that each agent doesn't know it's facing other agents, and plays a good (no-regret) algorithm to guarantee its performance against any arbitrary adversary. If this adversary happens to be N-1 other players, then the nice finding is that a Nash equilibrium emerges from the interaction. However, this paper assumes that players can observe the strategy profile. Even more, the algorithm requires players to approximate the reward functions. With approximately known reward functions that are fully aware of the presence of N-1 other players, it's not clear why would players agree to run Algorithm 1 while they can adopt dynamic sophisticated strategies (i.e., Folk Theorems). For this reason, I find Lines 76-77 a bit misleading (what important results are recovered? better to be explicit). The (somewhat implicit) assumption that T is known also doesn't contribute to see in which kind of scenario running Algorithm 1 makes sense. Of course, players don't have to be selfish. But if players are cooperative, then it's not clear why would they want to converge to a contextual-coarse-correlated-equilibrium. The smoothness assumption along with the provided bound for the gap from social optimum do provide some justification as to why would one wants to consider this algorithm from a cooperative point of view. The experiments also support this. However, if this is the main motivation, it should be discussed and compared to more cooperative approaches. My hunch is that this is the best way to strengthen the contribution of this paper. However, I tend to like the paper despite these weaknesses. I'm very open to hear the authors' interpretation of the results (and I believe that the reader would be happy to see them as well).

Correctness: The proofs in the paper are easy to follow and rigorous.

Clarity: The paper is very well written and well-organized.

Relation to Prior Work: Previous work is well discussed. The game-theoretic literature on dynamic equilibrium for repeated games should also be mentioned. For many people, a repeated game is more than a learning scenario with T rounds.

Reproducibility: Yes

Review 3

Summary and Contributions: The paper considers a repeated game setting, where games may be different in each round, but are characterized by a round-dependent known context. The context impacts the game pay-offs, and provided the difference in contexts bounds the difference in pay-offs, the authors present algorithms to play contextual games with vanishing (contextual) regret and relate these results to contextual bandits. They then study the equilibrium outcome and show that players converge to this equilibrium when they play no-regret algorithms (which mirrors the result for the static game case). The authors also show that the welfare of the equilibrium can be bounded for smooth games. Finally they complement their theoretical work with an empirical analysis of contextual traffic routing.

Strengths: 1. The model is well-motivated and supported with empirical evaluation. The connection to contextual bandits is also well-received. 2. I appreciated the definition and analysis of equilibrium that arises as a consequence of these repeated contextual games.

Weaknesses: 1. It would have been nice to study the equilibrium quality empirically as well, in addition to the evaluation of the algorithms.

Correctness: The claims and proofs seem correct.

Clarity: The paper is clear.

Relation to Prior Work: yes

Reproducibility: Yes

Additional Feedback: I was wondering how the notion of contextual game relates to that of succinct games (which are described by a polynomial number of parameters, see “Computing Correlated Equilibria in Multi-Player Games”). POST-REBUTTAL I appreciate the authors responses. For welfare (I didn't do a good job describing this), I'd be interested to understand the quality of the equilibrium compared to optimal welfare (say, an empirical version of price of anarchy). The authors introduce a new equilibrium concept (which is quite broad by analogy of CCE being a broad equilibrium class), but there's no discussion of whether solutions to this equilibrium concept have reasonable welfare (compared to optimal). What the authors point out is that the plots show that the welfare comparisons against benchmarks that perform worse, but they don't show it against the optimal benchmark. I won't hold that interpretation of my comment against the authors, and I've kept the score the same.

Review 4

Summary and Contributions: This paper introduces a class of games called contextual games. This is a repeated game where the players receive an observable “context” at each round. Assumption leveraged by the rest of the paper is that similar contexts lead to similar outcomes (formalized through kernel-based regularity). The authors also introduce contextual regret and algorithms for minimizing this notion. Analogically, contextual (coarse) correlated equilibria are also introduced as they are closely related to minimizing the contextual regret. The paper also includes an experiment on a traffic routing problem, where the context semantics is capacity of the network at each round.

Strengths: I enjoyed the paper, it is well motivated and the introduced concepts are clear and well defined. The important results are also theoretically grounded, and the presented theorems make sense. The introduced experiment also does not feel artificial, and shows that the introduced methods and context can be practically used.

Weaknesses: I feel like a limitation is the feedback model. Authors assume that after each round, not only the agent receives (noisy) observation, but also observes the actions of all the other players. This assumption is in many settings unrealistic. While I am not an expert in this particular line of work,some of the results also look like relatively simple combinations of previous work. This is not necessarily bad, as the authors do a very good job of referring to the previous results as they introduce their techniques (which I appreciate). It still sometimes feels like just putting existing things together in a straightforward way (e.g. GP-MW -> C.GP-MW, Strategy 2) Minor limitation is that the introduced models and techniques do not deal with sequential settings, but that could arguably be extended (maybe as a future work)

Correctness: All seems correct.

Clarity: The paper is clearly written and reads well.

Relation to Prior Work: Previous work seems to be well referenced.

Reproducibility: Yes

Additional Feedback: Line 183: I think you are missing z_t as a parameter for the optimal policy function Experiment: Maybe add a note that your reward function satisfies your assumptions. ***************************************************************************************************************************************************************** Post Rebuttal: No change to my score, I liked the paper and was happy with the feedback provided to the other reviewer's comments.