NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID: 5499 Bandits with Feedback Graphs and Switching Costs

### Reviewer 1

This work considers nonstochastic bandits with feedback graphs and switching cost. This is the combination of two settings that were previously studied. The regret of bandit learning with a known static feedback graph over K actions scales with sqrt{alpha*T} (ignoring log(K) factors), where alpha is the independent set of the graph, and this is provably tight. In the case of bandits with switching costs, the regret is K^{1/3}*T^{2/3}, which is also provably tight. In earlier work [RF2019], an upper bound of alpha^{1/3}*T^{2/3} was proven for the setting studied in this work. This bound holds even when the graph changes with time and is only revealed to the algorithm after each play (the so-called "uninformed" setting). This work fixes a mistake in the lower bound proven in [RF2019] and shows that alpha^{1/3}*T^{2/3} cannot be improved in the uninformed setting when the adversary controls both the loss sequence and the graph sequence. In the informed setting this works shows that the regret can be improved to \sqrt{beta}*T^{2/3}, where beta is the dominating set of the feedback graph. This is obtained by performing a star decomposition of the graph, running a bandit instance on each star, and combining these instances via a mini-batched version of the corraling algorithm of Agarwal et. al. A lower bound of beta^{1/3} T^{2/3} (as usual ignoring log factors) is proven for the class of disjoint union of stars in the informed setting. A further scenario considers policy regret with losses that depend of the past m actions, where m is fixed, and feedback graphs. The feedback model is peculiar: the loss of adjacent actions is only available when the same action was played in the last m steps. In this setting an upper bound of m^{1/3}*sqrt{b}*T^{2/3} is proven again using the minibatched version of the corraling algorithm. The motivation for combining switched regret with feedback graphs is a bit weak. The paper contains some interesting and clever ideas, especially in the design of the lower bounds. The use of the corraling technique feels suboptimal, and leaves indeed a gap between upper and lower bound. The feedback assumption for the policy regret feels quite arbitrary. Why restricting the feedback graph is needed to prove an upper bound on the regret? In summary, this submission provides suboptimal bounds in a weakly motivated setting. On the other hand, the techniques are nontrivial and there are some clever ideas in the proofs. AFTER AUTHOR FEEDBACK: After reading the author feedback, and after going through the other reviewers' comments, I decided to increase my score by 1 in recognition of the technical contributions.

### Reviewer 2

This paper studies adversarial multi-armed bandit with feedback graphs and switching costs: after playing an action, the player observes the losses of all neighboring actions and suffers an additional penalty if there was a switch. The algorithm is based on decomposing the graph into star graphs. For each star graph, they apply a local variant of Exp3 with mini-batches and combine them with a mini-batched variant of Corral. The authors provide upper-bound and lower-bound on the regret in terms of the domination number (\beta) of the graph. Strengths and weaknesses: + I liked the simplicity of the solution to divide the problem into star graphs. The domination number introduced seems to be a natural quantity for this problem. +/- To my opinion, the setting seems somewhat contrived combining feedback graphs and switching costs. The application to policy regret with counterfactual however provides a convincing example that the analysis can be useful and inspire future work. +/- The main part of the paper is rather clear and well written. Yet, I found the proofs in the appendices sometimes a bit hard to follow with sequences of unexplained equations. I would suggest to had some details. - There is a gap between the lower bound and the upper-bound (\sqrt(\beta) instead of \beta^{1/3}). In particular, for some graphs, the existing bound with the independence number may be better. This is also true for the results on the adaptive adversary and the counterfactual feedback. Other remarks: - Was the domination number already introduced for feedback graphs without switching costs? If yes, existing results for this problem should be cited. If not, it would be interesting to state what kind of results your analysis would provide without using the mini-batches. - Note that the length of the mini-batches tau_t may be non-integers. This should be clarified to be sure there are no side effects. For instance, what happens if $\tau_t << 1$? I am not sure if the analysis is still valid. - A better (more formal) definition of the independence and the domination numbers should be provided. It took me some time to understand their meaning. - Alg 1 and Thm 3.1: Since only upper-bounds on the pseudo-regret are provided, the exploration parameter gamma seems to be useless, isn't it? The choice gamma=0 seems to be optimal. A remark on high-probability upper-bounds and the role of gamma might be interesting. In particular, do you think your analysis (which is heavily based on expectations) can be extended to high-probability bounds on the regret? - I understand that this does not suit the analysis (which uses the equivalence in expectation btw Alg1 and Alg6) but it seems to be suboptimal (at least in practice) to discard all the feedbacks obtained while playing non-revealing actions. It would be nice to have practical experiments to understand better if we lose something here. It would be also nice to compare it with existing algorithms. Typos: - p2, l86: too many )) - Thm 3.1: A constant 2 in the number of switches is missing. - p13, l457: some notations seem to be undefined (w_t, W_t). - p14, you may add a remark - p15, l458: the number of switches can be upper-bounded by **twice** the number of times the revealing action is played - p16, l514: I did not understand why Thm 3.1 implies the condition of Thm C.5 with alpha=1/2 and not 1. By the way, (rho_t) should be non-decreasing for this condition to hold.

### Reviewer 3

The paper studies adversarial bandits with feedback graph and switching cost among arms. Both the feedback graphs and switching cost for bandit learning have been extensively studied, but combining two aspects together seem to be only covered by one recent prior work Rangi and Franceschetti [2019]. The main technical merit of the paper seems to be connecting the regret upper and lower bound with the dominance number of the feedback graph, denoted as \beta(G). Prior studies only connect regret bounds with independence number \alpha(G), which by definition is greater than or equal to the dominance number. This seems to be a useful advancement in the study of feedback graphs with switching costs. However, the technical result given in the paper is not really a technical improvement of the prior result, because the result of Rangi and Franceschetti [2019] show a regret upper bound of $O(\alpha(G)^{1/3} T^{2/3})$, while the current paper only shows $O(\sqrt{\beta(G)} T^{2/3})$ upper bound. Therefore the term $\sqrt{\beta{G}}$ does not beat $\alpha(G)^{1/3}$ technically, and the authors only conjecture that some adaptation of the algorithm could achieve $\beta{G}^{1/3}$ but was not able to realize that. This could be viewed as a technical limitation of the paper. Besides the above limitations, I also few that some other limitations of the paper prevent it to be more impactful. One is on the motivation. It looks to me that combining feedback graphs with switching cost is more from a theoretical curiosity in extending our knowledge on various bandit algorithms rather than from demanding application scenarios. I do not see a compelling argument in the paper on why this is particularly useful in practice. From this aspect, perhaps the paper would be appreciated more in a more theory oriented conference such as COLT or AISTATS. Another is on the technical novelty of the paper. It looks like both the upper bound and lower bound analyses are based on existing techniques, such as mini-batching and corralling algorithm for the upper bound, and multi-scale random walk for the lower bound. I do clearly see the technical novelty of the paper. The third issue is on empirical evaluation. Given the above limitations, I would feel that some empirical evaluations of the proposed algorithm and comparison with prior work would compensate the incremental nature of the work, but unfortunately the authors do not provide any empirical evidence of the superiority of the proposed algorithm. Overall, I feel that the paper has some interesting theoretical advancement in the theoretical study of bandit algorithms, but I have some reservation on its overall impact, and thus gives the current marginal acceptance score.