{"title": "Bandits with Feedback Graphs and Switching Costs", "book": "Advances in Neural Information Processing Systems", "page_first": 10397, "page_last": 10407, "abstract": "We study the adversarial multi-armed bandit problem where the learner is supplied with partial observations modeled by a \\emph{feedback graph} and where shifting to a new action incurs a fixed \\emph{switching cost}. We give two new algorithms for this problem in the informed setting. Our best algorithm achieves a pseudo-regret of $\\tilde O(\\gamma(G)^{\\frac{1}{3}}T^{\\frac{2}{3}})$, where $\\gamma(G)$ is the domination number of the feedback graph. This significantly improves upon the previous best result for the same problem, which was based on the independence number of $G$. We also present matching lower bounds for our result that we describe in detail. Finally, we give a new algorithm with improved policy regret bounds when partial counterfactual feedback is available.", "full_text": "Bandits with Feedback Graphs and Switching Costs\n\nRaman Arora\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\narora@cs.jhu.edu\n\nTeodor V. Marinov\n\nDept. of Computer Science\nJohns Hopkins University\n\nBaltimore, MD 21204\ntmarino2@jhu.edu\n\nCourant Institute of Math. Sciences\n\nMehryar Mohri\nGoogle Research &\n\nNew York, NY 10012\nmohri@google.com\n\nAbstract\n\nWe study the adversarial multi-armed bandit problem where the learner is supplied\nwith partial observations modeled by a feedback graph and where shifting to a new\naction incurs a \ufb01xed switching cost. We give two new algorithms for this problem in\nthe informed setting. Our best algorithm achieves a pseudo-regret of \u02dcO((G) 1\n3 ),\nwhere (G) is the domination number of the feedback graph. This signi\ufb01cantly\nimproves upon the previous best result for the same problem, which was based on\nthe independence number of G. We also present matching lower bounds for our\nresult that we describe in detail. Finally, we give a new algorithm with improved\npolicy regret bounds when partial counterfactual feedback is available.\n\n3 T 2\n\n1\n\nIntroduction\n\nA general framework for sequential learning is that of online prediction with expert advice [Littlestone\nand Warmuth, 1994, Cesa-Bianchi et al., 1997, Freund and Schapire, 1997], which consists of repeated\ninteractions between a learner and the environment. The learner maintains a distribution over a set of\nexperts or actions. At each round, the loss assigned to each action is revealed. The learner incurs\nthe expected value of these losses for their current distribution and next updates her distribution.\nThe learner\u2019s goal is to minimize her regret, which, in the simplest case, is de\ufb01ned as the difference\nbetween the cumulative loss over a \ufb01nite rounds of interactions and that of the best expert in hindsight.\nThe scenario just described corresponds to the so-called full information setting where the learner is\ninformed of the loss of all actions at each round. In the bandit setting, only the loss of the action they\nselect is known to the learner. These settings are both special instances of a general model of online\nlearning with side information introduced by Mannor and Shamir [2011], where the information\navailable to the learner is speci\ufb01ed by a feedback graph. In an undirected feedback graph, each\nvertex represents an action and an edge between vertices a and a0 indicates that the loss of action a0\nis observed when action a is selected and vice-versa. The bandit setting corresponds to a feedback\ngraph reduced to only self-loops at each vertex, the full information setting to a fully connected graph.\nOnline learning with feedback graphs has been further extensively analyzed by Alon et al. [2013,\n2017] and several other authors [Alon et al., 2015, Koc\u00e1k et al., 2014, Cohen et al., 2016, Yun et al.,\n2018, Cortes et al., 2018].\nIn many applications, the learner also incurs a cost when switching to a new action. Consider, for\nexample, a commercial bank that issues various credit card products, many of which are similar,\ne.g., different branded cards with comparable fees and interest rates. At each round, the bank offers\na speci\ufb01c product to a particular sub-population (e.g., customers at a store). The payoff observed\nfor this action also reveals feedback for related cards and similar sub-populations. At the same\ntime, offering a different product to a group incurs a switching cost in terms of designing a new\nmarketing campaign. Another example of a problem with feedback graph and switching costs is a\nlarge company seeking to allocate and reallocate employees to different tasks so that the productivity\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fis maximized. Employees with similar skills, e.g., technical expertise, people skills, can be expected\nto perform as well as each other on the same task. Reassigning employees between tasks, however, is\nassociated with a cost for retraining and readjustment time. We refer the reader to Appendix B for\nmore motivating examples.\nThe focus of this paper is to understand the fundamental tradeoffs between exploration and exploita-\ntion in online learning with feedback graphs and switching costs, and to design learning algorithms\nwith provably optimal guarantees. We consider the general case of a feedback graph G with a set\nof vertices or actions V . In the expert setting with no switching cost, the min-max optimal regret is\nachieved by the weighted-majority or the Hedge algorithm [Littlestone and Warmuth, 1994, Freund\n\nand Schapire, 1997], which is in \u21e5(plog (|V |) T ). In the bandit setting, the extension of these\nalgorithms, EXP3 [Auer et al., 2002], achieves a regret of O(p|V | log (|V |) T ). The min-max\noptimal regret of \u21e5(p|V |T ) can be achieved by the INF algorithm [Audibert and Bubeck, 2009].\nThep|V |-term in the bandit setting is inherently related to the additional exploration needed to\n\nobserve the loss of all actions.\nThe scenario of online learning with side information modeled by feedback graphs, which interpo-\nlates between the full information and the bandit setting, was introduced by Mannor and Shamir\n[2011]. When the feedback graph G is \ufb01xed over time and is undirected, a regret in the order of\n\nO(p\u21b5(G) log (|V |) T ) can be achieved, with a lower bound of \u2326(p\u21b5(G)T ), where \u21b5(G) denotes\n\nthe independence number of G. There has been a large body of work studying different settings of\nthis problem with time-varying graphs (Gt)T\nt=1, in both the directed or undirected cases, and in both\nthe so-called informed setting, where, at each round, the learner receives the graph before selecting\nan action, or the uninformed setting where it is only made available after the learner has selected an\naction and updated its distribution [Alon et al., 2013, Koc\u00e1k et al., 2014, Alon et al., 2015, Cohen\net al., 2016, Alon et al., 2017, Cortes et al., 2018].\nFor the expert setting augmented with switching costs, the min-max optimal regret remains in\n\n\u02dc\u21e5(plog (|V |) T ). However, classical algorithms such as the Hedge or Follow-the-Perturbed-Leader\n[Kalai and Vempala, 2005] no more achieve the optimal regret bound. Several algorithms designed by\nKalai and Vempala [2005], Geulen et al. [2010], Gyorgy and Neu [2014] achieve this min-max optimal\nregret. In the setting of bandits with switching costs, the lower bound was carefully investigated by\nCesa-Bianchi et al. [2013] and Dekel et al. [2014] and shown to be in \u02dc\u2326(|V | 1\n3 ). This lower bound\nis asymptotically matched by mini-batching the EXP3 algorithm, as proposed by Arora et al. [2012].\nThe only work we are familiar with, which studies both bandits with switching costs and side\ninformation is that of Rangi and Franceschetti [2019]. The authors propose two algorithms for\ntime-varying feedback graphs in the uninformed setting. When reduced to the \ufb01xed feedback graph\nsetting, their regret bound becomes \u02dcO(\u21b5(G) 1\n3 ). We note that, in the informed setting with a \ufb01xed\nfeedback graph, this bound can be achieved by applying the mini-batching technique of Arora et al.\n[2012] to the EXP3-SET algorithm of Alon et al. [2013].\nOur main contributions are two-fold. First, we propose two algorithms for online learning in the\ninformed setting with a \ufb01xed feedback graph G and switching costs. Our best algorithm admits a\npseudo-regret bound in \u02dcO((G) 1\n3 ), where (G) is the domination number of G. We note that\nthe domination number (G) can be substantially smaller than the independence number \u21b5(G) and\ntherefore that our algorithm signi\ufb01cantly improves upon previous work by Rangi and Franceschetti\n[2019] in the informed setting. We also extend our results to achieve a policy regret bound in\n\u02dcO((G) 1\n3 ) regret bound in\nthe switching costs setting might seem at odds with a lower bound stated by Rangi and Franceschetti\n[2019]. However, the lower bound of Rangi and Franceschetti [2019] can be shown to be technically\ninaccurate (see Appendix C). Our second main contribution is a lower bound in \u02dc\u2326(T 2\n3 ) for any\nnon-complete feedback graph. We also extend this lower bound to \u02dc\u2326((G) 1\n3 ) for a class of\nfeedback graphs that we will describe in detail. In Appendix I, we show a lower bound for the setting\nof evolving feedback graphs, matching the originally stated lower bound in [Rangi and Franceschetti,\n2019].\nThe rest of this paper is organized as follows. In Section 2, we describe in detail the setup we analyze\nand introduce the relevant notation. In Section 3, we describe our main algorithms and results. We\n\n3 ) when partial counterfactual feedback is available. The \u02dcO((G) 1\n\n3 T 2\n\n3 T 2\n\n3 T 2\n\n3 T 2\n\n3 T 2\n\n3 T 2\n\n2\n\n\ffurther extend our algorithms and analysis to the setting of online learning in reactive environments\n(Section 4). In Section 5, we present and discuss in detail lower bounds for this problem.\n\n2 Problem Setup and Notation\n\nWe study a repeated game between an adversary and a player over T rounds. For any n 2 N, we\ndenote by [n] the set of integers {1, . . . , n}. At each round t 2 [T ], the player selects an action\nat 2 V and incurs a loss `t(at), as well as a cost of one if switching between distinct actions in\nconsecutive rounds (at 6= at1). For convenience, we de\ufb01ne a0 as an element not in V so that the\n\ufb01rst action always incurs a switching cost. The regret RT of any sequence of actions (at)T\nt=1 is thus\nde\ufb01ned by RT = maxa2V PT\nt=1 1at6=at1 is the number of\naction switches in that sequence. We will assume an oblivious adversary, or, equivalently, that the\nsequence of losses for all actions is determined by the adversary before the start of the game. The\nperformance of an algorithm A in this setting is measured by its pseudo-regret RT (A) de\ufb01ned by\n\nt=1 `t(at)  `t(a) + M, where M =PT\n\nRT (A) = max\na2V\n\nE\" TXt=1\u21e3`t(at) + 1at6=at1\u2318  `t(a)#,\n\nwhere the expectation is taken over the player\u2019s randomized choice of actions. The regret of A is\nde\ufb01ned as E[RT ], with the expectation outside of the maximum. In the following, we will abusively\nrefer to RT (A) as the regret of A, to shorten the terminology.\nWe also assume that the player has access to an undirected graph G = (V, E), which determines\nwhich expert losses can be observed at each round. The vertex set V is the set of experts (or actions)\nand the graph speci\ufb01es that, if at round t the player selects action at, then, the losses of all experts\nwhose vertices are adjacent to that of at can be observed: `t(a) for a 2 N (at), where N (at) denotes\nthe neighborhood of at in G de\ufb01ned for any u 2 V by: N (u) = {v : (u, v) 2 E}. We will denote\nby deg(u) = |N (u)| the degree of u 2 V in graph G. We assume that G admits a self-loop at\nevery vertex, which implies that the player can at least observe the loss of their own action (bandit\ninformation). In all our \ufb01gures, self-loops are omitted for the sake of simplicity.\nWe assume that the feedback graph is available to the player at the beginning of the game (informed\nsetting). The independence number of G is the size of a maximum independent set in G and is denoted\nby \u21b5(G). The domination number of G is the size of a minimum dominating set and is denoted by\n(G). The following inequality holds for all graphs G: (G) \uf8ff \u21b5(G) [Bollob\u00e1s and Cockayne,\n1979, Goddard and Henning, 2013]. In general, (G) can be substantially smaller than \u21b5(G), with\n(G) = 1 and \u21b5(G) = |V | 1 in some cases. We note that all our results can be straightforwardly\nextended to the case of directed graphs.\n\n3 An Adaptive Mini-batch Algorithm\n\nIn this section, we describe an algorithm for online learning with switching costs, using adaptive\nmini-batches. All proofs of results are deferred to Appendix D.\nThe standard exploration versus exploitation dilemma in the bandit setting is further complicated\nin the presence of a feedback graph: if a poor action reveals the losses of all other actions, do we\nplay the poor action? The lower bound construction of Mannor and Shamir [2011] suggests that we\nshould not, since it might be better to just switch between the other actions.\nAdding switching costs, however, modi\ufb01es the price of exploration and the lower bound argument of\nMannor and Shamir [2011] no longer holds. It is in fact possible to show that EXP3 and its graph\nfeedback variants switch too often in the presence of two good actions, thereby incurring \u2326(T ) regret,\ndue to the switching costs. One way to deal with the switching costs problem is to adapt the \ufb01xed\nmini-batch technique of Arora et al. [2012]. That technique, however, treats all actions equally while,\nin the presence of switching costs, actions that provide additional information are more valuable.\nWe deal with the issues just discussed by adopting the idea that the mini-batch sizes could depend\nboth on how favorable an action is and how much information an action provides about good actions.\n\n3\n\n\fAlgorithm 1 Algorithm for star graphs\nInput: Star graph G(V, E), learning rates (\u2318t), exploration rate  2 [0, 1], maximum mini-batch \u2327.\nOutput: Action sequence (at)T\n1: q1 = 1\n|V |\n2: whilePt b\u2327tc \uf8ff T do\n\n% (r) is the Dirac distribution on r\n\nt=1.\n\n.\n\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\npt = (1  )qt + (r)\nDraw at \u21e0 pt, set \u2327t = pt(r)\u2327\nif at1 6= r and at 6= r then\nend if\nPlay at for the next b\u2327tc iterations\n\nSet at = at1\n\nj=t\n\nSetb`t(i) =Pt+b\u2327tc1\n\nFor all i 2 V , qt+1(i) =\nt = t + 1\n\n10:\n11:\n12: end while\n\nI(at = r) `j (i)\npt(r)\nqt(i) exp(\u2318tb`t(i))\nPj2V qt(j) exp(\u2318tb`t(j))\n\n3.1 Algorithm for Star Graphs\nWe start by studying a simple feedback graph case in which one action is adjacent to all other actions\nwith none of these other actions admitting other neighbors. For an example see Figure 1.\nWe call such graphs star graphs and we refer to the action\nadjacent to all other actions as the revealing action. The\nrevealing action is denoted by r. Since only the revealing\naction can convey additional information about other ac-\ntions, we will select our mini-batch size to be proportional\nto the quality of this action. Also, to prevent our algorithm\nfrom switching between two non-revealing actions too of-\nten, we will simply disallow that and allow switching only\nbetween the revealing action and a non-revealing action.\nFinally, we will disregard any feedback a non-revealing\naction provides us. This simpli\ufb01es the analysis of the regret of our algorithm. The pseudocode of the\nalgorithm is given in Algorithm 1.\nThe following intuition guides the design of our algorithm and its analysis. We need to visit the\nrevealing action suf\ufb01ciently often to derive information about all other actions, which is determined\nby the explicit exploration factor . If r is a good action, our regret will not be too large if we visit it\noften and spent a large amount of time in it. On the other hand if r is poor, then the algorithm should\nnot sample it often and, when it does, it should not spend too much time there. Disallowing the\nalgorithm to directly switch between non-revealing actions also prevents it from switching between\ntwo good non-revealing actions too often. The only remaining question is: do we observe enough\ninformation about each action to be able to devise a low regret strategy? The following regret\nguarantee provides a precise positive response.\nTheorem 3.1. Suppose that the inequality E[`2\n\u21e2 and   1\n\n\u2327 . Then, for any action a 2 V , Algorithm 1 admits the following guarantee:\n\nt (i)] \uf8ff \u21e2 holds for all t \uf8ff T and all i 2 V , for some\n\nFigure 1: Example of a star graph.\n\nE\" TXt=1\n\n`t(at)  `t(a)# \uf8ff\n\nlog (|V |)\n\n\u2318\n\n+ T\u2318\u2327 \u21e2 + T.\n\nFurthermore, the algorithm does not switch more than 2T/\u2327 times, in expectation.\nThe exploration parameter  is needed to ensure that \u2327t = pt(r)\u2327  1, so that at every iteration of the\nwhile loop Algorithm 1 plays at least one action. The bound assumed on the second moment E[`2\nt (i)]\nmight seem unusual since in the adversarial setting we do not assume a randomization of the losses.\nFor now, the reader can just assume that this is a bound on the squared loss, that is, `2\nt (i) \uf8ff \u21e2. The\nrole of this expectation and the source of the randomness will become clear in Section 3.3. We note\nthat the star graph admits independence number \u21b5(G) = |V | 1 and domination number (G) = 1.\n\n4\n\n\fIn this case, the algorithms of Rangi and Franceschetti [2019] and variants of the mini-batching\nalgorithm only guarantee a regret bound of the order \u02dcO(\u21b5(G) 1\n3 ), while Algorithm 1 guarantees a\nregret bound of the order \u02dcO(T 2\n3 , and  = 1/T 1\n3 .\n\n3 ) when we set \u2318 = 1/T 2\n\n3 T 2\n3 , \u2327 = T 2\n\n3.2 Algorithm for General Feedback Graphs\nWe now extend Algorithm 1 to handle arbitrary feedback graphs. The pseudocode of this more\ngeneral algorithm is given in Algorithm 2.\n\nAlgorithm 2 Algorithm for general feedback graphs\nInput: Graph G(V, E), learning rates (\u2318t), exploration rate  2 [0, 1], maximum mini-batch \u2327.\nOutput: Action sequence (at)t.\n1: Compute an approximate dominating set R\n2: q1 \u2318 U nif (V ), u \u2318 U nif (R)\n3: whilePt \u2327t \uf8ff T do\npt = (1  )qt + u.\n4:\nDraw i \u21e0 pt, set \u2327t = pt(ri)\u2327, where ri is the dominating vertex for i and set at = i.\n5:\nif at1 62 R and at 62 R then\n6:\n7:\nend if\n8:\nPlay at for the next b\u2327tc iterations.\n9:\n10:\n\nSet at = at1\n\nj=t\n\nSetb`t(i) =Pt+b\u2327tc1\n\nFor all i 2 V , qt+1(i) =\nt = t + 1.\n\n11:\n12:\n13: end while\n\nI(at = ri) `j (i)\npt(ri) .\nqt(i) exp(\u2318tb`t(i))\nPj2V qt(j) exp(\u2318tb`t(j))\n\n.\n\nThe \ufb01rst step of Algorithm 2 consists of computing an approximate minimum dominating set for G\nusing the Greedy Set Cover algorithm [Chvatal, 1979]. The Greedy Set Cover algorithm naturally\npartitions G into disjoint star graphs with revealing actions/vertices in the dominating set R. Next,\nAlgorithm 2 associates with each star-graph its revealing arm r 2 R. The mini-batch size at time t\nnow depends on the probability pt(r) of sampling a revealing action r, as in Algorithm 1. There are\nseveral key differences, however, that we now point out. Unlike Algorithm 1, the mini-batch size can\nchange between rounds even if the action remains \ufb01xed. This occurs when the newly sampled action\nis associated with a new revealing action in R, however, it is different from the revealing action. The\nabove difference introduces some complications, because \u2327t conditioned on all prior actions a1:t1 is\nstill a random variable, while it is a deterministic in Algorithm 1. We also allow switches between\nany action and any vertex r 2 R. This might seem to be a peculiar choice. For example, allowing\nonly switches within each star-graph in the partition and only between revealing vertices seems more\nnatural. Allowing switches between any vertex and any revealing action bene\ufb01ts exploration while\nstill being suf\ufb01cient for controlling the number of switches. If we further constrain the number of\nswitches by using the more natural approach, it is possible that not enough information is received\nabout each action, leading to worse regret guarantees. We leave the investigation of such more natural\napproaches to future work. Algorithm 2 admits the following regret bound.\nTheorem 3.2. For any   |R|\u2327 The expected regret of Algorithm 2 is\n\nlog (|V |)\n\n\u2318\n\n+ \u2318\u2327T + T.\n\ntimes.\n\nFurther, if the algorithm is augmented similar to Algorithm 7, then it will switch between actions at\nmost 2T|R|\u2327\nSetting \u2318 = 1/(|R| 1\n3 , recovers a pseudo-regret bound of\n\u02dcO(|R| 1\n3 . We note that |R| =\nO((G) log (|V |)) and thus the regret bound of our algorithm scales like (G) 1\n3 . Further, this is a\nstrict improvement over the results of Rangi and Franceschetti [2019] as their result shows a scaling\nof \u21b5(G) 1\n\n3 ), with an expected number of switches bounded by 2|R| 1\n\n3 . The proof of Theorem 3.2 can be found in Appendix D.3.\n\n3 and  = |R| 1\n\n3 ), \u2327 = |R| 2\n\n3 /T 1\n\n3 T 2\n\n3 T 2\n\n3 T 1\n\n3 T 2\n\n5\n\n\fAlgorithm 3 Corralling star-graph algorithms\nInput: Feedback graph G(V, E), learning rate \u2318, mini-batch size \u2327\nOutput: Action sequence (at)T\n1: Compute an approximate minimum dominating set R and initialize |R| base star-graph algorithms,\n, mini-batch size \u2327 and exploration rate 1/\u2327 (Algorithm 1).\n\nt=1.\n\nlog(T )\u2318, \u23181,i = \u2318, \u21e21,i = 2|R| for all i 2 [|R|], q1 = p1 = 1\n\n|R|\n\npt(it) I{i = it} as loss to algorithm Bi for all i 2 [|R|].\n\npt(it) I{i = it}.\n\npt(i), \u2318t+1,i = \u02dc\u2318t,i and restart i-th star-graph algorithm, with updated\n\nDraw it \u21e0 pt\nfor jt = (t  1)\u2327 + 1, . . . , (t  1)\u2327 + \u2327 do\n\njt from Bi for all i 2 [|R|].\n\njt, play ajt and observe loss `jt(ajt).\n\nB1, B2, . . . , B|R|, with step size \u23180\n2|R|\nT 0 , \u02dc = exp\u21e3 1\n\n\u2327 ,  = 1\n\nReceive action ai\nSet ajt = ait\nSend `jt (ajt )\n\n2: T 0 = T\n3: for t = 1, . . . , T 0 do\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n\nif\n\n1\n\n\u2327\n\n`jt (ajt )\n\nUpdateb`t(i) =b`t(i) + 1\nend for\nUpdate qt+1 = Algorithm 4(qt,b`t,\u2318 t).\nSet pt+1 = (1  )qt+1 +  1\n|R|\nfor i = 1, . . . ,|R| do\npt(i) >\u21e2 t,i then\nSet \u21e2t+1,i = 2\nstep-size\n\n\u23180\n\n.\n\n\u21e2t+1,i\n\nSet \u21e2t+1,i = \u21e2t,i, \u2318t+1,i = \u2318t,i.\n\nelse\n\n16:\n17:\n18:\n19:\n20: end for\n\nend if\nend for\n\n3.3 Corralling Star Graph Algorithms\n\nAlgorithm 4 Log-Barrier-OMD(qt,` t,\u2318 t)\nInput: Previous distribution qt, loss vector `t, learning rate vector \u2318t.\nOutput: Updated distribution qt+1.\n\n1: Find  2 [mini `t(i), maxi `t(i)] such thatP|R|i=1\n\n2: Return qt+1 such that\n\nqt+1(i) = 1\n\n1\n\nqt(i) + \u2318t,i(`t(i)  ).\n\n1\n\n1\n\nqt(i) +\u2318t,i(`t(i)) = 1\n\nAn alternative natural method to tackle the general feedback graph problem is to use the recent\ncorralling algorithm of Agarwal et al. [2016]. Corralling star graph algorithms was in fact our initial\napproach. In this section, we describe that technique, even though it does not seem to achieve an\noptimal rate. Here too, the \ufb01rst step consists of computing an approximate minimum dominating\nset. Next, we initialize an instance of Algorithm 1 for each star graph. Finally, we combine all of\nthe star graph algorithms via a mini-batched version of the corralling algorithm of Agarwal et al.\n[2016]. Mini-batching is necessary to avoid switching between star graph algorithms too often. The\npseudocode of this algorithm is given in Algorithm 3. Since during each mini-batch we sample a\nsingle star graph algorithm, we need to construct appropriate unbiased estimators of the losses `jt,\nwhich we feed back to the sampled star graph algorithm. The bound on the second moment of these\nestimators is exactly what Theorem 3.1 requires. Our algorithm admits the following guarantees.\nTheorem 3.3. Let \u2327 = T 1\n3 , where c\nis a constant independent of T , \u2327, |V | and |R|. Then, for any a 2 V , the following inequality holds\nfor Algorithm 3:\n\n3 log (|V |)), and \u23180 = 1/T 2\n\n4 /(40c log (T 0) T 1\n\n4 ,\u2318 = |R| 1\n\n3 /|R| 1\n\nE\" TXt=1\n\n`t(at)  `t(a)# \uf8ff \u02dcO\u21e3p|R| T\n\n2\n\n3\u2318 .\n\n6\n\n\fTheorem 3.3.\n\nAlgorithm 5 Policy regret with side observations\nInput: Feedback graph G(V, E), learning rate \u2318, mini-batch size \u2327, where \u2318 and \u2327 are set as in\n\nOutput: Action sequence (at)t.\n1: Transform feedback graph G from m-tuples to actions and initialize Algorithm 2.\n2: for t = 1, . . . , T /m do\n3:\n4:\n5:\n\nSample action at from pt generated by Algorithm 2 and play it for the next m rounds.\nif at1 = at then\n\ntions. Feed mini-batched loss and additional side observations to Algorithm 2.\n\nObserve mini-batched lossb`t(at) = 1\nSetb`t(at) = 0 and set additional feedback losses to 0. Feed losses to Algorithm 2.\n\n6:\n7:\nend if\n8:\n9: end for\n\nmPm\n\nelse\n\nj=1 `(t1)m+j(at) and additional side observa-\n\nFurthermore, the expected number of switches of the algorithm is bounded by T 2\n\n3|R| 1\n3 .\n\nThis bound is suboptimal compared to the (G) 1\n3 -dependency achieved by Algorithm 2. We conjec-\nture that this gap is an artifact of the analysis of the corralling algorithm of Agarwal et al. [2016].\nHowever, we were unable to improve on the current regret bound by simply corralling.\n\n4 Policy Regret with Partial Counterfactual Feedback\n\n3 T 2\n\n3 ) [Dekel et al., 2014], ignoring the dependency on m.\n\nt=1 `t(a1, . . . , at) PT\n\npolicy regret de\ufb01ned by the following: maxa2V PT\n\nIn this section, we consider games played against an adaptive adversary, who can select losses based\non the player\u2019s past actions. In that scenario, the notion of pseudo-regret is no longer meaningful\nor interpretable, as pointed out by Arora et al. [2012]. Instead, the authors proposed the notion of\nt=1 `t(a, . . . , a), where\nthe benchmark action a does not depend on the player\u2019s actions. Since it is impossible to achieve\no(T ) policy regret when the t-th loss is allowed to depend on all past actions of the player, the authors\nmade the natural assumption that the adversary is m-memory bounded, that is that the t-th loss can\nonly depend on the past m actions chosen by the player. In that case, the known min-max policy\nregret bounds are in \u02dc\u21e5(|V | 1\nHere, we show that the dependency on |V | can be improved in the presence of partial counterfactual\nfeedback. We assume that partial feedback on losses with memory m is available. We restrict the\nfeedback graph to admitting only vertices for repeated m-tuples of actions in V , that is, we can only\nobserve additional feedback for losses of the type `t(a, a, . . . , a), where a 2 V . For a motivating\nexample, consider the problem of prescribing treatment plans to incoming patients with certain\ndisorders. Two patients that are similar, for example patients in the same disease sub-type or with\nsimilar physiological attributes, when prescribed different treatments, reveal counterfactual feedback\nabout alternative treatments for each other.\nOur algorithm for incorporating such partial feedback to minimize policy regret is based on our\nalgorithm for general feedback graphs (Algorithm 2). The learner receives feedback about m-memory\nbounded losses in the form of m-tuples. We simplify the representation by replacing each m-tuple\nvertex in the graph by a single action, that is vertex (a, . . . , a) represented as a.\nAs described in Algorithm 5, the input stream of T losses is split into mini-batches of size m, indexed\nby t, such thatb`t(\u00b7) = 1\nt=1 , could be fed as input\nto Algorithm 2 if it were not for the constraint on the additional feedback. Suppose that between the\nt-th mini-batch and the t + 1-st mini-batch, Algorithm 2 decides to switch actions so that at+1 6= at.\nIn that case, no additional feedback is available forb`t+1(at+1) and the algorithm cannot proceed as\nnormal. To \ufb01x this minor issue, the feedback provided to Algorithm 2 is that the loss of action at+1\nwas 0 and all actions adjacent to at+1 also incurred 0 loss. This modi\ufb01cation of losses cannot occur\nmore than the number of switches performed by Algorithm 2. Since the expected number of switches\nis bounded by O((G) 1\n3 ), the modi\ufb01cation does not affect the total expected regret.\nTheorem 4.1. The expected policy regret of Algorithm 5 is bounded as \u02dcO((m(G)) 1\n\nj=1 `(t1)m+j(\u00b7). This sequence of losses, (b`t)T /m\n\nmPm\n\n3 T 2\n\n3 T 2\n\n3 ).\n\n7\n\n\fThe proof of the above theorem can be found in Appendix E. Let us point out that Algorithm 5\nrequires knowledge (or an upper bound) on the memory of the adversary, unlike the algorithm\nproposed by Arora et al. [2012]. We conjecture that this is due to the adaptive mini-batch technique\nof our algorithm. In particular, we believe that for m-memory bounded adversaries, it is necessary to\nrepeat each sampled action at at least m times.\n\n5 Lower Bound\n\nThe main tool for constructing lower bounds when switching costs are involved is the stochastic\nprocess constructed by Dekel et al. [2014]. The crux of the proof consists of a carefully designed\nmulti-scale random walk. The two characteristics of this random walk are its depth and its width.\nAt time t, the depth of the walk is the number of previous rounds on which the value of the current\nround depends. The width of the walk measures how far apart two rounds that depend on each other\nare in time. The loss of each action is equal to the value of the random walk at each time step, and the\nloss of the best action is slightly better by a small positive constant. The depth of the process controls\nhow well the losses concentrate in the interval [0, 1]1. The width of the walk controls the variance\nbetween losses of different actions and ensures it is impossible to gain information about the best\naction, unless one switches between different actions.\n\nv2\n\nv3\n\nv1\n\n5.1 Lower Bound for Non-complete Graphs\nWe \ufb01rst verify that the dependence on the time hori-\nzon cannot be improved from T 2\n3 for any feedback\ngraph in which there is at least one edge missing, that\nis, in which there exist two vertices that do not reveal\ninformation about each other. Without loss of gener-\nality, assume that the two vertices not joined by an\nedge are v1 and v2. Take any vertex that is a shared\nFigure 2: Feedback graph for switching costs\nneighbor and denote this vertex by v3 (see Figure 2\nfor an example). We set the loss for action v3 and all other vertices to be equal to one. We now focus\nthe discussion on the subgraph with vertices {v1, v2, v3}. The losses of actions v1 and v2 are set\naccording to the construction in [Dekel et al., 2014]. Since {v1, v2} forms an independent set, the\nplayer would need to switch between these vertices to gain information about the best action. This is\nalso what the lower bound proof of Rangi and Franceschetti [2019] is based upon. However, it is im-\nportant to realize that the construction in Dekel et al. [2014] also allows for gaining information about\nthe best action if its loss is revealed together with some other loss constructed from the stochastic\nprocess. In that case, playing vertex v3 would provide such information. This is a key property which\nRangi and Franceschetti [2019] seem to have missed in their lower bound proof. We discuss this\nmistake carefully in Appendix C and provide a lower bound matching what the authors claim in the\nuninformed setting in Appendix I. Our discussion suggests that we should set the price for revealing\ninformation about multiple actions according to the switching cost and this is why the losses of all\nvertices outside of the independent set are equal to one. We note that the losses of the best action\nare much smaller than one suf\ufb01ciently often, so that enough instantaneous regret is incurred when\npulling action v3. Our main result follows and its proof can be found in Appendix F.\nTheorem 5.1. For any non-complete feedback graph G, there exists a sequence of losses on which\nany algorithm A in the informed setting incurs expected regret at least\n\nRT (A)  \u2326 T 2\n\nlog (T )! .\n\n3\n\n5.2 Lower Bound for Disjoint Union of Star Graphs\nHow do we construct a lower bound for a disjoint union of star graphs? First, note that if two adjacent\nvertices are allowed to admit losses set according to the stochastic process and one of them is the\nbest vertex, then we could distinguish it in time O(pT ) by repeatedly playing the other vertex. This\nsuggests that losses set according to the stochastic process should be reserved for vertices in an\n\n1Technically, the losses are always clipped between [0, 1].\n\n8\n\n\fv1\n\n\u00b7\u00b7\u00b7\n\nv2 v3 v4\n\nindependent set. Second, it is important to keep track of the amount of information revealed by\ncommon neighbors.\nConsider the feedback graph of Figure 3. This disjoint\nunion of star graphs admits a domination number equal\nto four and its minimum dominating set is denoted by\n{v1, v2, v3, v4}. Probably the most natural way to set\nup the losses of the vertices is to set the losses of the\nmaximum independent set, which consists of the colored\nvertices, according to the construction of Dekel et al.\nFigure 3: Disjoint union of star graphs.\n[2014] and the losses of the minimum dominating set\nequal to one. Let v1 be the vertex with highest degree. Any time the best action is sampled to be not\nadjacent to v1, switching between that action and v1 reveals deg(v1) information about it. On the\nother hand, no matter how we sample the best action as a neighbor of v1, it is then enough to play v1\nto gain enough information about it. If I denotes the maximum independent set, the above reasoning\nshows that only O(T 2\n3|I|/deg(v1)) rounds of switching are needed to distinguish the best action.\nSince deg(v1) can be made arbitrarily large and thus |I|/deg(v1) gets arbitrary close to one, we see\nthat the regret lower bound becomes independent of the domination number and equal to \u02dc\u2326(T 2\n3 ).\nWe now present a construction for the disjoint union of star graphs which guarantees a lower bound\nof the \u02dc\u2326((G) 1\n3 ). The idea behind our construction is to choose an independent set such that\nnone of its members have a common neighbor, thereby avoiding the problem described above. We\nnote that such an independent set cannot have size greater than (G). Let R be the set of revealing\nvertices for the star graphs. We denote by Vi the set of vertices associated with the star graph with\nrevealing vertex vi. To construct the losses, we \ufb01rst sample an active vertex for each star graph from\nits leaves. The active vertices are represented in red in Figure 3. This forms an independent set I\nindexed by R. Next, we follow the construction of Dekel et al. [2014] for the vertices in I, by \ufb01rst\nsampling a best vertex uniformly at random from I and then setting the losses in I according to the\nmulti-scale random walk. All other losses are set to one. For any star graph consisting of a single\nvertex, we treat the vertex as a non-revealing vertex. This construction guarantees the following.\nTheorem 5.2. The expected regret of any algorithm A on a disjoint union of star graphs is lower\nbounded as follows:\n\n3 T 2\n\nRT (A)  \u2326 (G) 1\n\nlog (T ) ! .\n\n3 T 2\n\n3\n\nThe proof of this theorem can be found in Appendix G. This result can be viewed as a consequence\nof that of Dekel et al. [2014] but it can also be proven in alternative fashion. The general idea is\nto count the amount of information gained for the randomly sampled best vertex. For example, a\nstrategy that switches between two revealing vertices vi and vj will gain information proportional to\ndeg(vi)deg(vj). The lower bound follows from carefully counting the information gain of switching\nbetween revealing vertices. This counting argument can be generalized beyond the disjoint union of\nstar graphs, by considering an appropriate pair of minimal dominating/maximal independent sets. We\ngive an argument for the disjoint union of star graphs in Appendix G and leave a detailed argument\nfor general graphs to future work.\n\n6 Conclusion\n\nWe presented an extensive analysis of online learning with feedback graphs and switching costs in the\nadversarial setting, a scenario relevant to several applications in practice. We gave a new algorithm\nwhose regret guarantee only depends on the domination number. We also presented a matching\nlower bound for a family of graphs that includes disjoint unions of star graphs. The technical tools\nintroduced in our proofs are likely to help derive a lower bound for all graph families. We further\nderived an algorithm with more favorable policy regret guarantees in the presence of feedback graphs.\n\nAcknowledgements\n\nThis research was partly supported by NSF BIGDATA grants IIS-1546482 and NSF IIS-1838139,\nand by NSF CCF-1535987, NSF IIS-1618662, and a Google Research Award.\n\n9\n\n\fReferences\n\nAlekh Agarwal, Haipeng Luo, Behnam Neyshabur, and Robert E. Schapire. Corralling a band of\n\nbandit algorithms. arXiv preprint arXiv:1612.06246, 2016.\n\nNoga Alon, Nicol\u00f2 Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. From bandits to experts:\nA tale of domination and independence. In Advances in Neural Information Processing Systems,\npages 1610\u20131618, 2013.\n\nNoga Alon, Nicol\u00f2 Cesa-Bianchi, Ofer Dekel, and Tomer Koren. Online learning with feedback\ngraphs: Beyond bandits. In JMLR Workshop and Conference Proceedings, volume 40. Microtome\nPublishing, 2015.\n\nNoga Alon, Nicol\u00f2 Cesa-Bianchi, Claudio Gentile, Shie Mannor, Yishay Mansour, and Ohad Shamir.\nNonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing,\n46(6):1785\u20131826, 2017.\n\nRaman Arora, Ofer Dekel, and Ambuj Tewari. Online bandit learning against an adaptive adversary:\nfrom regret to policy regret. In Proceedings of the 29th International Conference on Machine\nLearning, pages 1747\u20131754, 2012.\n\nJean-Yves Audibert and S\u00e9bastien Bubeck. Minimax policies for adversarial and stochastic bandits.\n\nIn COLT, pages 217\u2013226, 2009.\n\nPeter Auer, Nicol\u00f2 Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed\n\nbandit problem. SIAM journal on computing, 32(1):48\u201377, 2002.\n\nB\u00e9la Bollob\u00e1s and Ernest J Cockayne. Graph-theoretic parameters concerning domination, indepen-\n\ndence, and irredundance. Journal of Graph Theory, 3(3):241\u2013249, 1979.\n\nSwapna Buccapatnam, Atilla Eryilmaz, and Ness B. Shroff. Stochastic bandits with side observations\non networks. In The 2014 ACM International Conference on Measurement and Modeling of\nComputer Systems, SIGMETRICS \u201914, pages 289\u2013300. ACM, 2014.\n\nStephane Caron, Branislav Kveton, Marc Lelarge, and Smriti Bhagat. Leveraging side observations\n\nin stochastic bandits. In UAI, 2012.\n\nNicol\u00f2 Cesa-Bianchi, Yoav Freund, David Haussler, David P Helmbold, Robert E. Schapire, and\nManfred K. Warmuth. How to use expert advice. Journal of the ACM (JACM), 44(3):427\u2013485,\n1997.\n\nNicol\u00f2 Cesa-Bianchi, Ofer Dekel, and Ohad Shamir. Online learning with switching costs and other\nadaptive adversaries. In Advances in Neural Information Processing Systems, pages 1160\u20131168,\n2013.\n\nVasek Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research,\n\n4(3):233\u2013235, 1979.\n\nAlon Cohen, Tamir Hazan, and Tomer Koren. Online learning with feedback graphs without the\n\ngraphs. In International Conference on Machine Learning, pages 811\u2013819, 2016.\n\nCorinna Cortes, Giulia DeSalvo, Claudio Gentile, Mehryar Mohri, and Scott Yang. Online learning\n\nwith abstention. In 35th ICML, 2018.\n\nOfer Dekel, Jian Ding, Tomer Koren, and Yuval Peres. Bandits with switching costs: T 2/3 regret. In\nProceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 459\u2013467.\nACM, 2014.\n\nYoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an\n\napplication to boosting. Journal of computer and system sciences, 55(1):119\u2013139, 1997.\n\nSascha Geulen, Berthold V\u00f6cking, and Melanie Winkler. Regret minimization for online buffering\n\nproblems using the weighted majority algorithm. In COLT, pages 132\u2013143, 2010.\n\n10\n\n\fWayne Goddard and Michael A. Henning. Independent domination in graphs: A survey and recent\n\nresults. Discrete Mathematics, 313(7):839\u2013854, 2013.\n\nAndras Gyorgy and Gergely Neu. Near-optimal rates for limited-delay universal lossy source coding.\n\nIEEE Transactions on Information Theory, 60(5):2823\u20132834, 2014.\n\nAdam Kalai and Santosh Vempala. Ef\ufb01cient algorithms for online decision problems. Journal of\n\nComputer and System Sciences, 71(3):291\u2013307, 2005.\n\nTom\u00e1vs Koc\u00e1k, Gergely Neu, Michal Valko, and R\u00e9mi Munos. Ef\ufb01cient learning by implicit explo-\nration in bandit problems with side observations. In Advances in Neural Information Processing\nSystems, pages 613\u2013621, 2014.\n\nNick Littlestone and Manfred K Warmuth. The weighted majority algorithm. Information and\n\ncomputation, 108(2):212\u2013261, 1994.\n\nFang Liu, Swapna Buccapatnam, and Ness Shroff. Information directed sampling for stochastic\n\nbandits with graph feedback. In 32nd AAAI Conference on Arti\ufb01cial Intelligence, 2018.\n\nShie Mannor and Ohad Shamir. From bandits to experts: On the value of side-observations. In\n\nAdvances in Neural Information Processing Systems, pages 684\u2013692, 2011.\n\nAnshuka Rangi and Massimo Franceschetti. Online learning with feedback graphs and switching costs.\nIn The 22nd International Conference on Arti\ufb01cial Intelligence and Statistics, pages 2435\u20132444,\n2019.\n\nAristide Tossou, Christos Dimitrakakis, and Devdatt Dubhashi. Thompson sampling for stochastic\n\nbandits with graph feedback. In 31st AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\nYifan Wu, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesvari. Online learning with gaussian payoffs and side\nobservations. In Advances in Neural Information Processing Systems 28, pages 1360\u20131368. Curran\nAssociates, Inc., 2015a.\n\nYifan Wu, Andr\u00e1s Gy\u00f6rgy, and Csaba Szepesv\u00e1ri. Online learning with Gaussian payoffs and side\n\nobservations. In NIPS, pages 1360\u20131368, 2015b.\n\nDonggyu Yun, Alexandre Proutiere, Sumyeong Ahn, Jinwoo Shin, and Yung Yi. Multi-armed bandit\n\nwith additional observations. Proc. ACM Meas. Anal. Comput. Syst., 2(1):13:1\u201313:22, 2018.\n\n11\n\n\f", "award": [], "sourceid": 5499, "authors": [{"given_name": "Raman", "family_name": "Arora", "institution": "Johns Hopkins University"}, {"given_name": "Teodor Vanislavov", "family_name": "Marinov", "institution": "Johns Hopkins University"}, {"given_name": "Mehryar", "family_name": "Mohri", "institution": "Courant Inst. of Math. Sciences & Google Research"}]}