{"title": "Stochastic Online Greedy Learning with Semi-bandit Feedbacks", "book": "Advances in Neural Information Processing Systems", "page_first": 352, "page_last": 360, "abstract": "The greedy algorithm is extensively studied in the field of combinatorial optimization for decades. In this paper, we address the online learning problem when the input to the greedy algorithm is stochastic with unknown parameters that have to be learned over time. We first propose the greedy regret and $\\epsilon$-quasi greedy regret as learning metrics comparing with the performance of offline greedy algorithm. We then propose two online greedy learning algorithms with semi-bandit feedbacks, which use multi-armed bandit and pure exploration bandit policies at each level of greedy learning, one for each of the regret metrics respectively. Both algorithms achieve $O(\\log T)$ problem-dependent regret bound ($T$ being the time horizon) for a general class of combinatorial structures and reward functions that allow greedy solutions. We further show that the bound is tight in $T$ and other problem instance parameters.", "full_text": "Stochastic Online Greedy Learning with\n\nSemi-bandit Feedbacks\n\nTian Lin\n\nTsinghua University\n\nBeijing, China\n\nWei Chen\n\nMicrosoft Research\n\nBeijing, China\n\nlintian06@gmail.com\n\nlapordge@gmail.com\n\nweic@microsoft.com\n\nJian Li\n\nTsinghua University\n\nBeijing, China\n\nAbstract\n\nThe greedy algorithm is extensively studied in the \ufb01eld of combinatorial optimiza-\ntion for decades. In this paper, we address the online learning problem when the\ninput to the greedy algorithm is stochastic with unknown parameters that have\nto be learned over time. We \ufb01rst propose the greedy regret and \u0001-quasi greedy\nregret as learning metrics comparing with the performance of of\ufb02ine greedy algo-\nrithm. We then propose two online greedy learning algorithms with semi-bandit\nfeedbacks, which use multi-armed bandit and pure exploration bandit policies at\neach level of greedy learning, one for each of the regret metrics respectively. Both\nalgorithms achieve O(log T ) problem-dependent regret bound (T being the time\nhorizon) for a general class of combinatorial structures and reward functions that\nallow greedy solutions. We further show that the bound is tight in T and other\nproblem instance parameters.\n\n1\n\nIntroduction\n\nThe greedy algorithm is simple and easy-to-implement, and can be applied to solve a wide range of\ncomplex optimization problems, either with exact solutions (e.g. minimum spanning tree [19, 25])\nor approximate solutions (e.g. maximum coverage [11] or in\ufb02uence maximization [17]). Moreover,\nfor many practical problems, the greedy algorithm often serves as the \ufb01rst heuristic of choice and\nperforms well in practice even when it does not provide a theoretical guarantee.\nThe classical greedy algorithm assumes that a certain reward function is given, and it constructs the\nsolution iteratively. In each phase, it searches for a local optimal element to maximize the marginal\ngain of reward, and add it to the solution. We refer to this case as the of\ufb02ine greedy algorithm with\na given reward function, and the corresponding problem the of\ufb02ine problems. The phase-by-phase\nprocess of the greedy algorithm naturally forms a decision sequence to illustrate the decision \ufb02ow in\n\ufb01nding the solution, which is named as the greedy sequence. We characterize the decision class as an\naccessible set system, a general combinatorial structure encompassing many interesting problems.\nIn many real applications, however, the reward function is stochastic and is not known in advance,\nand the reward is only instantiated based on the unknown distribution after the greedy sequence is\nselected. For example, in the in\ufb02uence maximization problem [17], social in\ufb02uence are propagated\nin a social network from the selected seed nodes following a stochastic model with unknown pa-\nrameters, and one wants to \ufb01nd the optimal seed set of size k that generates the largest in\ufb02uence\nspread, which is the expected number of nodes in\ufb02uenced in a cascade. In this case, the reward of\nseed selection is only instantiated after the seed selection, and is only one of the random outcomes.\nTherefore, when the stochastic reward function is unknown, we aim at maximizing the expected\nreward overtime while gradually learning the key parameters of the expected reward functions. This\nfalls in the domain of online learning, and we refer the online algorithm as the strategy of the player,\nwho makes sequential decisions, interacts with the environment, obtains feedbacks, and accumulates\n\n1\n\n\fher reward. For online greedy algorithms in particular, at each time step the player selects and plays\na candidate decision sequence while the environment instantiates the reward function, and then the\nplayer collects the values of instantiated function at every phase of the decision sequence as the\nfeedbacks (thus the name of semi-bandit feedbacks [2]), and takes the value of the \ufb01nal phase as the\nreward cumulated in this step.\nThe typical objective for an online algorithm is to make sequential decisions against the optimal\nsolution in the of\ufb02ine problem where the reward function is known a priori. For online greedy\nalgorithms, instead, we compare it with the solution of the of\ufb02ine greedy algorithm, and minimize\ntheir gap of the cumulative reward over time, termed as the greedy regret. Furthermore, in some\nproblems such as in\ufb02uence maximization, the reward function is estimated with error even for the\nof\ufb02ine problem [17] and thus the greedily selected element at each phase may contain some \u0001 error.\nWe call such greedy sequence as \u0001-quasi greedy sequence. To accommodate these cases, we also\nde\ufb01ne the metric of \u0001-quasi greedy regret, which compares the online solution against the minimum\nof\ufb02ine solution from all \u0001-quasi greedy sequences.\nIn this paper, we propose two online greedy algorithms targeted at two regret metrics respectively.\nThe \ufb01rst algorithm OG-UCB uses the stochastic multi-armed bandit (MAB) [22, 8], in particular\nthe well-known UCB policy [3] as the building block to minimize the greedy regret. We apply the\nUCB policy to every phase by associating the con\ufb01dence bound to each arm, and then choose the\narm having the highest upper con\ufb01dence bound greedily in the process of decision. For the second\nscenario where we allow tolerating \u0001-error for each phase, we propose a \ufb01rst-explore-then-exploit\nalgorithm OG-LUCB to minimize the \u0001-quasi greedy regret. For every phase in the greedy process,\nOG-LUCB applies the LUCB policy [16, 9] which depends on the upper and lower con\ufb01dence\nbound to eliminate arms. It \ufb01rst explores each arm until the lower bound of one arm is higher than the\nupper bound of any other arm within an \u0001-error, then the stage of current phase is switched to exploit\nthat best arm, and continues to the next phase. Both OG-UCB and OG-LUCB achieve the problem-\ndependent O(log T ) bound in terms of the respective regret metrics, where the coef\ufb01cients in front of\nT depends on direct elements along the greedy sequence (a.k.a., its decision frontier) corresponding\nto the instance of learning problem. The two algorithms have complementary advantages: when we\nreally target at greedy regret (setting \u0001 to 0 for OG-LUCB), OG-UCB has a slightly better regret\nguarantee and does not need an arti\ufb01cial switch between exploration and exploitation; when we are\nsatis\ufb01ed with \u0001-quasi greedy regret, OG-LUCB works but OG-UCB cannot be adapted for this case\nand may suffer a larger regret. We also show a problem instance in this paper, where the upper bound\nis tight to the lower bound in T and other problem parameters.\nWe further show our algorithms can be easily extended to the knapsack problem, and applied to\nthe stochastic online maximization for consistent functions and submodular functions, etc., in the\nsupplementary material.\nTo summarize, our contributions include the following: (a) To the best of our knowledge, we are\nthe \ufb01rst to propose the framework using the greedy regret and \u0001-quasi greedy regret to characterize\nthe online performance of the stochastic greedy algorithm for different scenarios, and it works for a\nwide class of accessible set systems and general reward functions; (b) We propose Algorithms OG-\nUCB and OG-LUCB that achieve the problem-dependent O(log T ) regret bound; and (c) We also\nshow that the upper bound matches with the lower bound (up to a constant factor).\nDue to the space constraint, the analysis of algorithms, applications and empirical evaluation of the\nlower bound are moved to the supplementary material.\n\nRelated Work. The multi-armed bandit (MAB) problem for both stochastic and adversarial set-\ntings [22, 4, 6] has been widely studied for decades. Most work focus on minimizing the cumulative\nregret over time [3, 14], or identifying the optimal solution in terms of pure exploration bandits\n[1, 16, 7]. Among those work, there is one line of research that generalizes MAB to combinatorial\nlearning problems [8, 13, 2, 10, 21, 23, 9]. Our paper belongs to this line considering stochastic\nlearning with semi-bandit feedbacks, while we focus on the greedy algorithm, the structure and its\nperformance measure, which have not been addressed.\nThe classical greedy algorithms in the of\ufb02ine setting are studied in many applications [19, 25, 11, 5],\nand there is a line of work [15, 18] focusing on characterizing the greedy structure for solutions. We\nadopt their characterizations of accessible set systems to the online setting of the greedy learning.\nThere is also a branch of work using the greedy algorithm to solve online learning problem, while\n\n2\n\n\fthey require the knowledge of the exact form of reward function, restricting to special functions such\nas linear [2, 20] and submodular rewards [26, 12]. Our work does not assume the exact form, and it\ncovers a much larger class of combinatorial structures and reward functions.\n\n2 Preliminaries\n\nOnline combinatorial learning problem can be formulated as a repeated game between the environ-\nment and the player under stochastic multi-armed bandit framework.\nLet E = {e1, e2, . . . , en} be a \ufb01nite ground set of size n, and F be a collection of subsets of E. We\nconsider the accessible set system (E,F) satisfying the following two axioms: (1) \u2205 \u2208 F; (2) If S \u2208\nF and S (cid:54)= \u2205, then there exists some e in E, s.t., S \\{e} \u2208 F. We de\ufb01ne any set S \u2286 E as a feasible\nset if S \u2208 F. For any S \u2208 F, its accessible set is de\ufb01ned as N (S) := {e \u2208 E \\ S : S \u222a {e} \u2208 F}.\nWe say feasible set S is maximal if N (S) = \u2205. De\ufb01ne the largest length of any feasible set as\nm := maxS\u2208F |S| (m \u2264 n), and the largest width of any feasible set as W := maxS\u2208F |N (S)|\n(W \u2264 n). We say that such an accessible set system (E,F) is the decision class of the player. In the\nclass of combinatorial learning problems, the size of F is usually very large (e.g., exponential in m,\nW and n).\nBeginning with an empty set, the accessible set system (E,F) ensures that any feasible set S can\nbe acquired by adding elements one by one in some order (cf. Lemma A.1 in the supplementary\nmaterial for more details), which naturally forms the decision process of the player. For conve-\nnience, we say the player can choose a decision sequence, de\ufb01ned as an ordered feasible sets\n\u03c3 := (cid:104)S0, S1, . . . , Sk(cid:105) \u2208 F k+1 satisfying that \u2205 = S0 \u2282 S1 \u2282 \u00b7\u00b7\u00b7 \u2282 Sk and for any i = 1, 2, . . . , k,\nSi = Si\u22121 \u222a{si} where si \u2208 N (Si\u22121). Besides, de\ufb01ne decision sequence \u03c3 as maximal if and only\nif Sk is maximal.\nLet \u2126 be an arbitrary set. The environment draws i.i.d. samples from \u2126 as \u03c91, \u03c92, . . . , at each time\nt = 1, 2, . . . , by following a predetermined but unknown distribution. Consider reward function\nf : F \u00d7 \u2126 \u2192 R that is bounded, and it is non-decreasing1 in the \ufb01rst parameter, while the exact\nform of function is agnostic to the player. We use a shorthand ft(S) := f (S, \u03c9t) to denote the\nreward for any given S at time t, and denote the expected reward as f (S) := E\u03c91 [f1(S)], where the\nexpectation E\u03c9t is taken from the randomness of the environment at time t. For ease of presentation,\nwe assume that the reward function for any time t is normalized with arbitrary alignment as follows:\n(1) ft(\u2205) = L (for any constant L \u2265 0); (2) for any S \u2208 F, e \u2208 N (S), ft(S \u222a{e})\u2212 ft(S) \u2208 [0, 1].\nTherefore, reward function f (\u00b7,\u00b7) is implicitly bounded within [L, L + m].\nWe extend the concept of arms in MAB, and introduce notation a := e|S to de\ufb01ne an arm, repre-\nsenting the selected element e based on the pre\ufb01x S, where S is a feasible set and e \u2208 N (S); and\nde\ufb01ne A := {e|S : \u2200S \u2208 F,\u2200e \u2208 N (S)} as the arm space. Then, we can de\ufb01ne the marginal\nreward for function ft as ft(e|S) := ft(S \u222a {e}) \u2212 ft(S), and the expected marginal reward for f\nas f (e|S) := f (S \u222a{e})\u2212 f (S). Notice that the use of arms characterizes the marginal reward, and\nalso indicates that it is related to the player\u2019s previous decision.\n\n2.1 The Of\ufb02ine Problem and The Of\ufb02ine Greedy Algorithm\n\nIn the of\ufb02ine problem, we assume that f is provided as a value oracle. Therefore, the objective is\nto \ufb01nd the optimal solution S\u2217 = arg maxS\u2208F f (S), which only depends on the player\u2019s decision.\nWhen the optimal solution is computationally hard to obtain, usually we are interested in \ufb01nding\na feasible set S+ \u2208 F such that f (S+) \u2265 \u03b1f (S\u2217) where \u03b1 \u2208 (0, 1], then S+ is called an \u03b1-\napproximation solution. That is a typical case where the greedy algorithm comes into play.\nThe of\ufb02ine greedy algorithm is a local search algorithm that re\ufb01nes the solution phase by\nphase. It goes as follows: (a) Let G0 = \u2205; (b) For each phase k = 0, 1, . . . , \ufb01nd\ngk+1 = arg maxe\u2208N (Gk) f (e|Gk), and let Gk+1 = Gk \u222a {gk+1}; (c) The above process ends\nwhen N (Gk+1) = \u2205 (Gk+1 is maximal). We de\ufb01ne the maximal decision sequence \u03c3G :=\n(cid:104)G0, G1, . . . , GmG(cid:105) (mG is its length) found by the of\ufb02ine greedy as the greedy sequence. For sim-\nplicity, we assume that it is unique.\n\n1Therefore, the optimal solution is a maximal decision sequence.\n\n3\n\n\ffeature is that\n\nOne important\nthe greedy algorithm uses a polynomial number of calls\n(poly(m, W, n)) to the of\ufb02ine oracle, even though the size of F or A may be exponentially large.\nIn some cases such as the of\ufb02ine in\ufb02uence maximization problem [17], the value of f (\u00b7) can only be\naccessed with some error or estimated approximately. Sometimes, even though f (\u00b7) can be computed\nexactly, we may only need an approximate maximizer in each greedy phase in favor of computational\nef\ufb01ciency (e.g., ef\ufb01cient submodular maximization [24]). To capture such scenarios, we say a max-\nimal decision sequence \u03c3 = (cid:104)S0, S1, . . . , Sm(cid:48)(cid:105) is an \u0001-quasi greedy sequence (\u0001 \u2265 0), if the greedy\ndecision can tolerate \u0001 error every phase, i.e., for each k = 0, 1, . . . , m(cid:48)\u22121 and Sk+1 = Sk\u222a{sk+1},\nf (sk+1|Sk) \u2265 maxs\u2208N (Sk) f (s|Sk)\u2212 \u0001. Notice that there could be many \u0001-quasi greedy sequences,\nand we denote \u03c3Q := (cid:104)Q0, Q1, . . . , QmQ(cid:105) (mQ is its length) as the one with the minimum reward,\nthat is f (QmQ) is minimized over all \u0001-quasi greedy sequences.\n\n2.2 The Online Problem\n\nIn the online case, in constrast f is not provided. The player can only access one of functions\nf1, f2, . . . , generated by the environment, for each time step during a repeated game.\nFor each time t, the game proceeds in the following three steps: (1) The environment draws\ni.i.d. sample \u03c9t \u2208 \u2126 from its predetermined distribution without revealing it; (2) the player may,\nbased on her previous knowledge, select a decision sequence \u03c3t = (cid:104)S0, S1, . . . , Smt(cid:105), which re-\n\ufb02ects the process of her decision phase by phase; (3) then, the player plays \u03c3t and gains reward\nft(Smt), while observes intermediate feedbacks ft(S0), ft(S1), . . . , ft(Smt) to update her knowl-\nedge. We refer such feedbacks as semi-bandit feedbacks in the decision order.\nFor any time t = 1, 2, . . . , denote \u03c3t = (cid:104)St\nmt. The player is to make se-\nquential decisions, and the classical objective is to minimize the cumulative gap of rewards against\nthe optimal solution [3] or the approximation solution [10]. For example, when the optimal solu-\ntion S\u2217 = arg maxS\u2208F E [f1(S)] can be solved in the of\ufb02ine problem, we minimize the expected\nE [ft(St)] over the time horizon T , where the\nexpectation is taken from the randomness of the environment and the possible random algorithm of\nthe player. In this paper, we are interested in online algorithms that are comparable to the solution\nof the of\ufb02ine greedy algorithm, namely the greedy sequence \u03c3G = (cid:104)G0, G1, . . . , GmG(cid:105). Thus, the\nobjective is to minimize the greedy regret de\ufb01ned as\n\ncumulative regret R(T ) := T \u00b7 E [f1(S\u2217)] \u2212(cid:80)T\n\nmt(cid:105) and St := St\n\n1, . . . , St\n\n0, St\n\nt=1\n\nGiven \u0001 \u2265 0, we de\ufb01ne the \u0001-quasi greedy regret as\n\nRG(T ) := T \u00b7 E [f1(GmG )] \u2212 T(cid:88)\nRQ(T ) := T \u00b7 E[f1(QmQ )] \u2212 T(cid:88)\n\nt=1\n\nE(cid:2)ft(St)(cid:3) .\nE(cid:2)ft(St)(cid:3) ,\n\n(1)\n\n(2)\n\nt=1\n\nwhere \u03c3Q = (cid:104)Q0, Q1, . . . , QmQ(cid:105) is the minimum \u0001-quasi greedy sequence.\nWe remark that if the of\ufb02ine greedy algorithm provides an \u03b1-approximation solution (with 0 < \u03b1 \u2264\n1), then the greedy regret (or \u0001-quasi greedy regret) also provides \u03b1-approximation regret, which is\nthe regret comparing to the \u03b1 fraction of the optimal solution, as de\ufb01ned in [10].\nIn the rest of the paper, our goal is to design the player\u2019s policy that is comparable to the of\ufb02ine\ngreedy, in other words, RG(T )/T = f (GmG) \u2212 1\nE [ft(St)] = o(1). Thus, to achieve sub-\nlinear greedy regret RG(T ) = o(T ) is our main focus.\n\n(cid:80)T\n\nt=1\n\nT\n\n3 The Online Greedy and Algorithm OG-UCB\n\nIn this section, we propose our Online Greedy (OG) algorithm with the UCB policy to minimize the\ngreedy regret (de\ufb01ned in (1)).\nFor any arm a = e|S \u2208 A, playing a at each time t yields the marginal reward as a random variable\nXt(a) = ft(a), in which the random event \u03c9t \u2208 \u2126 is i.i.d., and we denote \u00b5(a) as its true mean (i.e.,\n\n4\n\n\fS0 \u2190 \u2205; k \u2190 0; h0 \u2190 true\nrepeat\n\nA \u2190 {e|Sk : \u2200e \u2208 N (Sk)}; t(cid:48) \u2190(cid:80)\n\n(sk+1|Sk, hk) \u2190 MaxOracle\nSk+1 \u2190 Sk \u222a {sk+1}; k \u2190 k + 1\n\n(cid:16)\n\nA, \u02c6X(\u00b7), N (\u00b7), t(cid:48)(cid:17)\n\na\u2208A N (a) + 1\n\n(cid:46) online greedy procedure\n\n(cid:46) \ufb01nd the current maximal\n\nAlgorithm 1 OG\nRequire: MaxOracle\n1: for t = 1, 2, . . . do\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\nuntil N (Sk) = \u2205\nPlay sequence \u03c3t \u2190 (cid:104)S0, . . . , Sk(cid:105), observe {ft(S0), . . . , ft(Sk)}, and gain ft(Sk).\nfor all i = 1, 2, . . . , k do\n\n(cid:46) update according to signals from MaxOracle\n\n(cid:46) until a maximal sequence is found\n\nif h0, h1,\u00b7\u00b7\u00b7 , hi\u22121 are all true then\n\nUpdate \u02c6X(si|Si\u22121) and N (si|Si\u22121) according to (3).\n\n(cid:113) 3 ln t\nSubroutine 2 UCB(A, \u02c6X(\u00b7), N (\u00b7), t) to implement MaxOracle\n2N (a), for each a \u2208 A\nSetup: con\ufb01dence radius radt(a) :=\n1: if \u2203a \u2208 A, \u02c6X(a) is not initialized then\n(cid:110) \u02c6X(a) + radt(a)\n2:\n3: else\n4:\n\nreturn (a, true)\nt \u2190 arg maxa\u2208A\nI +\n\n, and return (I +\n\n(cid:111)\n\nt , true)\n\n(cid:46) break ties arbitrarily\n(cid:46) to initialize arms\n(cid:46) apply UCB\u2019s rule\n\n\u00b5(a) := E [X1(a)]). Let \u02c6X(a) be the empirical mean for the marginal reward of a, and N (a) be the\ncounter of the plays. More speci\ufb01cally, denote \u02c6Xt(a) and Nt(a) for particular \u02c6X(a) and N (a) at\nthe beginning of the time step t, and they are evaluated as follows:\n\n(cid:80)t\u22121\n(cid:80)t\u22121\n\ni=1\n\ni=1 fi(a)Ii(a)\n\nIi(a)\n\nt\u22121(cid:88)\n\ni=1\n\n\u02c6Xt(a) =\n\n, Nt(a) =\n\nIi(a),\n\n(3)\n\nwhere Ii(a) \u2208 {0, 1} indicates whether a is updated at time i. In particular, assume that our algo-\nrithm is lazy-initialized so that each \u02c6X(a) and N (a) is 0 by default, until a is played.\nThe Online Greedy algorithm (OG) proposed in Algorithm 1 serves as a meta-algorithm allow-\ning different implementations of Subroutine MaxOracle. For every time t, OG calls MaxOracle\n(Line 5, to be speci\ufb01ed later) to \ufb01nd the local maximal phase by phase, until the decision sequence \u03c3t\nis made. Then, it plays sequence \u03c3t, observes feedbacks and gains the reward (Line 8). Meanwhile,\nOG collects the Boolean signals (hk) from MaxOracle during the greedy process (Line 5), and up-\ndate estimators \u02c6X(\u00b7) and N (\u00b7) according to those signals (Line 10). On the other hand, MaxOracle\ntakes accessible arms A, estimators \u02c6X(\u00b7), N (\u00b7), and counted time t(cid:48), and returns an arm from A and\nsignal hk \u2208 {true, false} to instruct OG whether to update estimators for the following phase.\nThe classical UCB [3] can be used to implement MaxOracle, which is described in Subroutine 2.\nWe term our algorithm OG, in which MaxOracle is implemented by Subroutine 2 UCB, as Algo-\nrithm OG-UCB. A few remarks are in order: First, Algorithm OG-UCB chooses an arm with the\nhighest upper con\ufb01dence bound for each phase. Second, the signal hk is always true, meaning that\nOG-UCB always update empirical means of arms along the decision sequence. Third, because we\nuse lazy-initialized \u02c6X(\u00b7) and N (\u00b7), the memory is allocated only when it is needed.\n\n3.1 Regret Bound of OG-UCB\nS := arg maxe\u2208N (S) f (e|S), and we use\nFor any feasible set S, de\ufb01ne the greedy element for S as g\u2217\nS} for convenience. Denote F\u2020 := {S \u2208 F : S is maximal} as the collection\nN\u2212(S) := N (S) \\ {g\u2217\nof all maximal feasible sets in F. We use the following gaps to measure the performance of the\nalgorithm.\n\n5\n\n\fDe\ufb01nition 3.1 (Gaps). The gap between the maximal greedy feasible set GmG and any S \u2208 F is\nde\ufb01ned as \u2206(S) := f (GmG ) \u2212 f (S) if it is positive, and 0 otherwise. We de\ufb01ne the maximum gap\nas \u2206max = f (GmG ) \u2212 minS\u2208F\u2020 f (S), which is the worst penalty for any maximal feasible set. For\nany arms a = e|S \u2208 A, we de\ufb01ne the unit gap of a (i.e., the gap for one phase) as\ne (cid:54)= g\u2217\ne = g\u2217\n\n\u2206(a) = \u2206(e|S) :=\n\n(4)\n\nS\n\n.\n\n(cid:26)f (g\u2217\nS|S) \u2212 f (e|S),\nS|S) \u2212 maxe(cid:48)\u2208N\u2212(S) f (e(cid:48)|S),\n(cid:26)\n\nf (g\u2217\n\nFor any arms a = e|S \u2208 A, we de\ufb01ne the sunk-cost gap (irreversible once selected) as\n\nS\n\n(cid:27)\n\nf (GmG ) \u2212\n\n\u2206\u2217(a) = \u2206\u2217(e|S) := max\n\nmin\n\nf (V ), 0\n\nV :V \u2208F\u2020,S\u222a{e}\u227aV\n\n(5)\nwhere for two feasible sets A and B, A \u227a B means that A is a pre\ufb01x of B in some decision\nsequence, that is, there exists a decision sequence \u03c3 = (cid:104)S0 = \u2205, S1, . . . , Sk(cid:105) such that Sk = B and\nfor some j < k, Sj = A. Thus, \u2206\u2217(e|S) means the largest gap we may have after we have \ufb01xed our\npre\ufb01x selection to be S \u222a {e}, and is upper bounded by \u2206max.\nDe\ufb01nition 3.2 (Decision frontier). For any decision sequence \u03c3 = (cid:104)S0, S1, . . . , Sk(cid:105), de\ufb01ne decision\ni=1 {e|Si\u22121 : e \u2208 N (Si\u22121)} \u2286 A as the arms need to be explored in the decision\n\nfrontier \u0393(\u03c3) :=(cid:83)k\nsequence \u03c3, and \u0393\u2212(\u03c3) :=(cid:83)k\n\n,\n\nTheorem 3.1 (Greedy regret bound). For any time T , Algorithm OG-UCB (Algorithm 1 with Sub-\nroutine 2) can achieve the greedy regret\n\ni=1 {e|Si\u22121 : \u2200e \u2208 N\u2212(Si\u22121)} similarly.\n(cid:19)\n\n(cid:18) 6\u2206\u2217(a) \u00b7 ln T\n\n(cid:18) \u03c02\n\n\u2206(a)2\n\n+\n\n+ 1\n\n3\n\n\u2206\u2217(a)\n\n,\n\n(6)\n\nRG(T ) \u2264 (cid:88)\n\na\u2208\u0393\u2212(\u03c3G)\n\n(cid:19)\n\nwhere \u03c3G is the greedy decision sequence.\n\nWhen m = 1, the above theorem immediately recovers the regret bound of the classical UCB\n[3] (with \u2206\u2217(a) = \u2206(a)). The greedy regret is bounded by O\nwhere \u2206 is the\nminimum unit gap (\u2206 = mina\u2208A \u2206(a)), and the memory cost is at most proportional to the regret.\nFor a special class of linear bandits, a simple extension where we treat arms e|S and e|S(cid:48) as the\nsame can make OG-UCB essentially the same as OMM in [20], while the regret is O( n\n\u2206 log T ) and\nthe memory cost is O(n) (cf. Appendix F.1 of the supplementary material).\n\n\u22062\n\n(cid:16) mW \u2206max log T\n\n(cid:17)\n\n4 Relaxing the Greedy Sequence with \u0001-Error Tolerance\n\nIn this section, we propose an online algorithm called OG-LUCB, which learns an \u0001-quasi greedy\nsequence, with the goal of minimizing the \u0001-quasi greedy regret (in (2)). We learn \u0001-quasi-greedy\nsequences by a \ufb01rst-explore-then-exploit policy, which utilizes results from PAC learning with a\n\ufb01xed con\ufb01dence setting. In Section 4.1, we implement MaxOracle via the LUCB policy, and derive\nits exploration time; we then assume the knowledge of time horizon T in Section 4.2, and analyze\nthe \u0001-quasi greedy regret; and in Section 4.3, we show that the assumption of knowing T can be\nfurther removed.\n\n4.1 OG with a \ufb01rst-explore-then-exploit policy\nGiven \u0001 \u2265 0 and failure probability \u03b4 \u2208 (0, 1), we use Subroutine 3 LUCB\u0001,\u03b4 to implement the sub-\nroutine MaxOracle in Algorithm OG. We call the resulting algorithm OG-LUCB\u0001,\u03b4. Speci\ufb01cally,\nSubroutine 3 is adapted from CLUCB-PAC in [9], and specialized to explore the top-one element in\n2, width(M) = 2 and Oracle = arg max in [9]). Assume that\nthe support of [0, 1] (i.e., set R = 1\nI exploit(\u00b7) is lazy-initialized. For each greedy phase, the algorithm \ufb01rst explores each arm in A in the\nexploration stage, during which the return \ufb02ag (the second return \ufb01eld) is always false; when the\noptimal one is found (initialize I exploit(A) with \u02c6It), it sticks to I exploit(A) in the exploitation stage\nfor the subsequent time steps, and return \ufb02ag for this phase becomes true. The main algorithm OG\nthen uses these \ufb02ags in such a way that it updates arm estimates for phase i if any only if all phases\n\n6\n\n\f, for each a \u2208 A; I exploit(\u00b7) to cache arms for exploitation;\n\n(cid:113) ln(4W t3/\u03b4)\n\n2N (a)\n\nSubroutine 3 LUCB\u0001,\u03b4(A, \u02c6X(\u00b7), N (\u00b7), t) to implement MaxOracle\nSetup: radt(a) :=\n1: if I exploit(A) is initialized then return (I exploit(A), true)\n2: if \u2203a \u2208 A, \u02c6X(a) is not initialized then\n3:\n4: else\n5:\n\n(cid:40) \u02c6X(a) + radt(a), a (cid:54)= \u02c6It\n\n\u02c6X(a) \u2212 radt(a), a = \u02c6It\n\n\u02c6X(a)\n\nreturn (a, false)\n\u02c6It \u2190 arg maxa\u2208A\n\u2200a \u2208 A, X(cid:48)(a) \u2190\nt \u2190 arg maxa\u2208A X(cid:48)(a)\nI(cid:48)\nt) \u2212 X(cid:48)( \u02c6It) > \u0001 then\nif X(cid:48)(I(cid:48)\nt \u2190 arg maxi\u2208{ \u02c6It,I(cid:48)\nI(cid:48)(cid:48)\nI exploit(A) \u2190 \u02c6It\nreturn (I exploit(A), true)\n\nelse\n\n6:\n\n7:\n8:\n9:\n10:\n11:\n12:\n\n(cid:46) in the exploitation stage\n(cid:46) break ties arbitrarily\n(cid:46) to initialize arms\n\n(cid:46) perturb arms\n\n(cid:46) not separated\n(cid:46) in the exploration stage\n(cid:46) separated\n(cid:46) initialize I exploit(A) with \u02c6It\n(cid:46) in the exploitation stage\n\nt} radt(i), and return (I(cid:48)(cid:48)\n\nt , false)\n\nfor j < i are already in the exploitation stage. This avoids maintaining useless arm estimates and is\na major memory saving comparing to OG-UCB.\nIn Algorithm OG-LUCB\u0001,\u03b4, we de\ufb01ne the total exploration time T E = T E(\u03b4), such that for any\ntime t \u2265 T E, OG-LUCB\u0001,\u03b4 is in the exploitation stage for all greedy phases encountered in the\nalgorithm. This also means that after time T E, in every step we play the same maximal decision\nsequence \u03c3 = (cid:104)S0, S1,\u00b7\u00b7\u00b7 , Sk(cid:105) \u2208 F k+1, which we call a stable sequence. Following a common\npractice, we de\ufb01ne the hardness coef\ufb01cient with pre\ufb01x S \u2208 F as\n\n1\n\nmax{\u2206(e|S)2, \u00012} , where \u2206(e|S) is de\ufb01ned in (4).\n\n(7)\n\n(cid:88)\n\nH \u0001\n\nS :=\n\ne\u2208N (S)\n\nRewrite de\ufb01nitions with respect to the \u0001-quasi regret. Recall that \u03c3Q = (cid:104)Q0, Q1, . . . , QmQ(cid:105) is\nthe minimum \u0001-quasi greedy sequence. In this section, we rewrite the gap \u2206(S) := max{f (QmQ)\u2212\nf (S), 0} for any S \u2208 F, the maximum gap \u2206max := f (QmQ ) \u2212 minS\u2208F\u2020 f (S), and \u2206\u2217(a) =\n\n\u2206\u2217(e|S) := max(cid:8)f (QmQ) \u2212 minV :V \u2208F\u2020,S\u222a{e}\u227aV f (V ), 0(cid:9), for any arm a = e|S \u2208 A.\n\nThe following theorem shows that, with high probability, we can \ufb01nd a stable \u0001-quasi greedy se-\nquence, and the total exploration time is bounded.\nTheorem 4.1 (High probability exploration time). Given any \u0001 \u2265 0 and \u03b4 \u2208 (0, 1), suppose after\nthe total exploration time T E = T E(\u03b4), Algorithm OG-LUCB\u0001,\u03b4 (Algorithm 1 with Subroutine 3)\nsticks to a stable sequence \u03c3 = (cid:104)S0, S1,\u00b7\u00b7\u00b7 , Sm(cid:48)(cid:105) where m(cid:48) is its length. With probability at least\n1 \u2212 m\u03b4, the following claims hold: (1) \u03c3 is an \u0001-quasi greedy sequence; (2) The total exploration\n\ntime satis\ufb01es that T E \u2264 127(cid:80)m(cid:48)\u22121\n\nk=0 H \u0001\n\nS ln (1996W H \u0001\n\nS/\u03b4) ,\n\n4.2 Time Horizon T is Known\nT in OG-LUCB\u0001,\u03b4 to derive the \u0001-quasi regret as follows.\nKnowing time horizon T , we may let \u03b4 = 1\nTheorem 4.2. Given any \u0001 \u2265 0. When total time T is known, let Algorithm OG-LUCB\u0001,\u03b4 run\n(cid:48)(cid:105) is the sequence selected at time T . De\ufb01ne func-\nwith \u03b4 = 1\nST ) + \u2206maxm, where m is\n\u2206(e|S)2 , 113\n\u00012\nS is de\ufb01ned in (7). Then, the \u0001-quasi regret satis\ufb01es that\n\n(cid:110) 127\nT . Suppose \u03c3 = (cid:104)S0, S1,\u00b7\u00b7\u00b7 , Sm\n\ne|S\u2208\u0393(\u03c3) \u2206\u2217(e|S) min\nthe largest length of a feasible set and H \u0001\nRQ(T ) \u2264 RQ,\u03c3(T ) = O( W m\u2206max\n\ntion RQ,\u03c3(T ) :=(cid:80)\n\nmax{\u22062,\u00012} log T ), where \u2206 is the minimum unit gap.\n\nln (1996W H \u0001\n\n(cid:111)\n\nIn general, the two bounds (Theorem 3.1 and Theorem 4.2) are for different regret metrics, thus can\nnot be directly compared. When \u0001 = 0, OG-UCB is slightly better only in the constant before log T .\nOn other hand, when we are satis\ufb01ed with \u0001-quasi greedy regret, OG-LUCB\u0001,\u03b4 may work better for\n\n7\n\n\fAlgorithm 4 OG-LUCB-R (i.e., OG-LUCB with Restart)\nRequire: \u0001\n1: for epoch (cid:96) = 1, 2,\u00b7\u00b7\u00b7 do\n2:\n3:\n\nClean \u02c6X(\u00b7) and N (\u00b7) for all arms, and restart OG-LUCB\u0001,\u03b4 with \u03b4 = 1\n\u03c6(cid:96)\nRun OG-LUCB\u0001,\u03b4 for \u03c6(cid:96) time steps. (exit halfway, if the time is over.)\n\n(de\ufb01ned in (8)).\n\nsome large \u0001, for the bound takes the maximum (in the denominator) of the problem-dependent term\n\u2206(e|S) and the \ufb01xed constant \u0001 term, and the memory cost is only O(mW ).\n4.3 Time Horizon T is not Known\nWhen time horizon T is not known, we can apply the \u201csquaring trick\u201d, and restart the algorithm for\neach epoch as follows. De\ufb01ne the duration of epoch (cid:96) as \u03c6(cid:96), and its accumulated time as \u03c4(cid:96), where\n\n\u03c6(cid:96) := e2(cid:96)\n\n;\n\n\u03c4(cid:96) :=\n\n(cid:96) = 0\n(cid:96) \u2265 1\n\n.\n\n(8)\n\ns=1 \u03c6s,\n\n(cid:40)\n(cid:80)(cid:96)\n\n0,\n\nFor any time horizon T , de\ufb01ne the \ufb01nal epoch K = K(T ) as the epoch where T lies in, that is\n\u03c4K\u22121 < T \u2264 \u03c4K. Then, our algorithm OG-LUCB-R is proposed in Algorithm 4. The following\ntheorem shows that the O(log T ) \u0001-quasi regret still holds, with a slight blowup of the constant\nhidden in the big O notation (For completeness, the explicit constant before log T can be found in\nTheorem D.7 of the supplementary material).\nTheorem 4.3. Given any \u0001 \u2265 0. Use \u03c6(cid:96) and \u03c4(cid:96) de\ufb01ned in (8), and function RQ,\u03c3(T ) de\ufb01ned in\nTheorem 4.2. In Algorithm OG-LUCB-R, suppose \u03c3((cid:96)) = (cid:104)S((cid:96))\nm((cid:96))(cid:105) is the sequence\nselected by the end of (cid:96)-th epoch of OG-LUCB\u0001,\u03b4, where m((cid:96)) is its length. For any time T , denote\n\ufb01nal epoch as K = K(T ) such that \u03c4K\u22121 < T \u2264 \u03c4K, and the \u0001-quasi regret satis\ufb01es that RQ(T ) \u2264\n\n1 ,\u00b7\u00b7\u00b7 , S((cid:96))\n\n0 , S((cid:96))\n\n(cid:16) W m\u2206max\n\n(cid:17)\n\n(cid:80)K\n\n(cid:96)=1 RQ,\u03c3((cid:96))\n\n(\u03c6(cid:96)) = O\n\nmax{\u22062,\u00012} log T\n\n, where \u2206 is the minimum unit gap.\n\n5 Lower Bound on the Greedy Regret\n\ni=1 Ei, F = \u222am\n\nS|S) \u2212 f (e|S) : \u2200S \u2208 F,\u2200e \u2208 N\u2212(S)(cid:9) > 0, where its value is also un-\n\nConsider a problem of selecting one element each from m bandit instances, and the player sequen-\ntially collects prize at every phase. For simplicity, we call it the prize-collecting problem, which is\nde\ufb01ned as follows: For each bandit instance i = 1, 2, . . . , m, denote set Ei = {ei,1, ei,2, . . . , ei,W}\ni=1Fi\u222a{\u2205},\nand Fi = {S \u2286 E : |S| = i,\u2200k : 1 \u2264 k \u2264 i,|S\u2229Ek| = 1}. The reward function f : F\u00d7\u2126 \u2192 [0, m]\nis non-decreasing in the \ufb01rst parameter, and the form of f is unknown to the player. Let minimum\n\nof size W . The accessible set system is de\ufb01ned as (E,F), where E =(cid:83)m\nunit gap \u2206 := min(cid:8)f (g\u2217\narms a \u2208 A \\ AG is in o(T \u03b7), for any \u03b7 > 0, i.e., E[(cid:80)\nsome constant \u03be \u2208 (0, 1), the greedy regret satis\ufb01es that RG(T ) = \u2126(cid:0) mW ln T\n\nknown to the player. The objective of the player is to minimize the greedy regret.\nDenote the greedy sequence as \u03c3G = (cid:104)G0, G1,\u00b7\u00b7\u00b7 , Gm(cid:105), and the greedy arms as AG =\n{g\u2217\n|Gi\u22121 : \u2200i = 1, 2,\u00b7\u00b7\u00b7 , W}. We say an algorithm is consistent, if the sum of playing all\n\nTheorem 5.1. For any consistent algorithm, there exists a problem instance of the prize-collecting\nproblem, as time T tends to \u221e, for any minimum unit gap \u2206 \u2208 (0, 1\n3W \u03bem\u22121 for\n\n4 ), such that \u22062 \u2265\n\na\u2208A\\AG NT (a)] = o(T \u03b7).\n\n(cid:1) .\n\nGi\u22121\n\n2\n\n\u22062\n\nWe remark that the detailed problem instance and the greedy regret can be found in Theorem E.2 of\nthe supplementary material. Furthermore, we may also restrict the maximum gap \u2206max to \u0398(1), and\n), for any suf\ufb01ciently large T . For the upper bound, OG-\nthe lower bound RG(T ) = \u2126( mW \u2206max ln T\nUCB (Theorem 3.1) gives that RG(T ) = O( mW \u2206max\nlog T ), Thus, our upper bound of OG-UCB\n\u22062\nmatches the lower bound within a constant factor.\nAcknowledgments\nJian Li was supported in part by the National Basic Research Program of\nChina grants 2015CB358700, 2011CBA00300, 2011CBA00301, and the National NSFC grants\n61202009, 61033001, 61361136003.\n\n\u22062\n\n8\n\n\fReferences\n[1] J.-Y. Audibert and S. Bubeck. Best arm identi\ufb01cation in multi-armed bandits. In COLT, 2010.\n[2] J.-Y. Audibert, S. Bubeck, and G. Lugosi. Minimax policies for combinatorial prediction games. arXiv\n\npreprint arXiv:1105.4871, 2011.\n\n[3] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2-3):235\u2013256, 2002.\n\n[4] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem.\n\nSIAM Journal on Computing, 32(1):48\u201377, 2002.\n\n[5] A. Bj\u00a8orner and G. M. Ziegler. Introduction to greedoids. Matroid applications, 40:284\u2013357, 1992.\n[6] S. Bubeck and N. Cesa-Bianchi. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. arXiv preprint arXiv:1204.5721, 2012.\n\n[7] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in \ufb01nitely-armed and continuous-armed bandits.\n\nTheoretical Computer Science, 412(19):1832\u20131852, 2011.\n\n[8] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. Journal of Computer and System Sciences, 78\n\n(5):1404\u20131422, 2012.\n\n[9] S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits.\n\nIn NIPS, 2014.\n\n[10] W. Chen, Y. Wang, and Y. Yuan. Combinatorial multi-armed bandit: General framework and applications.\n\nIn ICML, 2013.\n\n[11] V. Chvatal. A greedy heuristic for the set-covering problem. Mathematics of operations research, 4(3):\n\n233\u2013235, 1979.\n\n[12] V. Gabillon, B. Kveton, Z. Wen, B. Eriksson, and S. Muthukrishnan. Adaptive submodular maximization\n\nin bandit setting. In NIPS. 2013.\n\n[13] Y. Gai, B. Krishnamachari, and R. Jain. Learning multiuser channel allocations in cognitive radio net-\n\nworks: A combinatorial multi-armed bandit formulation. In DySPAN. IEEE, 2010.\n\n[14] A. Garivier and O. Capp\u00b4e. The kl-ucb algorithm for bounded stochastic bandits and beyond. arXiv\n\npreprint arXiv:1102.2490, 2011.\n\n[15] P. Helman, B. M. Moret, and H. D. Shapiro. An exact characterization of greedy structures. SIAM Journal\n\non Discrete Mathematics, 6(2):274\u2013283, 1993.\n\n[16] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed\n\nbandits. In ICML, 2012.\n\n[17] D. Kempe, J. Kleinberg, and \u00b4E. Tardos. Maximizing the spread of in\ufb02uence through a social network. In\n\nSIGKDD, 2003.\n\n[18] B. Korte and L. Lov\u00b4asz. Greedoids and linear objective functions. SIAM Journal on Algebraic Discrete\n\nMethods, 5(2):229\u2013238, 1984.\n\n[19] J. B. Kruskal. On the shortest spanning subtree of a graph and the traveling salesman problem. Proceed-\n\nings of the American Mathematical society, 7(1):48\u201350, 1956.\n\n[20] B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson. Matroid bandits: Fast combinatorial opti-\n\nmization with learning. arXiv preprint arXiv:1403.5045, 2014.\n\n[21] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari. Tight regret bounds for stochastic combinatorial\n\nsemi-bandits. arXiv preprint arXiv:1410.0949, 2014.\n\n[22] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied mathe-\n\nmatics, 6(1):4\u201322, 1985.\n\n[23] T. Lin, B. Abrahao, R. Kleinberg, J. Lui, and W. Chen. Combinatorial partial monitoring game with linear\n\nfeedback and its applications. In ICML, 2014.\n\n[24] B. Mirzasoleiman, A. Badanidiyuru, A. Karbasi, J. Vondrak, and A. Krause. Lazier than lazy greedy. In\n\nProc. Conference on Arti\ufb01cial Intelligence (AAAI), 2015.\n\n[25] R. C. Prim. Shortest connection networks and some generalizations. Bell system technical journal, 36(6):\n\n1389\u20131401, 1957.\n\n[26] M. Streeter and D. Golovin. An online algorithm for maximizing submodular functions. In NIPS, 2009.\n\n9\n\n\f", "award": [], "sourceid": 245, "authors": [{"given_name": "Tian", "family_name": "Lin", "institution": "Tsinghua University"}, {"given_name": "Jian", "family_name": "Li", "institution": "Tsinghua University"}, {"given_name": "Wei", "family_name": "Chen", "institution": "Microsoft.com"}]}