{"title": "Phased Exploration with Greedy Exploitation in Stochastic Combinatorial Partial Monitoring Games", "book": "Advances in Neural Information Processing Systems", "page_first": 2433, "page_last": 2441, "abstract": "Partial monitoring games are repeated games where the learner receives feedback that might be different from adversary's move or even the reward gained by the learner. Recently, a general model of combinatorial partial monitoring (CPM) games was proposed \\cite{lincombinatorial2014}, where the learner's action space can be exponentially large and adversary samples its moves from a bounded, continuous space, according to a fixed distribution. The paper gave a confidence bound based algorithm (GCB) that achieves $O(T^{2/3}\\log T)$ distribution independent and $O(\\log T)$ distribution dependent regret bounds. The implementation of their algorithm depends on two separate offline oracles and the distribution dependent regret additionally requires existence of a unique optimal action for the learner. Adopting their CPM model, our first contribution is a Phased Exploration with Greedy Exploitation (PEGE) algorithmic framework for the problem. Different algorithms within the framework achieve $O(T^{2/3}\\sqrt{\\log T})$ distribution independent and $O(\\log^2 T)$ distribution dependent regret respectively. Crucially, our framework needs only the simpler ``argmax'' oracle from GCB and the distribution dependent regret does not require existence of a unique optimal action. Our second contribution is another algorithm, PEGE2, which combines gap estimation with a PEGE algorithm, to achieve an $O(\\log T)$ regret bound, matching the GCB guarantee but removing the dependence on size of the learner's action space. However, like GCB, PEGE2 requires access to both offline oracles and the existence of a unique optimal action. Finally, we discuss how our algorithm can be efficiently applied to a CPM problem of practical interest: namely, online ranking with feedback at the top.", "full_text": "Phased Exploration with Greedy Exploitation in\n\nStochastic Combinatorial Partial Monitoring Games\n\nSougata Chaudhuri\nDepartment of Statistics\n\nUniversity of Michigan Ann Arbor\n\nsougata@umich.edu\n\nDepartment of Statistics and Department of EECS\n\nAmbuj Tewari\n\nUniversity of Michigan Ann Arbor\n\ntewaria@umich.edu\n\nAbstract\n\nPartial monitoring games are repeated games where the learner receives feedback\nthat might be different from adversary\u2019s move or even the reward gained by the\nlearner. Recently, a general model of combinatorial partial monitoring (CPM)\ngames was proposed [1], where the learner\u2019s action space can be exponentially\nlarge and adversary samples its moves from a bounded, continuous space, according\nto a \ufb01xed distribution. The paper gave a con\ufb01dence bound based algorithm (GCB)\nthat achieves O(T 2/3 log T ) distribution independent and O(log T ) distribution\ndependent regret bounds. The implementation of their algorithm depends on\ntwo separate of\ufb02ine oracles and the distribution dependent regret additionally\nrequires existence of a unique optimal action for the learner. Adopting their CPM\nmodel, our \ufb01rst contribution is a Phased Exploration with Greedy Exploitation\n(PEGE) algorithmic framework for the problem. Different algorithms within\nlog T ) distribution independent and O(log2 T )\ndistribution dependent regret respectively. Crucially, our framework needs only the\nsimpler \u201cargmax\u201d oracle from GCB and the distribution dependent regret does not\nrequire existence of a unique optimal action. Our second contribution is another\nalgorithm, PEGE2, which combines gap estimation with a PEGE algorithm, to\nachieve an O(log T ) regret bound, matching the GCB guarantee but removing the\ndependence on size of the learner\u2019s action space. However, like GCB, PEGE2\nrequires access to both of\ufb02ine oracles and the existence of a unique optimal action.\nFinally, we discuss how our algorithm can be ef\ufb01ciently applied to a CPM problem\nof practical interest: namely, online ranking with feedback at the top.\n\nthe framework achieve O(T 2/3\u221a\n\n1\n\nIntroduction\n\nPartial monitoring (PM) games are repeated games played between a learner and an adversary over\ndiscrete time points. At every time point, the learner and adversary each simultaneously select an\naction, from their respective action sets, and the learner gains a reward, which is a function of the two\nactions. In PM games, the learner receives limited feedback, which might neither be adversary\u2019s move\n(full information games) nor the reward gained (bandit games). In stochastic PM games, adversary\ngenerates actions which are independent and identically distributed according to a distribution \ufb01xed\nbefore the start of the game and unknown to the learner. The learner\u2019s objective is to develop a\nlearning strategy that incurs low regret over time, based on the feedback received during the course of\nthe game. Regret is de\ufb01ned as the difference between cumulative reward of the learner\u2019s strategy and\nthe best \ufb01xed learner\u2019s action in hindsight. The usual learning strategies in online games combine\nsome form of exploration (getting feedback on certain learner\u2019s actions) and exploitation (playing the\nperceived optimal action based on current estimates).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fStarting with early work in the 2000s [2, 3], the study of \ufb01nite PM games reached a culmination\npoint with a comprehensive and complete classi\ufb01cation [4]. We refer the reader to these works for\nmore references and also note that newer results continue to appear [5]. Finite PM games restrict\nboth the learner\u2019s and adversary\u2019s action spaces to be \ufb01nite, with a very general feedback model. All\n\ufb01nite partial monitoring games can be classi\ufb01ed into one of four categories, with minimax regret\n\u0398(T ), \u0398(T 2/3), \u0398(T 1/2) and \u0398(1). The classi\ufb01cation is governed by global and local observability\nproperties pertaining to a game [4]. Another line of work has extended traditional multi-armed\nbandit problem (MAB) [6] to include combinatorial action spaces for learner (CMAB) [7, 8]. The\ncombinatorial action space can be exponentially large, rendering traditional MAB algorithms designed\nfor small \ufb01nite action spaces, impractical with regret bounds scaling with size of action space. The\nCMAB algorithms exploit a \ufb01nite subset of base actions, which are speci\ufb01c to the structure of problem\nat hand, leading to practical algorithms and regret bounds that do not scale with, or scale very mildly\nwith, the size of the learner\u2019s action space.\nWhile \ufb01nite PM and CMAB problems have witnessed a lot of activity, there is only one paper [1]\non combinatorial partial monitoring (CPM) games, to the best of our knowledge. In that paper, the\nauthors combined the combinatorial aspect of CMAB with the limited feedback aspect of \ufb01nite\nPM games, to develop a CPM model. The model extended PM games to include combinatorial\naction spaces for learner, which might be exponentially large, and in\ufb01nite action spaces for the\nadversary. Neither of these situations can be handled by generic algorithms for \ufb01nite PM games.\nSpeci\ufb01cally, the model considered an action space X for the learner, that has a small subset of actions\nde\ufb01ning a global observable set (see Assumption 2 in Section 2). The adversary\u2019s action space is\na continuous, bounded vector space with the adversary sampling moves from a \ufb01xed distribution\nover the vector space. The reward function considered is a general non-linear function of learner\u2019s\nand adversary\u2019s actions, with some restrictions (see Assumptions 1 & 3 in Section 2). The model\nincorporated a linear feedback mechanism where the feedback received is a linear transformation of\nadversary\u2019s move. Inspired by the classic con\ufb01dence bound algorithms for MABs, such as UCB [6],\nthe authors proposed a Global Con\ufb01dence Bound (GCB) algorithm that enjoyed two types of regret\nbound. The \ufb01rst one was a distribution independent O(T 2/3 log T ) regret bound and the second\none was a distribution dependent O(log T ) regret bound. A distribution dependent regret bound\ninvolves factors speci\ufb01c to the adversary\u2019s \ufb01xed distribution, while distribution independent means\nthe regret bound holds over all possible distributions in a broad class of distributions. Both bounds\nalso had a logarithmic dependence on |X|. The algorithm combined online estimation with two\nof\ufb02ine computational oracles. The \ufb01rst oracle \ufb01nds the action(s) achieving maximum value of reward\nfunction over X , for a particular adversary action (argmax oracle), and the second oracle \ufb01nds the\naction(s) achieving second maximum value of reward function over X , for a particular adversary\naction (arg-secondmax oracle). Moreover, the distribution dependent regret bound requires existence\nof a unique optimal learner action. The inspiration for the CPM model came from various applications\nlike crowdsourcing and matching problems like matching products with customers.\nOur Contributions. We adopt the CPM model proposed earlier [1]. However, instead of using\nupper con\ufb01dence bound techniques, our work is motivated by another classic technique developed\nfor MABs, namely that of forced exploration. This technique was already used in the classic paper\nof Robbins [9] and has also been called \u201cforcing with certainty equivalence\u201d in the control theory\nliterature [10]. We develop a Phased Exploration with Greedy Exploitation (PEGE) algorithmic\nframework (Section 3) borrowing the PEGE terminology from work on linearly parameterized bandits\nlog T )\ndistribution independent and O(log2 T ) distribution dependent regret. Signi\ufb01cantly, the framework\ncombines online estimation with only the argmax oracle from GCB, which is a practical advantage\nover requiring an additional arg-secondmax oracle. Moreover, the distribution dependent regret does\nnot require existence of unique optimal action. Uniqueness of optimal action can be an unreasonable\nassumption, especially in the presence of a combinatorial action space. Our second contribution\nis another algorithm PEGE2 (Section 4) that combines a PEGE algorithm with Gap estimation, to\nachieve a distribution dependent O(log T ) regret bound, thus matching the GCB regret guarantee\nin terms of T and gap. Here, gap refers to the difference between expected reward of optimal\nand second optimal learner\u2019s actions. However, like GCB, PEGE2 does require access to both the\noracles, existence of unique optimal action for O(log T ) regret and its regret is never larger than\nO(T 2/3 log T ) when there is no unique optimal action. A crucial advantage of PEGE and PEGE2\nover GCB is that all our regret bounds are independent of |X|, only depending on the size of the\n\n[11]. When the framework is instantiated with different parameters, it achieves O(T 2/3\u221a\n\n2\n\n\fsmall global observable set. Thus, though we have adopted the CPM model [1], our regret bounds\nare meaningful for countably in\ufb01nite or even continuous learner\u2019s action space, whereas GCB regret\nbound has an explicit logarithmic dependence on |X|. We provide a detailed comparison of our\nwork with the GCB algorithm in Section 5. Finally, we discuss how our algorithms can be ef\ufb01ciently\napplied in the CPM problem of online ranking with feedback restricted to top ranked items (Section 6),\na problem already considered [12] but analyzed in a non-stochastic setting.\n\n2 Preliminaries and Assumptions\n\nThe online game is played between a learner and an adversary, over discrete rounds indexed by\nt = 1, 2, . . .. The learner\u2019s action set is denoted as X which can be exponentially large. The\nadversary\u2019s action set is the in\ufb01nite set [0, 1]n. The adversary \ufb01xes a distribution p on [0, 1]n before\nstart of the game (adversary\u2019s strategy), with p unknown to the learner. At each round of the game,\nadversary samples \u03b8(t) \u2208 [0, 1]n according to p, with E\u03b8(t)\u223cp[\u03b8(t)] = \u03b8\u2217\np. The learner chooses\nx(t) \u2208 X and gets reward r(x(t), \u03b8(t)). However, the learner might not get to know either \u03b8(t) (as\nin a full information game) or r(x(t), \u03b8(t)) (as in a bandit game). In fact, the learner receives, as\nfeedback, a linear transformation of \u03b8(t).That is, every action x \u2208 X has an associated transformation\nmatrix Mx \u2208 Rmx\u00d7n. On playing action x(t), the learner receives a feedback Mx(t) \u00b7 \u03b8(t) \u2208 Rmx.\nNote that the game with the de\ufb01ned feedback mechanism subsumes full information and bandit\ngames. Mx = In\u00d7n, \u2200x makes it a full information game since Mx \u00b7 \u03b8 = \u03b8. If r(x, \u03b8) = x \u00b7 \u03b8, then\nMx = x \u2208 Rn makes it a bandit game. The dimension n, action space X , reward function r(\u00b7,\u00b7) and\ntransformation matrices Mx, \u2200x \u2208 X are known to the learner. The goal of the learner is to minimize\nthe expected regret, which, for a given time horizon T , is:\n\nR(T ) = T \u00b7 max\nx\u2208X\n\nE[r(x, \u03b8)] \u2212 T(cid:88)\n\nt=1\n\nE[r(x(t), \u03b8(t))]\n\n(1)\n\nwhere the expectation in the \ufb01rst term is taken over \u03b8, w.r.t. distribution p, and the second expectation\nis taken over \u03b8 and possible randomness in the learner\u2019s algorithm.\nAssumption 1. (Restriction on Reward Function) The \ufb01rst assumption is that E\u03b8\u223cp[r(x, \u03b8)] =\n\u00afr(x, \u03b8\u2217\np, which is\nalways satis\ufb01ed if r(x, \u03b8) is a linear function of \u03b8, or if distribution p happens to be any distribution\nwith support [0, 1]n and fully parameterized by its mean \u03b8\u2217\np. With this assumption, the expected regret\nbecomes:\n\np), for some function \u00afr(\u00b7,\u00b7). That is, the expected reward is a function of x and \u03b8\u2217\n\nR(T ) = T \u00b7 \u00afr(x\u2217, \u03b8\u2217\n\nE[\u00afr(x(t), \u03b8\u2217\n\np)].\n\n(2)\n\np) \u2212 T(cid:88)\n\nt=1\n\np). Then \u2206x = \u00afr(x\u2217, \u03b8\u2217\n\np) \u2212 \u00afr(x, \u03b8\u2217\n\np) =\np) , \u2206max = max{\u2206x : x \u2208 X} and \u2206 =\n\nFor distribution dependent regret bounds, we de\ufb01ne gaps in expected rewards: Let x\u2217 \u2208 S(\u03b8\u2217\nargmaxx\u2208X \u00afr(x, \u03b8\u2217\nmin{\u2206x : x \u2208 X , \u2206x > 0}.\nAssumption 2. (Existence of Global Observable Set) The second assumption is on the existence\nof a global observable set, which is a subset of learner\u2019s action set and is required for estimating\nan adversary\u2019s move \u03b8. The global observable set is de\ufb01ned as follows: for a set of actions \u03c3 =\n{x1, x2, . . . , x|\u03c3|} \u2286 X , let their transformation matrices be stacked in a top down fashion to obtain\na R(cid:80)|\u03c3|\ni=1 mxi\u00d7n dimensional matrix M\u03c3. \u03c3 is said to be a global observable set if M\u03c3 has full column\n\u03c3 M\u03c3 = In\u00d7n.\nrank, i.e., rank(M\u03c3) = n. Then, the Moore-Penrose pseudoinverse M +\nWithout the assumption on the existence of global observable set, it might be the case that even\nif the learner plays all actions in X on same \u03b8, the learner might not be able to recover \u03b8 (as\n\u03c3 M\u03c3 = I n\u00d7n will not hold without full rank assumption). In that case, learner might not be\nM +\nable to distinguish between \u03b8\u2217\np2, corresponding to two different adversary\u2019s strategies. Then,\nwith non-zero probability, the learner can suffer \u2126(T ) regret and no learner strategy can guarantee a\nsub-linear in T regret (the intuition forms the base of the global observability condition in [2]). Note\nthat the size of the global observable set is small, i.e., |\u03c3| \u2264 n. A global observable set can be found\nby including an action x in \u03c3 if it strictly increases the rank of M\u03c3, till the rank reaches n. There can,\nof course, be more than one global observable set.\n\n\u03c3 satis\ufb01es M +\n\np1 and \u03b8\u2217\n\n3\n\n\fAssumption 3. (Lipschitz Continuity of Expected Reward Function) The third assumption is on\nthe Lipschitz continuity of expected reward function in its second argument. More precisely, it is\nassumed that \u2203 R > 0 such that \u2200 x \u2208 X , for any \u03b81 and \u03b82, |\u00afr(x, \u03b81) \u2212 \u00afr(x, \u03b82)| \u2264 R(cid:107)\u03b81 \u2212 \u03b82(cid:107)2.\nThis assumption is reasonable since otherwise, a small error in estimation of mean reward vector\n\u03b8\u2217\np can introduce a large change in expected reward, leading to dif\ufb01culty in controlling regret over\ntime. The Lipschitz condition holds trivially for expected reward functions which are linear in second\nargument. The continuity assumption, along with the fact that adversary\u2019s moves are in [0, 1]n,\nimplies boundedness of expected reward for any learner\u2019s action and any adversary\u2019s action. We\ndenote Rmax = maxx\u2208X ,\u03b8\u2208[0,1]n \u00afr(x, \u03b8).\nThe three assumptions above will be made throughout. However, the fourth assumption will only be\nmade in a subset of our results.\nAssumption 4. (Unique Optimal Action) The optimal action x\u2217 = argmaxx\u2208X \u00afr(x, \u03b8\u2217\nDenote a second best action (which may not be unique) by x\u2217\nthat \u2206 = \u00afr(x\u2217, \u03b8\u2217\n\n\u2212 = argmaxx\u2208X ,x(cid:54)=x\u2217 \u00afr(x, \u03b8\u2217\n\np) is unique.\np). Note\n\np) \u2212 \u00afr(x\u2217\n\n\u2212, \u03b8\u2217\np).\n\n3 Phased Exploration with Greedy Exploitation\n\nAlgorithm 1 (PEGE) uses the classic idea of doing exploration in phases that are successively further\napart from each other. In between exploration phases, we select action greedily by completely trusting\nthe current estimates. The constant \u03b2 controls how much we explore in a given phase and the constant\n\u03b1 along with the function C(\u00b7) determines how much we exploit. This idea is classic in the bandit\nliterature [9\u201311] but has not been applied to the CPM framework to the best of our knowledge.\n\nAlgorithm 1 The PEGE Algorithmic Framework\n1: Inputs: \u03b1, \u03b2 and function C(\u00b7) (to determine amount of exploration/exploitation in each phase).\n\nFor i = 1 to |\u03c3| (\u03c3 is global observable set)\n\nFor j = 1 to b\u03b2\n\nLet tj,i = t and \u03b8(tj,i, b) = \u03b8(t) where t is current time point\nPlay xi \u2208 \u03c3 and get feedback Mxi \u00b7 \u03b8(tj,i, b) \u2208 Rmxi .\n\n2: For b = 1, 2, . . . ,\nExploration\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n\nEnd For\nEstimation\n\u02dc\u03b8j,i = M +\n\nEnd For\n\n(cid:80)b\n(cid:80)i\u03b2\n\u03c3 (Mx1 \u00b7 \u03b8(tj,1, i), . . . , Mx|\u03c3| \u00b7 \u03b8(tj,|\u03c3|, i)) \u2208 Rn.\n(cid:80)b\n\n\u02dc\u03b8j,i\n\ni=1\n\n\u2208 Rn.\n\u02c6\u03b8(b) =\nx(b) \u2208 argmaxx\u2208X \u00afr(x, \u02c6\u03b8(b)).\nExploitation\n\nj=1\nj=1 j\u03b2\n\nFor i = 1 to exp(C(b\u03b1))\n\nPlay x(b).\n\nEnd For\n\n12:\n\n13:\n14:\n15:\n16:\n17:\n18: End For\n\np) = M +\n\nIt is easy to see that the estimators in Algorithm 1 have the following properties: Ep[\u02dc\u03b8j,i] =\n\u03c3 (Mx1 \u00b7 \u03b8\u2217\np. Using the fact that\nM +\n\u03c3 = (M(cid:62)\nM +\n(cid:107)\u02dc\u03b8j,i \u2212 \u03b8\u2217\n\np, . . . , Mx|\u03c3| \u00b7 \u03b8\u2217\n\u03c3 M\u03c3)\u22121M(cid:62)\n\u03c3 , we also have the following bound on estimation error of \u03b8\u2217\np:\np(cid:107)2 \u2264 (cid:107)M +\n\u03c3 (Mx1 \u00b7 \u03b8(tj,1, i), . . . , Mx|\u03c3| \u00b7 \u03b8(tj,|\u03c3|, i)) \u2212 M +\n|\u03c3|(cid:88)\n(cid:107)(M(cid:62)\nM(cid:62)\n\u03c3 M\u03c3)\u22121\n\np(cid:107)2\n\u03c3 M\u03c3\u03b8\u2217\n\u03c3 M\u03c3)\u22121M(cid:62)\n\np and hence Ep[\u02c6\u03b8] = \u03b8\u2217\n\nMxk \u00b7 (\u03b8(tj,k, i) \u2212 \u03b8\u2217\n\np)(cid:107)2 \u2264 \u221a\n\n\u03c3 M\u03c3 \u00b7 \u03b8\u2217\n\n= (cid:107)(M(cid:62)\n\np = \u03b8\u2217\n\n|\u03c3|(cid:88)\n\nn\n\nxk\n\nxk\n\nMxk(cid:107)2 =: \u03b2\u03c3\n(3)\n\nk=1\n\nk=1\n\n4\n\n\fwhere the constant \u03b2\u03c3 de\ufb01ned above depends only on the structure of the linear transformation\nmatrices of the global observer set and not on adversary strategy p.\nOur \ufb01rst result is about the regret of Algorithm 1 when within phase number b, the exploration part\nspends |\u03c3| rounds (constant w.r.t. b) and the exploitation part grows polynomially with b.\nTheorem 1. (Distribution Independent Regret) When Algorithm 1 is initialized with the param-\neters C(a) = log a, \u03b1 = 1/2 and \u03b2 = 0, and the online game is played over T rounds, we get the\nfollowing bound on expected regret:\n\nR(T ) \u2264 Rmax|\u03c3|T 2/3 + 2R\u03b2\u03c3T 2/3(cid:112)\n\nlog 2e2 + 2 log T + Rmax\n\n(4)\n\nwhere \u03b2\u03c3 is the constant as de\ufb01ned in Eq. 3.\n\nOur next result is about the regret of Algorithm 1 when within phase number b, the exploration part\nspends |\u03c3| \u00b7 b rounds (linearly increasing with b) and the exploitation part grows exponentially with b.\nTheorem 2. (Distribution Dependent Regret) When Algorithm 1 is initialized with the parameters\nC(a) = h \u00b7 a, for a tuning parameter h > 0, \u03b1 = 1 and \u03b2 = 1, and the online game is played over T\nrounds, we get the following bound on expected regret:\n\u221a\n\n4\n\n2\u03c0e2R\u2206max\u03b2\u03c3\n\n+\n\nh2(2R2\u03b22\n\n\u03c3 )\n\n\u22062\n\ne\n\n.\n\n(5)\n\n\u2206\n\nR(T ) \u2264(cid:88)\n\n(cid:18) log T\n\n(cid:19)2\n\n\u2206x\n\nx\u2208\u03c3\n\nh\n\nSuch an explicit bound for a PEGE algorithm that is polylogarithmic in T and explicitly states the\nmultiplicative and additive constants involved in not known, to the best of our knowledge, even in\nthe bandit literature (e.g., earlier bounds [10] are asymptotic) whereas here we prove it in the CPM\nsetting. Note that the additive constant above, though \ufb01nite, blows up exponentially fast as \u2206 \u2192 0\nfor a \ufb01xed h. It is well behaved however, if the tuning parameter h is on the same scale as \u2206. This\nline of thought motivates us to estimate the gap to within constant factors and then feed that estimate\ninto a PEGE algorithm. This is what we will do in the next section.\n\n4 Combining Gap Estimation with PEGE\n\nAlgorithm 2 tries to estimate the gap \u2206 to within a constant multiplicative factor. However, if there is\nno unique optimal action or when the true gap is small, gap estimation can take a very large amount\nof time. To prevent that from happening, the algorithm also takes in a threshold T0 as input and\nde\ufb01nitely stops if the threshold is reached. The result below assures us that, with high probability,\nthe algorithm behaves as expected. That is, if there is a unique optimal action and the gap is large\nenough to be estimated with a given con\ufb01dence before the threshold T0 kicks in, it will output an\nestimate \u02c6\u2206 in the range [0.5\u2206, 1.5\u2206]. On the other hand, if there is no unique optimal action, it does\nnot generate an estimate of \u2206 and instead runs out of the exploration budget T0.\nTheorem 3. (Gap Estimation within Constant Factors) Let T0 \u2265 1 and \u03b4 \u2208 (0, 1) and de\ufb01ne\nT1(\u03b4) = 256R2\u03b22\n\n\u03b4 . Consider Algorithm 2 run with\n\n, T2(\u03b4) = 16R2\u03b22\n\nlog 512e2R2\u03b22\n\n\u03c3\n\nlog 4e2\n\n\u22062\u03b4\n\n\u22062\n\n\u03c3\n\n\u03c3\n\n(cid:115)\n\n\u22062\n\nThen, the following 3 claims hold.\n\nw(b) =\n\nR2\u03b22\n\n\u03c3 log( 4e2b2\n\n\u03b4\n\nb\n\n)\n\n.\n\n(6)\n\nAlgorithm 2 stops in T1(\u03b4) episodes and outputs an estimate \u02c6\u2206 that satis\ufb01es 1\n\n1. Suppose Assumption 4 holds and T1(\u03b4) < T0. Then with probability at least 1 \u2212 \u03b4,\n2 \u2206.\n2. Suppose Assumption 4 holds and T0 \u2264 T1(\u03b4). Then with probability at least 1 \u2212 \u03b4, the\nalgorithm either outputs \u201cthreshold exceeded\u201d or outputs an estimate \u02c6\u2206 that satis\ufb01es\n2 \u2206 \u2264 \u02c6\u2206 \u2264 3\n2 \u2206. Furthermore, if it outputs \u02c6\u2206, it must be the case that the algorithm stopped\n1\nat an episode b such that T2(\u03b4) < b < T0.\n\n2 \u2206 \u2264 \u02c6\u2206 \u2264 3\n\n3. Suppose Assumption 4 fails. Then, with probability at least 1 \u2212 \u03b4, Algorithm 2 stops in T0\n\nepisodes and outputs \u201cthreshold exceeded\u201d.\n\n5\n\n\fAlgorithm 2 Algorithm for Gap Estimation\n1: Inputs: T0 (exploration threshold) and \u03b4 (con\ufb01dence parameter)\n\nFor i = 1 to |\u03c3|\n\n2: For b = 1, 2, . . . ,\nExploration\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nEnd For\nEstimation\n\u02dc\u03b8b = M +\n\n(Denote) ti = t and \u03b8(ti, b) = \u03b8(t) (t is current time point).\nPlay xi \u2208 \u03c3 and get feedback Mxi \u00b7 \u03b8(ti, b) \u2208 Rmxi .\n(cid:80)b\n\u03c3 (Mx1 \u00b7 \u03b8(t1, b), . . . , Mx|\u03c3| \u00b7 \u03b8(t|\u03c3|, b)) \u2208 Rn.\n\n\u02dc\u03b8i\n\n\u2208 Rn.\n\ni=1\nb\n\nStopping Rule (w(b) is de\ufb01ned as in Eq. (6))\nIf argmaxx\u2208X \u00afr(x, \u02c6\u03b8(b)) is unique:\n\n\u02c6x(b) = argmaxx\u2208X \u00afr(x, \u02c6\u03b8(b))\n\u02c6x\u2212(b) = argmaxx\u2208X ,x(cid:54)=\u02c6x(b) \u00afr(x, \u02c6\u03b8(b)) (need not be unique)\nIf \u00afr(\u02c6x(b), \u02c6\u03b8(b)) \u2212 \u00afr(\u02c6x\u2212(b), \u02c6\u03b8(b)) > 6w(b):\nSTOP and output \u02c6\u2206 = \u00afr(\u02c6x(b), \u02c6\u03b8(b)) \u2212 \u00afr(\u02c6x\u2212(b), \u02c6\u03b8(b))\n\n10:\n\n\u02c6\u03b8(b) =\n\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\nEnd If\n22: End For\n\nEnd If\n\nEnd If\nIf b > T0:\n\nSTOP and output \u201cthreshold exceeded\u201d\n\nEquipped with Theorem 3, we are now ready to combine Algorithm 2 with Algorithm 1 to give\nAlgorithm 3. Algorithm 3 \ufb01rst calls Algorithm 2. If Algorithm 2 outputs an estimate \u02c6\u2206 it is fed into\nAlgorithm 1. If the threshold T0 is exceeded, then the remaining time is spent in pure exploitation.\nNote that by choosing T0 to be of order T 2/3 we can guarantee a worst case regret of the same order\neven when unique optimality assumption fails. For PM games that are globally observable but not\nlocally observable, such a distribution independent O(T 2/3) bound is known to be optimal [4].\nTheorem 4. (Regret Bound for PEGE2) Consider Algorithm 3 run with knowledge of the number\nT of rounds. Consider the distribution independent bound\n\nmaxT )2/3(cid:112)log(4e2T 3) + Rmax,\n\nB1(T ) = 2(2R\u03b2\u03c3|\u03c3|2R2\n\nand the distribution dependent bound\n512e2R2\u03b22\n\n256R2\u03b22\n\u03c3\n\n\u03c3T\n\nlog\n\n\u22062\n\nB2(T ) =\n+ Rmax.\nIf Assumption 4 fails, then the expected regret of Algorithm 3 is bounded as R(T ) \u2264 B1(T ). If\nAssumption 4 holds, then the expected regret of Algorithm 3 is bounded as\n\nx\u2208\u03c3\n\n\u2206x\n\n\u22062\n\n\u22062\n\n36R2\u03b22\n\u03c3 log T\n\u22062\n\n+\n\n8e2R2\u03b22\n\u03c3\n\nRmax|\u03c3| +\n\n(cid:88)\n\nO(T 2/3 log T )\n\nif T1(\u03b4) < T0\nif T0 \u2264 T1(\u03b4)\n\n,\n\n(7)\n\nwhere T1(\u03b4) is as de\ufb01ned in Theorem 3 and \u03b4, T0 are as de\ufb01ned in Algorithm 3.\n\nIn the above theorem, note that T1(\u03b4) scales as \u0398( 1\ncases in Eq. (7) correspond to large gap and small gap situations respectively.\n\n\u22062 log T\n\n\u22062 ) and T0 as \u0398(T 2/3). Thus, the two\n\n5 Comparison with GCB Algorithm\n\nWe provide a detailed comparison of our results with those obtained for GCB [1]. (a) While we\nuse the same CPM model, our solution is inspired by the forced exploration technique while GCB\n\n6\n\n(cid:26)B2(T )\n\nR(T ) \u2264\n\n\fAlgorithm 3 Algorithm Combining PEGE with Gap Estimation (PEGE2)\n1: Input: T (total number of rounds)\n\n(cid:16) 2R\u03b2\u03c3T\n\n(cid:17)2/3\n\nand \u03b4 = 1/T\n\n2: Call Algorithm 2 with inputs T0 =\n|\u03c3|Rmax\n3: If Algorithm 2 returns \u201cthreshold exceeded\u201d:\n4: Let \u02c6\u03b8(T0) be the latest estimate of \u03b8\u2217\n5: Play \u02c6x(T0) = argmaxx\u2208X \u00afr(x, \u02c6\u03b8) for the remaining T \u2212 T0|\u03c3| rounds\n6: Else:\n7: Let \u02c6\u2206 be the gap estimate produced by Algorithm 2\n8: For all remaining time steps, run Algorithm 1 with parameters C(a) = ha with\n\np maintained by Algorithm 2\n\nh = \u02c6\u22062\n9R2\u03b22\n\u03c3\n\n9: End If\n\n, \u03b1 = 1, \u03b2 = 0\n\n(b) One instantiation of our PEGE framework gives an O(T 2/3\u221a\n\nis inspired by the con\ufb01dence bound technique, both of which are classic in the bandit literature.\nlog T ) distribution independent\nregret bound (Theorem 1), which does not require call to arg-secondmax oracle. This is of substantial\npractical advantage over GCB since even for linear optimization problems over polyhedra, standard\nroutines usually do not have option of computing action(s) that achieve second maximum value\nfor the objective function. (c) Another instantiation of the PEGE framework gives an O(log2 T )\ndistribution dependent regret bound (Theorem 2), which neither requires call to arg-secondmax oracle\nnor the assumption of existence of unique optimal action for learner. This is once again important,\nsince the assumption of existence of unique optimal action might be impractical, especially for\nexponentially large action space. However, the caveat is that improper setting of the tuning parameter\nh in Theorem 2 can lead to an exponentially large additive component in the regret. (d) A crucial\npoint, which we had highlighted in the beginning, is that the regret bounds achieved by PEGE and\nPEGE2 do not have dependence on size of learner\u2019s action space, i.e., |X|. The dependence is only on\nthe size of global observable set \u03c3, which is guaranteed to be not more than dimension of adversary\u2019s\naction space. Thus, though we have adopted the CPM model [1], our algorithms achieve meaningful\nregret bounds for countably in\ufb01nite or even continuous learner\u2019s action space. In contrast, the GCB\nregret bounds have explicit, logarithmic dependence on size of learner\u2019s action space. Thus, their\nresults cannot be extended to problems with in\ufb01nite learner\u2019s action space (see Section 6 for an\nexample), and are restricted to large, but \ufb01nite action spaces. (e) The PEGE2 algorithm is a true\nanalogue of the GCB algorithm, matching the regret bounds of GCB in terms of T and gap \u2206 with\nthe advantage that it has no dependence on |X|. The disadvantage, however, is that PEGE2 requires\nknowledge of time horizon T , while GCB is an anytime algorithm. It remains an open problem to\ndesign an algorithm that combines the strengths of PEGE2 and GCB.\n\n6 Application to Online Ranking\n\nA recent paper studied the problem of online ranking with feedback restricted to top ranked items\n[12]. The problem was studied in a non-stochastic setting, i.e., it was assumed that an oblivious\nadversary generates reward vectors. Moreover, the learner\u2019s action space was exponentially large in\nnumber of items to be ranked. The paper made the connection of the problem setting to PM games\n(but not combinatorial PM games) and proposed an ef\ufb01cient algorithm for the speci\ufb01c problem at\nhand. However, a careful reading of the paper shows that their algorithmic techniques can handle the\nCPM model we have discussed so far, but in the non-stochastic setting. The reward function is linear\nin both learner\u2019s and adversary\u2019s moves, adversary\u2019s move is restricted to a \ufb01nite space of vectors and\nfeedback is a linear transformation of adversary\u2019s move. In this section, we give a brief description\nof the problem setting and show how our algorithms can be used to ef\ufb01ciently solve the problem of\nonline ranking with feedback on top ranked items in the stochastic setting. We also give an example\nof how the ranking problem setting can be somewhat naturally extended to one which has continuous\naction space for learner, instead of large but \ufb01nite action space.\nThe paper considered an online ranking problem, where a learner repeatedly re-ranks a set of n, \ufb01xed\nitems, to satisfy diverse users\u2019 preferences, who visit the system sequentially. Each learner action x\n\n7\n\n\fp) = f (x) \u00b7 \u03b8\u2217\n\nis a permutation of the n items. Each user has like/dislike preference for each item, varying between\nusers, with each user\u2019s preferences encoded as an n length binary relevance vector \u03b8. Once the ranked\nlist of items is presented to the user, the user scans through the items, but gives relevance feedback\nonly on top ranked item. However, the performance of the learner is judged based on full ranked list\nand unrevealed, full relevance vector. Thus, we have a PM game, where neither adversary generated\nrelevance vector nor reward is revealed to learner. The paper showed how a number of practical\nranking measures, like Discounted Cumulative Gain (DCG), can be expressed as a linear function,\ni.e., r(x, \u03b8) = f (x)\u00b7 \u03b8. The practical motivation of the work was based on learning a ranking strategy\nto satisfy diverse user preferences, but with limited feedback received due to user burden constraints\nand privacy concerns.\nOnline Ranking with Feedback at Top as a Stochastic CPM Game. We show how our algorithms\ncan be applied in online ranking with feedback for top ranked items by showing how it is a speci\ufb01c\ninstance of the CPM model and how our key assumptions are satis\ufb01ed. The learner\u2019s action space\nis the \ufb01nite but exponentially large space of X = n! permutations. Adversary\u2019s move is an n\ndimensional relevance vector, and thus, is restricted to {0, 1}n (\ufb01nite space of size 2n) contained\nin [0, 1]n. In the stochastic setting, we can assume that adversary samples \u03b8 \u2208 {0, 1}n from a \ufb01xed\ndistribution on the space. Since the feedback on playing a permutation is the relevance of top ranked\nitem, each move x has an associated transformation matrix (vector) Mx \u2208 {0, 1}n, with 1 in the place\nof the item which is ranked at the top by x and 0 everywhere else. Thus, Mx \u00b7 \u03b8 gives the relevance\nof item ranked at the top by x. The global observable set \u03c3 is the set of any n actions, where each\naction, in turn, puts a distinct item on top. Hence, M\u03c3 is the n \u00d7 n dimensional permutation matrix.\nAssumption 1 is satis\ufb01ed because the reward function is linear in \u03b8 and \u00afr(x, \u03b8\u2217\np, where\np \u2208 [0, 1]n. Assumption 2 is satis\ufb01ed since there will always be a global observable set\nEp[\u03b8] = \u03b8\u2217\nof size n and can be found easily. In fact, there will be multiple global observable sets, with the\nfreedom to choose any one of them. Assumption 3 is satis\ufb01ed due to the expected reward function\nbeing linear in second argument. The Lipschitz constant is maxx\u2208X (cid:107)f (x)(cid:107)2, which is always less\nthan some small polynomial factor of n, depending on speci\ufb01c f (\u00b7). The value of \u03b2\u03c3 can be easily\nseen to be n3/2. The argmax oracle returns the permutation which simply sorts items according to\ntheir corresponding \u03b8 values. The arg-secondmax oracle is more complicated, though feasible. It\nrequires \ufb01rst sorting the items according to \u03b8 and then compare each pair of consecutive items to see\nwhere least drop in reward value occurs and switch the corresponding items.\nLikely Failure of Unique Optimal Action Assumption. Assumption 4 is unlikely to hold in this\nproblem setting (though of course theoretically possible). The mean relevance vector \u03b8\u2217\np effectively\nre\ufb02ects the average preference of all users for each of the n items. It is very likely that at least a\nfew items will not be liked by anyone and which will ultimately be always ranked at the bottom.\nEqually possible is that two items will have same user preference on average, and can be exchanged\nwithout hurting the optimal ranking. Thus, existence of an unique optimal ranking, which indicates\nthat each item will have different average user preference than every other item, is unlikely. Thus,\nPEGE algorithm can still be applied to get poly-logarithmic regret (Theorem 2), but GCB will only\nachieve O(T 2/3 log T ) regret.\nA PM Game with In\ufb01nite Learner Action Space. We give a simple modi\ufb01cation of the ranking\nproblem above to show how the learner can have continuous action space. The learner now ranks the\nitems by producing an n dimensional score vector x \u2208 [0, 1]n and sorting items according to their\nscores. Thus the learner\u2019s action space is now an uncountably in\ufb01nite continuous space. As before,\nthe user gets to see the ranked list and gives relevance feedback on top ranked item. The learner\u2019s\nperformance will now be judged by a continuous loss function, instead of a discrete-valued ranking\nmeasure, since its moves are in a continuous space. Consider the simplest loss, viz., the squared\n\u201closs\u201d r(x, \u03b8) = \u2212(cid:107)x \u2212 \u03b8(cid:107)2\n2 (note -ve sign to keep reward interpetation). It can be easily seen that\np, if the relevance vectors \u03b8 are in {0, 1}n. Thus,\n\u00afr(x, \u03b8\u2217\nthe Lipschitz condition is satis\ufb01ed. The global observable set is still of size n, with the n actions\nbeing any n score vectors, whose sorted orders place each of the n items, in turn, on top. \u03b2\u03c3 remains\nE\u03b8\u223cpr(x, \u03b8) = E\u03b8\u223cp[\u03b8] = \u03b8\u2217\nsame as before, with argmaxx\np. Both PEGE and PEGE2 can achieve\nmeaningful regret bound for this problem, while GCB cannot.\n\np) = E\u03b8\u223cp[r(x, \u03b8)] = \u2212(cid:107)x(cid:107)2\n\n2 + 2x\u00b7 \u03b8\u2217\n\np \u2212 1\u00b7 \u03b8\u2217\n\nAcknowledgements\n\nWe acknowledge the support of NSF via grants IIS 1452099 and CCF 1422157.\n\n8\n\n\fReferences\n[1] Tian Lin, Bruno Abrahao, Robert Kleinberg, John Lui, and Wei Chen. Combinatorial par-\ntial monitoring game with linear feedback and its applications. In Proceedings of the 31th\nInternational Conference on Machine Learning, pages 901\u2013909. ACM, 2014.\n\n[2] Antonio Piccolboni and Christian Schindelhauer. Discrete prediction games with arbitrary\nfeedback and loss. In Proceedings of the 14th Annual Conference on Computational Learning\nTheory, pages 208\u2013223. Springer, 2001.\n\n[3] Nicolo Cesa-Bianchi, G\u00e1bor Lugosi, and Gilles Stoltz. Regret minimization under partial\n\nmonitoring. Mathematics of Operations Research, pages 562\u2013580, 2006.\n\n[4] Gabor Bartok et al. Partial monitoring\u2013classi\ufb01cation, regret bounds, and algorithms. Mathemat-\n\nics of Operations Research, 39(4):967\u2013997, 2014.\n\n[5] Junpei Komiyama, Junya Honda, and Hiroshi Nakagawa. Regret lower bound and optimal\nalgorithm in \ufb01nite stochastic partial monitoring. In Advances in Neural Information Processing\nSystems, pages 1783\u20131791, 2015.\n\n[6] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed\n\nbandit problem. Machine Learning, 47(2-3):235\u2013256, 2002.\n\n[7] Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework\nand applications. In Proceedings of the 30th International Conference on Machine Learning,\npages 151\u2013159, 2013.\n\n[8] Branislav Kveton, Zheng Wen, Azin Ashkan, and Csaba Szepesvari. Tight regret bounds\nIn Proceedings of the Eighteenth International\n\nfor stochastic combinatorial semi-bandits.\nConference on Arti\ufb01cial Intelligence and Statistics, pages 535\u2013543, 2015.\n\n[9] Herbert Robbins. Some aspects of the sequential design of experiments. In Herbert Robbins\n\nSelected Papers, pages 169\u2013177. Springer, 1985.\n\n[10] Rajeev Agrawal and Demosthenis Teneketzis. Certainty equivalence control with forcing:\n\nrevisited. Systems & Control Letters, 13(5):405\u2013412, 1989.\n\n[11] Paat Rusmevichientong and John N Tsitsiklis. Linearly parameterized bandits. Mathematics of\n\nOperations Research, 35(2):395\u2013411, 2010.\n\n[12] Sougata Chaudhuri and Ambuj Tewari. Online ranking with top-1 feedback. In Proceedings\nof the 18th International Conference on Arti\ufb01cial Intelligence and Statistics, pages 129\u2013137.\nACM, 2015.\n\n[13] Thomas P Hayes. A large-deviation inequality for vector-valued martingales. Combinatorics,\n\nProbability and Computing, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1267, "authors": [{"given_name": "Sougata", "family_name": "Chaudhuri", "institution": "University of Michigan"}, {"given_name": "Ambuj", "family_name": "Tewari", "institution": "University of Michigan"}]}