{"title": "Identifying Outlier Arms in Multi-Armed Bandit", "book": "Advances in Neural Information Processing Systems", "page_first": 5204, "page_last": 5213, "abstract": "We study a novel problem lying at the intersection of two areas: multi-armed bandit and outlier detection. Multi-armed bandit is a useful tool to model the process of incrementally collecting data for multiple objects in a decision space. Outlier detection is a powerful method to narrow down the attention to a few objects after the data for them are collected. However, no one has studied how to detect outlier objects while incrementally collecting data for them, which is necessary when data collection is expensive. We formalize this problem as identifying outlier arms in a multi-armed bandit. We propose two sampling strategies with theoretical guarantee, and analyze their sampling efficiency. Our experimental results on both synthetic and real data show that our solution saves 70-99% of data collection cost from baseline while having nearly perfect accuracy.", "full_text": "Identifying Outlier Arms in Multi-Armed Bandit \u2217\n\nHonglei Zhuang1\u2020\n\nChi Wang2\n\nYifan Wang3\n\nhzhuang3@illinois.edu\n\nwang.chi@microsoft.com\n\nyifan-wa16@mails.tsinghua.edu.cn\n\n1University of Illinois at Urbana-Champaign\n\n2Microsoft Research, Redmond\n\n3Tsinghua University\n\nAbstract\n\nWe study a novel problem lying at the intersection of two areas: multi-armed bandit\nand outlier detection. Multi-armed bandit is a useful tool to model the process\nof incrementally collecting data for multiple objects in a decision space. Outlier\ndetection is a powerful method to narrow down the attention to a few objects after\nthe data for them are collected. However, no one has studied how to detect outlier\nobjects while incrementally collecting data for them, which is necessary when data\ncollection is expensive. We formalize this problem as identifying outlier arms in a\nmulti-armed bandit. We propose two sampling strategies with theoretical guarantee,\nand analyze their sampling ef\ufb01ciency. Our experimental results on both synthetic\nand real data show that our solution saves 70-99% of data collection cost from\nbaseline while having nearly perfect accuracy.\n\n1\n\nIntroduction\n\nA multi-armed bandit models a set of items (arms), each associated with an unknown probability\ndistribution of rewards. An observer can iteratively select an item and request a sample reward from\nits distribution. This model has been predominant in modeling a broad range of applications, such\nas cold-start recommendation [24], crowdsourcing [13] etc. In some applications, the objective is to\nmaximize the collected rewards while playing the bandit (exploration-exploitation setting [7, 5, 23]);\nin others, the goal is to identify an optimal object among multiple candidates (pure exploration\nsetting [6]).\nIn the pure exploration setting, rich literature is devoted to the problem of identifying the top-K arms\nwith largest reward expectations [8, 15, 20]. We consider a different scenario, in which one is more\nconcerned about \u201coutlier arms\u201d with extremely high/low expectation of rewards that substantially\ndeviate from others. Such arms are valuable as they usually provide novel insight or imply potential\nerrors.\nFor example, suppose medical researchers are testing the effectiveness of a biomarker X (e.g.,\nthe existence of a certain gene sequence) in distinguishing several different diseases with similar\n\n\u2217The authors would like to thank anonymous reviewers for their helpful comments.\n\u2020Part of this work was done while the \ufb01rst author was an intern at Microsoft Research. The \ufb01rst author\nwas sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-\n2-0053 (NSCTA), National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, and grant\n1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K)\ninitiative (www.bd2k.nih.gov). The views and conclusions contained in this document are those of the author(s)\nand should not be interpreted as representing the of\ufb01cial policies of the U.S. Army Research Laboratory or the\nU.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government\npurposes notwithstanding any copyright notation hereon.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsymptoms. They need to perform medical tests (e.g., gene sequencing) on patients with each disease\nof interest, and observe if X\u2019s degree of presence is signi\ufb01cantly higher in a certain disease than other\ndiseases. In this example, a disease can be modeled as an arm. The researchers can iteratively select\na disease with which they sample a patient and perform the medical test to observe the presence\nof X. The reward is 1 if X is fully present, and 0 if fully absent. To make sure the biomarker is\nuseful, researchers look for the disease with an extremely high expectation of reward compared\nto other diseases, instead of merely searching for the disease with the highest reward expectation.\nThe identi\ufb01cation of \u201coutlier\u201d diseases is required to be suf\ufb01ciently accurate (e.g., correct with 99%\nprobability). Meanwhile, it should be achieved with a minimal number of medical tests in order\nto save the cost. Hence, a good sampling strategy needs to be developed to both guarantee the\ncorrectness and save cost.\nAs a generalization of the above example, we study a novel problem of identifying outlier arms in\nmulti-armed bandits. We de\ufb01ne the criterion of outlierness by extending an established rule of thumb,\n3\u03c3 rule. The detection of such outliers requires calculating an outlier threshold that depends on the\nmean reward of all arms, and outputing outlier arms with an expected reward above the threshold.\nWe speci\ufb01cally study pure exploration strategies in a \ufb01xed con\ufb01dence setting, which aims to output\nthe correct results with probability no less than 1 \u2212 \u03b4.\nExisting methods for top-K arm identi\ufb01cation cannot be directly applied, mainly because the number\nof outliers are unknown a priori. The problem also differs from the thresholding bandit problem [26],\nas the outlier threshold depends on the (unknown) reward con\ufb01guration of all the arms, and hence\nalso needs to be explored. Given the outlierness criterion, the key challenges in tackling this problem\nare: i) how to guarantee the identi\ufb01ed outlier arms truly satisfy the criterion; and ii) how to design\nan ef\ufb01cient sampling strategy which balances the trade-off between exploring individual arms and\nexploring outlier threshold.\nIn this paper, we make the following major contributions:\n\u2022 We propose a Round-Robin sampling algorithm, with a theoretical guarantee of its correctness as\nwell as a theoretical upper bound of its total number of pulls.\n\u2022 We further propose an improved algorithm Weighted Round-Robin, with the same correctness\n\u2022 We verify our algorithms on both synthetic and real datasets. Our Round-Robin algorithm has\nnear 100% accuracy, while reducing the cost of a competitive baseline up to 99%. Our Weighted\nRound-Robin algorithm further reduces the cost by around 60%, with even smaller error.\n\nguarantee, and a better upper bound of its total number of pulls.\n\n2 Related Work\n\nWe present studies related to our problem in different areas.\nMulti-armed bandit. Multi-armed bandit is an extensively studied topic. A classic setting is to\nregard the feedback of pulling an arm as a reward and aim to optimize the exploration-exploitation\ntrade-off [7, 5, 23]. In an alternative setting, the goal is to identify an optimal object using a small cost,\nand the cost is related to the number of pulls rather than the feedback. This is the \u201cpure exploration\u201d\nsetting [6]. Early work dates back to 1950s under the subject of sequential design of experiments [27].\nRecent applications in crowdsourcing and big data-driven experimentation etc. revitalized this \ufb01eld.\nThe problem we study also falls into the general category of pure exploration bandit.\nWithin this category, a number of studies focus on best arm identi\ufb01cation [4, 6, 14, 15], as well\nas \ufb01nding top-K arms [8, 15, 20]. These studies focus on designing algorithms with probabilistic\nguarantee of \ufb01nding correct top-K arms, and improving the number of pulls required by the algorithm.\nTypical cases of study include: (a) \ufb01xed con\ufb01dence, in which the algorithm needs to return correct\ntop-K arms with probability above a threshold; (b) \ufb01xed budget, in which the algorithm needs to\nmaximize the probability of correctness within a certain number of pulls. While there are promising\nadvances in recent theoretical work, optimal algorithms in general cases remain an open problem.\nFinding top-K arms is different from \ufb01nding outlier arms, because top arms are not necessarily\noutliers. Yet the analysis methods are useful and inspiring to our study.\nThere are also studies [26, 11] on thresholding bandit problem, where the aim is to \ufb01nd the set of\narms whose expected rewards are larger than a given threshold. However, since the outlier threshold\n\n2\n\n\fdepends on the unknown expected rewards of all the arms, these algorithms cannot apply to our\nproblem.\nSome studies [12, 16] propose a generalized objective to \ufb01nd the set of arms with the largest sum\nof reward expectations with a given combinatorial constraint. The constraint is independent of the\nrewards (e.g., the set must have K elements). Our problem is different as the outlier constraint\ndepends on the reward con\ufb01guration of all the arms.\nA few studies on clustering bandits [17, 22] aim to identify the internal cluster structure between arms.\nTheir objective is different from outlier detection. Moreover, they do not study a pure-exploration\nscenario.\nCarpentier and Valko [9] propose the notion of \u201cextreme bandits\u201d to detect a different kind of outlier:\nThey look for extreme values of individual rewards from each pull. Using the medical example\nin Section 1, the goal can be interpreted as \ufb01nding a patient with extremely high containment of a\nbiomarker. With that goal, the arm with the heaviest tail in its distribution is favored, because it is\nmore likely to generate extremely large rewards than other arms. In contrast, our objective is to \ufb01nd\narms with extremely large expectations of rewards.\nOutlier detection. Outlier detection has been studied for decades [10, 18]. Most existing work\nfocuses on \ufb01nding outlier data points from observed data points in a dataset. We do not target on\n\ufb01nding outlier data points from observed data points (rewards). Instead, we look for outlier arms\nwhich generate these rewards. Also, these rewards are not provided at the beginning to the algorithm,\nand the algorithm needs to proactively pull each arm to obtain more reward samples.\nSampling techniques were used in detecting outlier data points from observed data points with very\ndifferent purposes. In [1], outlier detection is reduced to a classi\ufb01cation problem and an active\nlearning algorithm is proposed to selectively sample data points for training the outlier detector.\nIn [28, 29], a subset of data points is uniformly sampled to accelerate the outlier detector. Kollios et al.\n[21] propose a biased sampling strategy. Zimek et al. [30], Liu et al. [25] use subsampling technique\nto introduce diversity in order to apply ensemble methods for better outlier detection performance. In\noutlier arm identi\ufb01cation, the purpose of sampling is to estimate the reward expectation of each arm,\nwhich is a hidden variable and can only be estimated from sampled rewards.\nThere are also studies on outlier detection when uncertainty of data points is considered [2, 19].\nHowever, these algorithms do not attempt to actively request more information about data points to\nreduce the uncertainty, which is a different setting from our work.\n\n3 Problem De\ufb01nition\n\nIn this section, we describe the problem of identifying outlier arms in a multi-armed bandit. We start\nwith recalling the settings of the multi-armed bandit model.\nMulti-armed bandit. A multi-armed bandit (MAB) consists of n-arms, where each arm is associated\nwith a reward distribution. The (unknown) expectation of each reward distribution is denoted as yi.\nAt each iteration, the algorithm is allowed to select an arm i to play (pull), and obtain a sample reward\nx(j)\ni \u2208 R from the corresponding distribution, where j corresponds to the j-th samples obtained\nfrom the i-th arm. We further use xi to represent all the samples obtained from the i-th arm, and y to\nrepresent the con\ufb01guration of all the yi\u2019s.\nProblem de\ufb01nition. We study to identify outlier arms with extremely high reward expectations\ncompared to other arms in the bandit. To de\ufb01ne \u201coutlier arms\u201d, we adopt a general statistical rule\nnamed k-sigma: The arms with reward expectations higher than the mean plus k standard deviation\nof all arms are considered as outliers. Formally, we de\ufb01ne the mean of all the n arms\u2019 reward\nexpectations as well as their standard deviation as:\n\nn(cid:88)\n\ni=1\n\n\u00b5y =\n\n1\nn\n\nyi, \u03c3y =\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nn\n\nn(cid:88)\n\ni=1\n\n(yi \u2212 \u00b5y)2\n\nWe de\ufb01ne a threshold function based on the above estimators as:\n\n\u03b8 = \u00b5y + k\u03c3y\n\n3\n\n\fAn arm i is de\ufb01ned as an outlier arm iff yi > \u03b8 and is de\ufb01ned as a normal (non-outlier) arm iff yi < \u03b8.\nWe denote the set of outlier arms as \u2126 = {i \u2208 [n]|yi > \u03b8}.\nIn a multi-armed bandit setting, the value of yi for each arm is unknown. Instead, the system needs\nto pull one arm at each iteration to obtain a sample, and estimate the value yi for each arm and the\nn(cid:88)\nthreshold \u03b8 from all the obtained samples xi,\u2200i. We introduce the following estimators:\n(\u02c6yi \u2212 \u02c6\u00b5y)2, \u02c6\u03b8 = \u02c6\u00b5y + k\u02c6\u03c3y\n\n(cid:118)(cid:117)(cid:117)(cid:116) 1\n\nn(cid:88)\n\nx(j)\ni\n\n, \u02c6\u00b5y =\n\n\u02c6yi, \u02c6\u03c3y =\n\n1\nn\n\n\u02c6yi =\n\n1\nmi\n\ni=1\n\nn\n\ni=1\n\n(cid:88)\n\nj\n\nusually a small constant).The fewer total number of pulls, i.e. T =(cid:80)\n\nwhere mi is the number of times the arm i is pulled.\nWe focus on the \ufb01xed con\ufb01dence setting. The objective is to design an ef\ufb01cient pulling algorithm,\nsuch that the algorithm can return the true set of outlier arms \u2126 with probability at least 1 \u2212 \u03b4 (\u03b4 is\ni mi, the better, because each\npull has a economic or time cost. Note that this is a pure exploration setting, i.e., the reward incurred\nduring exploration is irrelevant to the cost.\n\n4 Algorithms\n\nIn this section, we propose several algorithms, and present the theoretical guarantee of each algorithm.\n\n4.1 Round-Robin Algorithm\n\nThe most simple algorithm is to pull arms in a round-robin way. That is, the algorithm starts from\narm 1 and pulls arm 2, 3,\u00b7\u00b7\u00b7 respectively, and goes back to arm 1 after it iterates over all the n arms.\nThe process continues until a certain termination condition is met.\nIntuitively, the algorithm should terminate when it is con\ufb01dent about whether each arm is an outlier.\nWe achieve this by using the con\ufb01dence interval of each arm\u2019s reward expectation as well as the\ncon\ufb01dence interval of the outlier threshold. If the signi\ufb01cance levels of these intervals are carefully\nset, and each reward expectation\u2019s con\ufb01dence interval has no overlap with the threshold\u2019s con\ufb01dence\ninterval, we can safely terminate the algorithm while guaranteeing correctness with desired high\nprobability. In the following, we \ufb01rst discuss the formal de\ufb01nition of con\ufb01dence intervals, as well as\nhow to set the signi\ufb01cance levels. Then we present the formal termination condition.\n\nCon\ufb01dence intervals. We provide a general de\ufb01nition of con\ufb01dence intervals for \u02c6yi and \u02c6\u03b8. The\ncon\ufb01dence interval for \u02c6yi at signi\ufb01cance level \u03b4(cid:48) is de\ufb01ned as [\u02c6yi \u2212 \u03b2i(mi, \u03b4(cid:48)), \u02c6yi + \u03b2i(mi, \u03b4(cid:48))], such\nthat:\n\nP(\u02c6yi \u2212 yi > \u03b2i(mi, \u03b4(cid:48))) < \u03b4(cid:48), and P(\u02c6yi \u2212 yi < \u2212\u03b2i(mi, \u03b4(cid:48))) < \u03b4(cid:48)\n\nSimilarly, the con\ufb01dence interval for \u02c6\u03b8 at signi\ufb01cance level \u03b4(cid:48) is de\ufb01ned as [\u02c6\u03b8 \u2212 \u03b2\u03b8(m, \u03b4(cid:48)), \u02c6\u03b8 +\n\u03b2\u03b8(m, \u03b4(cid:48))], such that:\n\nP(\u02c6\u03b8 \u2212 \u03b8 > \u03b2\u03b8(m, \u03b4(cid:48))) < \u03b4(cid:48), and P(\u02c6\u03b8 \u2212 \u03b8 < \u2212\u03b2\u03b8(m, \u03b4(cid:48))) < \u03b4(cid:48)\n\nThe concrete form of con\ufb01dence interval may vary with the reward distribution associated with each\narm. We defer the discussion of concrete form of con\ufb01dence interval to Section 4.3.\nIn our algorithm, we update the signi\ufb01cance level \u03b4(cid:48) for the above con\ufb01dence intervals at each\niteration. After T pulls, the \u03b4(cid:48) should be set as:\n\n\u03b4(cid:48) =\n\n6\u03b4\n\n\u03c02(n + 1)T 2\n\n(1)\n\nIn the following discussion, we omit the parameters in \u03b2i and \u03b2\u03b8 when they are clear from the context.\nActive arms. At any time, if \u02c6yi\u2019s con\ufb01dence interval overlaps with \u02c6\u03b8\u2019s con\ufb01dence interval, then the\nalgorithm cannot con\ufb01dently tell if the arm i is an outlier or a normal arm. We call such arms active,\nand vice versa. Formally, an arm i is active, denoted as ACTIVEi = TRUE, iff\n\n(cid:40)\n\n\u02c6yi \u2212 \u03b2i < \u02c6\u03b8 + \u03b2\u03b8,\nif \u02c6yi > \u02c6\u03b8;\n\u02c6yi + \u03b2i > \u02c6\u03b8 \u2212 \u03b2\u03b8, otherwise.\n\n4\n\n(2)\n\n\fWe denote the set of active arms as A = {i \u2208 [n]|ACTIVEi = TRUE}. With this de\ufb01nition, the\ntermination condition is simply A = \u2205. When this condition is met, we return the result set:\n\n\u02c6\u2126 = {i|\u02c6yi > \u02c6\u03b8}\n\n(3)\n\nThe algorithm is outlined in Algorithm 1.\n\nAlgorithm 1: Round-Robin Algorithm (RR)\n\nInput: n arms, outlier parameter k\nOutput: A set \u02c6\u2126 of outlier arms\n1 Pull each arm i once \u2200i \u2208 [n];\n2 T \u2190 n;\n3 Update \u02c6yi, mi, \u03b2i,\u2200i \u2208 [n] and \u02c6\u03b8, \u03b2\u03b8;\n4 i \u2190 1;\n5 while A (cid:54)= \u2205 do\n\n6\n7\n8\n\ni \u2190 i%n + 1;\nPull arm i;\nT \u2190 T + 1;\nUpdate \u02c6yi, mi, \u03b2i and \u02c6\u03b8, \u03b2\u03b8;\n\n9\n10 return \u02c6\u2126 according to Eq. (3);\n\n// Initialization\n\n// Round-robin\n\nTheoretical results. We \ufb01rst show that if the algorithm terminates with no active arms, the returned\noutlier set will be correct with high probability.\nTheorem 1 (Correctness). With probability 1 \u2212 \u03b4, if the algorithm terminates after a certain number\nof pulls T when there is no active arms i.e. A = \u2205, then the returned set of outliers will be\ncorrect, i.e. \u02c6\u2126 = \u2126.\n\nWe can also provide an upper bound for the ef\ufb01ciency of the algorithm in a speci\ufb01c case when all the\nreward distributions are bounded within [a, b] where b \u2212 a = R. In this case, the con\ufb01dence intervals\ncan be instantiated as discussed in Section 4.3. And we can accordingly obtain the following results:\nTheorem 2. With probability 1\u2212 \u03b4, the total number of pulls T needed for the algorithm to terminate\nis bounded by\n\n(cid:18) 2R2\u03c02(n + 1)HRR\n\n(cid:19)\n\n(cid:21)\n\nT \u2264 8R2HRR\n\n3\u03b4\n\n+ 1\n\n+ 4n\n\n(4)\n\nwhere\n\nlog\n\n(cid:20)\n(cid:0)1 +(cid:112)l(k)(cid:1)2\n(cid:34)(cid:115)\n(1 + k\u221an \u2212 1)2\n\nn\n\nl(k) =\n\nHRR = H1\n\n, H1 =\n\n(cid:115)\n\nn\n\nmini\u2208[n](yi \u2212 \u03b8)2 ,\n2 log(cid:0) \u03c02n3\n\n(cid:1)(cid:35)2\n\nk2\n\n6\u03b4\n\n+\n\n4.2 Weighted Round-Robin Algorithm\n\nThe round-robin algorithm evenly distributes resources to all the arms. Intuitively, active arms deserve\nmore pulls than inactive arms, since the algorithm is almost sure about whether an inactive arm is\noutlier already.\nBased on this idea, we propose an improved algorithm. We allow the algorithm to sample the active\narms \u03c1 times as many as inactive arms, where \u03c1 \u2265 1 is a real constant. Since \u03c1 is not necessarily an\ninteger, we use a method similar to stride scheduling to guarantee the ratio between number of pulls\nof active and inactive arms are approximately \u03c1 in a long run. The algorithm still pulls by iterating\nover all the arms. However, after each arm is pulled, the algorithm can decide either to stay at this\narm for a few \u201cextra pulls,\u201d or proceed to the next arm. If the arm pulled at the T -th iteration is the\nsame as the arm pulled at the (T \u2212 1)-th iteration, we call the T -th pull an \u201cextra pull.\u201d Otherwise,\nwe call it a \u201cregular pull.\u201d We keep a counter ci for each arm i. When T > n, after the algorithm\n\n5\n\n\fperforms a regular pull on arm i, we add \u03c1 to the counter ci. If this arm is still active, we keep pulling\nthis arm until mi \u2265 ci or it becomes inactive. Otherwise we proceed to the next arm to perform the\nnext regular pull.\nThis algorithm is named Weighted Round-Robin, and outlined in Algorithm 2.\n\nAlgorithm 2: Weighted Round-Robin Algorithm (WRR)\n\nInput: n arms, outlier parameter k, \u03c1\nOutput: A set of outlier arms \u02c6\u2126\n1 Pull each arm i once \u2200i \u2208 [n];\n2 T \u2190 n;\n3 Update \u02c6yi, mi, \u03b2i,\u2200i \u2208 [n] and \u02c6\u03b8, \u03b2\u03b8;\n4 ci \u2190 0,\u2200i \u2208 [n];\n5 i \u2190 1;\n6 while A (cid:54)= \u2205 do\ni \u2190 i%n + 1 ;\nci \u2190 ci + \u03c1;\nrepeat\n\nPull arm i;\nT \u2190 T + 1;\nUpdate \u02c6yi, mi, \u03b2i and \u02c6\u03b8, \u03b2\u03b8;\n\nuntil i /\u2208 A (cid:87) mi \u2265 ci;\n\n12\n13\n14 return \u02c6\u2126 according to Eq. (3);\n\n7\n8\n9\n10\n11\n\n// Initialization\n\n// Next regular pull\n\nTheoretical results. Since the Weighted Round-Robin algorithm has the same termination condition,\naccording to Theorem 1, it has the same correctness guarantee.\nWe can also bound the total number of pulls needed for this algorithm when the reward distributions\nare bounded.\nTheorem 3. With probability 1\u2212 \u03b4, the total number of pulls T needed for the Weighted Round-Robin\nalgorithm to terminate is bounded by\n\n+ 1\n\n+ 2(\u03c1 + 2)n\n\n(5)\n\nT \u2264 8R2HWRR\n\nlog\n\nwhere\n\nHWRR =\n\n(cid:18) H1\n\n\u03c1\n\n(cid:21)\n\n(cid:20)\n\n(cid:19)\n(cid:18) 2R2\u03c02(n + 1)HWRR\n(cid:19)2\n1 +(cid:112)l(k)\u03c1\n\n(cid:19)(cid:18)\n\n3\u03b4\n\n(\u03c1 \u2212 1)H2\n\n+\n\n\u03c1\n\n(cid:88)\n\n1\n\n(yi \u2212 \u03b8)2\n\ni\n\n, H2 =\n\nDetermining \u03c1. One important parameter in this algorithm is \u03c1. For bounded reward distributions,\nwe have a closed form upper bound of T as O(HWRR log HWRR\n). The lower bound of T is independent\nof \u03c1. We conjecture the lower bound to be \u2126(H2 log H2\n\u03b4 ). We aim to \ufb01nd the \u03c1 that minimizes the\ngap between the upper bound and the lower bound. We formalize the objective as \ufb01nding a \u03c1 to\nminimize HWRR/H2. Since we do not know the reward distribution con\ufb01guration y, we use the\nminimax principle to \ufb01nd \u03c1\u2217 that optimizes the most dif\ufb01cult con\ufb01guration y, namely\n\n\u03b4\n\nSince H1\noptimal value \u03c1\u2217 as\n\nn \u2264 H2 \u2264 H1, and HWRR\n\nH2\n\n\u03c1\u2217 = argmin\n\u03c1\u22651\n\nsup\n\ny\n\nHWRR\n\nH2\n\nis monotonically increasing with regard to H1\nH2\n\n, we can obtain the\n\n\u03c1\u2217 =\n\n3\n\n(n \u2212 1) 2\nl 1\n3 (k)\n\n(6)\n\nTheoretical comparison with RR. We compare theses two algorithms by comparing their upper\nbounds. Essentially, we study HWRR/HRR since the two bounds only differ in this term after a small\n\n6\n\n\fconstant is ignored. We have\n\nHWRR\nHRR\n\n=\n\n(cid:18) 1\n\n\u03c1\n\n(cid:19)(cid:18) 1 +(cid:112)l(k)\u03c1\n1 +(cid:112)l(k)\n\n(cid:19)2\n\n+\n\n\u03c1 \u2212 1\n\u03c1\n\nH2\nH1\n\n(7)\n\nThe ratio between H2 and H1 indicates how much cost WRR will save from RR. Notice that\nH1 \u2264 1. In the degenerated case H2/H1 = 1, WRR does not save any cost from RR. This\nn \u2264 H2\n1\ncase occurs only when all arms have identical reward expectations, which is rare and not interesting.\nHowever, if H2/H1 = 1/n, by setting \u03c1 to the optimal value in Eq. (6), it is possible to save a\nsubstantial portion of pulls. In this scenario, the RR algorithm will iteratively pull all the arms until\nthe arm closest to the threshold i\u2217 con\ufb01dently determined as outlier or normal. However, the WRR\nalgorithm is able to invest more pulls on arm i\u2217 as it remains active, while pulling other arms for\nfewer times, only to obtain a more precise estimate of the outlier threshold.\n\n4.3 Con\ufb01dence Interval Instantiation\n\nWith different prior knowledge of reward distributions, con\ufb01dence intervals can be instantiated\ndifferently. We introduce the con\ufb01dence interval for a relatively general scenario, where reward\ndistributions are bounded.\nBounded distribution.\nR = b \u2212 a.\nAccording to Hoeffding\u2019s inequality and McDiarmid\u2019s inequality, we can derive the con\ufb01dence\ninterval for yi as\n\nSuppose the reward distribution of each arm is bounded in [a, b], and\n\n(cid:18) 1\n\n(cid:19)\n\n\u03b4(cid:48)\n\n(cid:115)\n\n(cid:18) 1\n\n(cid:19)\n\n\u03b4(cid:48)\n\n\u03b2i(mi, \u03b4(cid:48)) = R\n\n1\n\n2mi\n\nlog\n\n, \u03b2\u03b8(m, \u03b4(cid:48)) = R\n\nl(k)\n\n2h(m)\n\nlog\n\nwhere mi is the number of pulls of arm i so far, and h(m) is the harmonic mean of all the mi\u2019s.\nBernoulli distribution. In many real applications, each arm returns a binary sample 0 or 1, drawn\nfrom a Bernoulli distribution. We use the following con\ufb01dence intervals heuristically.\nWe leverage a con\ufb01dence interval presented in [3], de\ufb01ned as\n\n\u02dcp(1 \u2212 \u02dcp)\n\nmi\n\n,\n\n\u03b2\u03b8(m, \u03b4(cid:48)) =\n\n(cid:118)(cid:117)(cid:117)(cid:116)(cid:88)\n(cid:18) k \u02c6yi\nn(cid:112)\u02c6\u03c3y\n\ni\n\n(cid:19)2\n\n+\n\n1\nn\n\n\u03b22\ni\n\nwhere\n\n\u03b2i(mi, \u03b4(cid:48)) = z\u03b4(cid:48)/2\n\n\u02dcp =\n\n(cid:115)\n\n(cid:115)\n\ni +\n\nm+\nmi + z2\n\nz2\n\u03b4(cid:48) /2\n2\n\u03b4(cid:48)/2\n\n,\n\nz\u03b4(cid:48)/2 = erf\u22121(1 \u2212 \u03b4(cid:48)/2)\n\nis the number of samples that equal to 1 among mi samples, and z\u03b4(cid:48)/2 is value of the inverse\n\nm+\ni\nerror function.\n\n5 Experimental Results\n\nIn this section, we present experiments to evaluate both the effectivenss and ef\ufb01ciency of proposed\nalgorithms.\n\n5.1 Datasets\n\nSynthetic.\nWe construct several synthetic datasets with varying number of arms n =\n20, 50, 100, 200, and varying k = 2, 2.5, 3. There are 12 con\ufb01gurations in total. For each con-\n\ufb01guration, we generate 10 random test cases. For each arm, we draw its reward from a Bernoulli\ndistribution Bern(yi).\n\n7\n\n\fTwitter. We consider the following application of detecting outlier locations with respect to keywords\nfrom Twitter data. A user has a set of candidate regions L = {l1,\u00b7\u00b7\u00b7 , ln}, and is interested in \ufb01nding\noutlier regions where tweets are extremely likely to contain a keyword w. In this application, each\nregion corresponds to an arm. A region has an unknown probability of generating a tweet containing\nthe keyword, which can be regarded as a Bernoulli distribution. We collect a Twitter dataset with\n1, 500, 000 tweets from NYC, associated with its latitude and longitude. We divide the entire space\ninto regions of 2(cid:48)(cid:48)\n\u00d7 2(cid:48)(cid:48) in latitude and longitude respectively. We select 47 regions with more than\n5, 000 tweets as arms and select 20 keywords as test cases.\n\n5.2 Setup\n\nMethods for comparison. Since the problem is new, there is no directly comparable solution in\nexisting work. We design two baselines for comparative study.\n\u2022 Naive Round-Robin (NRR). We play arms in a round-robin fashion, and terminate as soon as we\n\ufb01nd the estimated outlier set \u02c6\u2126 has not changed in the last consecutive 1/\u03b4 pulls. \u02c6\u2126 is de\ufb01ned\nas in Eq. (3). This baseline re\ufb02ects how well the problem can be solved by RR with a heuristic\ntermination condition.\n\n\u2022 Iterative Best Arm Identi\ufb01cation (IB). We apply a state-of-the-art best arm identi\ufb01cation algo-\nrithm [12] iteratively. We \ufb01rst apply it to all n arms until it terminates, and then remove the best\narm and apply it to the rest arms. We repeat this process until the current best arm is not in \u02c6\u2126,\nwhere the threshold function is heuristically estimated based on the current data. We then return the\ncurrent \u02c6\u2126. This is a strong baseline that leverages the existing solution in best-arm identi\ufb01cation.\n\nThen we compare them with our proposed two algorithms, Round-Robin (RR) and Weighted Round-\nRobin (WRR).\nParameter con\ufb01gurations. For both of our algorithms, we derived the con\ufb01dence intervals based\non Bernoulli distribution. Since some algorithm takes extremely long time to terminate in certain\ncases, we place a cap on the total number of pulls. Once an algorithm runs for 107 pulls, the algorithm\nis forced to terminate and output the current estimated outlier set \u02c6\u2126. We set \u03b4 = 0.1.\nFor each test case, we run the experiments for 10 times, and take the average of both the correctness\nmetrics and number of pulls.\n\n5.3 Results\n\nPerformance on Synthetic. Figure 1(a) shows the correctness of each algorithm when n varies. It\ncan be observed that both of our proposed algorithms achieve perfect correctness on all the test sets.\nIn comparison, the NRR baseline has never achieved the desired level of correctness. Based on the\nperformance on correctness, the naive baseline NRR does not qualify an acceptable algorithm, so we\nonly measure the ef\ufb01ciency of the rest algorithms.\nWe plot the average number of pulls each algorithm takes before termination varying with the number\nof arms n in Figure 1(b). On all the different con\ufb01gurations of n, IB takes a much larger number\nof pulls than WRR and RR, which makes it 1-3 orders of magnitude as costly as WRR and RR. At\nthe same time, RR is also substantially slower than WRR, with the gap gradually increasing as n\nincreases. This shows our design of additional pulls helps. Figure 1(c) further shows that in 80% of\nthe test cases, WRR can save more than 40% of cost from RR; in about half of the test cases, WRR\ncan save more than 60% of the cost.\nPerformance on Twitter. Figure 2(a) shows the correctness of different algorithms on Twitter data\nset. As one can see, both of our proposed algorithms qualify the correctness requirement, i.e., the\nprobability of returning the exactly correct outlier set is higher than 1 \u2212 \u03b4. The NRR baseline is\nfar from reaching that bar. The IB baseline barely meets the bar, and the precision, recall and F1\nmeasures show that its returned result is averagely a good approximate to the correct result, with an\naverage F1 metric close to 0.95. This once again con\ufb01rms that IB is a strong baseline.\nWe compare the ef\ufb01ciency of IB, RR and WRR algorithms in Figure 2(b). In this \ufb01gure, we plot\nthe cost reduction percentage for both RR and WRR in comparison with IB. WRR is a clear winner.\nIn almost 80% of the test cases, it saves more than 50% of IB\u2019s cost, and in about 40% of the test\n\n8\n\n\f(a) % Exactly Correct\n\n(b) Avg. #Pulls vs. n\n\n(c) WRR\u2019s Cost Reduction wrt RR\n\nFigure 1: Effectiveness and ef\ufb01ciency studies on Synthetic data set. Cap indicates the maximum\nnumber of pulls we allow an algorithm to run.\n\n(a) Correctness comparison\n\n(b) Cost Reduction wrt IB\n\nFigure 2: Effectiveness and ef\ufb01ciency studies on Twitter\ndataset.\n\nFigure 3: Ratio between avg. #pulls\nwith a given \u03c1 and with \u03c1 = \u03c1\u2217.\n\ncases, it saves more than 75% of IB\u2019s cost. In contrast, RR\u2019s performance is comparable to IB. In\napproximately 30% of the test cases, RR is actually slower than IB and has negative cost reduction,\nthough in another 40% of the test cases, RR saves more than 50% of IB\u2019s cost.\nTuning \u03c1.\nIn order to experimentally justify our selection of \u03c1 value, we test the performance of\nWRR on a speci\ufb01c setting of synthetic data set (n = 15, k = 2.5) with varying preset \u03c1 values.\nFigure 3 shows the average number of pulls of 10 test cases for each \u03c1 in {1.5, 2, . . . , 5}, comparing\nto the performance with \u03c1 = \u03c1\u2217 according to Eq. (6). It can be observed that the performance of\n\u03c1 = \u03c1\u2217 is very close to the best performance when \u03c1 = 3. A further investigation reveals that the\nof these test cases vary from 3 to 14. Although we choose \u03c1\u2217 based on an extreme assumption\n= n, its average performance is found to be close to the optimal even when the data do not satisfy\n\nH1\nH2\nH1\nH2\nthe assumption.\n\n6 Conclusion\n\nIn this paper, we study a novel problem of identifying the outlier arms with extremely high/low reward\nexpectations compared to other arms in a multi-armed bandit. We propose a Round-Robin algorithm\nand a Weighted Round-Robin algorithm with correctness guarantee. We also upper bound both\nalgorithms when the reward distributions are bounded. We conduct experiments on both synthetic\nand real data to verify our algorithms. There could be further extensions of this work, including\nderiving a lower bound of this problem, or extending the problem to a PAC setting.\n\nReferences\n[1] N. Abe, B. Zadrozny, and J. Langford. Outlier detection by active learning. In KDD, pages 504\u2013509. ACM,\n\n2006.\n\n[2] C. C. Aggarwal and P. S. Yu. Outlier detection with uncertain data. In SDM, pages 483\u2013493. SIAM, 2008.\n[3] A. Agresti and B. A. Coull. Approximate is better than \"exact\" for interval estimation of binomial\nproportions. The American Statistician, 52(2):119\u2013126, 1998. ISSN 00031305. URL http://www.\njstor.org/stable/2685469.\n\n[4] J.-Y. Audibert and S. Bubeck. Best arm identi\ufb01cation in multi-armed bandits. In COLT, pages 13\u2013p, 2010.\n[5] P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Machine\n\nlearning, 47(2-3):235\u2013256, 2002.\n\n9\n\n2050100200n0.00.20.40.60.81.0%Correct1\u2212\u03b4NRRIBRRWRR2050100200n103104105106107#PullsIBRRWRRCap0.00.20.40.60.81.0CostReductionPercentage0.00.20.40.60.81.0PercentageofTestCases%CorrectPrecisionRecallF10.00.20.40.60.81.0PerformanceMeasure1\u2212\u03b4NRRIBRRWRR\u22121.5\u22121.0\u22120.50.00.51.0CostReductionPercentage0.00.20.40.60.81.0PercentageofTestCasesRRWRR1.01.52.02.53.03.54.04.55.05.5\u03c11.001.021.041.061.08T\u03c1/T\u03c1\u2217Preset\u03c1\u03c1=\u03c1\u2217\f[6] S. Bubeck, R. Munos, and G. Stoltz. Pure exploration in \ufb01nitely-armed and continuous-armed bandits.\n\nTheoretical Computer Science, 412(19):1832\u20131852, 2011.\n\n[7] S. Bubeck, N. Cesa-Bianchi, et al. Regret analysis of stochastic and nonstochastic multi-armed bandit\n\nproblems. Foundations and Trends in Machine Learning, 5(1):1\u2013122, 2012.\n\n[8] S. Bubeck, T. Wang, and N. Viswanathan. Multiple identi\ufb01cations in multi-armed bandits. In ICML, pages\n\n258\u2013265, 2013.\n\n[9] A. Carpentier and M. Valko. Extreme bandits. In NIPS, pages 1089\u20131097, 2014.\n[10] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):\n\n15:1\u201315:58, 2009.\n\n[11] L. Chen and J. Li. On the optimal sample complexity for best arm identi\ufb01cation. arXiv preprint\n\narXiv:1511.03774, 2015.\n\n[12] S. Chen, T. Lin, I. King, M. R. Lyu, and W. Chen. Combinatorial pure exploration of multi-armed bandits.\n\nIn NIPS, pages 379\u2013387, 2014.\n\n[13] P. Donmez, J. G. Carbonell, and J. Schneider. Ef\ufb01ciently learning the accuracy of labeling sources for\n\nselective sampling. In KDD, pages 259\u2013268. ACM, 2009.\n\n[14] E. Even-Dar, S. Mannor, and Y. Mansour. Action elimination and stopping conditions for the multi-armed\nbandit and reinforcement learning problems. Journal of machine learning research, 7(Jun):1079\u20131105,\n2006.\n\n[15] V. Gabillon, M. Ghavamzadeh, and A. Lazaric. Best arm identi\ufb01cation: A uni\ufb01ed approach to \ufb01xed budget\n\nand \ufb01xed con\ufb01dence. In NIPS, pages 3212\u20133220, 2012.\n\n[16] V. Gabillon, A. Lazaric, M. Ghavamzadeh, R. Ortner, and P. Bartlett. Improved learning complexity in\n\ncombinatorial pure exploration bandits. In AISTATS, pages 1004\u20131012, 2016.\n\n[17] C. Gentile, S. Li, and G. Zappella. Online clustering of bandits. In ICML, pages 757\u2013765, 2014.\n[18] V. J. Hodge and J. Austin. A survey of outlier detection methodologies. Arti\ufb01cial Intelligence Review, 22\n\n(2):85\u2013126, 2004.\n\n[19] B. Jiang and J. Pei. Outlier detection on uncertain data: Objects, instances, and inferences. In ICDE, pages\n\n422\u2013433. IEEE, 2011.\n\n[20] S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. Pac subset selection in stochastic multi-armed\n\nbandits. In ICML, pages 655\u2013662, 2012.\n\n[21] G. Kollios, D. Gunopulos, N. Koudas, and S. Berchtold. Ef\ufb01cient biased sampling for approximate\nclustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering,\n15(5):1170\u20131187, 2003.\n\n[22] N. Korda, B. Sz\u00f6r\u00e9nyi, and L. Shuai. Distributed clustering of linear bandits in peer to peer networks.\nIn Journal of Machine Learning Research Workshop and Conference Proceedings, volume 48, pages\n1301\u20131309. International Machine Learning Societ, 2016.\n\n[23] T. L. Lai and H. Robbins. Asymptotically ef\ufb01cient adaptive allocation rules. Advances in applied\n\nmathematics, 6(1):4\u201322, 1985.\n\n[24] L. Li, W. Chu, J. Langford, and R. E. Schapire. A contextual-bandit approach to personalized news article\n\nrecommendation. In WWW, pages 661\u2013670. ACM, 2010.\n\n[25] H. Liu, Y. Zhang, B. Deng, and Y. Fu. Outlier detection via sampling ensemble. In Big Data, pages\n\n726\u2013735. IEEE, 2016.\n\n[26] A. Locatelli, M. Gutzeit, and A. Carpentier. An optimal algorithm for the thresholding bandit problem. In\n\nICML, pages 1690\u20131698. JMLR. org, 2016.\n\n[27] H. Robbins. Some aspects of the sequential design of experiments. Bulletin of the American Mathematical\n\nSociety, 58(5):527\u201335, 1952.\n\n[28] M. Sugiyama and K. Borgwardt. Rapid distance-based outlier detection via sampling. In NIPS, pages\n\n467\u2013475, 2013.\n\n[29] M. Wu and C. Jermaine. Outlier detection by sampling with accuracy guarantees. In KDD, pages 767\u2013772.\n\nACM, 2006.\n\n[30] A. Zimek, M. Gaudet, R. J. Campello, and J. Sander. Subsampling for ef\ufb01cient and effective unsupervised\n\noutlier detection ensembles. In KDD, pages 428\u2013436. ACM, 2013.\n\n10\n\n\f", "award": [], "sourceid": 2681, "authors": [{"given_name": "Honglei", "family_name": "Zhuang", "institution": "University of Illinois"}, {"given_name": "Chi", "family_name": "Wang", "institution": "Microsoft Research"}, {"given_name": "Yifan", "family_name": "Wang", "institution": "Tsinghua University"}]}