{"title": "Active Learning from Imperfect Labelers", "book": "Advances in Neural Information Processing Systems", "page_first": 2128, "page_last": 2136, "abstract": "We study active learning where the labeler can not only return incorrect labels but also abstain from labeling. We consider different noise and abstention conditions of the labeler. We propose an algorithm which utilizes abstention responses, and analyze its statistical consistency and query complexity under fairly natural assumptions on the noise and abstention rate of the labeler. This algorithm is adaptive in a sense that it can automatically request less queries with a more informed or less noisy labeler. We couple our algorithm with lower bounds to show that under some technical conditions, it achieves nearly optimal query complexity.", "full_text": "Active Learning from Imperfect Labelers\n\nSongbai Yan\n\nKamalika Chaudhuri\n\nUniversity of California, San Diego\n\nUniversity of California, San Diego\n\nyansongbai@eng.ucsd.edu\n\nkamalika@cs.ucsd.edu\n\nTara Javidi\n\nUniversity of California, San Diego\n\ntjavidi@eng.ucsd.edu\n\nAbstract\n\nWe study active learning where the labeler can not only return incorrect labels but\nalso abstain from labeling. We consider different noise and abstention conditions\nof the labeler. We propose an algorithm which utilizes abstention responses,\nand analyze its statistical consistency and query complexity under fairly natural\nassumptions on the noise and abstention rate of the labeler. This algorithm is\nadaptive in a sense that it can automatically request less queries with a more\ninformed or less noisy labeler. We couple our algorithm with lower bounds to show\nthat under some technical conditions, it achieves nearly optimal query complexity.\n\n1\n\nIntroduction\n\nIn active learning, the learner is given an input space X , a label space L, and a hypothesis class H\nsuch that one of the hypotheses in the class generates ground truth labels. Additionally, the learner\nhas at its disposal a labeler to which it can pose interactive queries about the labels of examples in\nthe input space. Note that the labeler may output a noisy version of the ground truth label (a \ufb02ipped\nlabel). The goal of the learner is to learn a hypothesis in H which is close to the hypothesis that\ngenerates the ground truth labels.\n\nThere has been a signi\ufb01cant amount of literature on active learning, both theoretical and practical.\nPrevious theoretical work on active learning has mostly focused on the above basic setting [2, 4,\n7, 10, 25] and has developed algorithms under a number of different models of label noise. A\nhandful of exceptions include [3] which allows class conditional queries, [5] which allows requesting\ncounterexamples to current version spaces, and [23, 26] where the learner has access to a strong\nlabeler and one or more weak labelers.\n\nIn this paper, we consider a more general setting where, in addition to providing a possibly noisy\nlabel, the labeler can sometimes abstain from labeling. This scenario arises naturally in dif\ufb01cult\nlabeling tasks and has been considered in computer vision by [11, 15]. Our goal in this paper is to\ninvestigate this problem from a foundational perspective, and explore what kind of conditions are\nneeded, and how an abstaining labeler can affect properties such as consistency and query complexity\nof active learning algorithms.\n\nThe setting of active learning with an abstaining noisy labeler was \ufb01rst considered by [24], who\nlooked at learning binary threshold classi\ufb01ers based on queries to an labeler whose abstention rate is\nhigher closer to the decision boundary. They primarily looked at the case when the abstention rate at a\ndistance \u2206 from the decision boundary is less than 1 \u2212 \u0398(\u2206\u03b1), and the rate of label \ufb02ips at the same\ndistance is less than 1\n2 \u2212 \u0398(\u2206\u03b2); under these conditions, they provided an active learning algorithm\nthat given parameters \u03b1 and \u03b2, outputs a classi\ufb01er with error \u01eb using \u02dcO(\u01eb\u2212\u03b1\u22122\u03b2) queries to the labeler.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fHowever, there are several limitations to this work. The primary limitation is that parameters \u03b1 and\n\u03b2 need to be known to the algorithm, which is not usually the case in practice. A second major\nlimitation is that even if the labeler has nice properties, such as, the abstention rates increase sharply\nclose to the boundary, their algorithm is unable to exploit these properties to reduce the number of\nqueries. A third and \ufb01nal limitation is that their analysis only applies to one dimensional thresholds,\nand not to more general decision boundaries.\n\nIn this work, we provide an algorithm which is able to exploit nice properties of the labeler. Our\nalgorithm is statistically consistent under very mild conditions \u2014 when the abstention rate is non-\ndecreasing as we get closer to the decision boundary. Under slightly stronger conditions as in [24],\nour algorithm has the same query complexity. However, if the abstention rate of the labeler increases\nstrictly monotonically close to the decision boundary, then our algorithm adapts and does substantially\nbetter. It simply exploits the increasing abstention rate close to the decision boundary, and does not\neven have to rely on the noisy labels! Speci\ufb01cally, when applied to the case where the noise rate is at\nmost 1\n2 \u2212 \u0398(\u2206\u03b2) and the abstention rate is 1 \u2212 \u0398(\u2206\u03b1) at distance \u2206 from the decision boundary,\nour algorithm can output a classi\ufb01er with error \u01eb based on only \u02dcO(\u01eb\u2212\u03b1) queries.\n\nAn important property of our algorithm is that the improvement of query complexity is achieved\nin a completely adaptive manner; unlike previous work [24], our algorithm needs no information\nwhatsoever on the abstention rates or rates of label noise. Thus our result also strengthens existing\nresults on active learning from (non-abstaining) noisy labelers by providing an adaptive algorithm\nthat achieves that same performance as [6] without knowledge of noise parameters.\n\nWe extend our algorithm so that it applies to any smooth d-dimensional decision boundary in a\nnon-parametric setting, not just one-dimensional thresholds, and we complement it with lower bounds\non the number of queries that need to be made to any labeler. Our lower bounds generalize the lower\nbounds in [24], and shows that our upper bounds are nearly optimal. We also present an example\nthat shows that at least a relaxed version of the monotonicity property is necessary to achieve this\nperformance gain; if the abstention rate plateaus around the decision boundary, then our algorithm\nneeds to query and rely on the noisy labels (resulting in higher query complexity) in order to \ufb01nd a\nhypothesis close to the one generating the ground truth labels.\n\n1.1 Related work\n\nThere has been a considerable amount of work on active learning, most of which involves labelers\nthat are not allowed to abstain. Theoretical work on this topic largely falls under two categories \u2014\nthe membership query model [6, 13, 18, 19], where the learner can request label of any example in\nthe instance space, and the PAC model, where the learner is given a large set of unlabeled examples\nfrom an underlying unlabeled data distribution, and can request labels of a subset of these examples.\nOur work and also that of [24] builds on the membership query model.\n\nThere has also been a lot of work on active learning under different noise models. The problem is\nrelatively easy when the labeler always provides the ground truth labels \u2013 see [8, 9, 12] for work\nin this setting in the PAC model, and [13] for the membership query model. Perhaps the simplest\nsetting of label noise is random classi\ufb01cation noise, where each label is \ufb02ipped with a probability that\nis independent of the unlabeled instance. [14] shows how to address this kind of noise in the PAC\nmodel by repeatedly querying an example until the learner is con\ufb01dent of its label; [18, 19] provide\nmore sophisticated algorithms with better query complexities in the membership query model. A\nsecond setting is when the noise rate increases closer to the decision boundary; this setting has been\nstudied under the membership query model by [6] and in the PAC model by [10, 4, 25]. A \ufb01nal\nsetting is agnostic PAC learning \u2014 when a \ufb01xed but arbitrary fraction of labels may disagree with\nthe label assigned by the optimal hypothesis in the hypothesis class. Active learning is known to be\nparticularly dif\ufb01cult in this setting; however, algorithms and associated label complexity bounds have\nbeen provided by [1, 2, 4, 10, 12, 25] among others.\n\nOur work expands on the membership query model, and our abstention and noise models are related\nto a variant of the Tsybakov noise condition. A setting similar to ours was considered by [6, 24]. [6]\nconsiders a non-abstaining labeler, and provides a near-optimal binary search style active learning\nalgorithm; however, their algorithm is non-adaptive. [24] gives a nearly matching lower and upper\nquery complexity bounds for active learning with abstention feedback, but they only give a non-\nadaptive algorithm for learning one dimensional thresholds, and only study the situation where the\n\n2\n\n\fabstention rate is upper-bounded by a polynomial function. Besides [24] , [11, 15] study active\nlearning with abstention feedback in computer vision applications. However, these works are based\non heuristics and do not provide any theoretical guarantees.\n\n2 Settings\n\nNotation. 1 [A] is the indicator function: 1 [A] = 1 if A is true, and 0 otherwise. For x =\n(x1, . . . , xd) \u2208 Rd (d > 1), denote (x1, . . . , xd\u22121) by \u02dcx. De\ufb01ne ln x = loge x, log x = log 4\nx,\n[ln ln]+ (x) = ln ln max{x, ee}. We use \u02dcO and \u02dc\u0398 to hide logarithmic factors in 1\nDe\ufb01nition. Suppose \u03b3 \u2265 1. A function g :\nif\n\n[0, 1]d\u22121 \u2192 R is (K, \u03b3)-H\u00f6lder smooth,\nis continuously differentiable up to \u230a\u03b3\u230b-th order, and for any x, y \u2208 [0, 1]d\u22121,\n\n\u03b4 , and d.\n\n\u01eb , 1\n\nit\n\n3\n\n\u2264 K ky \u2212 xk\u03b3 . We denote this class of functions by \u03a3(K, \u03b3).\n\ng(y) \u2212P\u230a\u03b3\u230b\n\nm=0\n\n(cid:12)(cid:12)(cid:12)\n\n\u2202mg(x)\n\nm!\n\n(y \u2212 x)m(cid:12)(cid:12)(cid:12)\n\nWe consider active learning for binary classi\ufb01cation. We are given an instance space X = [0, 1]d and\na label space L = {0, 1}. Each instance x \u2208 X is assigned to a label l \u2208 {0, 1} by an underlying\nfunction h\u2217 : X \u2192 {0, 1} unknown to the learning algorithm in a hypothesis space H of interest. The\nlearning algorithm has access to any x \u2208 X , but no access to their labels. Instead, it can only obtain\nlabel information through interactions with a labeler, whose relation to h\u2217 is to be speci\ufb01ed later. The\nobjective of the algorithm is to sequentially select the instances to query for label information and\noutput a classi\ufb01er \u02c6h that is close to h\u2217 while making as few queries as possible.\n\nWe consider a non-parametric setting as in [6, 17] where the hypothesis space is the smooth boundary\nfragment class H = {hg(x) = 1 [xd > g( \u02dcx)] | g : [0, 1]d\u22121 \u2192 [0, 1] is (K, \u03b3)-H\u00f6lder smooth}. In\nother words, the decision boundaries of classi\ufb01ers in this class are epigraph of smooth functions (see\nFigure 1 for example). We assume h\u2217(x) = 1 [xd > g\u2217( \u02dcx)] \u2208 H. When d = 1, H reduces to the\nspace of threshold functions {h\u03b8(x) = 1 [x > \u03b8] : \u03b8 \u2208 [0, 1]}.\nThe performance of a classi\ufb01er h(x) = 1 [xd > g( \u02dcx)] is evaluated by the L1 distance between the\ndecision boundaries kg \u2212 g\u2217k = \u00b4[0,1]d\u22121 |g( \u02dcx) \u2212 g\u2217( \u02dcx)| d \u02dcx.\nThe learning algorithm can only obtain label information by querying a labeler who is allowed\nto abstain from labeling or return an incorrect label (\ufb02ipping between 0 and 1). For each query\nx \u2208 [0, 1]d, the labeler L will return y \u2208 Y = {0, 1, \u22a5} (\u22a5 means that the labeler abstains from\nproviding a 0/1 label) according to some distribution PL(Y = y | X = x). When it is clear from\nthe context, we will drop the subscript from PL(Y | X). Note that while the labeler can declare its\nindecision by outputting \u22a5, we do not allow classi\ufb01ers in our hypothesis space to output \u22a5.\nIn our active learning setting, our goal is to output a boundary g that is close to g\u2217 while making as few\ninteractive queries to the labeler as possible. In particular, we want to \ufb01nd an algorithm with low query\ncomplexity \u039b(\u01eb, \u03b4, A, L, g\u2217), which is de\ufb01ned as the minimum number of queries that Algorithm\nA, acting on samples with ground truth g\u2217, should make to a labeler L to ensure that the output\nclassi\ufb01er hg(x) = 1 [xd > g( \u02dcx)] has the property kg \u2212 g\u2217k = \u00b4[0,1]d\u22121 |g( \u02dcx) \u2212 g\u2217( \u02dcx)| d \u02dcx \u2264 \u01eb with\nprobability at least 1 \u2212 \u03b4 over the responses of L.\n\n2.1 Conditions\n\nWe now introduce three conditions on the response of the labeler with increasing strictness. Later we\nwill provide an algorithm whose query complexity improves with increasing strictness of conditions.\nCondition 1. The response distribution of the labeler P (Y | X) satis\ufb01es:\n\n\u2022 (abstention) For any \u02dcx \u2208 [0, 1]d\u22121, xd, x\u2032\n\nd \u2208 [0, 1], if |xd \u2212 g\u2217( \u02dcx)| \u2265 |x\u2032\n\nd \u2212 g\u2217( \u02dcx)| then\n\nP (\u22a5| ( \u02dcx, xd)) \u2264 P (\u22a5| ( \u02dcx, x\u2032\n\nd));\n\n\u2022 (noise) For any x \u2208 [0, 1]d, P (Y 6= 1 [xd > g\u2217( \u02dcx)] | x, Y 6=\u22a5) \u2264 1\n2 .\n\nCondition 1 means that the closer x is to the decision boundary ( \u02dcx, g\u2217( \u02dcx)), the more likely the labeler\nis to abstain from labeling. This complies with the intuition that instances closer to the decision\nboundary are harder to classify. We also assume the 0/1 labels can be \ufb02ipped with probability as large\nas 1\n\n2 . In other words, we allow unbounded noise.\n\n3\n\n\f1\n\n2\nx\n\nP (Y =\u22a5| X = x)\nP (Y = 1 | X = x)\nP (Y = 0 | X = x)\n\n1\n\n)\n\nX\n\n|\n\nY\n(\nP\n\n0.5\n\n1\n\n)\n\nX\n\n|\n\nY\n(\nP\n\n0.5\n\nP (Y =\u22a5| X = x)\nP (Y = 1 | X = x)\nP (Y = 0 | X = x)\n\n0\n\n1\n\n0\n\nx1\n\n0.5\nX\n\n1\n\n0\n\n0.5\nX\n\n1\n\n1:\nwith\n\nFigure\nA classi-\n\ufb01er\nboundary\ng( \u02dcx) = (x1 \u2212 0.4)2 + 0.1 for\nd = 2. Label 1 is assigned\nto the region above, 0 to the\nbelow (red region)\n\nFigure 2: The distributions\nabove satisfy Conditions 1\nand 2, but the abstention feed-\nback is useless since P (\u22a5| x)\nis \ufb02at between x = 0.2 and\n0.4\n\nFigure 3: Distributions above\nsatisfy Conditions 1, 2, and 3.\n\nCondition 2. Let C, \u03b2 be non-negative constants, and f : [0, 1] \u2192 [0, 1] be a nondecreasing function.\nThe response distribution P (Y | X) satis\ufb01es:\n\n\u2022 (abstention) P (\u22a5| x) \u2264 1 \u2212 f (|xd \u2212 g\u2217( \u02dcx)|);\n\n\u2022 (noise) P (Y 6= 1 [xd > g\u2217( \u02dcx)] | x, Y 6=\u22a5) \u2264 1\n\n2(cid:16)1 \u2212 C |xd \u2212 g\u2217( \u02dcx)|\u03b2(cid:17).\n\nCondition 2 requires the abstention and noise probabilities to be upper-bounded, and these upper\nbounds decrease as x moves further away from the decision boundary. The abstention rate can be 1\nat the decision boundary, so the labeler may always abstain at the decision boundary. The condition\non the noise satis\ufb01es the popular Tsybakov noise condition [22].\nCondition 3. Let f : [0, 1] \u2192 [0, 1] be a nondecreasing function such that \u22030 < c < 1, \u22000 < a \u2264 1\n\u22000 \u2264 b \u2264 2\nf (a) \u2264 1 \u2212 c. The response distribution satis\ufb01es: P (\u22a5| x) = 1 \u2212 f (|xd \u2212 g\u2217( \u02dcx)|).\n\n3 a, f (b)\n\nAn example where Condition 3 holds is P (\u22a5| x) = 1 \u2212 (x \u2212 0.3)\u03b1 (\u03b1 > 0).\n\nCondition 3 requires the abstention rate to increase monotonically close to the decision boundary\nas in Condition 1. In addition, it requires the abstention probability P (\u22a5 |( \u02dcx, xd)) not to be too\n\ufb02at with respect to xd. For example, when d = 1, P (\u22a5| x) = 0.68 for 0.2 \u2264 x \u2264 0.4 (shown\nas Figure 2) does not satisfy Condition 3, and abstention responses are not informative since this\nabstention rate alone yields no information on the location of the decision boundary. In contrast,\n\nP (\u22a5| x) = 1 \u2212p|x \u2212 0.3| (shown as Figure 3) satis\ufb01es Condition 3, and the learner could infer it\n\nis getting close to the decision boundary when it starts receiving more abstention responses.\n\nNote that here c, f, C, \u03b2 are unknown and arbitrary parameters that characterize the complexity of the\nlearning task. We want to design an algorithm that does not require knowledge of these parameters\nbut still achieves nearly optimal query complexity.\n\n3 Learning one-dimensional thresholds\n\nIn this section, we start with the one dimensional case (d = 1) to demonstrate the main idea. We will\ngeneralize these results to multidimensional instance space in the next section.\nWhen d = 1, the decision boundary g\u2217 becomes a point in [0, 1], and the corresponding classi\ufb01er\nis a threshold function over [0,1]. In other words the hypothesis space becomes H = {f\u03b8(x) =\n1 [x > \u03b8] : \u03b8 \u2208 [0, 1]}). We denote the ground truth decision boundary by \u03b8\u2217 \u2208 [0, 1]. We want to\n\ufb01nd a \u02c6\u03b8 \u2208 [0, 1] such that |\u02c6\u03b8 \u2212 \u03b8\u2217| is small while making as few queries as possible.\n\n3.1 Algorithm\n\nThe proposed algorithm is a binary search style algorithm shown as Algorithm 1. (For the sake of\nsimplicity, we assume log 1\n2\u01eb is an integer.) Algorithm 1 takes a desired precision \u01eb and con\ufb01dence\n\n4\n\n\fAlgorithm 1 The active learning algorithm for learning thresholds\n\n1: Input: \u03b4, \u01eb\n2: [L0, R0] \u2190 [0, 1]\n3: for k = 0, 1, 2, . . . , log 1\n4:\n\n2\u01eb \u2212 1 do\n\nDe\ufb01ne three quartiles: Uk \u2190 3Lk+Rk\nA(u), A(m), A(v), B(u), B(v) \u2190 Empty Array\nfor n = 1, 2, . . . do\n\n4\n\n, Mk \u2190 Lk+Rk\n\n2\n\n, Vk \u2190 Lk+3Rk\n\n4\n\n5:\n6:\n\n7:\n8:\n\n9:\n\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n\n17:\n\n18:\n\n19:\n\n20:\n21:\n22:\n\n23:\n\n24:\n\n25:\n\nQuery at Uk, Mk, Vk, and receive labels X (u)\nfor w \u2208 {u, m, v} do\n\nn , X (m)\n\nn , X (v)\nn\n\nX (w) 6=\u22a5\n\n\u22b2 We record whether X (w) =\u22a5 in A(w), and the 0/1 label (as -1/1) in B(w) if\n\nif X (w) 6=\u22a5 then\n\nelse\n\nA(w) \u2190 A(w).append(1) , B(w) \u2190 B(w).append(21(cid:2)X (w) = 1(cid:3) \u2212 1)\n\nA(w) \u2190 A(w).append(0)\n\nend if\n\nend for\n\u22b2 Check if the differences of abstention responses are statistically signi\ufb01cant\n\n[Lk+1, Rk+1] \u2190 [Uk, Rk]; break\n\nif CHECKSIGNIFICANT-VAR(nA(u)\nelse if CHECKSIGNIFICANT-VAR(nA(v)\n\n[Lk+1, Rk+1] \u2190 [Lk, Vk]; break\n\ni \u2212 A(m)\n\ni \u2212 A(m)\n\ni on\n\ni=1\n\ni on\n\ni=1\n\n,\n\n\u03b4\n\n4 log 1\n2\u01eb\n\n) then\n\n,\n\n\u03b4\n\n4 log 1\n2\u01eb\n\n) then\n\nend if\n\u22b2 Check if the differences between 0 and 1 labels are statistically signi\ufb01cant\n\n[Lk+1, Rk+1] \u2190 [Uk, Rk]; break\n\nif CHECKSIGNIFICANT(n\u2212B(u)\nelse if CHECKSIGNIFICANT(nB(v)\n\n[Lk+1, Rk+1] \u2190 [Lk, Vk]; break\n\ni oB(u).length\ni oB(v).length\n\ni=1\n\ni=1\n\n,\n\n\u03b4\n\n4 log 1\n2\u01eb\n\n) then\n\n,\n\n\u03b4\n\n4 log 1\n2\u01eb\n\n) then\n\n26:\n27:\n28:\n29: end for\n\nend if\n\nend for\n\n30: Output: \u02c6\u03b8 =(cid:16)Llog 1\n\n2\u01eb\n\n+ Rlog 1\n\n2\u01eb(cid:17) /2\n\nlevel \u03b4 as its input, and returns an estimation \u02c6\u03b8 of the decision boundary \u03b8\u2217. The algorithm maintains\nan interval [Lk, Rk] in which \u03b8\u2217 is believed to lie, and shrinks this interval iteratively. To \ufb01nd the\nsubinterval that contains \u03b8\u2217, Algorithm 1 relies on two auxiliary functions (marked in Procedure 2) to\nconduct adaptive sequential hypothesis tests regarding subintervals of interval [Lk, Rk].\nSuppose \u03b8\u2217 \u2208 [Lk, Rk]. Algorithm 1 tries to shrink this interval to a 3\n4 of its length in each iteration by\nrepetitively querying on quartiles Uk = 3Lk+Rk\n. To determine which\nspeci\ufb01c subinterval to choose, the algorithm uses 0/1 labels and abstention responses simultaneously.\nSince the ground truth labels are determined by 1 [x > \u03b8\u2217], one can infer that if the number of queries\nthat return label 0 at Uk (Vk) is statistically signi\ufb01cantly more (less) than label 1, then \u03b8\u2217 should\nbe on the right (left) side of Uk (Vk). Similarly, from Condition 1, if the number of non-abstention\nresponses at Uk (Vk) is statistically signi\ufb01cantly more than non-abstention responses at Mk, then \u03b8\u2217\nshould be closer to Mk than Uk (Vk).\n\n, Vk = Lk+3Rk\n\n, Mk = Lk+Rk\n\n4\n\n2\n\n4\n\nAlgorithm 1 relies on the ability to shrink the search interval via statistically comparing the numbers\nof obtained labels at locations Uk, Mk, Vk. As a result, a main building block of Algorithm 1 is to\ntest whether i.i.d. bounded random variables Yi are greater in expectation than i.i.d. bounded random\nvariables Zi with statistical signi\ufb01cance. In Procedure 2, we have two test functions CheckSigni\ufb01cant\nand CheckSigni\ufb01cant-Var that take i.i.d. random variables {Xi = Yi \u2212 Zi} (|Xi| \u2264 1) and con\ufb01dence\nlevel \u03b4 as their input, and output whether it is statistically signi\ufb01cant to conclude EXi > 0.\n\n5\n\n\fProcedure 2 Adaptive sequential testing\n1: \u22b2 D0, D1 are absolute constants de\ufb01ned in Proposition 1 and Proposition 2\n2: \u22b2 {Xi} are i.i.d. random variables bounded by 1. \u03b4 is the con\ufb01dence level. Detect if EX > 0\n3: function CHECKSIGNIFICANT({Xi}n\n4:\n\ni=1 , \u03b4)\n\np(n, \u03b4) \u2190 D0(cid:16)1 + ln 1\nReturnPn\n\n5:\n6: end function\n7: function CHECKSIGNIFICANT-VAR({Xi}n\n8:\n\nCalculate the empirical variance Var = n\n\ni=1 Xi \u2265 p(n, \u03b4)\n\n\u03b4 +q4n(cid:0)[ln ln]+ 4n + ln 1\n\u03b4(cid:1)(cid:17)\n\ni=1 , \u03b4)\n\ni=1 Xi\n\nn\u22121(cid:16)Pn\n\u03b4 +q(cid:0)Var + ln 1\n\ni=1 Xi)2(cid:17)\n\u03b4 + 1(cid:1)(cid:0)[ln ln]+(cid:0)Var + ln 1\n\nn (Pn\n\ni=1 Xi \u2265 q(n, Var, \u03b4)\n\n2 \u2212 1\n\n\u03b4(cid:1)(cid:17)\n\u03b4 + 1(cid:1) + ln 1\n\n9:\n\nReturn n \u2265 ln 1\n\nq(n, Var, \u03b4) \u2190 D1(cid:16)1 + ln 1\n\u03b4 ANDPn\n\n10:\n11: end function\n\nCheckSigni\ufb01cant is based on the following uniform concentration result regarding the empirical\nmean:\n\nProposition 1. Suppose X1, X2, . . . are a sequence of i.i.d. random variables with X1 \u2208 [\u22122, 2],\nEX1 = 0. Take any 0 < \u03b4 < 1. Then there is an absolute constant D0 such that with probability at\nleast 1 \u2212 \u03b4, for all n > 0 simultaneously,\n\n\u2264 D0 1 + ln\n\n1\n\u03b4\n\n+s4n(cid:18)[ln ln]+ 4n + ln\n\n1\n\n\u03b4(cid:19)!\n\nn\n\nXi=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nXi(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nIn Algorithm 1, we use CheckSigni\ufb01cant to detect whether the expected number of queries that\nreturn label 0 at location Uk (Vk) is more/less than the expected number of label 1 with a statistical\nsigni\ufb01cance.\n\nCheckSigni\ufb01cant-Var is based on the following uniform concentration result which further utilizes\nthe empirical variance Vn = n\n\ni=1 X 2\n\ni \u2212 1\n\nProposition 2. There is an absolute constant D1 such that with probability at least 1 \u2212 \u03b4, for all\nn \u2265 ln 1\n\n\u03b4 simultaneously,\n\nn\u22121(cid:16)Pn\n\nn (Pn\n\ni=1 Xi)2(cid:17):\n\nn\n\nXi=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\nXi(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n\u2264 D1 1 + ln\n\n1\n\u03b4\n\n+s(cid:18)1 + ln\n\n1\n\u03b4\n\n+ Vn(cid:19)(cid:18)[ln ln]+ (1 + ln\n\n1\n\u03b4\n\n+ Vn) + ln\n\n1\n\n\u03b4(cid:19)!\n\nThe use of variance results in a tighter bound when Var(Xi) is small.\n\nIn Algorithm 1, we use CheckSigni\ufb01cant-Var to detect the statistical signi\ufb01cance of the relative order\nof the number of queries that return non-abstention responses at Uk (Vk) compared to the number of\nnon-abstention responses at Mk. This results in a better query complexity than using CheckSigni\ufb01cant\nunder Condition 3, since the variance of the number of abstention responses approaches 0 when the\ninterval [Lk, Rk] zooms in on \u03b8\u2217.1\n\n3.2 Analysis\n\nFor Algorithm 1 to be statistically consistent, we only need Condition 1.\nTheorem 1. Let \u03b8\u2217 be the ground truth. If the labeler L satis\ufb01es Condition 1 and Algorithm 1 stops\n\nto output \u02c6\u03b8, then(cid:12)(cid:12)(cid:12)\n\n\u03b8\u2217 \u2212 \u02c6\u03b8(cid:12)(cid:12)(cid:12)\n\n\u2264 \u01eb with probability at least 1 \u2212 \u03b4\n2 .\n\n1We do not apply CheckSigni\ufb01cant-Var to 0/1 labels, because unlike the difference between the numbers of\nabstention responses at Uk (Vk) and Mk, the variance of the difference between the numbers of 0 and 1 labels\nstays above a positive constant.\n\n6\n\n\fUnder additional Conditions 2 and 3, we can derive upper bounds of the query complexity for our\nalgorithm. (Recall f and \u03b2 are de\ufb01ned in Conditions 2 and 3.)\nTheorem 2. Let \u03b8\u2217 be the ground truth, and \u02c6\u03b8 be the output of Algorithm 1. Under Conditions 1 and\n\nTheorem 3. Let \u03b8\u2217 be the ground truth, and \u02c6\u03b8 be the output of Algorithm 1. Under Conditions 1 and\n\n2, with probability at least 1 \u2212 \u03b4, Algorithm 1 makes at most \u02dcO(cid:16) 1\n3, with probability at least 1 \u2212 \u03b4, Algorithm 1 makes at most \u02dcO(cid:16) 1\n\n2 ) \u01eb\u22122\u03b2(cid:17) queries.\n2 )(cid:17) queries.\n\nf ( \u01eb\n\nf ( \u01eb\n\nThe query complexity given by Theorem 3 is independent of \u03b2 that decides the \ufb02ipping rate, and\nconsequently smaller than the bound in Theorem 2. This improvement is due to the use of abstention\nresponses, which become much more informative under Condition 3.\n\n3.3 Lower Bounds\n\nIn this subsection, we give lower bounds of query complexity in the one-dimensional case and\nestablish near optimality of Algorithm 1. We will give corresponding lower bounds for the high-\ndimensional case in the next section.\n\nThe lower bound in [24] can be easily generalized to Condition 2:\n\nTheorem 4. ([24]) There is a universal constant \u03b40 \u2208 (0, 1) and a labeler L satisfying Conditions 1\nand 2, such that for any active learning algorithm A, there is a \u03b8\u2217 \u2208 [0, 1], such that for small enough\n\n\u01eb, \u039b(\u01eb, \u03b40, A, L, \u03b8\u2217) \u2265 \u2126(cid:16) 1\n\nf (\u01eb) \u01eb\u22122\u03b2(cid:17).\n\nOur query complexity (Theorem 3) for the algorithm is also almost tight under Conditions 1 and 3\nwith a polynomial abstention rate.\n\nTheorem 5. There is a universal constant \u03b40 \u2208 (0, 1) and a labeler L satisfying Conditions 1, 2,\nand 3 with f (x) = C \u2032x\u03b1 (C \u2032 > 0 and 0 < \u03b1 \u2264 2 are constants), such that for any active learning\nalgorithm A, there is a \u03b8\u2217 \u2208 [0, 1], such that for small enough \u01eb, \u039b(\u01eb, \u03b40, A, L, \u03b8\u2217) \u2265 \u2126 (\u01eb\u2212\u03b1).\n\n3.4 Remarks\n\nOur results con\ufb01rm the intuition that learning with abstention is easier than learning with noisy\nlabels. This is true because a noisy label might mislead the learning algorithm, but an abstention\nresponse never does. Our analysis shows, in particular, that if the labeler never abstains, and outputs\ncompletely noisy labels with probability bounded by 1 \u2212 |x \u2212 \u03b8\u2217|\u03b3 (i.e., P (Y 6= I [x > \u03b8\u2217] | x) \u2264\n1\nnear optimal \u02dcO (\u01eb\u2212\u03b3) query complexity associated with a labeler who only abstains with probability\nP (Y =\u22a5| x) \u2264 1 \u2212 |x \u2212 \u03b8\u2217|\u03b3 and never \ufb02ips a label. More precisely, while in both cases the labeler\noutputs the same amount of corrupted labels, the query complexity of the abstention-only case is\nsigni\ufb01cantly smaller than the noise-only case.\n\n2 (1 \u2212 |x \u2212 \u03b8\u2217|\u03b3)), then the near optimal query complexity of \u02dcO(cid:0)\u01eb\u22122\u03b3(cid:1) is signi\ufb01cantly larger than the\n\nNote that the query complexity of Algorithm 1 consists of two kinds of queries: queries which return\n0/1 labels and are used by function CheckSigni\ufb01cant, and queries which return abstention and are\nused by function CheckSigni\ufb01cant-Var. Algorithm 1 will stop querying when the responses of one of\nthe two kinds of queries are statistically signi\ufb01cant. Under Condition 2, our proof actually shows\nthat the optimal number of queries is dominated by the number of queries used by CheckSigni\ufb01cant\nfunction. In other words, a simpli\ufb01ed variant of Algorithm 1 which excludes use of abstention\nfeedback is near optimal. Similarly, under Condition 3, the optimal query complexity is dominated\nby the number of queries used by CheckSigni\ufb01cant-Var function. Hence the variant of Algorithm 1\nwhich disregards 0/1 labels would be near optimal.\n\n4 The multidimensional case\n\nWe follow [6] to generalize the results from one-dimensional thresholds to the d-dimensional (d > 1)\nsmooth boundary fragment class \u03a3(K, \u03b3).\n\n7\n\n\fAlgorithm 3 The active learning algorithm for the smooth boundary fragment class\n\n1: Input: \u03b4, \u01eb, \u03b3\n\n3: For each l \u2208 L, apply Algorithm 1 with parameter (\u01eb, \u03b4/M d\u22121) to learn a threshold gl that\n\nM , 1\n\nM , . . . , M \u22121\n\napproximates g\u2217(l)\n\n2: M \u2190 \u0398(cid:0)\u01eb\u22121/\u03b3(cid:1). L \u2190(cid:8) 0\n4: Partition the instance space into cells {Iq} indexed by q \u2208n0, 1, . . . , M\n\nM (cid:9)d\u22121\n\n\u03b3 \u2212 1od\u22121\n\n, where\n\nIq =(cid:20) q1\u03b3\n\nM\n\n,\n\n(q1 + 1)\u03b3\n\nM (cid:21) \u00d7 \u00b7 \u00b7 \u00b7 \u00d7(cid:20) qd\u22121\u03b3\n\nM\n\n,\n\n(qd\u22121 + 1)\u03b3\n\nM\n\n(cid:21)\n\n5: For each cell Iq, perform a polynomial interpolation: gq(\u02dcx) =Pl\u2208Iq\u2229L glQq,l(\u02dcx), where\n\nd\u22121\n\n\u03b3\n\nQq,l(\u02dcx) =\n\nYi=1\n\nYj=0,j6=M li\u2212\u03b3qi\n\u03b3 \u22121}d\u22121 gq(\u02dcx)1 [\u02dcx \u2208 q]\n\n\u02dcxi \u2212 (\u03b3qi + j)/M\nli \u2212 (\u03b3qi + j)/M\n\n6: Output: g(\u02dcx) =Pq\u2208{0,1,..., M\n\n4.1 Lower bounds\n\nTheorem 6. There are universal constants \u03b40 \u2208 (0, 1), c0 > 0, and a labeler L satisfying Condi-\ntions 1 and 2, such that for any active learning algorithm A, there is a g\u2217 \u2208 \u03a3(K, \u03b3), such that for\n\nsmall enough \u01eb, \u039b(\u01eb, \u03b40, A, L, g\u2217) \u2265 \u2126(cid:16) 1\n\n\u03b3 (cid:17).\nf (c0\u01eb) \u01eb\u22122\u03b2\u2212 d\u22121\n\nTheorem 7. There is a universal constant \u03b40 \u2208 (0, 1) and a labeler L satisfying Conditions 1, 2,\nand Condition 3 with f (x) = C \u2032x\u03b1 (C \u2032 > 0 and 0 < \u03b1 \u2264 2 are constants), such that for any active\nlearning algorithm A, there is a g\u2217 \u2208 \u03a3(K, \u03b3), such that for small enough \u01eb, \u039b(\u01eb, \u03b40, A, L, g\u2217) \u2265\n\n\u03b3 (cid:17).\n\u2126(cid:16)\u01eb\u2212\u03b1\u2212 d\u22121\n\n4.2 Algorithm and Analysis\n\nRecall the decision boundary of the smooth boundary fragment class can be seen as the epigraph of a\nsmooth function [0, 1]d\u22121 \u2192 [0, 1]. For d > 1, we can reduce the problem to the one-dimensional\nproblem by discretizing the \ufb01rst d\u22121 dimensions of the instance space and then perform a polynomial\ninterpolation. The algorithm is shown as Algorithm 3. For the sake of simplicity, we assume \u03b3, M/\u03b3\nin Algorithm 3 are integers.\n\nWe have similar consistency guarantee and upper bounds as in the one-dimensional case.\nTheorem 8. Let g\u2217 be the ground truth. If the labeler L satis\ufb01es Condition 1 and Algorithm 3 stops\nto output g, then kg\u2217 \u2212 gk \u2264 \u01eb with probability at least 1 \u2212 \u03b4\n2 .\nTheorem 9. Let g\u2217 be the ground truth, and g be the output of Algorithm 3. Under Conditions 1 and\n\nTheorem 10. Let g\u2217 be the ground truth, and g be the output of Algorithm 3. Under Conditions 1\n\n2, with probability at least 1 \u2212 \u03b4, Algorithm 3 makes at most \u02dcO(cid:16) d\nand 3, with probability at least 1 \u2212 \u03b4, Algorithm 3 makes at most \u02dcO(cid:16) d\n\nf (\u01eb/2) \u01eb\u22122\u03b2\u2212 d\u22121\n\n\u03b3 (cid:17) queries.\n\u03b3 (cid:17) queries.\n\nf (\u01eb/2) \u01eb\u2212 d\u22121\n\nAcknowledgments. We thank NSF under IIS-1162581, CCF-1513883, and CNS-1329819 for\nresearch support.\n\nReferences\n\n[1] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distribu-\n\ntions. In COLT, 2013.\n\n[2] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of\n\nthe 23rd international conference on Machine learning, pages 65\u201372. ACM, 2006.\n\n8\n\n\f[3] Maria-Florina Balcan and Steve Hanneke. Robust interactive learning.\n\nIn Proceedings of The 25th\n\nConference on Learning Theory, 2012.\n\n[4] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS,\n\n2010.\n\n[5] Alina Beygelzimer, Daniel Hsu, John Langford, and Chicheng Zhang. Search improves label for active\n\nlearning. arXiv preprint arXiv:1602.07265, 2016.\n\n[6] Rui M. Castro and Robert D. Nowak. Minimax bounds for active learning.\n\nIEEE Transactions on\n\nInformation Theory, 54(5):2339\u20132353, 2008.\n\n[7] Yuxin Chen, S Hamed Hassani, Amin Karbasi, and Andreas Krause. Sequential information maximization:\nWhen is greedy near-optimal? In Proceedings of The 28th Conference on Learning Theory, pages 338\u2013363,\n2015.\n\n[8] D. A. Cohn, L. E. Atlas, and R. E. Ladner. Improving generalization with active learning. Machine\n\nLearning, 15(2), 1994.\n\n[9] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.\n\n[10] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS, 2007.\n\n[11] Meng Fang and Xingquan Zhu. I don\u2019t know the label: Active learning with blind knowledge. In Pattern\n\nRecognition (ICPR), 2012 21st International Conference on, pages 2238\u20132241. IEEE, 2012.\n\n[12] Steve Hanneke. Teaching dimension and the complexity of active learning. In Learning Theory, pages\n\n66\u201381. Springer, 2007.\n\n[13] Tibor Heged\u02ddus. Generalized teaching dimensions and the query complexity of learning. In Proceedings of\n\nthe eighth annual conference on Computational learning theory, pages 108\u2013117. ACM, 1995.\n\n[14] M. K\u00e4\u00e4ri\u00e4inen. Active learning in the non-realizable case. In ALT, 2006.\n\n[15] Christoph Kading, Alexander Freytag, Erik Rodner, Paul Bodesheim, and Joachim Denzler. Active learning\nand discovery of object categories in the presence of unnameable instances. In Computer Vision and\nPattern Recognition (CVPR), 2015 IEEE Conference on, pages 4343\u20134352. IEEE, 2015.\n\n[16] Yuan-Chuan Li and Cheh-Chih Yeh. Some equivalent forms of bernoulli\u2019s inequality: A survey. Applied\n\nMathematics, 4(07):1070, 2013.\n\n[17] Stanislav Minsker. Plug-in approach to active learning. Journal of Machine Learning Research, 13(Jan):67\u2013\n\n90, 2012.\n\n[18] Mohammad Naghshvar, Tara Javidi, and Kamalika Chaudhuri. Bayesian active learning with non-persistent\n\nnoise. IEEE Transactions on Information Theory, 61(7):4080\u20134098, 2015.\n\n[19] R. D. Nowak. The geometry of generalized binary search. IEEE Transactions on Information Theory,\n\n57(12):7893\u20137906, 2011.\n\n[20] Maxim Raginsky and Alexander Rakhlin. Lower bounds for passive and active learning. In Advances in\n\nNeural Information Processing Systems, pages 1026\u20131034, 2011.\n\n[21] Aaditya Ramdas and Akshay Balsubramani. Sequential nonparametric testing with the law of the iterated\n\nlogarithm. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence, 2016.\n\n[22] A. B. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. Annals of Statistics, 32:135\u2013166,\n\n2004.\n\n[23] Ruth Urner, Shai Ben-david, and Ohad Shamir. Learning from weak teachers. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 1252\u20131260, 2012.\n\n[24] Songbai Yan, Kamalika Chaudhuri, and Tara Javidi. Active learning from noisy and abstention feedback.\nIn Communication, Control, and Computing (Allerton), 2015 53th Annual Allerton Conference on. IEEE,\n2015.\n\n[25] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In\n\nAdvances in Neural Information Processing Systems, pages 442\u2013450, 2014.\n\n[26] Chicheng Zhang and Kamalika Chaudhuri. Active learning from weak and strong labelers. In Advances in\n\nNeural Information Processing Systems, pages 703\u2013711, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1116, "authors": [{"given_name": "Songbai", "family_name": "Yan", "institution": "University of California"}, {"given_name": "Kamalika", "family_name": "Chaudhuri", "institution": "University of California"}, {"given_name": "Tara", "family_name": "Javidi", "institution": "University of California"}]}