{"title": "Search Improves Label for Active Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3342, "page_last": 3350, "abstract": "We investigate active learning with access to two distinct oracles: LABEL (which is standard) and SEARCH (which is not). The SEARCH oracle models the situation where a human searches a database to seed or counterexample an existing solution. SEARCH is stronger than LABEL while being natural to implement in many situations. We show that an algorithm using both oracles can provide exponentially large problem-dependent improvements over LABEL alone.", "full_text": "Search Improves Label for Active Learning\n\nAlina Beygelzimer\nYahoo Research\nNew York, NY\n\nbeygel@yahoo-inc.com\n\nJohn Langford\n\nMicrosoft Research\n\nNew York, NY\n\njcl@microsoft.com\n\nDaniel Hsu\n\nColumbia University\n\nNew York, NY\n\ndjhsu@cs.columbia.edu\n\nChicheng Zhang\nUC San Diego\nLa Jolla, CA\n\nchz038@cs.ucsd.edu\n\nAbstract\n\nWe investigate active learning with access to two distinct oracles: LABEL (which\nis standard) and SEARCH (which is not). The SEARCH oracle models the situation\nwhere a human searches a database to seed or counterexample an existing solution.\nSEARCH is stronger than LABEL while being natural to implement in many situ-\nations. We show that an algorithm using both oracles can provide exponentially\nlarge problem-dependent improvements over LABEL alone.\n\n1\n\nIntroduction\n\nMost active learning theory is based on interacting with a LABEL oracle: An active learner observes\nunlabeled examples, each with a label that is initially hidden. The learner provides an unlabeled\nexample to the oracle, and the oracle responds with the label. Using LABEL in an active learning\nalgorithm is known to give (sometimes exponentially large) problem-dependent improvements in\nlabel complexity, even in the agnostic setting where no assumption is made about the underlying\ndistribution [e.g., Balcan et al., 2006, Hanneke, 2007, Dasgupta et al., 2007, Hanneke, 2014].\nA well-known de\ufb01ciency of LABEL arises in the presence of rare classes in classi\ufb01cation problems,\nfrequently the case in practice [Attenberg and Provost, 2010, Simard et al., 2014]. Class imbalance\nmay be so extreme that simply \ufb01nding an example from the rare class can exhaust the labeling budget.\nConsider the problem of learning interval functions in [0, 1]. Any LABEL-only active learner needs at\nleast \u03a9(1/\ufffd) LABEL queries to learn an arbitrary target interval with error at most \ufffd [Dasgupta, 2005].\nGiven any positive example from the interval, however, the query complexity of learning intervals\ncollapses to O(log(1/\ufffd)), as we can just do a binary search for each of the end points.\nA natural approach used to overcome this hurdle in practice is to search for known examples of the\nrare class [Attenberg and Provost, 2010, Simard et al., 2014]. Domain experts are often adept at\n\ufb01nding examples of a class by various, often clever means. For instance, when building a hate speech\n\ufb01lter, a simple web search can readily produce a set of positive examples. Sending a random batch of\nunlabeled text to LABEL is unlikely to produce any positive examples at all.\nAnother form of interaction common in practice is providing counterexamples to a learned predictor.\nWhen monitoring the stream \ufb01ltered by the current hate speech \ufb01lter, a human editor may spot a\nclear-cut example of hate speech that seeped through the \ufb01lter. The editor, using all the search tools\navailable to her, may even be tasked with searching for such counterexamples. The goal of the\nlearning system is then to interactively restrict the searchable space, guiding the search process to\nwhere it is most effective.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fCounterexamples can be ineffective or misleading in practice as well. Reconsidering the intervals\nexample above, a counterexample on the boundary of an incorrect interval provides no useful\ninformation about any other examples. What is a good counterexample? What is a natural way to\nrestrict the searchable space? How can the intervals problem be generalized?\nWe de\ufb01ne a new oracle, SEARCH, that provides counterexamples to version spaces. Given a set of\npossible classi\ufb01ers H mapping unlabeled examples to labels, a version space V \u2286 H is the subset\nof classi\ufb01ers still under consideration by the algorithm. A counterexample to a version space is a\nlabeled example which every classi\ufb01er in the version space classi\ufb01es incorrectly. When there is no\ncounterexample to the version space, SEARCH returns nothing.\nHow can a counterexample to the version space be used? We consider a nested sequence of hypothesis\nclasses of increasing complexity, akin to Structural Risk Minimization (SRM) in passive learning\n[see, e.g., Vapnik, 1982, Devroye et al., 1996]. When SEARCH produces a counterexample to the\nversion space, it gives a proof that the current hypothesis class is too simplistic to solve the problem\neffectively. We show that this guided increase in hypothesis complexity results in a radically lower\nLABEL complexity than directly learning on the complex space. Sample complexity bounds for\nmodel selection in LABEL-only active learning were studied by Balcan et al. [2010], Hanneke [2011].\nSEARCH can easily model the practice of seeding discussed earlier. If the \ufb01rst hypothesis class\nhas just the constant always-negative classi\ufb01er h(x) = \u22121, a seed example with label +1 is a\ncounterexample to the version space. Our most basic algorithm uses SEARCH just once before using\nLABEL, but it is clear from inspection that multiple seeds are not harmful, and they may be helpful if\nthey provide the proof required to operate with an appropriately complex hypothesis class.\nDe\ufb01ning SEARCH with respect to a version space rather than a single classi\ufb01er allows us to formalize\n\u201ccounterexample far from the boundary\u201d in a general fashion which is compatible with the way\nLABEL-based active learning algorithms work.\n\nRelated work. The closest oracle considered in the literature is the Class Conditional Query\n(CCQ) [Balcan and Hanneke, 2012] oracle. A query to CCQ speci\ufb01es a \ufb01nite set of unlabeled\nexamples and a label while returning an example in the subset with the speci\ufb01ed label, if one exists.\nIn contrast, SEARCH has an implicit query set that is an entire region of the input space rather than\na \ufb01nite set. Simple searches over this large implicit domain can more plausibly discover relevant\ncounterexamples: When building a detector for penguins in images, the input to CCQ might be a\nset of images and the label \u201cpenguin\u201d. Even if we are very lucky and the set happens to contain a\npenguin image, a search amongst image tags may fail to \ufb01nd it in the subset because it is not tagged\nappropriately. SEARCH is more likely to discover counterexamples\u2014surely there are many images\ncorrectly tagged as having penguins.\nWhy is it natural to de\ufb01ne a query region implicitly via a version space? There is a practical\nreason\u2014it is a concise description of a natural region with an ef\ufb01ciently implementable membership\n\ufb01lter [Beygelzimer et al., 2010, 2011, Huang et al., 2015]. (Compare this to an oracle call that has\nto explicitly enumerate a large set of examples. The algorithm of Balcan and Hanneke [2012] uses\nsamples of size roughly d\u03bd/\ufffd2.)\nThe use of SEARCH in this paper is also substantially different from the use of CCQ by Balcan and\nHanneke [2012]. Our motivation is to use SEARCH to assist LABEL, as opposed to using SEARCH\nalone. This is especially useful in any setting where the cost of SEARCH is signi\ufb01cantly higher\nthan the cost of LABEL\u2014we hope to avoid using SEARCH queries whenever it is possible to make\nprogress using LABEL queries. This is consistent with how interactive learning systems are used in\npractice. For example, the Interactive Classi\ufb01cation and Extraction system of Simard et al. [2014]\ncombines LABEL with search in a production environment.\nThe \ufb01nal important distinction is that we require SEARCH to return the label of the optimal predictor\nin the nested sequence. For many natural sequences of hypothesis classes, the Bayes optimal\nclassi\ufb01er is eventually in the sequence, in which case it is equivalent to assuming that the label in a\ncounterexample is the most probable one, as opposed to a randomly-drawn label from the conditional\ndistribution (as in CCQ and LABEL).\nIs this a reasonable assumption? Unlike with LABEL queries, where the labeler has no choice of\nwhat to label, here the labeler chooses a counterexample. If a human editor \ufb01nds an unquestionable\n\n2\n\n\fexample of hate speech that seeped through the \ufb01lter, it is quite reasonable to assume that this\ncounterexample is consistent with the Bayes optimal predictor for any sensible feature representation.\n\nOrganization. Section 2 formally introduces the setting. Section 3 shows that SEARCH is at least\nas powerful as LABEL. Section 4 shows how to use SEARCH and LABEL jointly in the realizable\nsetting where a zero-error classi\ufb01er exists in the nested sequence of hypothesis classes. Section 5\nhandles the agnostic setting where LABEL is subject to label noise, and shows an amortized approach\nto combining the two oracles with a good guarantee on the total cost.\n\n2 De\ufb01nitions and Setting\nIn active learning, there is an underlying distribution D over X \u00d7 Y, where X is the instance space\nand Y := {\u22121, +1} is the label space. The learner can obtain independent draws from D, but the\nlabel is hidden unless explicitly requested through a query to the LABEL oracle. Let DX denote the\nmarginal of D over X .\nWe consider learning with a nested sequence of hypotheses classes H0 \u2282 H1 \u2282 \u00b7\u00b7\u00b7 \u2282 Hk \u00b7\u00b7\u00b7 ,\nwhere Hk \u2286 YX has VC dimension dk. For a set of labeled examples S \u2286 X \u00d7 Y, let Hk(S) :=\n{h \u2208 Hk : \u2200(x, y) \u2208 S \ufffd h(x) = y} be the set of hypotheses in Hk consistent with S. Let\nerr(h) := Pr(x,y)\u223cD[h(x) \ufffd= y] denote the error rate of a hypothesis h with respect to distribution\nD, and err(h, S) be the error rate of h on the labeled examples in S. Let h\u2217k = arg minh\u2208Hk err(h)\nbreaking ties arbitrarily and let k\u2217 := arg mink\u22650 err(h\u2217k) breaking ties in favor of the smallest such\nk. For simplicity, we assume the minimum is attained at some \ufb01nite k\u2217. Finally, de\ufb01ne h\u2217 := h\u2217k\u2217,\nthe optimal hypothesis in the sequence of classes. The goal of the learner is to learn a hypothesis\nwith error rate not much more than that of h\u2217.\nIn addition to LABEL, the learner can also query SEARCH with a version space.\n\nOracle SEARCHH (V ) (where H \u2208 {Hk}\u221ek=0)\ninput: Set of hypotheses V \u2282 H\noutput: Labeled example (x, h\u2217(x)) s.t. h(x) \ufffd= h\u2217(x) for all h \u2208 V , or \u22a5 if there is no such\n\nexample.\n\nThus if SEARCHH (V ) returns an example, this example is a systematic mistake made by all hypothe-\nses in V . (If V = \u2205, we expect SEARCH to return some example, i.e., not \u22a5.)\nOur analysis is given in terms of the disagreement coef\ufb01cient of Hanneke [2007], which has been\na central parameter for analyzing active learning algorithms. De\ufb01ne the region of disagreement of\na set of hypotheses V as Dis(V ) := {x \u2208 X : \u2203h, h\ufffd \u2208 V s.t. h(x) \ufffd= h\ufffd(x)}. The disagreement\ncoef\ufb01cient of V at scale r is \u03b8V (r) := suph\u2208V,r\ufffd\u2265r PrDX\n[Dis(BV (h, r\ufffd))]/r\ufffd, where BV (h, r\ufffd) =\n{h\ufffd \u2208 V : Prx\u223cDX\nThe \u02dcO(\u00b7) notation hides factors that are polylogarithmic in 1/\u03b4 and quantities that do appear, where \u03b4\nis the usual con\ufb01dence parameter.\n\n[h\ufffd(x) \ufffd= h(x)] \u2264 r\ufffd} is the ball of radius r\ufffd around h.\n\n3 The Relative Power of the Two Oracles\n\nAlthough SEARCH cannot always implement LABEL ef\ufb01ciently, it is as effective at reducing the region\nof disagreement. The clearest example is learning threshold classi\ufb01ers H := {hw : w \u2208 [0, 1]}\nin the realizable case, where hw(x) = +1 if w \u2264 x \u2264 1, and \u22121 if 0 \u2264 x < w. A simple\nbinary search with LABEL achieves an exponential improvement in query complexity over passive\nlearning. The agreement region of any set of threshold classi\ufb01ers with thresholds in [wmin, wmax] is\n[0, wmin)\u222a[wmax, 1]. Since SEARCH is allowed to return any counterexample in the agreement region,\nthere is no mechanism for forcing SEARCH to return the label of a particular point we want. However,\nthis is not needed to achieve logarithmic query complexity with SEARCH: If binary search starts with\nquerying the label of x \u2208 [0, 1], we can query SEARCHH (Vx), where Vx := {hw \u2208 H : w < x}\ninstead. If SEARCH returns \u22a5, we know that the target w\u2217 \u2264 x and can safely reduce the region of\ndisagreement to [0, x). If SEARCH returns a counterexample (x0,\u22121) with x0 \u2265 x, we know that\nw\u2217 > x0 and can reduce the region of disagreement to (x0, 1].\n\n3\n\n\fThis observation holds more generally. In the proposition below, we assume that LABEL(x) = h\u2217(x)\nfor simplicity. If LABEL(x) is noisy, the proposition holds for any active learning algorithm that\ndoesn\u2019t eliminate any h \u2208 H : h(x) = LABEL(x) from the version space.\nProposition 1. For any call x \u2208 X to LABEL such that LABEL(x) = h\u2217(x), we can construct a call\nto SEARCH that achieves a no lesser reduction in the region of disagreement.\n\n:= H+1(x)\n\nProof. For any V \u2286 H, let HSEARCH(V ) be the hypotheses in H consistent with the output of\nSEARCHH (V ): if SEARCHH (V ) returns a counterexample (x, y) to V , then HSEARCH(V ) := {h \u2208\nH : h(x) = y}; otherwise, HSEARCH(V ) := V . Let HLABEL(x) := {h \u2208 H : h(x) = LABEL(x)}.\nAlso, let Vx\n:= {h \u2208 H : h(x) = +1}. We will show that Vx is such that\nHSEARCH(Vx) \u2286 HLABEL(x), and hence Dis(HSEARCH(Vx)) \u2286 Dis(HLABEL(x)).\nThere are two cases to consider: If h\u2217(x) = +1, then SEARCHH (Vx) returns \u22a5. In this case,\nHLABEL(x) = HSEARCH(Vx) = H+1(x), and we are done. If h\u2217(x) = \u22121, SEARCH(Vx) returns a\nvalid counterexample (possibly (x,\u22121)) in the region of agreement of H+1(x), eliminating all of\nH+1(x). Thus HSEARCH(Vx) \u2282 H \\ H+1(x) = HLABEL(x), and the claim holds also.\nAs shown by the problem of learning intervals on the line, SEARCH can be exponentially more\npowerful than LABEL.\n\n4 Realizable Case\n\nWe now turn to general active learning algorithms that combine SEARCH and LABEL. We focus\non algorithms using both SEARCH and LABEL since LABEL is typically easier to implement than\nSEARCH and hence should be used where SEARCH has no signi\ufb01cant advantage. (Whenever SEARCH\nis less expensive than LABEL, Section 3 suggests a transformation to a SEARCH-only algorithm.)\nThis section considers the realizable case, in which we assume that the hypothesis h\u2217 = h\u2217k\u2217 \u2208 Hk\u2217\nhas err(h\u2217) = 0. This means that LABEL(x) returns h\u2217(x) for any x in the support of DX .\n4.1 Combining LABEL and SEARCH\n\nOur algorithm (shown as Algorithm 1) is called LARCH, because it combines LABEL and SEARCH.\nLike many selective sampling methods, LARCH uses a version space to determine its LABEL queries.\nFor concreteness, we use (a variant of) the algorithm of Cohn et al. [1994], denoted by CAL, as a\nsubroutine in LARCH. The inputs to CAL are: a version space V , the LABEL oracle, a target error\nrate, and a con\ufb01dence parameter; and its output is a set of labeled examples (implicitly de\ufb01ning a new\nversion space). CAL is described in Appendix B; its essential properties are speci\ufb01ed in Lemma 1.\nLARCH differs from LABEL-only active learners (like CAL) by \ufb01rst calling SEARCH in Step 3. If\nSEARCH returns \u22a5, LARCH checks to see if the last call to CAL resulted in a small-enough error,\nhalting if so in Step 6, and decreasing the allowed error rate if not in Step 8. If SEARCH instead\nreturns a counterexample, the hypothesis class Hk must be impoverished, so in Step 12, LARCH\nincreases the complexity of the hypothesis class to the minimum complexity suf\ufb01cient to correctly\nclassify all known labeled examples in S. After the SEARCH, CAL is called in Step 14 to discover a\nsuf\ufb01ciently low-error (or at least low-disagreement) version space with high probability.\nWhen LARCH advances to index k (for any k \u2264 k\u2217), its set of labeled examples S may imply a\nversion space Hk(S) \u2286 Hk that can be actively-learned more ef\ufb01ciently than the whole of Hk. In our\nanalysis, we quantify this through the disagreement coef\ufb01cient of Hk(S), which may be markedly\nsmaller than that of the full Hk.\nThe following theorem bounds the oracle query complexity of Algorithm 1 for learning with both\nSEARCH and LABEL in the realizable setting. The proof is in section 4.2.\nTheorem 1. Assume that err(h\u2217) = 0. For each k\ufffd \u2265 0, let \u03b8k\ufffd (\u00b7) be the disagreement coef\ufb01cient\nof Hk\ufffd (S[k\ufffd]), where S[k\ufffd] is the set of labeled examples S in LARCH at the \ufb01rst time that k \u2265 k\ufffd.\nFix any \ufffd, \u03b4 \u2208 (0, 1). If LARCH is run with inputs hypothesis classes {Hk}\u221ek=0, oracles LABEL and\nSEARCH, and learning parameters \ufffd, \u03b4, then with probability at least 1 \u2212 \u03b4: LARCH halts after at\nmost k\u2217 + log2(1/\ufffd) for-loop iterations and returns a classi\ufb01er with error rate at most \ufffd; furthermore,\n\n4\n\n\fters \ufffd, \u03b4 \u2208 (0, 1)\n\nAlgorithm 1 LARCH\ninput: Nested hypothesis classes H0 \u2282 H1 \u2282 \u00b7\u00b7\u00b7 ; oracles LABEL and SEARCH; learning parame-\n1: initialize S \u2190 \u2205, (index) k \u2190 0, \ufffd \u2190 0\n2: for i = 1, 2, . . . do\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15: end for\n\nend if\nS \u2190 S \u222a CAL(Hk(S), LABEL, 2\u2212\ufffd, \u03b4/(i2 + i))\n\nS \u2190 S \u222a {e}\nk \u2190 min{k\ufffd : Hk\ufffd (S) \ufffd= \u2205}\n\n# no counterexample found\n\n# counterexample found\n\ne \u2190 SEARCHHk (Hk(S))\nif e = \u22a5 then\n\nif 2\u2212\ufffd \u2264 \ufffd then\nelse\n\nreturn any h \u2208 Hk(S)\n\ufffd \u2190 \ufffd + 1\n\nend if\n\nelse\n\nit draws at most \u02dcO(k\u2217dk\u2217 /\ufffd) unlabeled examples from DX , makes at most k\u2217 + log2(1/\ufffd) queries to\n\nSEARCH, and at most \u02dcO (\ufffdk\u2217 + log(1/\ufffd)\ufffd \u00b7 (maxk\ufffd\u2264k\u2217 \u03b8k\ufffd (\ufffd)) \u00b7 dk\u2217 \u00b7 log2(1/\ufffd)) queries to LABEL.\nUnion-of-intervals example. We now show an implication of Theorem 1 in the case where the\ntarget hypothesis h\u2217 is the union of non-trivial intervals in X := [0, 1], assuming that DX is uniform.\nFor k \u2265 0, let Hk be the hypothesis class of the union of up to k intervals in [0, 1] with H0 containing\nonly the always-negative hypothesis.\n(Thus, h\u2217 is the union of k\u2217 non-empty intervals.) The\ndisagreement coef\ufb01cient of H1 is \u03a9(1/\ufffd), and hence LABEL-only active learners like CAL are not\nvery effective at learning with such classes. However, the \ufb01rst SEARCH query by LARCH provides a\ncounterexample to H0, which must be a positive example (x1, +1). Hence, H1(S[1]) (where S[1] is\nde\ufb01ned in Theorem 1) is the class of intervals that contain x1 with disagreement coef\ufb01cient \u03b81 \u2264 4.\nNow consider the inductive case. Just before LARCH advances its index to a value k (for any k \u2264 k\u2217),\nSEARCH returns a counterexample (x, h\u2217(x)) to the version space; every hypothesis in this version\nspace (which could be empty) is a union of fewer than k intervals. If the version space is empty, then\nS must already contain positive examples from at least k different intervals in h\u2217 and at least k \u2212 1\nnegative examples separating them. If the version space is not empty, then the point x is either a\npositive example belonging to a previously uncovered interval in h\u2217 or a negative example splitting\nan existing interval. In either case, S[k] contains positive examples from at least k distinct intervals\nseparated by at least k \u2212 1 negative examples. The disagreement coef\ufb01cient of the set of unions of k\nintervals consistent with S[k] is at most 4k, independent of \ufffd.\nThe VC dimension of Hk is O(k), so Theorem 1 implies that with high probability, LARCH makes at\nmost k\u2217 + log(1/\ufffd) queries to SEARCH and \u02dcO((k\u2217)3 log(1/\ufffd) + (k\u2217)2 log3(1/\ufffd)) queries to LABEL.\n\n4.2 Proof of Theorem 1\n\nThe proof of Theorem 1 uses the following lemma regarding the CAL subroutine, proved in Ap-\npendix B. It is similar to a result of Hanneke [2011], but an important difference here is that the input\nversion space V is not assumed to contain h\u2217.\nLemma 1. Assume LABEL(x) = h\u2217(x) for every x in the support of DX . For any hypothesis set\nV \u2286 YX with VC dimension d < \u221e, and any \ufffd, \u03b4 \u2208 (0, 1), the following holds with probability at\nleast 1 \u2212 \u03b4. CAL(V, LABEL, \ufffd, \u03b4) returns labeled examples T \u2286 {(x, h\u2217(x)) : x \u2208 X} such that for\nany h in V (T ), Pr(x,y)\u223cD[h(x) \ufffd= y \u2227 x \u2208 Dis(V (T ))] \u2264 \ufffd; furthermore, it draws at most \u02dcO(d/\ufffd)\nunlabeled examples from DX , and makes at most \u02dcO (\u03b8V (\ufffd) \u00b7 d \u00b7 log2(1/\ufffd)) queries to LABEL.\nWe now prove Theorem 1. By Lemma 1 and a union bound, there is an event with probability\nat least 1 \u2212\ufffdi\u22651 \u03b4/(i2 + i) \u2265 1 \u2212 \u03b4 such that each call to CAL made by LARCH satis\ufb01es the\n\nhigh-probability guarantee from Lemma 1. We henceforth condition on this event.\n\n5\n\n\fWe \ufb01rst establish the guarantee on the error rate of a hypothesis returned by LARCH. By the\nassumed properties of LABEL and SEARCH, and the properties of CAL from Lemma 1, the labeled\nexamples S in LARCH are always consistent with h\u2217. Moreover, the return property of CAL\nimplies that at the end of any loop iteration, with the present values of S, k, and \ufffd, we have\nPr(x,y)\u223cD[h(x) \ufffd= y \u2227 x \u2208 Dis(Hk(S))] \u2264 2\u2212\ufffd for all h \u2208 Hk(S). (The same holds trivially before\nthe \ufb01rst loop iteration.) Therefore, if LARCH halts and returns a hypothesis h \u2208 Hk(S), then there is\nno counterexample to Hk(S), and Pr(x,y)\u223cD[h(x) \ufffd= y\u2227x \u2208 Dis(Hk(S))] \u2264 \ufffd. These consequences\nand the law of total probability imply err(h) = Pr(x,y)\u223cD[h(x) \ufffd= y \u2227 x \u2208 Dis(Hk(S))] \u2264 \ufffd.\nWe next consider the number of for-loop iterations executed by LARCH. Let Si, ki, and \ufffdi be,\nrespectively, the values of S, k, and \ufffd at the start of the i-th for-loop iteration in LARCH. We claim\nthat if LARCH does not halt in the i-th iteration, then one of k and \ufffd is incremented by at least one.\nClearly, if there is no counterexample to Hki (Si) and 2\u2212\ufffdi > \ufffd, then \ufffd is incremented by one (Step 8).\nIf, instead, there is a counterexample (x, y), then Hki (Si\u222a{(x, y)}) = \u2205, and hence k is incremented\nto some index larger than ki (Step 12). This proves that ki+1 + \ufffdi+1 \u2265 ki + \ufffdi + 1. We also have\nki \u2264 k\u2217, since h\u2217 \u2208 Hk\u2217 is consistent with S, and \ufffdi \u2264 log2(1/\ufffd), as long as LARCH does not halt\nin for-loop iteration i. So the total number of for-loop iterations is at most k\u2217 + log2(1/\ufffd). Together\nwith Lemma 1, this bounds the number of unlabeled examples drawn from DX .\nFinally, we bound the number of queries to SEARCH and LABEL. The number of queries to SEARCH\nis the same as the number of for-loop iterations\u2014this is at most k\u2217 + log2(1/\ufffd). By Lemma 1 and\nthe fact that V (S\ufffd \u222a S\ufffd\ufffd) \u2286 V (S\ufffd) for any hypothesis space V and sets of labeled examples S\ufffd, S\ufffd\ufffd,\nthe number of LABEL queries made by CAL in the i-th for-loop iteration is at most \u02dcO(\u03b8ki (\ufffd) \u00b7 dki \u00b7\ni \u00b7 polylog(i)). The claimed bound on the number of LABEL queries made by LARCH now readily\n\ufffd2\nfollows by taking a max over i, and using the facts that i \u2264 k\u2217 and dk\ufffd \u2264 dk\u2217 for all k\ufffd \u2264 k.\n4.3 An Improved Algorithm\n\nLARCH is somewhat conservative in its use of SEARCH, interleaving just one SEARCH query between\nsequences of LABEL queries (from CAL). Often, it is advantageous to advance to higher complexity\nhypothesis classes quickly, as long as there is justi\ufb01cation to do so. Counterexamples from SEARCH\nprovide such justi\ufb01cation, and a \u22a5 result from SEARCH also provides useful feedback about the\ncurrent version space: outside of its disagreement region, the version space is in complete agreement\nwith h\u2217 (even if the version space does not contain h\u2217). Based on these observations, we propose an\nimproved algorithm for the realizable setting, which we call SEABEL. Due to space limitations, we\npresent it in Appendix C. We prove the following performance guarantee for SEABEL.\nTheorem 2. Assume that err(h\u2217) = 0. Let \u03b8k(\u00b7) denote the disagreement coef\ufb01cient of V ki\nat the \ufb01rst iteration i in SEABEL where ki \u2265 k. Fix any \ufffd, \u03b4 \u2208 (0, 1).\nIf SEABEL is run\nwith inputs hypothesis classes {Hk}\u221ek=0, oracles SEARCH and LABEL, and learning parame-\nters \ufffd, \u03b4 \u2208 (0, 1), then with probability 1 \u2212 \u03b4: SEABEL halts and returns a classi\ufb01er with\nerror rate at most \ufffd; furthermore, it draws at most \u02dcO((dk\u2217 + log k\u2217)/\ufffd) unlabeled examples\nfrom DX , makes at most k\u2217 + O (log(dk\u2217 /\ufffd) + log log k\u2217) queries to SEARCH, and at most\n\u02dcO (maxk\u2264k\u2217 \u03b8k(2\ufffd) \u00b7 (dk\u2217 log2(1/\ufffd) + log k\u2217)) queries to LABEL.\nIt is not generally possible to directly compare Theorems 1 and 2 on account of the algorithm-\ndependent disagreement coef\ufb01cient bounds. However, in cases where these disagreement coef\ufb01cients\nare comparable (as in the union-of-intervals example), the SEARCH complexity in Theorem 2 is\nslightly higher (by additive log terms), but the LABEL complexity is smaller than that from Theorem 1\nby roughly a factor of k\u2217. For the union-of-intervals example, SEABEL would learn target union of\nk\u2217 intervals with k\u2217 + O(log(k\u2217/\ufffd)) queries to SEARCH and \u02dcO((k\u2217)2 log2(1/\ufffd)) queries to LABEL.\n\ni\n\n5 Non-Realizable Case\n\nIn this section, we consider the case where the optimal hypothesis h\u2217 may have non-zero error rate,\ni.e., the non-realizable (or agnostic) setting. In this case, the algorithm LARCH, which was designed\nfor the realizable setting, is no longer applicable. First, examples obtained by LABEL and SEARCH\nare of different quality: those returned by SEARCH always agree with h\u2217, whereas the labels given\nby LABEL need not agree with h\u2217. Moreover, the version spaces (even when k = k\u2217) as de\ufb01ned by\nLARCH may always be empty due to the noisy labels.\n\n6\n\n\fAnother complication arises in our SRM setting that differentiates it from the usual agnostic active\nlearning setting. When working with a speci\ufb01c hypothesis class Hk in the nested sequence, we\nmay observe high error rates because (i) the \ufb01nite sample error is too high (but additional labeled\nexamples could reduce it), or (ii) the current hypothesis class Hk is impoverished. In case (ii), the best\nhypothesis in Hk may have a much larger error rate than h\u2217, and hence lower bounds [K\u00e4\u00e4ri\u00e4inen,\n2006] imply that active learning on Hk instead of Hk\u2217 may be substantially more dif\ufb01cult.\nThese dif\ufb01culties in the SRM setting are circumvented by an algorithm that adaptively estimates the\nerror of h\u2217. The algorithm, A-LARCH (Algorithm 5), is presented in Appendix D.\nTheorem 3. Assume err(h\u2217) = \u03bd. Let \u03b8k(\u00b7) denote the disagreement coef\ufb01cient of V ki\ni at the \ufb01rst\niteration i in A-LARCH where ki \u2265 k. Fix any \ufffd, \u03b4 \u2208 (0, 1). If A-LARCH is run with inputs hypothe-\nsis classes {Hk}\u221ek=0, oracles SEARCH and LABEL, learning parameter \u03b4, and unlabeled example\nbudget \u02dcO((dk\u2217 + log k\u2217)(\u03bd + \ufffd)/\ufffd2), then with probability 1 \u2212 \u03b4: A-LARCH returns a classi\ufb01er\nwith error rate \u2264 \u03bd + \ufffd; it makes at most k\u2217 + O (log(dk\u2217 /\ufffd) + log log k\u2217) queries to SEARCH, and\n\u02dcO (maxk\u2264k\u2217 \u03b8k(2\u03bd + 2\ufffd) \u00b7 (dk\u2217 log2(1/\ufffd) + log k\u2217) \u00b7 (1 + \u03bd2/\ufffd2)) queries to LABEL.\nThe proof is in Appendix D. The LABEL query complexity is at least a factor of k\u2217 better than\nthat in Hanneke [2011], and sometimes exponentially better thanks to the reduced disagreement\ncoef\ufb01cient of the version space when consistency constraints are incorporated.\n\n5.1 AA-LARCH: an Opportunistic Anytime Algorithm\n\non\n\nseveral\n\nsubroutines:\n\nIn many practical scenarios, termination conditions based on quantities like a target excess error rate\n\ufffd are undesirable. The target \ufffd is unknown, and we instead prefer an algorithm that performs as well\nas possible until a cost budget is exhausted. Fortunately, when the primary cost being considered are\nLABEL queries, there are many LABEL-only active learning algorithms that readily work in such an\n\u201canytime\u201d setting [see, e.g., Dasgupta et al., 2007, Hanneke, 2014].\nThe situation is more complicated when we consider both SEARCH and LABEL: we can often make\nsubstantially more progress with SEARCH queries than with LABEL queries (as the error rate of the\nbest hypothesis in Hk\ufffd for k\ufffd > k can be far lower than in Hk). AA-LARCH (Algorithm 2) shows\nthat although these queries come at a higher cost, the cost can be amortized.\nAA-LARCH relies\nERROR-CHECK,\nPRUNE-VERSION-SPACE and UPGRADE-VERSION-SPACE (Algorithms 6, 7,\n8, and 9).\nThe detailed descriptions are deferred to Appendix E. SAMPLE-AND-LABEL performs standard\ndisagreement-based selective sampling using oracle LABEL; labels of examples in the disagreement\nregion are queried, otherwise inferred. PRUNE-VERSION-SPACE prunes the version space given the\nlabeled examples collected, based on standard generalization error bounds. ERROR-CHECK checks if\nthe best hypothesis in the version space has large error; SEARCH is used to \ufb01nd a systematic mistake\nfor the version space; if either event happens, AA-LARCH calls UPGRADE-VERSION-SPACE to\nincrease k, the level of our working hypothesis class.\nTheorem 4. Assume err(h\u2217) = \u03bd. Let \u03b8k\ufffd (\u00b7) denote the disagreement coef\ufb01cient of Vi at the \ufb01rst\niteration i after which k \u2265 k\ufffd. Fix any \ufffd \u2208 (0, 1). Let n\ufffd = \u02dcO(maxk\u2264k\u2217 \u03b8k(2\u03bd + 2\ufffd)dk\u2217 (1 + \u03bd2/\ufffd2))\nand de\ufb01ne C\ufffd = 2(n\ufffd + k\u2217\u03c4 ). Run Algorithm 2 with a nested sequence of hypotheses {Hk}\u221ek=0,\noracles LABEL and SEARCH, con\ufb01dence parameter \u03b4, cost ratio \u03c4 \u2265 1, and upper bound N =\n\u02dcO(dk\u2217 /\ufffd2). If the cost spent is at least C\ufffd, then with probability 1 \u2212 \u03b4, the current hypothesis \u02dch has\nerror at most \u03bd + \ufffd.\n\nSAMPLE-AND-LABEL,\n\nThe proof is in Appendix E. A comparison to Theorem 3 shows that AA-LARCH is adaptive: for any\ncost complexity C, the excess error rate \ufffd is roughly at most twice that achieved by A-LARCH.\n\n6 Discussion\n\nThe SEARCH oracle captures a powerful form of interaction that is useful for machine learning. Our\ntheoretical analyses of LARCH and variants demonstrate that SEARCH can substantially improve\nLABEL-based active learners, while being plausibly cheaper to implement than oracles like CCQ.\n\n7\n\n\fif ERROR-CHECK(Vi, Li, \u03b4i) then\n\n(k, S, Vi) \u2190 UPGRADE-VERSION-SPACE(k, S,\u2205)\nVi \u2190 PRUNE-VERSION-SPACE(Vi, \u02dcL, \u03b4i)\nLi \u2190 \u02dcL\ncontinue loop\n\nend if\ni \u2190 i + 1\n(Li, c) \u2190 SAMPLE-AND-LABEL(Vi\u22121, LABEL, Li\u22121, c)\nVi \u2190 PRUNE-VERSION-SPACE(Vi\u22121, Li, \u03b4i)\n\n\u03b4 \u2208 (0, 1); SEARCH-to-LABEL cost ratio \u03c4, dataset size upper bound N.\n\nworking labeled dataset L0 \u2190 \u2205, unlabeled examples processed i \u2190 0, Vi \u2190 Hk(S).\n\nAlgorithm 2 AA-LARCH\ninput: Nested hypothesis set H0 \u2286 H1 \u2286 \u00b7\u00b7\u00b7 ; oracles LABEL and SEARCH; learning parameter\noutput: hypothesis \u02dch.\n1: Initialize: consistency constraints S \u2190 \u2205, counter c \u2190 0, k \u2190 0, veri\ufb01ed labeled dataset \u02dcL \u2190 \u2205,\n2: loop\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10:\n11:\n12:\n13:\n14:\n15:\n16:\n17:\n18:\n19:\n20:\n21:\n22:\nend if\n23:\n24: end loop\n\n(k, S, Vi) \u2190 UPGRADE-VERSION-SPACE(k, S,{e})\nVi \u2190 PRUNE-VERSION-SPACE(Vi, \u02dcL, \u03b4i)\nLi \u2190 \u02dcL\nUpdate veri\ufb01ed dataset \u02dcL \u2190 Li.\nStore temporary solution \u02dch = arg minh\ufffd\u2208Vi err(h\ufffd, \u02dcL).\n\nReset counter c \u2190 0.\nrepeat\n\nuntil c = \u03c4 or li = N\ne \u2190 SEARCHHk (Vi)\nif e \ufffd= \u22a5 then\n\nelse\n\nAre there examples where CCQ is substantially more powerful than SEARCH? This is a key question,\nbecause a good active learning system should use minimally powerful oracles. Another key question\nis: Can the bene\ufb01ts of SEARCH be provided in a computationally ef\ufb01cient general purpose manner?\n\nReferences\nJosh Attenberg and Foster J. Provost. Why label when you can search? alternatives to active learning\nfor applying human resources to build classi\ufb01cation models under extreme class imbalance. In\nProceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and\nData Mining, Washington, DC, USA, July 25-28, 2010, pages 423\u2013432, 2010.\n\nMaria-Florina Balcan and Steve Hanneke. Robust interactive learning. In COLT, 2012.\n\nMaria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In ICML,\n\n2006.\n\nMaria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity\n\nof active learning. Machine learning, 80(2-3):111\u2013139, 2010.\n\nAlina Beygelzimer, Daniel Hsu, John Langford, and Tong Zhang. Agnostic active learning without\n\nconstraints. In Advances in Neural Information Processing Systems 23, 2010.\n\nAlina Beygelzimer, Daniel Hsu, Nikos Karampatziakis, John Langford, and Tong Zhang. Ef\ufb01cient\n\nactive learning. In ICML Workshop on Online Trading of Exploration and Exploitation, 2011.\n\nDavid A. Cohn, Les E. Atlas, and Richard E. Ladner. Improving generalization with active learning.\n\nMachine Learning, 15(2):201\u2013221, 1994.\n\n8\n\n\fSanjoy Dasgupta. Coarse sample complexity bounds for active learning. In Advances in Neural\n\nInformation Processing Systems 18, 2005.\n\nSanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm.\n\nIn Advances in Neural Information Processing Systems 20, 2007.\n\nLuc Devroye, L\u00e1szl\u00f3 Gy\u00f6r\ufb01, and Gabor Lugosi. A Probabilistic Theory of Pattern Recognition.\n\nSpringer Verlag, 1996.\n\nSteve Hanneke. A bound on the label complexity of agnostic active learning. In ICML, pages\n\n249\u2013278, 2007.\n\nSteve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361,\n\n2011.\n\nSteve Hanneke. Theory of disagreement-based active learning. Foundations and Trends R\ufffd in Machine\n\nLearning, 7(2-3):131\u2013309, 2014. ISSN 1935-8237. doi: 10.1561/2200000037.\n\nTzu-Kuo Huang, Alekh Agarwal, Daniel Hsu, John Langford, and Robert E. Schapire. Ef\ufb01cient and\nparsimonious agnostic active learning. In Advances in Neural Information Processing Systems 28,\n2015.\n\nMatti K\u00e4\u00e4ri\u00e4inen. Active learning in the non-realizable case. In Algorithmic Learning Theory, 17th\nInternational Conference, ALT 2006, Barcelona, Spain, October 7-10, 2006, Proceedings, pages\n63\u201377, 2006.\n\nPatrice Y. Simard, David Maxwell Chickering, Aparna Lakshmiratan, Denis Xavier Charles, L\u00e9on\nBottou, Carlos Garcia Jurado Suarez, David Grangier, Saleema Amershi, Johan Verwey, and Jina\nSuh. ICE: enabling non-experts to build models interactively for large-scale lopsided problems.\nCoRR, abs/1409.4814, 2014. URL http://arxiv.org/abs/1409.4814.\n\nVladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, 1982.\n\nVladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies\nof events to their probabilities. Theory of Probability and Its Applications, 16(2):264\u2013280, 1971.\n\n9\n\n\f", "award": [], "sourceid": 1664, "authors": [{"given_name": "Alina", "family_name": "Beygelzimer", "institution": "Yahoo Inc"}, {"given_name": "Daniel", "family_name": "Hsu", "institution": "Columbia University"}, {"given_name": "John", "family_name": "Langford", "institution": "Microsoft Research New York"}, {"given_name": "Chicheng", "family_name": "Zhang", "institution": "UCSD"}]}