{"title": "Active Learning with Oracle Epiphany", "book": "Advances in Neural Information Processing Systems", "page_first": 2820, "page_last": 2828, "abstract": "We present a theoretical analysis of active learning with more realistic interactions with human oracles. Previous empirical studies have shown oracles abstaining on difficult queries until accumulating enough information to make label decisions. We formalize this phenomenon with an \u201coracle epiphany model\u201d and analyze active learning query complexity under such oracles for both the realizable and the agnos- tic cases. Our analysis shows that active learning is possible with oracle epiphany, but incurs an additional cost depending on when the epiphany happens. Our results suggest new, principled active learning approaches with realistic oracles.", "full_text": "Active Learning with Oracle Epiphany\n\nTzu-Kuo Huang \u2217\n\nUber Advanced Technologies Group\n\nPittsburgh, PA 15201\n\nAra Vartanian\n\nUniversity of Wisconsin\u2013Madison\n\nMadison, WI 53706\n\nSaleema Amershi\nMicrosoft Research\nRedmond, WA 98052\n\nAbstract\n\nLihong Li\n\nMicrosoft Research\nRedmond, WA 98052\n\nXiaojin Zhu\n\nUniversity of Wisconsin\u2013Madison\n\nMadison, WI 53706\n\nWe present a theoretical analysis of active learning with more realistic interactions\nwith human oracles. Previous empirical studies have shown oracles abstaining on\ndif\ufb01cult queries until accumulating enough information to make label decisions.\nWe formalize this phenomenon with an \u201coracle epiphany model\u201d and analyze active\nlearning query complexity under such oracles for both the realizable and the agnos-\ntic cases. Our analysis shows that active learning is possible with oracle epiphany,\nbut incurs an additional cost depending on when the epiphany happens. Our results\nsuggest new, principled active learning approaches with realistic oracles.\n\n1\n\nIntroduction\n\nThere is currently a wide gap between theory and practice of active learning with oracle interaction.\nTheoretical active learning assumes an omniscient oracle. Given a query x, the oracle simply\nanswers its label y by drawing from the conditional distribution p(y | x). This oracle model is\nmotivated largely by its convenience for analysis. However, there is mounting empirical evidence\nfrom psychology and human-computer interaction research that humans behave in far more complex\nways. The oracle may abstain on some queries [Donmez and Carbonell, 2008] (note this is distinct\nfrom classi\ufb01er abstention [Zhang and Chaudhuri, 2014, El-Yaniv and Wiener, 2010]), or their answers\ncan be in\ufb02uenced by the identity and order of previous queries [Newell and Ruths, 2016, Sarkar et al.,\n2016, Kulesza et al., 2014] and by incentives [Shah and Zhou, 2015]. Theoretical active learning\nhas yet to account for such richness in human behaviors, which are critical to designing principled\nalgorithms to effectively learn from human annotators.\nThis paper takes a step toward bridging this gap. Speci\ufb01cally, we formalize and analyze the phe-\nnomenon of \u201coracle epiphany.\u201d Consider active learning from a human oracle to build a webpage\nclassi\ufb01er on basketball sport vs. others. It is well-known in practice that no matter how simple\nthe task looks, the oracle can encounter dif\ufb01cult queries. The oracle may easily answer webpage\nqueries that are obviously about basketball or obviously not about the sport, until she encounters\na webpage on basketball jerseys. Here, the oracle cannot immediately decide how to label (\u201cDoes\nthis jersey webpage qualify as a webpage about basketball?\u201d). One solution is to allow the oracle\nto abstain by answering with a special I-don\u2019t-know label [Donmez and Carbonell, 2008]. More\ninterestingly, Kulesza et al. [2014] demonstrated that with proper user interface support, the oracle\nmay temporarily abstain on similar queries but then have an \u201cepiphany\u201d: she may suddenly decide\nhow to label all basketball apparel-related webpages. Empirical evidence in [Kulesza et al., 2014]\nsuggests that epiphany may be induced by the accumulative effect of seeing multiple similar queries.\nIf a future basketball-jersey webpage query arrives, the oracle will no longer abstain but will answer\n\n\u2217Part of this work was done while the author was with Microsoft Research.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fwith the label she determined during epiphany. In this way, the oracle improves herself on the subset\nof the input space that corresponds to basketball apparel-related webpages.\nEmpirical evidence also suggests that oracle abstention, and subsequent epiphany, may happen\nseparately on different subsets of the input space. When building a cooking vs. others text classi\ufb01er,\nKulesza et al. [2014] observed oracle epiphany on a subset of cooking supplies documents, and\nseparately on the subset of culinary service documents; on gardening vs. others, they observed\nseparate oracle epiphany on plant information and on local garden documents; on travel vs. others,\nthey observed separate oracle epiphany on photography, rental cars, and medical tourism documents.\nOur contributions are three-fold: (i) We formalize oracle epiphany in Section 2; (ii) We analyze\nEPICAL, a variant of the CAL algorithm [Cohn et al., 1994], for realizable active learning with\noracle epiphany in Section 3. (iii) We analyze Oracular-EPICAL, a variant of the Oracular-CAL\nalgorithm [Hsu, 2010, Huang et al., 2015], for agnostic active learning in Section 4. Our query\ncomplexity bounds show that active learning is possible with oracle epiphany, although we may\nincur a penalty waiting for epiphany to happen. This is veri\ufb01ed with simulations in Section 5, which\nhighlights the nuanced dependency between query complexity and epiphany parameters.\n\n2 Problem Setting\nAs in standard active learning, we are given a hypothesis class H \u2286 YX for some input space X\nand a binary label set Y (cid:44) {\u22121, 1}. There is an unknown distribution \u00b5 over X \u00d7 Y, from which\nexamples are drawn IID. The marginal distribution over X is \u00b5X. De\ufb01ne the expected classi\ufb01cation\nerror, or risk, of a classi\ufb01er h \u2208 H to be err(h) (cid:44) E(x,y)\u223c\u00b5 [1 (h(x) (cid:54)= y)]. As usual, the active\nlearning goal is as follows: given any \ufb01xed \u0001, \u03b4 \u2208 (0, 1), we seek an active learning algorithm which,\nwith probability at least 1 \u2212 \u03b4, returns a hypothesis with classi\ufb01cation error at most\u0001 after sending a\n\u201csmall\u201d number of queries to the oracle. What is unique here is an \u201coracle epiphany model.\u201d\nThe input space consists of two disjoint sets X = K \u222a U. The oracle knows the label for items in K\n(for \u201cknown\u201d) but initially does not know the labels in U (for \u201cunknown\u201d). The oracle will abstain if\na query comes from U (unless epiphany happens, see below). Furthermore, U is partitioned into K\ndisjoint subsets U = U1 \u222a U2 \u222a . . . \u222a UK. These correspond to the photograph/rental cars/medical\ntourism subsets in the travel task earlier. The active learner does not know the partitions nor K.\nWhen the active learner submits a query x \u2208 X to the oracle, the learner will receive one of three\noutcomes in Y+ (cid:44) {\u22121, 1,\u22a5}, where \u22a5 indicates I-don\u2019t-know abstention.\nImportantly, we assume that epiphany is modeled as K Markov chains: Whenever a unique x \u2208 Uk\nis queried on some unknown region k \u2208 {1, . . . , K} which did not experience epiphany yet, the\noracle has a probability \u03b2 \u2208 [0, 1] of epiphany on that region. If epiphany happens, the oracle\nthen understands how to label everything in Uk. In effect, the state of Uk is \ufb02ipped from unknown\nto known. Epiphany is irrevocable: Uk will stay known from now on and the oracle will answer\naccordingly for all future x therein. Thus the oracle will only answer \u22a5 if Uk remains unknown. The\nrequirement for a unique x is to prevent a trivial active learning algorithm which repeatedly queries\nthe same \u22a5 item in an attempt to induce oracle epiphany. This requirement does not pose dif\ufb01culty\nfor analysis if \u00b5X is continuous on X, since all queries will be unique with probability one.\nTherefore, our oracle epiphany model is parameterized by (\u03b2, K, U1, . . . , UK). All our analyses\nbelow will be based on this epiphany model. Of course, the model is only an approximation to real\nhuman oracle behaviors; In Section 6 we will discuss more sophisticated epiphany models for future\nwork.\n\n3 The Realizable Case\n\nIn this section, we study the realizable active learning case, where we assume there exists some\nh\u2217 \u2208 H such that the label of an example x \u2208 X is y = h\u2217(x). It follows that err(h\u2217) = 0. Although\nthe realizability assumption is strong, the analysis is insightful on the role of epiphany. We will\nshow that the worst-case query complexity has an additional 1/\u03b2 dependence. We also discuss nice\ncases where this 1/\u03b2 can be avoided depending on U\u2019s interaction with the disagreement region.\nFurthermore, our analysis focuses on the K = 1 case; that is, the oracle has only one unknown region\nU = U1. This case is the simplest but captures the essence of the algorithm we propose in this section.\n\n2\n\n\fFor convenience, we will drop the superscript and write U. In the next section, we will eliminate both\nassumptions, and present and analyze an algorithm for the agnostic case with an arbitrary K \u2265 1.\nWe modify the standard CAL algorithm [Cohn et al., 1994] to accommodate oracle epiphany. The\nmodi\ufb01ed algorithm, which we call EPICAL for \u201cepiphany CAL,\u201d is given in Alg. 1. Like CAL,\nEPICAL receives a stream of unlabeled items; It maintains a version space; If the unlabeled item falls\ninto the disagreement region of the version space the oracle is queried. The essential difference to\nCAL is that if the oracle answers \u22a5, no update to the version space happens. The stopping criterion\nensures that the true risk of any hypothesis in the version space is at most \u0001, with high probability.\n\nAlgorithm 1 EPICAL\n\nInput: \u0001, \u03b4, oracle, X, H\nVersion space V \u2190 H\nDisagreement region D \u2190 {x \u2208 X | \u2203h, h(cid:48) \u2208 V, h(x) (cid:54)= h(cid:48)(x)}\nfor t = 1, 2, 3, . . . do\n\nSample an unlabeled example from the marginal distribution restricted to D: xt \u223c \u00b5X|D\nQuery oracle with xt to get yt\nif yt (cid:54)= \u22a5 then\nV \u2190 {h \u2208 V | h(xt) = yt}\nD \u2190 {x \u2208 X | \u2203h, h(cid:48) \u2208 V, h(x) (cid:54)= h(cid:48)(x)}\n\nend if\nif \u00b5X(D) \u2264 \u0001 then\nReturn any h \u2208 V\n\nend if\nend for\n\nOur analysis is based on the following observation: before oracle epiphany and ignoring all queries\nthat result in \u22a5, EPICAL behaves exactly the same as CAL on an induced active-learning problem.\nThe induced problem has input space K, but with a projected hypothesis space we detail below. Hence,\nstandard CAL analysis bounds the number of queries to \ufb01nd a good hypothesis in the induced problem.\nNow consider the sequence of probabilities of getting a \u22a5 label in each step of EPICAL. If these\nprobabilities tend to be small, EPICAL will terminate with an\u0001-risk hypothesis without even having\nto wait for epiphany. If these probabilities tend to be large, we may often hit the unknown region U.\nBut the number of such steps is bounded because epiphany will happen with high probability.\nFormally, we de\ufb01ne the induced active-learning problem as follows. The input space is \u00afX (cid:44) K, and\nthe output space is still Y. The sampling distribution is \u00af\u00b5X(x) (cid:44) \u00b5X(x)1 (x \u2208 K) /\u00b5X(K). The\nhypothesis space is the projection of H onto \u00afX: \u00afH (cid:44) {\u00afh \u2208 Y\u00afX | \u2203h \u2208 H,\u2200x \u2208 \u00afX : \u00afh(x) = h(x)}.\nClearly, the induced problem is still realizable; let \u00afh\u2217 be the projected target hypothesis. Let \u03b8 be the\ndisagreement coef\ufb01cient [Hanneke, 2014] for the original problem without unknown regions. The\ninduced problem potentially has a different disagreement coef\ufb01cient:\n\nLet \u00afm be the number of queries required for the CAL algorithm to \ufb01nd a hypothesis of \u0001/2 risk with\nprobability 1 \u2212 \u03b4/4 in the induced problem. It is known [Hanneke, 2014, Theorem 5.1] that\n\n(cid:2)1(cid:0)\u2203\u00afh \u2208 \u00afH s.t. \u00afh\u2217(x) (cid:54)= \u00afh(x), Ex(cid:48)\u223c\u00af\u00b5X\n(cid:18) 4\n\ndim( \u00afH) ln \u00af\u03b8 + ln\n\nln\n\n2\n\u0001\n\n\u03b4\n\n(cid:2)1(cid:0)\u00afh(x(cid:48)) (cid:54)= \u00afh\u2217(x(cid:48))(cid:1)(cid:3) \u2264 r(cid:1)(cid:3) .\n(cid:19)(cid:19)\n\nln\n\n2\n\u0001\n\n.\n\n\u00af\u03b8 (cid:44) sup\n\nr>0\n\nr\u22121 \u00b7 Ex\u223c\u00af\u00b5X\n\n(cid:18)\nmCAL \u2264 MCAL (cid:44) \u03b8(cid:0)dim(H) ln \u03b8 + ln(cid:0) 4\n\n\u00afm \u2264 \u00afM (cid:44) \u00af\u03b8\n\n(cid:1)(cid:1) ln 1\n\nwhere dim(\u00b7) is the VC dimension. Similarly, let mCAL be the number of queries required for\nCAL to \ufb01nd a hypothesis of \u0001 risk with probability 1 \u2212 \u03b4/4 in the original problem, and we have\n\u0001 . Furthermore, de\ufb01ne m\u22a5 (cid:44) |{t | yt = \u22a5}|\nto be the number of queries in EPICAL for which the oracle returns \u22a5. We de\ufb01ne Ut to be U for an\niteration t before epiphany, and \u2205 after that. We de\ufb01ne Dt to be the disagreement region D at iteration\nt. Finally, de\ufb01ne the unknown fraction within disagreement as \u03b1t (cid:44) \u00b5X(Dt \u2229 Ut)/\u00b5X(Dt). We are\nnow ready to state the main result of this section.\nTheorem 1. Given any \u0001 and \u03b4, EPICAL will, with probability at least 1 \u2212 \u03b4, return an \u02c6h \u2208 H with\nerr(\u02c6h) \u2264 \u0001, after making at most MCAL + \u00afM + 3\n\n\u03b4 ln 1\n\n\u0001\n\n\u03b2 ln 4\n\n\u03b4 queries.\n\n3\n\n\fRemark The bound above consists of three terms. The \ufb01rst is the standard CAL query complexity\nbound with an omniscient oracle. The other two are the price we pay when the oracle is imperfect. The\nsecond term is the query complexity for \ufb01nding a low-risk hypothesis in the induced active-learning\nproblem. In situations where \u00b5X(U) = \u0001/2 and \u03b2 (cid:28) 1, it is hard to induce epiphany, but it suf\ufb01ces to\n\ufb01nd a hypothesis from \u00afH with \u0001/2 risk in the induced problem (which implies at most \u0001 risk under\nthe original distribution \u00b5X); it indicates \u00afM is unavoidable in some cases. The third term is roughly\nthe extra query complexity required to induce epiphany. It is unavoidable in the worst case: when\nU = X, one has to wait for oracle epiphany to start collecting labeled examples to infer h\u2217; the\naverage number of steps until epiphany is on the order of 1/\u03b2. Finally, note that not all three terms\ncontribute simultaneously to the query complexity of EPICAL. As we will see in the analysis and in\nthe experiments, usually one or two of them will dominate, depending on how U interacts with the\ndisagreement region. Summing them up simpli\ufb01es our exposition, without changing the order of the\nworst-case bounds.\n\n(cid:26)\n\n(cid:27)\n\n(cid:26)\n\nOur analysis starts with the de\ufb01nition of the following two events. Lemmas 2 and 3 show that they\nhold with high probability when running EPICAL; the proofs are delegated to Appendix A. De\ufb01ne:\n\n4\n\u03b4\n\nE\u22a5 (cid:44)\n\nand E\u03b1 (cid:44)\n\nm\u22a5 \u2264 1\nln\n\u03b2\nLemma 2. Pr{E\u22a5} \u2265 1 \u2212 \u03b4/4 .\nLemma 3. Pr{E\u03b1} \u2265 1 \u2212 \u03b4/4.\nLemma 4. Assume event E\u03b1 holds. Then, the number of queries from K before oracle epiphany or\nbefore EPICAL terminates, whichever happens \ufb01rst, is at most \u00afm + 2\n\n|{t | \u03b1t > 1/2}| \u2264 2\n\u03b2\n\n.\n\n\u03b4 .\n\u03b2 ln 4\n\n(cid:27)\n\nln\n\n4\n\u03b4\n\nProof. (sketch) Denote the quantity by m. Before epiphany, V and D in EPICAL behave in exactly\nthe same way as in CAL on K. It takes \u00afm queries to get to \u0001/2 accuracy in K by the de\ufb01nition of\n\u00afm. If m \u2264 \u00afm, then m < \u00afm + 2\n\u03b4 trivially, and we are done. Otherwise, it must be the case that\n\u03b1t > 1/2 for every step after V reaches \u0001/2 accuracy on K. Suppose not. Then there is a step t where\n\u03b1t \u2264 1/2. Note V reaching \u0001/2 accuracy on K implies \u00b5X(Dt) \u2212 \u00b5X(Dt \u2229 Ut) \u2264 \u0001/2. Together with\n\u03b1t = \u00b5X (Dt \u2229 Ut)/\u00b5X (Dt) \u2264 1/2, we have \u00b5X(Dt) < \u0001. But this would have triggered termination\nof EPICAL at step t, a contradiction. Since we assume E\u03b1 holds, we have m \u2264 \u00afm + 2\n\n\u03b2 ln 4\n\n\u03b4 .\n\u03b2 ln 4\n\nProof of Theorem 1. We will prove the query complexity bound, assuming (i) events E\u22a5 and E\u03b1 hold;\nand (ii) \u00afM and MCAL successfully upper bound the corresponding query complexity of standard\nCAL. By Lemmas 2 and 3 and a union bound, the above holds with probability at least 1 \u2212 \u03b4.\nSuppose epiphany happens before EPICAL terminates. By event E\u22a5 and Lemma 4, the total number\nof queried examples before epiphany is at most \u00afm + 3\n\u03b4 . After epiphany, the total number of\nqueries is no more than that of running CAL from scratch; this number is at most MCAL. Therefore,\nthe total query complexity is at most \u00afM + MCAL + 3\nSuppose epiphany does not happen before EPICAL terminates. In this case, the number of queries in\nthe unknown region is at most 1\n\u03b4 (event E\u22a5), and the number of queries in the known region is at\nmost \u00afm + 2\n\n\u03b4 (Lemma 4). Thus, the total number of queries is at most \u00afM + 3\n\n\u03b4 .\n\u03b2 ln 4\n\n\u03b4 .\n\u03b2 ln 4\n\n\u03b2 ln 4\n\n\u03b2 ln 4\n\n\u03b2 ln 4\n\n4 The Agnostic Case\nIn the agnostic setting the best hypothesis, h\u2217 (cid:44) arg minh err(h), has a nonzero error. We want\nan active learning algorithm that, for a given accuracy \u0001 > 0, returns a hypothesis h with small\nregret reg(h, h\u2217) (cid:44) err(h) \u2212 err(h\u2217) \u2264 \u0001 while making a small number of queries. Among existing\nagnostic active learning algorithms we choose to adapt the Oracular-CAL algorithm, \ufb01rst proposed\nby Hsu [2010] and later improved by Huang et al. [2015]. Oracular-CAL makes no assumption on H\nor \u00b5, and can be implemented solely with an empirical risk minimization (ERM) subroutine, which is\noften well approximated by convex optimization over a surrogate loss in practice. This is a signi\ufb01cant\nadvantage over several existing agnostic algorithms, which either explicitly maintain a version space,\nas done in A2 [Balcan et al., 2006], or require a constrained ERM routine [Dasgupta et al., 2007] that\nmay not be well approximated ef\ufb01ciently in practice. IWAL [Beygelzimer et al., 2010] and Active\n\n4\n\n\f(cid:16) 32t|H| ln t\n\n(cid:17)\n\n\u03b4\n\nt ln\n\n6 + 9. Let \u03b70 (cid:44) 1 and \u03b7t (cid:44) 12\n\nAlgorithm 2 Oracular-EPICAL\n\u221a\n1: Set c1 (cid:44) 4 and c2 (cid:44) 2\n2: Initialize labeled data Z0 \u2190 \u2205, the version space V1 \u2190 H, and the ERM h1 as any h \u2208 H.\n3: for t = 1, 2, . . . do\n4:\n5:\n6:\n\nObserve new example xt, where (xt, yt) i.i.d.\u223c \u00b5.\nif xt \u2208 Dt (cid:44) {x | x \u2208 X,\u2203(h, h(cid:48)) \u2208 V2\n\nt s.t. h(x) (cid:54)= h(cid:48)(x)} then\n\n, t \u2265 1.\n\n(cid:26)Zt\u22121 \u222a {(xt, yt)}, oracle returns yt.\n\nQuery oracle with xt.\nZt \u2190\nut \u2190 1 (oracle returns \u22a5) .\nZt \u2190 Zt\u22121 \u222a{(xt, ht(xt))}. // update the labeled data with the current ERM\u2019s prediction\nut \u2190 0.\n\noracle returns \u22a5.\n\nZt\u22121,\n\n7:\n\nelse\n\n8:\n9:\n10:\n11:\nend if\n12:\nerr(h, Zt) (cid:44) 1\n13:\nht+1 \u2190 arg minh\u2208H err(h, Zt).\n14:\nbt \u2190 1\n15:\n16: \u2206t \u2190 c1\nVt+1 \u2190 {h \u2208 H | err(h, Zt) \u2212 err(ht+1, Zt) \u2264 \u2206t}.\n17:\n18: end for\n\n(cid:80)t\n(cid:80)t\n(cid:112)\u03b7t err(ht+1, Zt) + c2(\u03b7t + bt).\n\ni=1 ui.\n\ni=1\n\nt\n\nt\n\n1 (xi \u2208 Di) (1\u2212 ui)1 (h(xi) (cid:54)= yi) + 1 (xi /\u2208 Di) 1 (h(xi) (cid:54)= hi(xi)) .\n\nCover [Huang et al., 2015] are agnostic algorithms that are implementable with an ERM routine,\nboth using importance weights to correct for querying bias. But in the presence of \u22a5\u2019s, choosing\nproper importance weights becomes challenging. Moreover, the improved Oracular-CAL [Huang\net al., 2015] we use2 has stronger guarantees than IWAL, and in fact, the best known worst-case\nguarantees among ef\ufb01cient, agnostic active learning algorithms.\nOur proposed algorithm, Oracular-EPICAL, is given in Alg. 2. Note t here counts unlabeled data,\nwhile in Alg. 1 it counts queries. Roughly speaking, Oracular-EPICAL also has an additive factor\nof O(K/\u03b2) compared to Oracular-CAL\u2019s query complexity. It keeps a growing set Z of labeled\nexamples. If the unlabeled example xt falls in the disagreement region, the algorithm queries its label:\nwhen the oracle returns a label yt, the algorithm adds xt and yt to Z; when the oracle returns \u22a5, no\nupdate to Z happens. If xt is outside the disagreement region, the algorithm adds xt and the label\npredicted by the current ERM hypothesis ht(xt) to Z. Alg. 2 keeps an indicator ut, which records\nwhether \u22a5 was returned on xt, and it always updates the ERM and the version space after every new\nxt. For simplicity we assume a \ufb01nite H; this can be extended to H with \ufb01nite VC dimension.\nThe critical modi\ufb01cation we make here to accommodate oracle abstention is that the threshold \u2206t\nde\ufb01ning the version space additively depends on the average number of \u22a5\u2019s received up to round\nt. This allows us to show that Oracular-EPICAL retains the favorable bias guarantee of Oracular-\nCAL: with high probability, all of the imputed labels are consistent with the classi\ufb01cations of h\u2217,\nso imputation never pushes the algorithm away from h\u2217. Oracular-EPICAL only uses the version\nspace in the disagreement test. With the same technique used by Oracular-CAL, summarized in\nAppendix B, the algorithm is able to perform the test solely with an ERM routine.\nWe now state Oracular-EPICAL\u2019s general theoretical guarantees, which hold for any oracle model,\nand then specialize them for the epiphany model in Section 2. We start with a consistency result:\nTheorem 5 (Consistency Guarantee). Pick any 0 < \u03b4 < 1/e and let \u2206\u2217\nbt). With probability at least 1 \u2212 \u03b4, the following holds for all t \u2265 1,\n\n(cid:112)\u03b7t err(h\u2217) + c2(\u03b7t +\n\nt := c1\n\nerr(h) \u2212 err(h\u2217) \u2264 4\u2206\u2217\nerr(h\u2217, Zt) \u2212 err(ht+1, Zt) \u2264 \u2206t.\n\nt\n\nfor all h \u2208 Vt+1,\n\nand\n\n(1)\n(2)\n\n2This improved version of Oracular-CAL de\ufb01nes the version space using a tighter threshold than the one\n\nused by Hsu [2010], and has the same worst-case guarantees as Active Cover [Huang et al., 2015].\n\n5\n\n\fTheorem 6 (Query Complexity Bound). Let Qt (cid:44)(cid:80)t\n\nAll hypotheses in the current version space, including the current ERM, have controlled expected\nregrets. Compared with Oracular-CAL\u2019s consistency guarantee, this is worse by an additive factor of\nO(bt), the average number of \u22a5\u2019s over t examples. Importantly, h\u2217 always remains in the version\nspace, as implied by (2). This guarantees that all predicted labels used by the algorithm are consistent\nwith h\u2217, since the entire version space makes the same prediction. The query complexity bound is:\n1 (xi \u2208 Di) denote the total number of\nqueries Alg. 2 makes after observing t examples. Under the conditions of Theorem 5, with probability\nat least 1 \u2212 \u03b4 the following holds: \u2200t > 0, Qt is bounded by\n4\u03b8 err(h\u2217)t + \u03b8 \u00b7 O\n\nt err(h\u2217) ln(t|H|/\u03b4) ln2 t + ln(t|H|/\u03b4) ln t + tbt ln t + 8 ln(8t2 ln t/\u03b4)\n\n(cid:18)(cid:113)\n\n(cid:19)\n\ni=1\n\n,\n\nwhere \u03b8 denotes the disagreement coef\ufb01cient [Hanneke, 2014].\n\nAgain, this result is worse than Oracular-CAL\u2019s query complexity [Huang et al., 2015] by an additive\nfactor. The magnitude of this factor is less trivial than it seems: since the algorithm increases the\nthreshold by bt, it includes more hypotheses in the version space, which may cause the algorithm\nto query a lot more. However, our analysis shows that the number of queries only increases by\nO(tbt ln t), i.e., ln t times the total number of \u22a5\u2019s received over t examples.\nThe full proofs of both theorems are in Appendix C. Here we provide the key ingredient. Consider an\n\u2020\nimaginary dataset Z\nt where all the labels queried by the algorithm but not returned by the oracle are\nimputed, and de\ufb01ne the error on this imputed data:\n\n1 (xi \u2208 Di) 1 (h(xi) (cid:54)= yi) + 1 (xi /\u2208 Di) 1 (h(xi) (cid:54)= hi(xi)) .\n\n(3)\n\nerr(h, Z\n\n\u2020\nt ) (cid:44) 1\nt\n\nt(cid:88)\n\ni=1\n\nNote that the version space Vt and therefore the disagreement region Dt are still de\ufb01ned in terms\n\u2020\nt ). Also de\ufb01ne the empirical regrets between two hypotheses h and h(cid:48):\nof err(h, Zt), not err(h, Z\n\u2020\nreg(h, h(cid:48), Zt) (cid:44) err(h, Zt) \u2212 err(h(cid:48), Zt) and reg(h, h(cid:48), Z\nt in the same way. The empirical\nerror and regret on Z\n\n\u2020\nt are not observable, but can be easily bounded by observable quantities:\n\n\u2020\nt ) on Z\n\nerr(h, Zt) \u2264 err(h, Z\n\n\u2020\nt ) \u2264 err(h, Zt) + bt,\n\n\u2020\nt )| \u2264 bt,\n\n|reg(h, h(cid:48), Zt) \u2212 reg(h, h(cid:48), Z\n\nwhere bt = (cid:80)t\n\n(4)\n(5)\ni=1 ui/t is also observable. Using a martingale analysis resembling Huang et al.\n\u2020\n[2015]\u2019s for Oracular-CAL, we prove concentration of the empirical regret reg(h, h\u2217, Z\nt ) to its\nexpectation. For every h \u2208 Vt+1, the algorithm controls its empirical regret on Zt , which bounds\n\u2020\nreg(h, h\u2217, Z\nt ) by the above. This leads to a bound on the expected regret of h. The query complexity\nanalysis follows the standard framework of Hsu [2010] and Huang et al. [2015].\nNext, we specialize the guarantees to the oracle epiphany model in Section 2:\n\nCorollary 7. Assume the epiphany model in Section 2. Fix \u0001 > 0, \u03b4 > 0. Let \u02dcd (cid:44) ln(|H|/(\u0001\u03b4)), (cid:101)K (cid:44)\nK ln(K/\u03b4) and e\u2217 (cid:44) err(h\u2217). With probability at least 1 \u2212 \u03b4, the following holds: The ERM\nhypothesis ht\u0001+1 satis\ufb01es err(ht\u0001+1) \u2212 e\u2217 \u2264 \u0001, where t\u0001 = O\n, and the total\nnumber of queries made up to round t\u0001 is\n+ ln\n\n(cid:16) \u02dcde\u2217\n(cid:17) \u00b7(cid:16)(cid:16) e\u2217\n\n(cid:0) \u02dcd + (cid:101)K\n(cid:17) \u02dcd + (cid:101)K\n\n(cid:17) \u02dcd + (cid:101)K\n\n(cid:16)(cid:16) e\u2217\n\n(cid:16) \u02dcd\u00b7e\u2217\n\u0001 + (cid:101)K\n\n\u03b2\n\n(cid:1)(cid:17)\n(cid:17)(cid:17)\n\n(cid:16) e\u2217\n\n\u0001\n\n\u00012 + 1\n\n\u0001\n\n\u03b8 \u00b7 O\n\n\u00012 + 1\n\n\u0001\n\n\u0001\u03b2\n\n\u0001 + 1\n\n\u03b2\n\n\u03b2\n\n.\n\n(cid:17)\n\nThe proof is in Appendix D. This corollary reveals how the epiphany parameters K and \u03b2 affect query\n\ncomplexity. Setting (cid:101)K = 0 recovers the result for a perfect oracle, showing that the (unlabeled) sample\ncomplexity t\u0001 worsens by an additive factor of (cid:101)K/(\u03b2\u0001) in both realizable and agnostic settings. For\nquery complexity, in the realizable setting the bound becomes \u03b8 \u00b7 O(cid:0) ln(cid:0)(cid:0) \u02dcd + (cid:101)K/\u03b2(cid:1)/\u0001(cid:1)(cid:0) \u02dcd + (cid:101)K/\u03b2(cid:1)(cid:1).\nIn the agnostic setting, the leading term in our bound is \u03b8\u00b7 O(cid:0)(e\u2217/\u0001)2 \u02dcd + ((cid:101)Ke\u2217)/(\u03b2\u0001)(cid:1). In both cases,\nour bounds are worse by roughly an additive factor of O((cid:101)K/\u03b2) than bounds for perfect oracles.\n\nAs for the effect of U, the above corollary is a worst-case result: it uses an upper bound on tbt\nthat holds even for U = X. For certain U\u2019s the upper bound can be much tighter. For example, if\nU \u2229 Dt = \u2205 for suf\ufb01ciently large t, then tbt will be O(1) for all \u03b2, with or without epiphany.\n\n6\n\n\f5 Experiments\n\nand set the target classi\ufb01cation error at \u0001 = 0.05.\n\nTo complement our theoretical results, we present two simulated experiments on active learning\nwith oracle epiphany: learning a 1D threshold classi\ufb01er and handwritten digit recognition (OCR).\nSpeci\ufb01cally, we will highlight query complexity dependency on the epiphany parameter\u03b2 and on U.\nEPICAL on 1D Threshold Classi\ufb01ers. Take \u00b5X to be the uniform distribution over the interval\nX = [0, 1]. Our hypothesis space is the set of threshold classi\ufb01ers H = {ha : a \u2208 [0, 1]} where\nha(x) = 1 (x \u2265 a). We choose h\u2217 = h 1\nWe illustrate epiphany with a single unknown region K = 1, U = U1. However, we contrast two\nshapes of U: in one set of experiments we set U = [0.4, 0.6] which contains the decision boundary\n0.5. In this case, the active learner EPICAL must induce oracle epiphany in order to achieve \u0001 risk.\nIn another set of experiments U = [0.7, 0.9], where we expect the learner to be able to \u201cbypass\u201d the\nneed for epiphany. Intuitively, this latter U could soon be excluded from the disagreement region.\nFor both U, we systematically vary the oracle epiphany parameter \u03b2 \u2208 {2\u22126, 2\u22125, . . . , 20}. A small\n\u03b2 means epiphany is less likely per query, thus we expect the learner to spend more queries trying\nto induce epiphany in the case of U = [0.4, 0.6]. In contrast, \u03b2 may not matter much in the case of\nU = [0.7, 0.9] since epiphany may not be required. Note that \u03b2 = 20 = 1 reverts back to the standard\nactive learning oracle, since epiphany always happens immediately. We run each combination of \u03b2, U\n\n2\n\n0\n0\n4\n\n0\n0\n3\n\ns\ne\ni\nr\ne\nu\nQ\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\nEPICAL\nPassive\n\n0\n0\n4\n\n0\n0\n3\n\ns\ne\ni\nr\ne\nu\nQ\n\n0\n0\n2\n\n0\n0\n1\n\n0\n\nEPICAL\nPassive\n\n0.5\n\u03b2\n\n0.0\n(a) U = [0.4, 0.6]\n\n1.0\n\n0.5\n\u03b2\n\n0.0\n(b) U = [0.7, 0.9]\n\n1.0\n\n0\n8\n\n0\n6\n\n0\n4\n\ns\ne\ni\nr\ne\nu\nq\ns\ns\ne\nc\nx\nE\n\n0\n2\n\n0\n\n20\n\n40\n1/\u03b2\n\n0\n60\n(c) U = [0.4, 0.6]\n\n80\n\nFigure 1: EPICAL results on 1D threshold classi\ufb01ers\n\nfor 10, 000 trials. The results are shown in Figure 1. As expected, (a) shows a clear dependency on\n\u03b2. This indicates that epiphany is necessary in the case U = [0.4, 0.6] for learning to be successful.\nIn contrast, the dependence on \u03b2 vanishes in (b) when U is shifted suf\ufb01ciently away from the target\nthreshold (and thus from later disagreement regions). The oracle need not reach epiphany for learning\nto happen. Note (b) does not contradict with EPICAL query complexity analysis since Theorem 1 is\na worst case bound that must hold true for all U.\nTo further clarify the role of \u03b2, note EPICAL query complexity bound predicts an additive term\nof O(1/\u03b2) on top of the standard CAL query complexities (i.e., both \u00afM and MCAL). This term\nrepresents \u201cexcess queries\u201d needed to induce epiphany. In Figure 1(c) we plot this excess against 1\n\u03b2\nfor U = [0.4, 0.6]. Excess is computed as the number of EPICAL queries minus the average number\nof queries for \u03b2 = 1. Indeed, we see a near linear relationship between excess queries and 1/\u03b2.\nFinally, as a baseline we compare EPICAL to passive learning. In passive learning x1, x2, . . . are\nchosen randomly according to \u00b5X instead of adaptively. Note passive learning here is also subject\nto oracle epiphany. That is, the labels yt are produced by the same oracle epiphany model, some of\nthem can be \u22a5 initially. Our passive learning simply maintains a version space. If it encounters\u22a5 it\ndoes not update the version space. All EPICAL results are better than passive learning.\nOracular-EPICAL on OCR. We consider the binary classi\ufb01cation task of 5 vs. other digits on\nMNIST [LeCun et al., 1998]. This allows us to design the unknown regions {Uk} as certain other\ndigits, making the experiments more interpretable. Furthermore, we can control how confusable the\nU digits are to \u201c5\u201d to observe the in\ufb02uence on oracle epiphany.\nAlthough Alg. 2 is ef\ufb01ciently implementable with an ERM routine, it still requires two calls to a\nsupervised learning algorithm on every new example. To scale it up, we implement an approximate\nversion of Alg. 2 that uses online optimization in place of the ERM. More details are in Appendix E.\nWhile being ef\ufb01cient in practice, this online algorithm may not retain Alg. 2\u2019s theoretical guarantees.\n\n7\n\n\f(a) U =\u201c3\u201d\n\n(b) U =\u201c1\u201d\n\nFigure 2: Oracular-EPICAL results on OCR.\n\nWe use epiphany parameters \u03b2 \u2208 {1, 10\u22121, 10\u22122, 10\u22123, 10\u22124, 0}, K = 1, and U is either \u201c3\u201d or \u201c1\u201d.\nBy using \u03b2 = 1 and \u03b2 = 0, we include the boundary cases where the oracle is perfect or never has an\nepiphany. The two different U\u2019s correspond to two contrasting scenarios: \u201c3\u201d is among the \u201cnearest\u201d\ndigits to \u201c5\u201d as measured by the binary classi\ufb01cation error between \u201c5\u201d and every other single digit,\nwhile \u201c1\u201d is the farthest. The two U\u2019s are about the same size, each covering roughly 10% of the\ndata. More details and experimental results with other choices of U can be found in Appendix E.\nFor each combination of \u03b2 and U, we perform 100 random trials. In each trial, we run both the\nonline version of Alg. 2 and online passive logistic regression (also subject to oracle epiphany) over\na randomly permuted training set of 60, 000 examples, and check the error of the online ERM on\nthe 10, 000 testing examples every 10 queries from 200 up to our query budget of 13, 000. In each\ntrial we record the smallest number of queries for achieving a test error of 4%. Fig. 2(a) and Fig. 2(b)\nshow the median of this number over the 100 random trials, with error bars being the 25th and 75th\nquantiles. The effect of \u03b2 on query complexity is dramatic for the near U = \u201c3\u201d but subdued for\nthe far U = \u201c1\u201d. In particular, for U = \u201c3\u201d small \u03b2\u2019s force active learning to query as many labels\nas passive learning. The \ufb02attening at 13, 000 at the end means no algorithm could achieve a 4%\ntest error within our query budget. For U = \u201c1\u201d, active learning is always much better than passive\nregardless of \u03b2. Again, this illustrates that both \u03b2 and U affect the query complexity. As performance\nreferences, passive learning on the entire labeled training data achieves a test error of 2.6%, while\npredicting the majority class (non-5) has a test error of 8.9%.\n\n6 Discussions\n\nOur analysis reveals a worst case O(1/\u03b2) term in query complexity due to the wait for epiphany, and\nwe hypothesize \u2126(K/\u03b2) to be the tight lower bound. This immediately raises the question: can we\ndecouple active learning queries from epiphany induction? What if the learner can quickly induce\nepiphany by showing the oracle a screenful of unlabeled items at a time, without the oracle labeling\nthem? This possibility is hinted in empirical studies. For example, Kulesza et al. [2014] observed\nepiphanies resulting from seeing items. Then there is a tradeoff between two learner actions toward\nthe oracle: asking a query (getting a label or small contribution toward epiphany), or showing several\nitems (not getting labels but potentially large contribution toward epiphany). One must formalize\nthe cost and bene\ufb01t of this tradeoff. Of course, real human behaviors are even richer. Epiphanies\nmay be reversible on certain queries, where the oracle begins to have doubts on her previous labeling.\nExtending our model under more relaxed assumptions is an interesting open question for future\nresearch.\n\nAcknowledgments\n\nThis work is supported in part by NSF grants IIS-0953219, IIS-1623605, DGE-1545481, CCF-\n1423237, and by the University of Wisconsin-Madison Graduate School with funding from the\nWisconsin Alumni Research Foundation.\n\nReferences\nMaria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceed-\n\nings of the 23rd international conference on Machine learning, pages 65\u201372. ACM, 2006.\n\n8\n\n01e\u221241e\u221231e\u221221e\u22121102000400060008000100001200014000\u03b2Queries Oracular\u2212EPICALPassive01e\u221241e\u221231e\u221221e\u22121140060080010001200140016001800\u03b2Queries Oracular\u2212EPICALPassive\fAlina Beygelzimer, John Langford, Zhang Tong, and Daniel J Hsu. Agnostic active learning without\n\nconstraints. In Advances in Neural Information Processing Systems, pages 199\u2013207, 2010.\n\nDavid Cohn, Les Atlas, and Richard Ladner. Improving generalization with active learning. Machine\n\nlearning, 15(2):201\u2013221, 1994.\n\nSanjoy Dasgupta, Claire Monteleoni, and Daniel J Hsu. A general agnostic active learning algorithm.\n\nIn Advances in neural information processing systems, pages 353\u2013360, 2007.\n\nPinar Donmez and Jaime G Carbonell. Proactive learning: cost-sensitive active learning with multiple\nimperfect oracles. In Proceedings of the 17th ACM conference on Information and knowledge\nmanagement, pages 619\u2013628. ACM, 2008.\n\nRan El-Yaniv and Yair Wiener. On the foundations of noise-free selective classi\ufb01cation. The Journal\n\nof Machine Learning Research, 11:1605\u20131641, 2010.\n\nSteve Hanneke. Theory of disagreement-based active learning. Foundations and Trends in Machine\n\nLearning, 7(2-3):131\u2013309, 2014.\n\nDaniel J. Hsu. Algorithms for Active Learning. PhD thesis, University of California at San Diego,\n\n2010.\n\nTzu-Kuo Huang, Alekh Agarwal, Daniel J Hsu, John Langford, and Robert E Schapire. Ef\ufb01cient and\n\nparsimonious agnostic active learning. In NIPS, pages 2737\u20132745, 2015.\n\nS. M. Kakade and A. Tewari. On the generalization ability of online strongly convex programming\n\nalgorithms. In Advances in Neural Information Processing Systems 21, 2009.\n\nNikos Karampatziakis and John Langford. Online importance weight aware updates. In UAI, pages\n\n392\u2013399, 2011.\n\nTodd Kulesza, Saleema Amershi, Rich Caruana, Danyel Fisher, and Denis Xavier Charles. Structured\nlabeling for facilitating concept evolution in machine learning. In CHI, pages 3075\u20133084, 2014.\n\nYann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nEdward Newell and Derek Ruths. How one microtask affects another. In Proceedings of the 2016\n\nCHI Conference on Human Factors in Computing Systems, CHI, pages 3155\u20133166, 2016.\n\nAdvait Sarkar, Cecily Morrison, Jonas F Dorn, Rishi Bedi, Saskia Steinheimer, Jacques Boisvert,\nJessica Burggraaff, Marcus D\u2019Souza, Peter Kontschieder, Samuel Rota Bul\u00f2, et al. Setwise\ncomparison: Consistent, scalable, continuum labels for computer vision. In CHI, 2016.\n\nNihar Bhadresh Shah and Denny Zhou. Double or nothing: Multiplicative incentive mechanisms for\n\ncrowdsourcing. In Advances in Neural Information Processing Systems, pages 1\u20139, 2015.\n\nChicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In\n\nAdvances in Neural Information Processing Systems, pages 442\u2013450, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1426, "authors": [{"given_name": "Tzu-Kuo", "family_name": "Huang", "institution": "Uber Advanced Technologies Center"}, {"given_name": "Lihong", "family_name": "Li", "institution": "Microsoft Research"}, {"given_name": "Ara", "family_name": "Vartanian", "institution": "University of Wisconsin-Madison"}, {"given_name": "Saleema", "family_name": "Amershi", "institution": "Microsoft"}, {"given_name": "Jerry", "family_name": "Zhu", "institution": "University of Wisconsin-Madison"}]}