{"title": "Revisiting Perceptron: Efficient and Label-Optimal Learning of Halfspaces", "book": "Advances in Neural Information Processing Systems", "page_first": 1056, "page_last": 1066, "abstract": "It has been a long-standing problem to efficiently learn a halfspace using as few labels as possible in the presence of noise. In this work, we propose an efficient Perceptron-based algorithm for actively learning homogeneous halfspaces under the uniform distribution over the unit sphere. Under the bounded noise condition~\\cite{MN06}, where each label is flipped with probability at most $\\eta < \\frac 1 2$, our algorithm achieves a near-optimal label complexity of $\\tilde{O}\\left(\\frac{d}{(1-2\\eta)^2}\\ln\\frac{1}{\\epsilon}\\right)$ in time $\\tilde{O}\\left(\\frac{d^2}{\\epsilon(1-2\\eta)^3}\\right)$. Under the adversarial noise condition~\\cite{ABL14, KLS09, KKMS08}, where at most a $\\tilde \\Omega(\\epsilon)$ fraction of labels can be flipped, our algorithm achieves a near-optimal label complexity of $\\tilde{O}\\left(d\\ln\\frac{1}{\\epsilon}\\right)$ in time $\\tilde{O}\\left(\\frac{d^2}{\\epsilon}\\right)$. Furthermore, we show that our active learning algorithm can be converted to an efficient passive learning algorithm that has near-optimal sample complexities with respect to $\\epsilon$ and $d$.", "full_text": "Revisiting Perceptron:\n\nEf\ufb01cient and Label-Optimal Learning of Halfspaces\n\nSongbai Yan\nUC San Diego\nLa Jolla, CA\n\nChicheng Zhang\u2217\nMicrosoft Research\n\nNew York, NY\n\nyansongbai@ucsd.edu\n\nchicheng.zhang@microsoft.com\n\nAbstract\n\nIt has been a long-standing problem to ef\ufb01ciently learn a halfspace using as few\nlabels as possible in the presence of noise. In this work, we propose an ef\ufb01cient\nPerceptron-based algorithm for actively learning homogeneous halfspaces under the\nuniform distribution over the unit sphere. Under the bounded noise condition [49],\nwhere each label is \ufb02ipped with probability at most \u03b7 < 1\n2, our algorithm achieves a\n\nthe adversarial noise condition [6, 45, 42], where at most a \u02dc\u03a9(\ufffd) fraction of labels\n\nnear-optimal label complexity of \u02dcO\ufffd\n\ufffd(1\u22122\u03b7)3\ufffd. Under\ncan be \ufb02ipped, our algorithm achieves a near-optimal label complexity of \u02dcO\ufffdd ln 1\n\ufffd\ufffd\nin time \u02dcO\ufffd d2\n\ufffd\ufffd. Furthermore, we show that our active learning algorithm can be\n\nconverted to an ef\ufb01cient passive learning algorithm that has near-optimal sample\ncomplexities with respect to \ufffd and d.\n\nd\n\n(1\u22122\u03b7)2 ln 1\n\n\ufffd\ufffd2 in time \u02dcO\ufffd\n\nd2\n\n1\n\nIntroduction\n\nWe study the problem of designing ef\ufb01cient noise-tolerant algorithms for actively learning homoge-\nneous halfspaces in the streaming setting. We are given access to a data distribution from which we\ncan draw unlabeled examples, and a noisy labeling oracle O that we can query for labels. The goal is\nto \ufb01nd a computationally ef\ufb01cient algorithm to learn a halfspace that best classi\ufb01es the data while\nmaking as few queries to the labeling oracle as possible.\nActive learning arises naturally in many machine learning applications where unlabeled examples are\nabundant and cheap, but labeling requires human effort and is expensive. For those applications, one\nnatural question is whether we can learn an accurate classi\ufb01er using as few labels as possible. Active\nlearning addresses this question by allowing the learning algorithm to sequentially select examples\nto query for labels, and avoid requesting labels which are less informative, or can be inferred from\npreviously-observed examples.\nThere has been a large body of work on the theory of active learning, showing sharp distribution-\ndependent label complexity bounds [21, 11, 34, 27, 35, 46, 60, 41]. However, most of these general\nactive learning algorithms rely on solving empirical risk minimization problems, which are computa-\ntionally hard in the presence of noise [5].\nOn the other hand, existing computationally ef\ufb01cient algorithms for learning halfspaces [17, 29, 42,\n45, 6, 23, 7, 8] are not optimal in terms of label requirements. These algorithms have different degrees\nof noise tolerance (e.g. adversarial noise [6], malicious noise [43], random classi\ufb01cation noise [3],\n\n\u2217Work done while at UC San Diego.\n2We use \u02dcO(f (\u00b7)) := O(f (\u00b7) ln f (\u00b7)), and \u02dc\u03a9(f (\u00b7)) := \u03a9(f (\u00b7)/ ln f (\u00b7)). We say f (\u00b7) = \u02dc\u0398(g(\u00b7)) if f (\u00b7) =\n\u02dcO(g(\u00b7)) and f (\u00b7) = \u02dc\u03a9\ufffdg(\u00b7)\ufffd\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fbounded noise [49], etc), and run in time polynomial in 1\n\ufffd and d. Some of them naturally exploit the\nutility of active learning [6, 7, 8], but they do not achieve the sharpest label complexity bounds in\ncontrast to those computationally-inef\ufb01cient active learning algorithms [10, 9, 60].\nTherefore, a natural question is: is there any active learning halfspace algorithm that is computationally\nef\ufb01cient, and has a minimum label requirement? This has been posed as an open problem in [50].\nIn the realizable setting, [26, 10, 9, 56] give ef\ufb01cient algorithms that have optimal label complexity\nof \u02dcO(d ln 1\n\ufffd ) under some distributional assumptions. However, the challenge still remains open in\nthe nonrealizable setting. It has been shown that learning halfspaces with agnostic noise even under\nGaussian unlabeled distribution is hard [44]. Nonetheless, we give an af\ufb01rmative answer to this\nquestion under two moderate noise settings: bounded noise and adversarial noise.\n\n1.1 Our Results\n\nd\n\nd\n\nd\n\nd2\n\nd\n\nO(\n\n1\n\nO(\n\n1\n\nWe propose a Perceptron-based algorithm, ACTIVE-PERCEPTRON, for actively learning homoge-\nneous halfspaces under the uniform distribution over the unit sphere. It works under two noise\nsettings: bounded noise and adversarial noise. Our work answers an open question by [26] on\nwhether Perceptron-based active learning algorithms can be modi\ufb01ed to tolerate label noise.\nIn the \u03b7-bounded noise setting (also known as the Massart noise model [49]), the label of an example\nx \u2208 Rd is generated by sign(u \u00b7 x) for some underlying halfspace u, and \ufb02ipped with probabil-\n(1\u22122\u03b7)3\ufffd\ufffd, and requires \u02dcO\ufffd\n\ufffd\ufffd\nity \u03b7(x) \u2264 \u03b7 < 1\n(1\u22122\u03b7)2 \u00b7 ln 1\nlabels. We show that this label complexity is nearly optimal by providing an almost matching\n\ufffd\ufffd. Our time and label complexities substan-\ninformation-theoretic lower bound of \u03a9\ufffd\n\n2. Our algorithm runs in time \u02dcO\ufffd\n(1\u22122\u03b7)2 \u00b7 ln 1\n\ntially improve over the state of the art result of [8], which runs in time \u02dcO(d\n\u02dcO(d\nOur main theorem on learning under bounded noise is as follows:\nTheorem 2 (Informal). Suppose the labeling oracle O satis\ufb01es the \u03b7-bounded noise condition with\nrespect to u, then for ACTIVE-PERCEPTRON, with probability at least 1\u2212\u03b4: (1) The output halfspace\nv is such that P[sign(v \u00b7 X) \ufffd= sign(u \u00b7 X)] \u2264 \ufffd; (2) The number of label queries to oracle O is at\nmost \u02dcO\ufffd\n(1\u22122\u03b7)3\ufffd\ufffd; (4)\nThe algorithm runs in time \u02dcO\ufffd\n\n\ufffd\ufffd; (3) The number of unlabeled examples drawn is at most \u02dcO\ufffd\n\n(1\u22122\u03b7)3\ufffd\ufffd.\n\n(1\u22122\u03b7)2 \u00b7 ln 1\n\n\ufffd ) and requires\n\n\ufffd ) labels.\n\n(1\u22122\u03b7)4 ) 1\n\n(1\u22122\u03b7)4 )\n\nln 1\n\n\ufffd\n\nln d\n\n\u03b4 +ln ln 1\n\n\ufffd\ufffd. It runs in time \u02dcO\ufffd d2\n\nIn addition, we show that our algorithm also works in a more challenging setting, the \u03bd-adversarial\nnoise setting [6, 42, 45].3 In this setting, the examples still come iid from a distribution, but the\nassumption on the labels is just that P[sign(u \u00b7 X) \ufffd= Y ] \u2264 \u03bd for some halfspace u. Under this\nassumption, the Bayes classi\ufb01er may not be a halfspace. We show that our algorithm achieves an\nerror of \ufffd while tolerating a noise level of \u03bd = \u03a9\ufffd\n\ufffd\ufffd, and requires\nonly \u02dcO\ufffdd \u00b7 ln 1\n\ufffd\ufffd labels which is near-optimal. ACTIVE-PERCEPTRON has a label complexity bound\nthat matches the state of the art result of [39]4, while having a lower running time.\nOur main theorem on learning under adversarial noise is as follows:\nTheorem 3 (Informal). Suppose the labeling oracle O satis\ufb01es the \u03bd-adversarial noise condition\nwith respect to u, where \u03bd < \u0398(\n). Then for ACTIVE-PERCEPTRON, with probability at\nleast 1\u2212 \u03b4: (1) The output halfspace v is such that P[sign(v\u00b7 X) \ufffd= sign(u\u00b7 X)] \u2264 \ufffd; (2) The number\nof label queries to oracle O is at most \u02dcO\ufffdd \u00b7 ln 1\n\ufffd\ufffd; (3) The number of unlabeled examples drawn is\nat most \u02dcO\ufffd d\n\ufffd\ufffd.\n\ufffd\ufffd; (4) The algorithm runs in time \u02dcO\ufffd d2\n\n\u03b4 +ln ln 1\n\nln d\n\n\ufffd\n\n\ufffd\n\nd2\n\n3Note that the adversarial noise model is not the same as that in online learning [18], where each example\n\n4The label complexity bound is implicit in [39] by a re\ufb01ned analysis of the algorithm of [6] (See their Lemma\n\ncan be chosen adversarially.\n\n8 for details).\n\n2\n\n\fTable 1: A comparison of algorithms for active learning of halfspaces under the uniform distribution,\nin the \u03b7-bounded noise model.\n\nAlgorithm Label Complexity\n[10, 9, 60]\n[8]\nOur Work\n\n(1\u22122\u03b7)2 ln 1\n\ufffd )\n\n\u02dcO(\n\u02dcO(d\n\u02dcO(\n\n(1\u22122\u03b7)4 ) \u00b7 ln 1\n\ufffd )\n\nO(\n\nd\n\n1\n\nd\n\n(1\u22122\u03b7)2 ln 1\n\ufffd )\n\nTime Complexity\n\ufffd ) 5\nsuperpoly(d, 1\n(1\u22122\u03b7)4 ) \u00b7 1\n\u02dcO(d\n\ufffd )\n\ufffd\ufffd\n\n\u02dcO\ufffd d2\n\n(1\u22122\u03b7)3\n\nO(\n\n1\n\n1\n\nTable 2: A comparison of algorithms for active learning of halfspaces under the uniform distribution,\nin the \u03bd-adversarial noise model.\n\nAlgorithm Noise Tolerance\n[60]\n[39]\nOur Work\n\n\u03bd = \u03a9(\ufffd)\n\u03bd = \u03a9(\ufffd)\n\u03bd = \u03a9(\n\n\ufffd\n\nln d+ln ln 1\n\ufffd\n\nLabel Complexity Time Complexity\n\u02dcO(d ln 1\n\ufffd )\n\u02dcO(d ln 1\n\ufffd )\n\u02dcO(d ln 1\n\ufffd )\n\nsuperpoly(d, 1\n\ufffd )\npoly(d, 1\n\ufffd )\n\n\u02dcO\ufffdd2 \u00b7 1\n\ufffd\ufffd\n\n)\n\nThroughout the paper, ACTIVE-PERCEPTRON is shown to work if the unlabeled examples are drawn\nuniformly from the unit sphere. The algorithm and analysis can be easily generalized to any spherical\nsymmetrical distributions, for example, isotropic Gaussian distributions. They can also be generalized\nto distributions whose densities with respect to uniform distribution are bounded away from 0.\nIn addition, we show in Section 6 that ACTIVE-PERCEPTRON can be converted to a passive learning\nalgorithm, PASSIVE-PERCEPTRON, that has near optimal sample complexities with respect to \ufffd and\nd under the two noise settings. We defer the discussion to the end of the paper.\n\n2 Related Work\n\nActive Learning. The recent decades have seen much success in both theory and practice of active\nlearning; see the excellent surveys by [54, 37, 25]. On the theory side, many label-ef\ufb01cient active\nlearning algorithms have been proposed and analyzed. An incomplete list includes [21, 11, 34, 27,\n35, 46, 60, 41]. Most algorithms relies on solving empirical risk minimization problems, which are\ncomputationally hard in the presence of noise [5].\n\nComputational Hardness of Learning Halfspaces. Ef\ufb01cient learning of halfspaces is one of\nthe central problems in machine learning [22]. In the realizable case, it is well known that linear\nprogramming will \ufb01nd a consistent hypothesis over data ef\ufb01ciently. In the nonrealizable setting,\nhowever, the problem is much more challenging.\nA series of papers have shown the hardness of learning halfspaces with agnostic noise [5, 30, 33, 44,\n23]. The state of the art result [23] shows that under standard complexity-theoretic assumptions, there\nexists a data distribution, such that the best linear classi\ufb01er has error o(1), but no polynomial time\nalgorithms can achieve an error at most 1\ndc for every c > 0, even with improper learning. [44]\n2 \u2212 1\nshows that under standard assumptions, even if the unlabeled distribution is Gaussian, any agnostic\nhalfspace learning algorithm must run in time ( 1\n\ufffd )\u03a9(ln d) to achieve an excess error of \ufffd. These results\nindicate that, to have nontrivial guarantees on learning halfspaces with noise in polynomial time, one\nhas to make additional assumptions on the data distribution over instances and labels.\n\nEf\ufb01cient Active Learning of Halfspaces. Despite considerable efforts, there are only a few halfs-\npace learning algorithms that are both computationally-ef\ufb01cient and label-ef\ufb01cient even under the\nuniform distribution. In the realizable setting, [26, 10, 9] propose computationally ef\ufb01cient active\nlearning algorithms which have an optimal label complexity of \u02dcO(d ln 1\nSince it is believed to be hard for learning halfspaces in the general agnostic setting, it is natural\nto consider algorithms that work under more moderate noise conditions. Under the bounded noise\n\n\ufffd ).\n\n5The algorithm needs to minimize 0-1 loss, the best known method for which requires superpolynomial time.\n\n3\n\n\f1\n\nd\n\nO(\n\nln 1\n\n(1\u22122\u03b7)4 )\n\n(1\u22122\u03b7)2 ln 1\n\nsetting [49], the only known algorithms that are both label-ef\ufb01cient and computationally-ef\ufb01cient are\n[7, 8]. [7] uses a margin-based framework which queries the labels of examples near the decision\nboundary. To achieve computational ef\ufb01ciency, it adaptively chooses a sequence of hinge loss\nminimization problems to optimize as opposed to directly optimizing the 0-1 loss. It works only\nwhen the label \ufb02ipping probability upper bound \u03b7 is small (\u03b7 \u2264 1.8 \u00d7 10\u22126). [8] improves over [7]\nby adapting a polynomial regression procedure into the margin-based framework. It works for any\n\u03b7 < 1/2, but its label complexity is O(d\n\ufffd ), which is far worse than the information-\ntheoretic lower bound \u03a9(\n\ufffd ). Recently [20] gives an ef\ufb01cient algorithm with a near-optimal\nlabel complexity under the membership query model where the learner can query on synthesized\npoints. In contrast, in our stream-based model, the learner can only query on points drawn from the\ndata distribution. We note that learning in the stream-based model is harder than in the membership\nquery model, and it is unclear how to transform the DC algorithm in [20] into a computationally\nef\ufb01cient stream-based active learning algorithm.\nUnder the more challenging \u03bd-adversarial noise setting, [6] proposes a margin-based algorithm that\nreduces the problem to a sequence of hinge loss minimization problems. Their algorithm achieves an\nerror of \ufffd in polynomial time when \u03bd = \u03a9(\ufffd), but requires \u02dcO(d2 ln 1\n\ufffd ) labels. Later, [39] performs a\nre\ufb01ned analysis to achieve a near-optimal label complexity of \u02dcO(d ln 1\n\ufffd ), but the time complexity of\nthe algorithm is still an unspeci\ufb01ed high order polynomial.\nTables 1 and 2 present comparisons between our results and results most closely related to ours\nin the literature. Due to space limitations, discussions of additional related work are deferred to\nAppendix A.\n\n3 De\ufb01nitions and Settings\n\nWe consider learning homogeneous halfspaces under uniform distribution. The instance space X\nis the unit sphere in Rd, which we denote by Sd\u22121 := \ufffdx \u2208 Rd : \ufffdx\ufffd = 1\ufffd. We assume d \u2265 3\nthroughout this paper. The label space Y = {+1,\u22121}. We assume all data points (x, y) are drawn\ni.i.d. from an underlying distribution D over X \u00d7 Y. We denote by DX the marginal of D over X\n(which is uniform over Sd\u22121), and DY |X the conditional distribution of Y given X. Our algorithm is\nallowed to draw unlabeled examples x \u2208 X from DX , and to make queries to a labeling oracle O for\nlabels. Upon query x, O returns a label y drawn from DY |X=x. The hypothesis class of interest is\nthe set of homogeneous halfspaces H :=\ufffdhw(x) = sign(w \u00b7 x) | w \u2208 Sd\u22121\ufffd. For any hypothesis\nh \u2208 H, we de\ufb01ne its error rate err(h) := PD[h(X) \ufffd= Y ]. We will drop the subscript D in PD when\nit is clear from the context. Given a dataset S =\ufffd(X1, Y1), . . . , (Xm, Ym)\ufffd, we de\ufb01ne the empirical\nerror rate of h over S as errS(h) := 1\nDe\ufb01nition 1 (Bounded Noise [49]). We say that the labeling oracle O satis\ufb01es the \u03b7-bounded noise\ncondition for some \u03b7 \u2208 [0, 1/2) with respect to u, if for any x, P[Y \ufffd= sign(u \u00b7 x) | X = x] \u2264 \u03b7.\nIt can be seen that under \u03b7-bounded noise condition, hu is the Bayes classi\ufb01er.\nDe\ufb01nition 2 (Adversarial Noise [6]). We say that the labeling oracle O satis\ufb01es the \u03bd-adversarial\nnoise condition for some \u03bd \u2208 [0, 1] with respect to u, if P[Y \ufffd= sign(u \u00b7 X)] \u2264 \u03bd.\nFor two unit vectors v1, v2, denote by \u03b8(v1, v2) = arccos(v1 \u00b7 v2) the angle between them. The\nfollowing lemma gives relationships between errors and angles (see also Lemma 1 in [8]).\n\n1\ufffdh(xi) \ufffd= yi\ufffd.\n\nm\ufffdm\n\ni=1\n\nAdditionally, if the labeling oracle satis\ufb01es the \u03b7-bounded noise condition with respect to u, then for\n\nLemma 1. For any v1, v2 \u2208 Sd\u22121,\ufffd\ufffderr(hv1 ) \u2212 err(hv2 )\ufffd\ufffd \u2264 P\ufffdhv1 (X) \ufffd= hv2 (X)\ufffd = \u03b8(v1,v2)\nany vector v,\ufffd\ufffderr(hv) \u2212 err(hu)\ufffd\ufffd \u2265 (1 \u2212 2\u03b7)P\ufffdhv(X) \ufffd= hu(X)\ufffd = 1\u22122\u03b7\n\nGiven access to unlabeled examples drawn from DX and a labeling oracle O, our goal is to \ufb01nd a\npolynomial time algorithm A such that with probability at least 1 \u2212 \u03b4, A outputs a halfspace hv \u2208 H\nwith P[sign(v \u00b7 X) \ufffd= sign(u \u00b7 X)] \u2264 \ufffd for some target accuracy \ufffd and con\ufb01dence \u03b4. (By Lemma 1,\nthis guarantees that the excess error of hv is at most \ufffd, namely, err(hv) \u2212 err(hu) \u2264 \ufffd.) The desired\nalgorithm should make as few queries to the labeling oracle O as possible.\n\n\u03c0 \u03b8(v, u).\n\n\u03c0\n\n.\n\n4\n\n\fWe say an algorithm A achieves a label complexity of \u039b(\ufffd, \u03b4), if for any target halfspace hu \u2208 H,\nwith probability at least 1 \u2212 \u03b4, A outputs a halfspace hv \u2208 H such that err(hv) \u2264 err(hu) + \ufffd, and\nrequests at most \u039b(\ufffd, \u03b4) labels from oracle O.\n4 Main Algorithm\n\nOur main algorithm, ACTIVE-PERCEPTRON (Algorithm 1), works in epochs. It works under the\nbounded and the adversarial noise models, if its sample schedule {mk} and band width {bk} are set\nappropriately with respect to each noise model. At the beginning of each epoch k, it assumes an upper\n2k on \u03b8(vk\u22121, u), the angle between current iterate vk\u22121 and the underlying halfspace u.\nbound of \u03c0\nAs we will see, this can be shown to hold with high probability inductively. Then, it calls procedure\nMODIFIED-PERCEPTRON (Algorithm 2) to \ufb01nd an new iterate vk, which can be shown to have an\nangle with u at most\n1\n\ufffd\ufffd\nepochs have passed.\nFor simplicity, we assume for the rest of the paper that the angle between the initial halfspace v0 and\n2 ; Appendix F shows that this assumption\nthe underlying halfspace u is acute, that is, \u03b8(v0, u) \u2264 \u03c0\ncan be removed with a constant overhead in terms of label and time complexities.\n\n2k+1 with high probability. The algorithm ends when a total of k0 = \ufffdlog2\n\n\u03c0\n\nAlgorithm 1 ACTIVE-PERCEPTRON\nInput: Labeling oracle O, initial halfspace v0, target error \ufffd, con\ufb01dence \u03b4, sample schedule {mk},\nband width {bk}.\nOutput: learned halfspace v.\n1: Let k0 = \ufffdlog2\n\ufffd\ufffd.\n2: for k = 1, 2, . . . , k0 do\n3:\n4: end for\n5: return vk0.\n\nvk \u2190 MODIFIED-PERCEPTRON(O, vk\u22121, \u03c0\n2k ,\n\n1\n\n\u03b4\n\nk(k+1) , mk, bk).\n\nProcedure MODIFIED-PERCEPTRON (Algorithm 2) is the core component of ACTIVE-PERCEPTRON.\nIt sequentially performs a modi\ufb01ed Perceptron update rule on the selected new examples (xt, yt) [51,\n17, 26]:\n\nwt+1 \u2190 wt \u2212 21{ytwt \u00b7 xt < 0} (wt \u00b7 xt) \u00b7 xt\n\n(1)\n\nDe\ufb01ne \u03b8t := \u03b8(wt, u). Update rule (1) implies the following relationship between \u03b8t+1 and \u03b8t (See\nLemma 8 in Appendix E for its proof):\n\ncos \u03b8t+1 \u2212 cos \u03b8t = \u221221{ytwt \u00b7 xt < 0} (wt \u00b7 xt) \u00b7 (u \u00b7 xt)\n\n(2)\nThis motivates us to take cos \u03b8t as our measure of progress; we would like to drive cos \u03b8t up to 1(so\nthat \u03b8t goes down to 0) as fast as possible.\nTo this end, MODIFIED-PERCEPTRON samples new points xt under time-varying distributions DX|Rt\nand query for their labels, where Rt =\ufffdx \u2208 Sd\u22121 : b\n2 \u2264 wt \u00b7 x \u2264 b\ufffd is a band inside the unit sphere.\n\nThe rationale behind the choice of Rt is twofold:\n\n1. We set Rt to have a probability mass of \u02dc\u03a9(\ufffd), so that the time complexity of rejection\n\ufffd ) per example. Moreover, in the adversarial noise setting, we set Rt\n\nsampling is at most \u02dcO( 1\nlarge enough to dominate the noise of magnitude \u03bd = \u02dc\u03a9(\ufffd).\n\n2. Unlike the active Perceptron algorithm in [26] or other margin-based approaches (for\nexample [55, 10]) where examples with small margin are queried, we query the label of the\nexamples with a range of margin [ b\n2 , b]. From a technical perspective, this ensures that \u03b8t\ndecreases by a decent amount in expectation (see Lemmas 9 and 10 for details).\n\nFollowing the insight of [32], we remark that the modi\ufb01ed Perceptron update (1) on distribution\nDX|Rt can be alternatively viewed as performing stochastic gradient descent on a special non-convex\nloss function \ufffd(w, (x, y)) = min(1, max(0,\u22121\u2212 2\nb yw\u00b7x)). It is an interesting open question whether\noptimizing this new loss function can lead to improved empirical results for learning halfspaces.\n\n5\n\n\fiterations m, band width b.\n\nAlgorithm 2 MODIFIED-PERCEPTRON\nInput: Labeling oracle O, initial halfspace w0, angle upper bound \u03b8, con\ufb01dence \u03b4, number of\nOutput: Improved halfspace wm.\n1: for t = 0, 1, 2, . . . , m \u2212 1 do\n2:\n3:\n\n2 \u2264 wt \u00b7 x \u2264 b\ufffd.\nDe\ufb01ne region Rt =\ufffdx \u2208 Sd\u22121 : b\nRejection sample xt \u223c DX|Rt. In other words, draw xt from DX until xt is in Rt. Query O\nfor its label yt.\nwt+1 \u2190 wt \u2212 21{ytwt \u00b7 xt < 0} \u00b7 (wt \u00b7 xt) \u00b7 xt.\n\n4:\n5: end for\n6: return wm.\n\n5 Performance Guarantees\n\nWe show that ACTIVE-PERCEPTRON works in the bounded and the adversarial noise models, achiev-\ning computational ef\ufb01ciency and near-optimal label complexities. To this end, we \ufb01rst give a lower\nbound on the label complexity under bounded noise, and then give computational and label complexity\nupper bounds under the two noise conditions respectively. We defer all proofs to the Appendix.\n\n5.1 A Lower Bound under Bounded Noise\n\nWe \ufb01rst present an information-theoretic lower bound on the label complexity in the bounded noise\nsetting under uniform distribution. This extends the distribution-free lower bounds of [53, 37], and\ngeneralizes the realizable-case lower bound of [47] to the bounded noise setting. Our lower bound\ncan also be viewed as an extension of [59]\u2019s Theorem 3; speci\ufb01cally it addresses the hardness under\nthe \u03b1-Tsybakov noise condition where \u03b1 = 0 (while [59]\u2019s Theorem 3 provides lower boundes when\n\u03b1 \u2208 (0, 1)).\nTheorem 1. For any d > 4, 0 \u2264 \u03b7 < 1\n4 , for any active learning algorithm\nA, there is a u \u2208 Sd\u22121, and a labeling oracle O that satis\ufb01es \u03b7-bounded noise condition with respect\nto u, such that if with probability at least 1\u2212 \u03b4, A makes at most n queries of labels to O and outputs\nv \u2208 Sd\u22121 such that P[sign(v \u00b7 X) \ufffd= sign(u \u00b7 X)] \u2264 \ufffd, then n \u2265 \u03a9\ufffd d log 1\n\n(1\u22122\u03b7)2\ufffd.\n(1\u22122\u03b7)2 + \u03b7 log 1\n\n4\u03c0 , 0 < \u03b4 \u2264 1\n\n2 , 0 < \ufffd \u2264 1\n\n\u03b4\n\n\ufffd\n\n5.2 Bounded Noise\n\nd\n\n1\n\n(1\u22122\u03b7)4 )\n\nln 1\n\nO(\n\n(1\u22122\u03b7)2 ln 1\n\n\ufffd ) is shown using a different algorithm.\n\nWe establish Theorem 2 in the bounded noise setting. The theorem implies that, with appropriate\nsettings of input parameters, ACTIVE-PERCEPTRON ef\ufb01ciently learns a halfspace of excess error\nat most \ufffd with probability at least 1 \u2212 \u03b4, under the assumption that DX is uniform over the unit\nsphere and O has bounded noise. In addition, it queries at most \u02dcO(\n\ufffd ) labels. This matches\nthe lower bound of Theorem 1, and improves over the state of the art result of [8], where a label\ncomplexity of \u02dcO(d\nThe proof and the precise setting of parameters (mk and bk) are given in Appendix C.\nTheorem 2 (ACTIVE-PERCEPTRON under Bounded Noise). Suppose Algorithm 1 has inputs la-\nbeling oracle O that satis\ufb01es \u03b7-bounded noise condition with respect to halfspace u, initial half-\nspace v0 such that \u03b8(v0, u) \u2208 [0, \u03c0\n2 ], target error \ufffd, con\ufb01dence \u03b4, sample schedule {mk} where\n\u221ad ln(kmk/\u03b4)\ufffd. Then with\nmk = \u0398\ufffd\n(1\u22122\u03b7)2 + ln k\nprobability at least 1 \u2212 \u03b4:\n1. The output halfspace v is such that P[sign(v \u00b7 X) \ufffd= sign(u \u00b7 X)] \u2264 \ufffd.\n2. The number of label queries is O\ufffd d\n(1\u22122\u03b7)2 + ln 1\n\n\u03b4 )\ufffd, band width {bk} where bk = \u0398\ufffd 2\u2212k(1\u22122\u03b7)\n\n(1\u22122\u03b7)2 \u00b7 ln 1\n\n(1\u22122\u03b7)2 (ln\n\n\u03b4 + ln ln 1\n\nd\n\nd\n\n\ufffd \u00b7\ufffdln\n\nd\n\n6\n\n\ufffd\ufffd\ufffd.\n\n\fd\n\nd\n\n3. The number of unlabeled examples drawn is\n\n\u00b7 1\n\ufffd ln 1\n\n\u00b7 1\n\ufffd ln 1\n\n\ufffd\ufffd.\n\n\ufffd\ufffd2\n\n\u03b4 + ln ln 1\n\n\u03b4 + ln ln 1\n\n\ufffd\ufffd2\n(1\u22122\u03b7)3 \u00b7\ufffdln\n\n\ufffd\ufffd.\n(1\u22122\u03b7)2 + ln 1\n\nO\ufffd d\n(1\u22122\u03b7)3 \u00b7\ufffdln\n(1\u22122\u03b7)2 + ln 1\n4. The algorithm runs in time O\ufffd d2\nThe theorem follows from Lemma 2 below. The key ingredient of the lemma is a delicate analysis\nof the dynamics of the angles {\u03b8t}m\nt=0, where \u03b8t = \u03b8(wt, u) is the angle between the iterate wt\nand the halfspace u. Since xt is randomly sampled and yt is noisy, we are only able to show that\n\u03b8t decreases by a decent amount in expectation. To remedy the stochastic \ufb02uctuations, we apply\nmartingale concentration inequalities to carefully control the upper envelope of sequence {\u03b8t}m\nt=0.\nLemma 2 (MODIFIED-PERCEPTRON under Bounded Noise). Suppose Algorithm 2 has inputs\nlabeling oracle O that satis\ufb01es \u03b7-bounded noise condition with respect to halfspace u, initial\nhalfspace w0 and angle upper bound \u03b8 \u2208 (0, \u03c0\n2 ] such that \u03b8(w0, u) \u2264 \u03b8, con\ufb01dence \u03b4, number\n\u03b4 )), band width b = \u0398\ufffd \u03b8(1\u22122\u03b7)\n\u221ad ln(m/\u03b4)\ufffd. Then with\nof iterations m = \u0398(\n(1\u22122\u03b7)2 + ln 1\nprobability at least 1 \u2212 \u03b4:\n1. The output halfspace wm is such that \u03b8(wm, u) \u2264 \u03b8\n2 .\n2. The number of label queries is O\ufffd d\n3. The number of unlabeled examples drawn is O\ufffd d\n4. The algorithm runs in time O\ufffd d2\n\n\u03b4\ufffd\ufffd.\n(1\u22122\u03b7)2 + ln 1\n(1\u22122\u03b7)3 \u00b7\ufffdln\n\u03b4\ufffd2\n(1\u22122\u03b7)2 + ln 1\n\n\u03b4\ufffd2\n(1\u22122\u03b7)2 + ln 1\n\u03b8\ufffd.\n\u00b7 1\n\n(1\u22122\u03b7)3 \u00b7\ufffdln\n\n(1\u22122\u03b7)2\ufffdln\n\n\u03b8\ufffd.\n\u00b7 1\n\n(1\u22122\u03b7)2 (ln\n\nd\n\nd\n\nd\n\nd\n\nd\n\n5.3 Adversarial Noise\n\n\ufffd\n\nln d+ln ln 1\n\ufffd\n\nWe establish Theorem 3 in the adversarial noise setting. The theorem implies that, with appropriate\nsettings of input parameters, ACTIVE-PERCEPTRON ef\ufb01ciently learns a halfspace of excess error at\nmost \ufffd with probability at least 1 \u2212 \u03b4, under the assumption that DX is uniform over the unit sphere\n). In addition, it queries at most\nand O has an adversarial noise of magnitude \u03bd = \u03a9(\n\u02dcO(d ln 1\n\ufffd ) labels. Our label complexity bound is information-theoretically optimal [47], and matches\nthe state of the art result of [39]. The bene\ufb01t of our approach is computational: it has a running time\nof \u02dcO( d2\n\ufffd ), while [39] needs to solve a convex optimization problem whose running time is some\npolynomial over d and 1\nThe proof and the precise setting of parameters (mk and bk) are given in Appendix C.\nTheorem 3 (ACTIVE-PERCEPTRON under Adversarial Noise). Suppose Algorithm 1 has inputs\nlabeling oracle O that satis\ufb01es \u03bd-adversarial noise condition with respect to halfspace u, initial\nhalfspace v0 such that \u03b8(v0, u) \u2264 \u03c0\n2 , target error \ufffd, con\ufb01dence \u03b4, sample schedule {mk} where\n\u221ad ln(kmk/\u03b4)\ufffd. Additionally \u03bd \u2264\nmk = \u0398(d(ln d + ln k\n\n\u03b4 )), band width {bk} where bk = \u0398\ufffd\n\n\ufffd with an unspeci\ufb01ed degree.\n\n2\u2212k\n\n\u03a9(\n\nln d\n\n\ufffd\n\n\ufffd\n\n). Then with probability at least 1 \u2212 \u03b4:\n\u03b4 +ln ln 1\n1. The output halfspace v is such that P[sign(v \u00b7 X) \ufffd= sign(u \u00b7 X)] \u2264 \ufffd.\n2. The number of label queries is O\ufffdd \u00b7 ln 1\n3. The number of unlabeled examples drawn is O\ufffdd \u00b7\ufffdln d + ln 1\n4. The algorithm runs in time O\ufffdd2 \u00b7\ufffdln d + ln 1\n\ufffd\ufffd2\n\n\ufffd \u00b7\ufffdln d + ln 1\n\n\ufffd\ufffd\ufffd.\n\ufffd\ufffd.\n\n\u03b4 + ln ln 1\n\n\u03b4 + ln ln 1\n\n\u00b7 1\n\ufffd ln 1\n\n7\n\n\u03b4 + ln ln 1\n\n\u00b7 1\n\ufffd ln 1\n\n\ufffd\ufffd.\n\n\ufffd\ufffd2\n\n\fThe theorem follows from Lemma 3 below, whose proof is similar to Lemma 2.\nLemma 3 (MODIFIED-PERCEPTRON under Adversarial Noise). Suppose Algorithm 2 has inputs\nlabeling oracle O that satis\ufb01es \u03bd-adversarial noise condition with respect to halfspace u, initial\nhalfspace w0 and angle upper bound \u03b8 \u2208 (0, \u03c0\n2 ] such that \u03b8(w0, u) \u2264 \u03b8, con\ufb01dence \u03b4, number of\niterations m = \u0398(d(ln d + ln 1\nln(m/\u03b4)) ).\nThen with probability at least 1 \u2212 \u03b4:\n\n\u03b4 )), band width b = \u0398\ufffd\n1. The output halfspace wm is such that \u03b8(wm, u) \u2264 \u03b8\n2 .\n\n\u03b8\u221ad ln(m/\u03b4)\ufffd. Additionally \u03bd \u2264 \u03a9(\n\n\u03b8\n\n2. The number of label queries is O\ufffdd \u00b7\ufffdln d + ln 1\n\u03b4\ufffd\ufffd.\n3. The number of unlabeled examples drawn is O\ufffdd \u00b7\ufffdln d + ln 1\n\u03b4\ufffd2\n\u03b8\ufffd.\n4. The algorithm runs in time O\ufffdd2 \u00b7\ufffdln d + ln 1\n\u03b4\ufffd2\n\u00b7 1\n\n\u03b8\ufffd\n\u00b7 1\n\n6\n\nImplications to Passive Learning\n\n1\n\n\ufffd\n\nd\n\nACTIVE-PERCEPTRON can be converted to a passive learning algorithm, PASSIVE-PERCEPTRON,\nfor learning homogeneous halfspaces under the uniform distribution over the unit sphere.\nPASSIVE-PERCEPTRON has PAC sample complexities close to the lower bounds under the two\nnoise models. We give a formal description of PASSIVE-PERCEPTRON in Appendix B. We give its\nformal guarantees in the corollaries below, which are immediate consequences of Theorems 2 and 3.\nIn the \u03b7-bounded noise model, the sample complexity of PASSIVE-PERCEPTRON improves over the\n(1\u22122\u03b7)4 )\nstate of the art result of [8], where a sample complexity of \u02dcO( d\n) is obtained. The bound\nhas the same dependency on \ufffd and d as the minimax upper bound of \u02dc\u0398(\n\ufffd(1\u22122\u03b7) ) by [49], which is\nachieved by a computationally inef\ufb01cient ERM algorithm.\nCorollary 1 (PASSIVE-PERCEPTRON under Bounded Noise). Suppose PASSIVE-PERCEPTRON has\ninputs distribution D that satis\ufb01es \u03b7-bounded noise condition with respect to u, initial halfspace v0,\n\n\u03b4 )\ufffd,\ntarget error \ufffd, con\ufb01dence \u03b4, sample schedule {mk} where mk = \u0398\ufffd\n(1\u22122\u03b7)2 + ln k\nband width {bk} where bk = \u0398\ufffd 2\u2212k(1\u22122\u03b7)\n\u221ad ln(kmk/\u03b4)\ufffd. Then with probability at least 1 \u2212 \u03b4: (1) The\noutput halfspace v is such that err(hv) \u2264 err(hu) + \ufffd; (2) The number of labeled examples drawn is\n\u02dcO\ufffd\n(1\u22122\u03b7)3\ufffd\ufffd. (3) The algorithm runs in time \u02dcO\ufffd\n\n(1\u22122\u03b7)3\ufffd\ufffd.\n\n(1\u22122\u03b7)2 (ln\n\nd2\n\nO(\n\nIn the \u03bd-adversarial noise model, the sample complexity of PASSIVE-PERCEPTRON matches the\nminimax optimal sample complexity upper bound of \u02dc\u0398( d\n\ufffd ) obtained in [39]. Same as in active\nlearning, our algorithm has a faster running time than [39].\nCorollary 2 (PASSIVE-PERCEPTRON under Adversarial Noise). Suppose PASSIVE-PERCEPTRON\nhas inputs distribution D that satis\ufb01es \u03bd-adversarial noise condition with respect to u, initial\n\nhalfspace v0, target error \ufffd, con\ufb01dence \u03b4, sample schedule {mk} where mk = \u0398\ufffdd(ln d + ln k\n\u03b4 )\ufffd,\nband width {bk} where bk = \u0398\ufffd\n). Then with\nprobability at least 1 \u2212 \u03b4: (1) The output halfspace v is such that err(hv) \u2264 err(hu) + \ufffd; (2) The\nnumber of labeled examples drawn is \u02dcO\ufffd d\n\n\ufffd\ufffd. (3) The algorithm runs in time \u02dcO\ufffd d2\n\ufffd\ufffd.\n\n\u221ad ln(kmk/\u03b4)\ufffd. Furthermore \u03bd = \u03a9(\n\nTables 3 and 4 present comparisons between our results and results most closely related to ours.\n\n\ufffd\nln ln 1\n\ufffd +ln d\n\n2\u2212k\n\nd\n\nd\n\nd\n\n\u03b4\n\nAcknowledgments. The authors thank Kamalika Chaudhuri for help and support, Hongyang Zhang\nfor thought-provoking initial conversations, Jiapeng Zhang for helpful discussions, and the anonymous\nreviewers for their insightful feedback. Much of this work is supported by NSF IIS-1167157 and\n1162581.\n\n8\n\n\fTable 3: A comparison of algorithms for PAC learning halfspaces under the uniform distribution, in\nthe \u03b7-bounded noise model.\n\nAlgorithm Sample Complexity Time Complexity\n\n[8]\nERM [49]\nOur Work\n\nO(\n\n\u02dcO( d\n\u02dcO(\n\u02dcO(\n\n1\n\n(1\u22122\u03b7)4 )\nd\n\n\ufffd\n\n(1\u22122\u03b7)\ufffd )\n(1\u22122\u03b7)3\ufffd )\n\nd\n\n)\n\n1\n\n\ufffd\n\nO(\n\n(1\u22122\u03b7)4 )\n\u02dcO( d\n)\nsuperpoly(d, 1\n\ufffd )\n\u02dcO(\n(1\u22122\u03b7)3 \u00b7 1\n\ufffd )\n\nd2\n\nTable 4: A comparison of algorithms for PAC learning halfspaces under the uniform distribution, in\nthe \u03bd-adversarial noise model where \u03bd = \u03a9(\n\n\ufffd\nln ln 1\n\n\ufffd +ln d ).\n\nAlgorithm Sample Complexity Time Complexity\n[39]\nERM [57]\nOur Work\n\npoly(d, 1\n\ufffd )\nsuperpoly(d, 1\n\ufffd )\n\u02dcO( d2\n\ufffd )\n\n\u02dcO( d\n\ufffd )\n\u02dcO( d\n\ufffd )\n\u02dcO( d\n\ufffd )\n\nReferences\n[1] Alekh Agarwal. Selective sampling algorithms for cost-sensitive multiclass prediction. ICML (3), 28:\n\n1220\u20131228, 2013.\n\n[2] Nir Ailon, Ron Begleiter, and Esther Ezra. Active learning using smooth relative regret approximations\n\nwith applications. Journal of Machine Learning Research, 15(1):885\u2013920, 2014.\n\n[3] Dana Angluin and Philip Laird. Learning from noisy examples. Machine Learning, 2(4):343\u2013370, Apr\n1988. ISSN 1573-0565. doi: 10.1023/A:1022873112823. URL https://doi.org/10.1023/A:\n1022873112823.\n\n[4] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. Cambridge\n\nUniversity Press, 2009.\n\n[5] Sanjeev Arora, L\u00e1szl\u00f3 Babai, Jacques Stern, and Z Sweedyk. The hardness of approximate optima in\nlattices, codes, and systems of linear equations. In Foundations of Computer Science, 1993. Proceedings.,\n34th Annual Symposium on, pages 724\u2013733. IEEE, 1993.\n\n[6] Pranjal Awasthi, Maria Florina Balcan, and Philip M Long. The power of localization for ef\ufb01ciently\nlearning linear separators with noise. In Proceedings of the 46th Annual ACM Symposium on Theory of\nComputing, pages 449\u2013458. ACM, 2014.\n\n[7] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Ruth Urner. Ef\ufb01cient learning of linear\n\nseparators under bounded noise. In COLT, pages 167\u2013190, 2015.\n\n[8] Pranjal Awasthi, Maria-Florina Balcan, Nika Haghtalab, and Hongyang Zhang. Learning and 1-bit\ncompressed sensing under asymmetric noise. In Proceedings of The 28th Conference on Learning Theory,\nCOLT 2016, 2016.\n\n[9] M.-F. Balcan and P. M. Long. Active and passive learning of linear separators under log-concave distribu-\n\ntions. In COLT, 2013.\n\n[10] M.-F. Balcan, A. Z. Broder, and T. Zhang. Margin based active learning. In COLT, 2007.\n\n[11] M.-F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. J. Comput. Syst. Sci., 75(1):\n\n78\u201389, 2009.\n\n[12] Maria-Florina Balcan and Vitaly Feldman. Statistical active learning algorithms. In NIPS, pages 1295\u20131303,\n\n2013.\n\n[13] Maria-Florina Balcan and Hongyang Zhang. S-concave distributions: Towards broader distributions for\n\nnoise-tolerant and sample-ef\ufb01cient learning algorithms. arXiv preprint arXiv:1703.07758, 2017.\n\n[14] Maria-Florina Balcan, Steve Hanneke, and Jennifer Wortman Vaughan. The true sample complexity of\n\nactive learning. Machine learning, 80(2-3):111\u2013139, 2010.\n\n9\n\n\f[15] A. Beygelzimer, D. Hsu, J. Langford, and T. Zhang. Agnostic active learning without constraints. In NIPS,\n\n2010.\n\n[16] Alina Beygelzimer, Sanjoy Dasgupta, and John Langford.\n\nImportance weighted active learning.\n\nTwenty-Sixth International Conference on Machine Learning, 2009.\n\nIn\n\n[17] Avrim Blum, Alan M. Frieze, Ravi Kannan, and Santosh Vempala. A polynomial-time algorithm for\n\nlearning noisy linear threshold functions. Algorithmica, 22(1/2):35\u201352, 1998.\n\n[18] Nicolo Cesa-Bianchi and G\u00e1bor Lugosi. Prediction, learning, and games. Cambridge university press,\n\n2006.\n\n[19] Nicol\u00f2 Cesa-Bianchi, Claudio Gentile, and erancesco Orabona. Robust bounds for classi\ufb01cation via\nselective sampling. In Proceedings of the 26th Annual International Conference on Machine Learning,\nICML 2009, Montreal, Quebec, Canada, June 14-18, 2009, pages 121\u2013128, 2009.\n\n[20] Lin Chen, Hamed Hassani, and Amin Karbasi. Near-optimal active learning of halfspaces via query\n\nsynthesis in the noisy setting. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[21] David A. Cohn, Les E. Atlas, and Richard E. Ladner. Improving generalization with active learning.\n\nMachine Learning, 15(2):201\u2013221, 1994.\n\n[22] Nello Cristianini and John Shawe-Taylor. An introduction to support vector machines and other kernel-\n\nbased learning methods. 2000.\n\n[23] Amit Daniely. Complexity theoretic limitations on learning halfspaces. arXiv preprint arXiv:1505.05800,\n\n2015.\n\n[24] S. Dasgupta. Coarse sample complexity bounds for active learning. In NIPS, 2005.\n\n[25] Sanjoy Dasgupta. Two faces of active learning. Theoretical computer science, 412(19):1767\u20131781, 2011.\n\n[26] Sanjoy Dasgupta, Adam Tauman Kalai, and Claire Monteleoni. Analysis of perceptron-based active\nlearning. In Learning Theory, 18th Annual Conference on Learning Theory, COLT 2005, Bertinoro, Italy,\nJune 27-30, 2005, Proceedings, pages 249\u2013263, 2005.\n\n[27] Sanjoy Dasgupta, Daniel Hsu, and Claire Monteleoni. A general agnostic active learning algorithm. In\n\nAdvances in Neural Information Processing Systems 20, 2007.\n\n[28] Ofer Dekel, Claudio Gentile, and Karthik Sridharan. Selective sampling and active learning from single\n\nand multiple teachers. Journal of Machine Learning Research, 13(Sep):2655\u20132697, 2012.\n\n[29] John Dunagan and Santosh Vempala. A simple polynomial-time rescaling algorithm for solving linear\nprograms. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages\n315\u2013320. ACM, 2004.\n\n[30] Vitaly Feldman, Parikshit Gopalan, Subhash Khot, and Ashok Kumar Ponnuswami. New results for\nlearning noisy parities and halfspaces. In Foundations of Computer Science, 2006. FOCS\u201906. 47th Annual\nIEEE Symposium on, pages 563\u2013574. IEEE, 2006.\n\n[31] Y. Freund, H. S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee\n\nalgorithm. Machine Learning, 28(2-3):133\u2013168, 1997.\n\n[32] Andrew Guillory, Erick Chastain, and Jeff Bilmes. Active learning as non-convex optimization.\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 201\u2013208, 2009.\n\nIn\n\n[33] Venkatesan Guruswami and Prasad Raghavendra. Hardness of learning halfspaces with noise. SIAM\n\nJournal on Computing, 39(2):742\u2013765, 2009.\n\n[34] S. Hanneke. A bound on the label complexity of agnostic active learning. In ICML, 2007.\n\n[35] S. Hanneke. Theoretical Foundations of Active Learning. PhD thesis, Carnegie Mellon University, 2009.\n\n[36] Steve Hanneke. Rates of convergence in active learning. The Annals of Statistics, 39(1):333\u2013361, 2011.\n[37] Steve Hanneke. Theory of disagreement-based active learning. Foundations and Trends R\ufffd in Machine\n\nLearning, 7(2-3):131\u2013309, 2014.\n\n[38] Steve Hanneke and Liu Yang. Surrogate losses in passive and active learning.\n\narXiv:1207.3772, 2012.\n\narXiv preprint\n\n10\n\n\f[39] Steve Hanneke, Varun Kanade, and Liu Yang. Learning with a drifting target concept. In International\n\nConference on Algorithmic Learning Theory, pages 149\u2013164. Springer, 2015.\n\n[40] D. Hsu. Algorithms for Active Learning. PhD thesis, UC San Diego, 2010.\n\n[41] Tzu-Kuo Huang, Alekh Agarwal, Daniel Hsu, John Langford, and Robert E. Schapire. Ef\ufb01cient and\n\nparsimonious agnostic active learning. CoRR, abs/1506.08669, 2015.\n\n[42] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning\n\nhalfspaces. SIAM Journal on Computing, 37(6):1777\u20131805, 2008.\n\n[43] Michael Kearns and Ming Li. Learning in the presence of malicious errors. SIAM Journal on Computing,\n\n22(4):807\u2013837, 1993.\n\n[44] Adam Klivans and Pravesh Kothari. Embedding Hard Learning Problems Into Gaussian Space.\n\nAPPROX/RANDOM 2014, pages 793\u2013809, 2014.\n\nIn\n\n[45] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise.\n\nJournal of Machine Learning Research, 10(Dec):2715\u20132740, 2009.\n\n[46] V. Koltchinskii. Rademacher complexities and bounding the excess risk in active learning. JMLR, 2010.\n\n[47] Sanjeev R Kulkarni, Sanjoy K Mitter, and John N Tsitsiklis. Active learning using arbitrary binary valued\n\nqueries. Machine Learning, 11(1):23\u201335, 1993.\n\n[48] Philip M Long. On the sample complexity of pac learning half-spaces against the uniform distribution.\n\nIEEE Transactions on Neural Networks, 6(6):1556\u20131559, 1995.\n\n[49] Pascal Massart and \u00c9lodie N\u00e9d\u00e9lec. Risk bounds for statistical learning. The Annals of Statistics, pages\n\n2326\u20132366, 2006.\n\n[50] Claire Monteleoni. Ef\ufb01cient algorithms for general active learning.\n\nComputational Learning Theory, pages 650\u2013652. Springer, 2006.\n\nIn International Conference on\n\n[51] TS Motzkin and IJ Schoenberg. The relaxation method for linear inequalities. Canadian Journal of\n\nMathematics, 6(3):393\u2013404, 1954.\n\n[52] Francesco Orabona and Nicolo Cesa-Bianchi. Better algorithms for selective sampling. In Proceedings of\n\nthe 28th international conference on Machine learning (ICML-11), pages 433\u2013440, 2011.\n\n[53] Maxim Raginsky and Alexander Rakhlin. Lower bounds for passive and active learning. In Advances in\n\nNeural Information Processing Systems, pages 1026\u20131034, 2011.\n\n[54] Burr Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.\n\n[55] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classi\ufb01-\n\ncation. Journal of machine learning research, 2(Nov):45\u201366, 2001.\n\n[56] Christopher Tosh and Sanjoy Dasgupta. Diameter-based active learning. In ICML, pages 3444\u20133452, 2017.\n\n[57] Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of\n\nevents to their probabilities. Theory of Probability and Its Applications, 16(2):264\u2013280, 1971.\n\n[58] Liwei Wang. Smoothness, disagreement coef\ufb01cient, and the label complexity of agnostic active learning.\n\nJournal of Machine Learning Research, 12(Jul):2269\u20132292, 2011.\n\n[59] Yining Wang and Aarti Singh. Noise-adaptive margin-based active learning and lower bounds under\n\ntsybakov noise condition. In AAAI, 2016.\n\n[60] Chicheng Zhang and Kamalika Chaudhuri. Beyond disagreement-based agnostic active learning. In\nAdvances in Neural Information Processing Systems 27: Annual Conference on Neural Information\nProcessing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pages 442\u2013450, 2014.\n\n[61] Yuchen Zhang, Percy Liang, and Moses Charikar. A hitting time analysis of stochastic gradient langevin\n\ndynamics. In COLT, pages 1980\u20132022, 2017.\n\n11\n\n\f", "award": [], "sourceid": 684, "authors": [{"given_name": "Songbai", "family_name": "Yan", "institution": "University of California, San Diego"}, {"given_name": "Chicheng", "family_name": "Zhang", "institution": "University of California San Diego"}]}