{"title": "Linear Classification and Selective Sampling Under Low Noise Conditions", "book": "Advances in Neural Information Processing Systems", "page_first": 249, "page_last": 256, "abstract": "We provide a new analysis of an efficient margin-based algorithm for selective sampling in classification problems. Using the so-called Tsybakov low noise condition to parametrize the instance distribution, we show bounds on the convergence rate to the Bayes risk of both the fully supervised and the selective sampling versions of the basic algorithm. Our analysis reveals that, excluding logarithmic factors, the average risk of the selective sampler converges to the Bayes risk at rate $n^{-(1+\\alpha)/(3+\\alpha)}$, with labels being sampled at the same rate (here $n$ denotes the sample size, and $\\alpha > 0$ is the exponent in the low noise condition). We compare this convergence rate to the rate $n^{-(1+\\alpha)/(2+\\alpha)}$ achieved by the fully supervised algorithm using all labels. Experiments on textual data reveal that simple variants of the proposed selective sampler perform much better than popular and similarly efficient competitors.", "full_text": "Linear Classi\ufb01cation and Selective Sampling\n\nUnder Low Noise Conditions\n\nGiovanni Cavallanti\n\nDSI, Universit`a degli Studi di Milano, Italy\n\nNicol`o Cesa-Bianchi\n\nDSI, Universit`a degli Studi di Milano, Italy\n\ncavallanti@dsi.unimi.it\n\ncesa-bianchi@dsi.unimi.it\n\nClaudio Gentile\n\nDICOM, Universit`a dell\u2019Insubria, Italy\n\nclaudio.gentile@uninsubria.it\n\nAbstract\n\nWe provide a new analysis of an ef\ufb01cient margin-based algorithm for selective\nsampling in classi\ufb01cation problems. Using the so-called Tsybakov low noise con-\ndition to parametrize the instance distribution, we show bounds on the conver-\ngence rate to the Bayes risk of both the fully supervised and the selective sampling\nversions of the basic algorithm. Our analysis reveals that, excluding logarithmic\nfactors, the average risk of the selective sampler converges to the Bayes risk at\nrate N \u2212(1+\u03b1)(2+\u03b1)/2(3+\u03b1) where N denotes the number of queried labels, and\n\n\u03b1 > 0 is the exponent in the low noise condition. For all \u03b1 > \u221a3 \u2212 1 \u2248 0.73 this\n\nconvergence rate is asymptotically faster than the rate N \u2212(1+\u03b1)/(2+\u03b1) achieved\nby the fully supervised version of the same classi\ufb01er, which queries all labels, and\nfor \u03b1 \u2192 \u221e the two rates exhibit an exponential gap. Experiments on textual data\nreveal that simple variants of the proposed selective sampler perform much better\nthan popular and similarly ef\ufb01cient competitors.\n\n1 Introduction\nIn the standard online learning protocol for binary classi\ufb01cation the learner receives a sequence of\ninstances generated by an unknown source. Each time a new instance is received the learner predicts\nits binary label, and is then given the true label of the current instance before the next instance is\nobserved. This protocol is natural in many applications, for instance weather forecasting or stock\nmarket prediction, because Nature (or the market) is spontaneously disclosing the true label after\neach learner\u2019s guess. On the other hand, in many other applications obtaining labels may be an\nexpensive process.\nIn order to address this problem, a variant of online learning that has been\nproposed is selective sampling. In this modi\ufb01ed protocol the true label of the current instance is\nnever revealed unless the learner decides to issue an explicit query. The learner\u2019s performance is then\nmeasured with respect to both the number of mistakes (made on the entire sequence of instances)\nand the number of queries. A natural sampling strategy is one that tries to identify labels which are\nlikely to be useful to the algorithm, and then queries those ones only. This strategy somehow needs\nto combine a measure of utility of examples with a measure of con\ufb01dence. In the case of learning\nwith linear functions, a statistic that has often been used to quantify both utility and con\ufb01dence is\nthe margin. In [10] this approach was employed to de\ufb01ne a selective sampling rule that queries a\nnew label whenever the margin of the current instance, with respect to the current linear hypothesis,\nis smaller (in magnitude) than an adaptively adjusted threshold. Margins were computed using\na linear learning algorithm based on an incremental version of Regularized linear Least-Squares\n(RLS) for classi\ufb01cation. Although this selective sampling algorithm is ef\ufb01cient, and has simple\nvariants working quite well in practice, the rate of convergence to the Bayes risk was never assessed\nin terms of natural distributional parameters, thus preventing a full understanding of the properties\nof this algorithm.\n\n\fWe improve on those results in several ways making three main contributions: (i) By coupling the\nTsybakov low noise condition, used to parametrize the instance distribution, with the linear model\nof [10], de\ufb01ning the conditional distribution of labels, we prove that the fully supervised RLS (all\n\nexponent in the low noise condition. (ii) Under the same low noise condition, we prove that the\n\nlabels are queried) converges to the Bayes risk at rate eO(cid:0)n\u2212(1+\u03b1)/(2+\u03b1)(cid:1) where \u03b1 \u2265 0 is the noise\nRLS-based selective sampling rule of [10] converges to the Bayes risk at rate eO(cid:0)n\u2212(1+\u03b1)/(3+\u03b1)(cid:1),\nwith labels being queried at rate eO(cid:0)n\u2212\u03b1/(2+\u03b1)(cid:1). Moreover, we show that similar results can be\n\nestablished for a mistake-driven (i.e., space and time ef\ufb01cient) variant. (iii) We perform experiments\non a real-world medium-size dataset showing that variants of our mistake-driven sampler compare\nfavorably with other selective samplers proposed in the literature, like the ones in [11, 16, 20].\nRelated work. Selective sampling, originally introduced by Cohn, Atlas and Ladner in [13, 14],\ndiffers from the active learning framework as in the latter the learner has more freedom in selecting\nwhich instances to query. For example, in Angluin\u2019s adversarial learning with queries (see [1] for a\nsurvey), the goal is to identify an unknown boolean function f from a given class, and the learner\ncan query the labels (i.e., values of f) of arbitrary boolean instances. Castro and Nowak [9] study a\nframework in which the learner also queries arbitrary domain points. However, in their case labels\nare stochastically related to instances (which are real vectors). They prove risk bounds in terms\nof nonparametric characterizations of both the regularity of the Bayes decision boundary and the\nbehavior of the noise rate in its proximity. In fact, a large statistical literature on adaptive sampling\nand sequential hypothesis testing exists (see for instance the detailed description in [9]) which is\nconcerned with problems that share similarities with active learning. The idea of querying small\nmargin instances when learning linear classi\ufb01ers has been explored several times in different active\nlearning contexts. Campbell, Cristianini and Smola [8], and also Tong and Koller [23], study a pool-\nbased model of active learning, where the algorithm is allowed to interactively choose which labels\nto obtain from an i.i.d. pool of unlabeled instances. A landmark result in the selective sampling\nprotocol is the query-by-committee algorithm of Freund, Seung, Shamir and Tishby [17]. In the\nrealizable (noise-free) case, and under strong distributional assumptions, this algorithm is shown to\nrequire exponentially fewer labels than instances when learning linear classi\ufb01ers (see also [18] for\na more practical implementation). An exponential advantage in the realizable case is also obtained\nwith a simple variant of the Perceptron algorithm by Dasgupta, Kalai and Monteleoni [16], under\nthe sole assumption that instances are drawn from the uniform distribution over the unit ball in Rd.\nIn the general statistical learning case, under no assumptions on the joint distribution of label and\ninstances, selective sampling bears no such exponential advantage. For instance, K\u00a8a\u00a8ari\u00a8ainen shows\nthat, in order to approach the risk of the best linear classi\ufb01er f \u2217 within error \u03b5, at least \u2126((\u03b7/\u03b5)2)\nlabels are needed, where \u03b7 is the risk of f \u2217. A much more general nonparametric lower bound for\nactive learning is obtained by Castro and Nowak [9]. General selective sampling strategies for the\nnonrealizable case have been proposed in [3, 4, 15]. However, none of these learning algorithms\nseems to be computationally ef\ufb01cient when learning linear classi\ufb01ers in the general agnostic case.\n\n2 Learning protocol and data model\nWe consider the following online selective sampling protocol. At each step t = 1, 2, . . . the sam-\npling algorithm (or selectivesampler) receives an instance xt \u2208 Rd and outputs a binary prediction\nfor the associated label yt \u2208 {\u22121, +1}. After each prediction, the algorithm has the option of \u201csam-\npling\u201d (issuing a query) in order to receive the label yt. We call the pair (xt, yt) an example. After\nseeing the label yt, the algorithm can choose whether or not to update its internal state using the new\ninformation encoded by (xt, yt).\nWe assume instances xt are realizations of i.i.d. random variables X t drawn from an unknown\ndistribution on the surface of the unit Euclidean sphere in Rd, so that kX tk = 1 for all t \u2265 1.\nFollowing [10], we assume that labels yt are generated according to the following simple linear\nnoise model: there exists a \ufb01xed and unknown vector u \u2208 Rd, with Euclidean norm kuk = 1,\nsuch that E(cid:2)Yt(cid:12)(cid:12) X t = xt(cid:3) = u\u22a4xt for all t \u2265 1. Hence X t = xt has label 1 with probability\n(1 + u\u22a4xt)/2 \u2208 [0, 1]. Note that SGN(f \u2217), for f \u2217(x) = u\u22a4x, is the Bayes optimal classi\ufb01er\nfor this noise model. In the following, all probabilities P and expectations E are understood with\nrespect to the joint distribution of the i.i.d. data process {(X 1, Y1), (X 2, Y2), . . .}. We use Pt\nto denote conditioning on (X 1, Y1), . . . , (X t, Yt). Let f : Rd \u2192 R be an arbitrary measurable\nfunction. The instantaneous regret R(f ) is the excess risk of SGN(f ) w.r.t.\nthe Bayes risk, i.e.,\nR(f ) = P(Y1 f (X 1) < 0) \u2212 P(Y1 f \u2217(X 1) < 0). Let f1, f2, . . . be a sequence of real functions\n\n\fwhere each ft is measurable w.r.t. the \u03c3-algebra generated by (X 1, Y1), . . . , (X t\u22121, Yt\u22121), X t.\nWhen (X 1, Y1), . . . , (X t\u22121, Yt\u22121) is understood from the context, we write ft as a function of\nX t only. Let Rt\u22121(ft) be the instantaneous conditional regret Rt\u22121(ft) = Pt\u22121(Yt ft(X t) <\n\n0) \u2212 Pt\u22121(Yt f \u2217(X t) < 0). Our goal is to bound the expected cumulative regret E(cid:2)R0(f1) +\nR1(f2) + \u00b7\u00b7\u00b7 + Rn\u22121(fn)(cid:3), as a function of n, and other relevant quantities. Observe that, although\n\nthe learner\u2019s predictions can only depend on the queried examples, the regret is computed over all\ntime steps, including the ones when the selective sampler did not issue a query. In order to model\nthe distribution of the instances around the hyperplane u\u22a4x = 0, we use Mammen-Tsybakov low\nnoisecondition [24]:\n\nfor all \u03b5 > 0.\n\nThere exist c > 0 and \u03b1 \u2265 0 such that P(cid:0)|f \u2217(X 1)| < \u03b5(cid:1) \u2264 c \u03b5\u03b1\n\n(1)\nWhen the noise exponent \u03b1 is 0 the low noise condition becomes vacuous.\nIn order to study\nthe case \u03b1 \u2192 \u221e, one can use the following equivalent formulation of (1) \u2014see, e.g., [5],\nP(cid:0)f \u2217(X 1)f (X 1) < 0(cid:1) \u2264 c R(f )\u03b1/(1+\u03b1) for all measurable f : Rd \u2192 R. With this formula-\ntion, one can show that \u03b1 \u2192 \u221e implies the hardmargincondition |f \u2217(X 1)| \u2265 1/(2c) w.p. 1.\n3 Algorithms and theoretical analysis\nt X t), where wt \u2208 Rd is\nWe consider linear classi\ufb01ers predicting the value of Yt through SGN(w\u22a4\na dynamically updated weight vector which might be intended as the current estimate for u. Our\nwt is an RLS estimator de\ufb01ned over the set of previously queried examples. More precisely, let Nt\nNt\u22121(cid:3) be the\nNt\u22121(cid:3)\u22a4 be the vector of the\n\nbe the number of queried examples during the \ufb01rst t time steps, St\u22121 = (cid:2) x\u2032\nmatrix of the queried instances up to time t \u2212 1, and yt\u22121 =(cid:2)y\u2032\nt (cid:1)\u22121\n\n(2)\nwhere I is the d \u00d7 d identity matrix. Note that wt depends on the current instance xt. The RLS\nestimator in this particular form has been \ufb01rst considered by Vovk [25] and by Azoury and War-\nmuth [2]. Compared to standard RLS, here xt acts by futher reducing the variance of wt. We\n\ncorresponding labels. Then the RLS estimator is de\ufb01ned by\n\nwt =(cid:0)I + St\u22121 S\u22a4\n\nt\u22121 + xtx\u22a4\n\nSt\u22121 yt\u22121 ,\n\n1, . . . , x\u2032\n\n1, . . . , y\u2032\n\nt\u22121 + xtx\u22a4\n\nt\u22121) space. The update time is also quadratic in Nt\u22121.\n\nuse b\u2206t to denote the margin w\u22a4\nt X t whenever wt is understood from the context. Thus b\u2206t is\nthe current approximation to \u2206t. Note that b\u2206t is measurable w.r.t. the \u03c3-algebra generated by\n\n(X 1, Y1), . . . , (X t\u22121, Yt\u22121), X t. We also use \u2206t to denote the Bayes margin f \u2217(X t) = u\u22a4X t.\nThe RLS estimator (2) can be stored in space \u0398(d2), which we need for the inverse of I +\nt . Moreover, using a standard formula for small-rank adjustments of inverse\nSt\u22121 S\u22a4\nmatrices, we can compute updates and predictions in time \u0398(d2). The algorithm in (2) can also\nbe expressed in dual variable form. This is needed, for instance, when we want to use the feature\nexpansion facility provided by kernel functions. In this case, at time t the RLS estimator (2) can be\nrepresented in O(N 2\nOur \ufb01rst result establishes a regret bound for the fully supervised algorithm, i.e., the algorithm that\npredicts using RLS as in (2), queries the label of every instance, and stores all examples. This result\nis the baseline against which we measure the performance of our selective sampling algorithm. The\nregret bound is expressed i.t.o. the whole spectrum of the process covariance matrix E[X 1X \u22a4\n1 ].\nTheorem 1 Assume the low noise condition (1) holds with exponent \u03b1 \u2265 0 and constant c > 0.\nThen the expected cumulative regret after n steps of the fully supervised algorithm based on (2) is\nbounded by E(cid:20)(cid:16)4c(1 + ln|I + SnS\u22a4\n(cid:16)4c(cid:16)1 +Pd\nnant of a matrix, Sn =(cid:2)X 1, X 2, . . . , X n(cid:3), and \u03bbi is the i-th eigenvalue of E[X 1X \u22a4\nO(cid:0)\u221ad n ln n(cid:1). When \u03b1 \u2192 \u221e (corresponding to a hard margin condition) the bound gives the\nlogarithmic behavior O(cid:0)d ln n(cid:1). Notice thatPd\nwhenever the spectrum of E[X 1X \u22a4\nmeaningful even when d = \u221e, while the third one only applies to the \ufb01nite dimensional case.\n\ni=1 ln(1 + n\u03bbi) is substantially smaller than d ln n\n1 ] is rapidly decreasing. In fact, the second bound is clearly\n\nn |)(cid:17) 1+\u03b1\n2+\u03b1 = O(cid:16)(cid:0)d ln n(cid:1) 1+\u03b1\n\nWhen \u03b1 = 0 (corresponding to a vacuous noise condition) the bound of Theorem 1 reduces to\n\n2+\u03b1(cid:17) . Here | \u00b7 | denotes the determi-\n\ni=1 ln(1 + n\u03bbi)(cid:17)(cid:17) 1+\u03b1\n\n2+\u03b1 . This, in turn, is bounded from above by\n\n2+\u03b1(cid:21) n\n\n2+\u03b1 n\n\n1 ].\n\n2+\u03b1\n\n1\n\n1\n\n1\n\nn\n\n\fParameters: \u03bb > 0, \u03c1t > 0 for each t \u2265 1.\nInitialization: weight vector w = (0, . . . , 0)\u22a4; storage counter N = 0.\nAt each time t = 1, 2, . . . do the following:\n\n1. Observe instance xt \u2208 Rd : ||xt|| = 1;\n2. Predict the label yt \u2208 {\u22121, 1} with SGN(w\u22a4\n3. If N \u2264 \u03c1t then query label yt and store (xt, yt);\n4. Else if b\u22062\n\nt \u2264 128 ln t\n\n\u03bb N then schedule the query of yt+1;\n\nt xt), where wt is as in (2).\n\n5. If (xt, yt) is scheduled to be stored, then increment N and update wt using (xt+1, yt+1).\n\nFigure 1: The selective sampling algorithm.\n\nFast rates of convergence have typically been proven for batch-style algorithms, such as empirical\nrisk minimizers and SVM (see, e.g., [24, 22]), rather than for online algorithms. A reference closer\nto our paper is Ying and Zhou [26], where the authors prove bounds for online linear classi\ufb01cation\nusing the low noise condition (1), though under different distributional assumptions.\nOur second result establishes a new regret bound, under low noise conditions, for the selective\nsampler introduced in [10]. This variant, described in Figure 1, queries all labels (and stores all\nexamples) during an initial stage of length at least (16d)/\u03bb2, where \u03bb denotes the smallest nonzero\neigenvalue of the process covariance matrix E[X 1X \u22a4\n1 ]. When this transient regime is over, the\n\nln(cid:12)(cid:12)I + Sn S\u22a4\n\nsampler issues a query at time t based on both the query counter Nt\u22121 and the margin b\u2206t. Specif-\nically, if evidence is collected that the number Nt\u22121 of stored examples is smaller than our current\nestimate of 1/\u22062\nt \u2264 (128 ln t)/(\u03bbNt\u22121), then we query (and store) the label of the next\ninstance xt+1. Note that the margin threshold explicitly depends, through \u03bb, on additional informa-\ntion about the data-generating process. This additional information is needed because, unlike the\nfully supervised classi\ufb01er of Theorem 1, the selective sampler queries labels at random steps. This\nprevents us from bounding the sum of conditional variances of the involved RLS estimator through\n\nt , that is if b\u22062\nn(cid:12)(cid:12), as we can do when proving Theorem 1 (see below). Instead, we have to individu-\n\nally bound each conditional variance term via the smallest empirical eigenvalue of the correlation\nmatrix. The transient regime in Figure 1 is exactly needed to ensure that this smallest empirical\neigenvalue gets close enough to \u03bb. Compared to the analysis contained in [10], we are able to better\ncapture the two main aspects of the selective sampling protocol: First, we control the probability\nof making a mistake when we do not query labels; second, the algorithm is able to adaptively op-\ntimize the sampling rate by exploiting the additional information provided by the examples having\nsmall margin. The appropriate sampling rate clearly depends on the (unknown) amount of noise \u03b1\nwhich the algorithm implicitly learns on the \ufb02y. In this respect, our algorithm is more properly an\nadaptive sampler, rather than a selective sampler. Finally, we stress that it is fairly straightforward\nto add to the algorithm in Figure 1 a mistake-driven rule for storing examples. Such a rule provides\nthat, when a small margin is detected, a query be issued (and the next example be stored) only if\n\nSGN(b\u2206t) 6= yt (i.e., only if the current prediction is mistaken). This turns out to be highly advanta-\n\ngeous from a computational standpoint, because of the sparsity of the computed solution. It is easy\nto adapt our analysis to obtain even for this algorithm the same regret bound as the one established\nin Theorem 2. However, in this case we can only give guarantees on the expected number of stored\nexamples (which can indeed be much smaller than the actual number of queried labels).\n\nTheorem 2 Assume the low noise condition (1) holds with unknown exponent \u03b1 \u2265 0 and assume\nthe selective sampler of Figure 1 is run with \u03c1t = 16\n\u03bb2 max{d, ln t}. Then, after n steps, the expected\n3+\u03b1(cid:19) whereas the expected number\ncumulative regret is bounded by O(cid:18) d + ln n\n\u03bb (cid:17) 1+\u03b1\n2+\u03b1(cid:19) .\nof queried labels (including the stored ones) is bounded by O(cid:18) d + ln n\n\u03bb2 +(cid:16) ln n\nThe proof, sketched below, hinges on showing that b\u2206t is an almost unbiased estimate of the true\n\nmargin \u2206t, and relies on known concentration properties of i.i.d. processes. In particular, we show\nthat our selective sampler is able to adaptively estimate the number of queries needed to ensure a\n1/t increase of the regret when a query is not issued at time t.\n\n\u03bb2 +(cid:16) ln n\n\n\u03bb (cid:17) \u03b1\n\n2+\u03b1\n\n3+\u03b1\n\nn\n\n2\n\nn\n\n2\n\n\f2(3+\u03b1)\n\n3+\u03b1 vs. n\u2212 1+\u03b1\n\nAs expected, when we compare our semi-supervised selective sampler (Theorem 2) to the fully\nsupervised \u201cyardstick\u201d (Theorem 1), we see that the per-step regret of the former vanishes at a sig-\nni\ufb01cantly slower rate than the latter, i.e., n\u2212 1+\u03b1\n2+\u03b1 . Note, however, that the per-step regret\nof the semi-supervised algorithm vanishes faster than its fully-supervised counterpart when both re-\ngrets are expressed in terms of the number N of issued queries. To see this consider \ufb01rst the case\n\u03b1 \u2192 \u221e (the hard margin case, essentially analyzed in [10]). Then both algorithms have a per-step\nregret of order (ln n)/n. However, since the semi-supervised algorithm makes only N = O(ln n)\nqueries, we have that, as a function of N, the per-step regret of the semi-supervised algorithm is of\norder N/eN where the fully supervised has only (ln N )/N. We have thus recovered the exponen-\ntial advantage observed in previous works [16, 17]. When \u03b1 = 0 (vacuous noise conditions), the\nper-step regret rates in terms of N become (excluding logarithmic factors) of order N \u22121/3 in the\nsemi-supervised case and of order N \u22121/2 in the fully supervised case. Hence, there is a critical value\nof \u03b1 where the semi-supervised bound becomes better. In order to \ufb01nd this critical value we write the\nrates of the per-step regret for 0 \u2264 \u03b1 < \u221e obtaining N \u2212 (1+\u03b1)(2+\u03b1)\n(semi-supervised algorithm) and\nN \u2212 1+\u03b1\n2+\u03b1 (fully supervised algorithm). By comparing the two exponents we \ufb01nd that, asymptotically,\nthe semi-supervised rate is better than the fully supervised one for all values of \u03b1 > \u221a3 \u2212 1. This\n\nindicates that selective sampling is advantageous when the noise level (as modeled by the Mammen-\nTsybakov condition) is not too high. Finally, observe that the way it is stated now, the bound of\nTheorem 2 only applies to the \ufb01nite-dimensional (d < \u221e) case. It turns out this is a \ufb01xable artifact\nof our analysis, rather than an intrinsic limitation of the selective sampling scheme in Figure 1. See\nRemark 3 below.\nProof of Theorem 1. The proof proceeds by relating the classi\ufb01cation regret to the square loss re-\ngret via a comparison theorem. The square loss regret is then controlled by applying a known point-\n\n\u2212(cid:0)1\u2212 Y1 f \u2217(X 1)2(cid:1)]\n\nt=1(cid:16)4c R\u03c6,t\u22121(ft)(cid:17) 1+\u03b1\n\nbe the square loss regret, and Rt\u22121,\u03c6 its conditional version. We apply the comparison theo-\nrem from [5] with the \u03c8-transform function \u03c8(z) = z2 associated with the square loss. Under\n2+\u03b1 for all measurable f. We thus\n\nwise bound. For all measurable f : Rd \u2192 R, let R\u03c6(f ) = E[(cid:0)1\u2212 Y1 f (X 1)(cid:1)2\nthe low noise condition (1) this yields R(f ) \u2264 (cid:0)4c R\u03c6(f )(cid:1) 1+\u03b1\nt=1 Rt\u22121(ft)(cid:3)\u2264 EhPn\n2+\u03b1i ,\nt=1 R\u03c6,t\u22121(ft)(cid:17) 1+\u03b1\nhave E(cid:2)Pn\nthe last term following from Jensen\u2019s inequality. Further, we observe that in our probabilistic model\nf \u2217(x) = u\u22a4x is Bayes optimal for the square loss. In fact, for any unit norm x \u2208 Rd, we have\nf \u2217(x) = arginf z\u2208R(cid:16)(1 \u2212 z)2 1+u\nt=1 R\u03c6,t\u22121(ft) =\nt X t)2 \u2212 (Yt \u2212 u\u22a4X t)2(cid:1) which, in turn, can be bounded pointwise (see, e.g., [12,\nt=1(cid:0)(Yt \u2212 w\u22a4\nPn\nTheorem 11.8]) by 1 + ln(cid:12)(cid:12)I + Sn S\u22a4\nn(cid:12)(cid:12). Putting together gives the \ufb01rst bound. Next, we take the\nbound just obtained and apply Jensen\u2019s inequality twice, \ufb01rst to the concave function (\u00b7)\n2+\u03b1 of a real\nargument, and then to the concave function ln|\u00b7| of a (positive de\ufb01nite) matrix argument. Observing\n1 yields the second bound. The third bound derives\nthat ESnS\u22a4\nfrom the second one just by using \u03bbi \u2264 1.\n(cid:3)\nProof sketch of Theorem 2.\nt=1(cid:16)P(Yt b\u2206t < 0) \u2212 P(Yt \u2206t < 0)(cid:17) which, according to our probabilistic model, can be shown\nPn\nto be at most c n \u03b51+\u03b1 +Pn\n\n2+\u03b1i\u2264 Ehn(cid:16) 4c\n(cid:17)= u\u22a4x . Hence Pn\n\nWe aim at bounding from above the cumulative regret\n\nP(\u2206t b\u2206t \u2264 0, |\u2206t| \u2265 \u03b5) . The last sum is upper bounded by\nP (Nt\u22121 \u2264 \u03c1t)\n}\n\n, Nt\u22121 > \u03c1t, |\u2206t| \u2265 \u03b5(cid:19)\n}\n{z\n\nP(cid:18)b\u22062\n\n128 ln t\n\u03bbNt\u22121\n\nn = E[Pn\n\n2 + (1 + z)2 1\u2212u\n\nnXt=1\n|\n\nt=1 X tX \u22a4\n\nt ] = n EX 1X \u22a4\n\nn Pn\n\nt \u2264\n\n\u22a4\n\nx\n\n2\n\nt=1\n\n\u22a4\n\nx\n\n1+\u03b1\n\nwhere: (I) are the initial time steps; (II) are the time steps on which we trigger the query of the next\nt is smaller than the threshold at time t); (III) are the steps that do not trigger any\n\nlabel (because b\u22062\n\nqueries at all.\n\n(I)\n\n+\n\nnXt=1\n|\n{z\nP(cid:18)\u2206t b\u2206t \u2264 0, b\u22062\n\n+\n\nnXt=1\n|\n\n128 ln t\n\u03bbNt\u22121\n\nt >\n\n(III)\n\n{z\n\n(II)\n\n, Nt\u22121 > \u03c1t(cid:19)\n}\n\n.\n\n\fNote that (III) bounds the regret over non-sampled examples. In what follows, we sketch the way\nwe bound each of the three terms separately. A bound on (I) is easily obtained as (I) \u2264 \u03c1n =\nO( d+ln n\n) just because \u03c1n \u2265 \u03c1t for all t \u2264 n. To bound (II) and (III) we need to exploit the fact that\nthe subsequence of stored instances and labels is a sequence of i.i.d. random variables distributed\nas (X 1, Y1), see [10]. This allows us to carry out a (somewhat involved) bias-variance analysis\n\n\u03bb2\n\nsPi X iX \u22a4\n\ni\n\nof \u2206t, whose bias and variance tend to vanish as 1/s when s is suf\ufb01ciently large. In particular, if\n\nshowing that for any \ufb01xed number Nt\u22121 = s of stored examples, b\u2206t is an almost unbiased estimator\n\u03bb \u03b52 . The variance of b\u2206t is controlled\n|\u2206t| \u2265 \u03b5 then b\u2206t \u2248 \u2206t as long as Nt\u22121 is of the order of ln n\nby known results (the one we used is [21, Theorem 4.2]) on the concentration of eigenvalues of\nan empirical correlation matrix 1\nto the eigenvalues of the process covariance matrix\nE[X 1X \u22a4\n1 ]. For such a result to apply, we have to impose that Nt\u22121 \u2265 \u03c1t. By suitably combining\nthese concentration results we can bound term (II) by O( d+ln n\n\u03bb\u03b52 ) and term (III) by O(ln n).\nPutting together and choosing \u03b5 of the order of(cid:0) ln n\n3+\u03b1 gives the desired regret bound. The bound\n(cid:3)\n\non the number of queried labels is obtained in a similar way.\nRemark 3 The linear dependence on d in Theorem 2 derives from a direct application of the con-\ncentration results in [21].\nIn fact, it is possible to take into account in a fairly precise manner\nthe way the process spectrum decreases (e.g., [6, 7]), thereby extending the above analysis to the\nin\ufb01nite-dimensional case. In this paper, however, we decided to stick to the simpler analysis leading\nto Theorem 2, since the resulting bounds would be harder to read, and would somehow obscure\nunderstanding of regret and sampling rate behavior as a function of n.\n\n\u03bb2 + ln n\n\n\u03bb n(cid:1) 1+\u03b1\n\n4 Experimental analysis\nIn evaluating the empirical performance of our selective sampling algorithm, we consider two addi-\ntional variants obtained by slightly modifying Step 4 in Figure 1. The \ufb01rst variant (which we just\ncall SS, Selective Sampler) queries the current label instead of the next one. The rationale here is that\nwe want to leverage the more informative content of small margin instances. The second variant is\na mistake-driven version (referred to as SSMD, Selective Sampling Mistake Driven) that queries the\ncurrent label (and stores the corresponding example) only if the label gets mispredicted. For clarity,\nthe algorithm in Figure 1 will then be called SSNL (Selective Sampling Next Label) since it queries\nthe next label whenever a small margin is observed. For all three algorithms we dropped the intial\ntransient regime (Step 3 in Figure 1).\nWe run our experiments on the \ufb01rst, in chronological order, 40,000 newswire stories from the\nReuters Corpus Volume 1 dataset (RCV1). Every example in this dataset is encoded as a vector\nof real attributes computed through a standard TF-IDF bag-of-words processing of the original news\nstories, and is tagged with zero or more labels from a set of 102 classes. The online categorization\nof excerpts from a newswire feed is a realistic learning problem for selective sampling algorithms\nsince a newswire feed consists of a large amount of uncategorized data with a high labeling cost. The\nclassi\ufb01cation performance is measured using a macroaveraged F -measure 2RP/(R + P ), where P\nis the precision (fraction of correctly classi\ufb01ed documents among all documents that were classi\ufb01ed\npositive for the given topic) and R is the recall (fraction of correctly classi\ufb01ed documents among all\ndocuments that are labelled with the given topic). All algorithms presented here are evaluated using\ndual variable implementations and linear kernels.\nThe results are summarized in Figures 2 and 3. The former only refers to (an average over) the 50\nmost frequent categories, while the latter includes them all. In Figure 2 (left) we show how SSMD\ncompares to SSNL, and to its most immediate counterpart, SS. In Figure 2 (right) we compare SSMD\nto other algorithms that are known to have good empirical performance, including the second-order\nversion of the label ef\ufb01cient classi\ufb01er (SOLE), as described in [11], and the DKMPERC variant of\nthe DKM algorithm (see, e.g., [16, 20]). DKMPERC differs from DKM since it adopts a standard\nperceptron update rule. The perceptron algorithm (PERC) and its second-order counterpart (SOP)\nare reported here as a reference, since they are designed to query all labels. In particular, SOP is\na mistake-driven variant of the algorithm analyzed in Theorem 1. It is reasonable to assume that\nin a selective sampling setup we are interested in the performance achieved when the fraction of\nqueried labels stays below some threshold, say 10%. In this range of sampling rate, SSMD has the\nsteepest increase in the achieved F -measure, and surpasses any other algorithm. Unsurprisingly, as\nthe number of queried labels gets larger, SSMD, SOLE and SOP exhibit similar behaviors. Moreover,\nthe less than ideal plot of SSNL seems to con\ufb01rm the intuition that querying small margin instances\n\n\f 0.75\n 0.7\n 0.65\n 0.6\n 0.55\n 0.5\n 0.45\n 0.4\n 0.35\n 0.3\n 0.25\n 0.2\n 0.15\n 0.1\n 0.05\n 0\n 0.01\n\n 0.02\n\n 0.03\n\n 0.04\n\n 0.05\n\n 0.06\n\n 0.07\n\n 0.08\n\n 0.09\n\n 0.1\n\nFraction of queried labels\n\nSSMD\nSSNL\nSS\n\ne\nr\nu\ns\na\ne\nm\n-\nF\n\n 0.75\n 0.7\n 0.65\n 0.6\n 0.55\n 0.5\n 0.45\n 0.4\n 0.35\n 0.3\n 0.25\n 0.2\n 0.15\n 0.1\n 0.05\n 0\n 0.01\n\nSSMD\nDKMperc\nSOLE\nSOP\nPERC\n\n 0.02\n\n 0.03\n\n 0.04\n\n 0.05\n\n 0.06\n\n 0.07\n\n 0.08\n\n 0.09\n\n 0.1\n\nFraction of queried labels\n\nFigure 2: Average F -measure obtained by different algorithms after 40,000 examples, as a function\nof the number of queried labels. The average only refers to the 50 most frequent categories. Points\nare obtained by repeatedly running each algorithm with different values of parameters (in Figure\n1, the relevant parameter is \u03bb). Trend lines are computed as approximate cubic splines connecting\nconsecutive points.\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n 0\n\n 0\n\nNumber of stored examples (normalized)\nNorm of the SVM weight vector (normalized)\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\nTopics\n\n 1\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n 0\n\n 0\n\nF-measure\nFraction of positive examples\nFraction of queried labels\n\n 20\n\n 40\n\n 60\n\n 80\n\n 100\n\nTopics\n\nFigure 3: Left: Correlation between the fraction of stored examples and the dif\ufb01culty of each binary\ntask, as measured by the separation margin. Right: F -measure achieved on the different binary\nclassi\ufb01cation tasks compared to the number of positive examples in each topic, and to the fraction of\nqueried labels (including the stored ones). In both plots, topics are sorted by decreasing frequency\nof positive examples. The two plots are produced by SSMD with a speci\ufb01c value of the \u03bb parameter.\nVarying \u03bb does not signi\ufb01cantly alter the reported trend.\n\nprovides a signi\ufb01cant advantage. Under our test conditions DKMPERC proved ineffective, probably\nbecause most tasks in the RCV1 dataset are not linearly separable. A similar behavior was observed\nin [20]. It is fair to remark that DKMPERC is a perceptron-like linear-threshold classi\ufb01er while the\nother algorithms considered here are based on the more computationally intensive ridge regression-\nlike procedure.\nIn our selective sampling framework it is important to investigate how harder problems in\ufb02uence\nthe sampling rate of an algorithm and, for each binary problem, to assess the impact of the number\nof positive examples on F-measure performance. Coarsely speaking, we would expect that the hard\ntopics are the infrequent ones. Here we focus on SSMD since it is reasonably the best candidate,\namong our selective samplers, as applied to real-world problems. In Figure 3 (left) we report the\nfraction of examples stored by SSMD on each of the 102 binary learning tasks (i.e., on each individual\ntopic, including the infrequent ones), and the corresponding levels of F -measure and queried labels\n(right). Note that in both plots topics are sorted by frequency with the most frequent categories\nappearing on the left. We represent the dif\ufb01culty of a learning task by the norm of the weight vector\nobtained by running the C-SVM algorithm on that task1. Figure 3 (left) clearly shows that SSMD\nrises the storage rate on dif\ufb01cult problems. In particular, even if two different tasks have largely\ndifferent numbers of positive examples, the storage rate achieved by SSMD on those tasks may be\n\n1The actual values were computed using SVM-LIGHT [19] with default parameters. Since the examples in\nthe Reuters Corpus Volume 1 are cosine normalized, the choice of default parameters amounts to indirectly\nsetting the parameter C to approximately 1.0.\n\n\fsimilar when the norm of the weight vectors computed by C-SVM is nearly the same. On the other\nhand, the right plot shows (to our surprise) that the achieved F-measure is fairly independent of the\nnumber of positive examples, but this independence is obtained at the cost of querying more and\nmore labels. In other words, SSMD seems to realize the dif\ufb01culty of learning infrequent topics and,\nin order to achieve a good F-measure performance, it compensates by querying many more labels.\n\nReferences\n\n[1] D. Angluin. Queries revisited. In 12th ALT, pages 12\u201331. Springer, 2001.\n[2] K.S. Azoury and M.K. Warmuth. Relative loss bounds for on-line density estimation with the exponential\n\nfamily of distributions. Machine Learning, 43(3):211\u2013246, 2001.\n\n[3] M.F. Balcan, A. Beygelzimer, and J. Langford. Agnostic active learning. In 23rd ICML, pages 65\u201372.\n\nACM Press, 2006.\n\n[4] M.F. Balcan, A. Broder, and T. Zhang. Margin-based active learning.\n\nSpringer, 2007.\n\nIn 20th COLT, pages 35\u201350.\n\n[5] P.L. Bartlett, M.I. Jordan, and J.D. McAuliffe. Convexity, classi\ufb01cation, and risk bounds.\n\n101(473):138\u2013156, 2006.\n\nJASA,\n\n[6] G. Blanchard, O. Bousquet, and L. Zwald. Statistical properties of kernel principal component analysis.\n\nMachine Learning, 66:259\u2013294, 2007.\n\n[7] M.L. Braun. Accurate error bounds for the eigenvalues of the kernel matrix. JMLR, 7:2303\u20132328, 2006.\n[8] C. Campbell, N. Cristianini, and A. Smola. Query learning with large margin classi\ufb01ers. In 17th ICML,\n\npages 111\u2013118. Morgan Kaufmann, 2000.\n\n[9] R. Castro and R.D. Nowak. Minimax bounds for active learning. IEEE Trans. IT, 2008. To appear.\n[10] N. Cesa-Bianchi, A. Conconi, and C. Gentile. Learning probabilistic linear-threshold classi\ufb01ers via se-\n\nlective sampling. In 16th COLT, pages 373\u2013387. Springer, 2003.\n\n[11] N. Cesa-Bianchi, C. Gentile, and L. Zaniboni. Worst-case analysis of selective sampling for linear classi-\n\n\ufb01cation. JMLR, 7:1205\u20131230, 2006.\n\n[12] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press, 2006.\n[13] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning. Machine Learning,\n\n15(2):201\u2013221, 1994.\n\n[14] R. Cohn, L. Atlas, and R. Ladner. Training connectionist networks with queries and selective sampling.\n\nIn NIPS 2. MIT Press, 1990.\n\n[15] S. Dasgupta, D. Hsu, and C. Monteleoni. A general agnostic active learning algorithm. In NIPS 20, pages\n\n353\u2013360. MIT Press, 2008.\n\n[16] S. Dasgupta, A. T. Kalai, and C. Monteleoni. Analysis of Perceptron-based active learning. In 18th COLT,\n\npages 249\u2013263. Springer, 2005.\n\n[17] Y. Freund, S. Seung, E. Shamir, and N. Tishby. Selective sampling using the query by committee algo-\n\nrithm. Machine Learning, 28(2/3):133\u2013168, 1997.\n\n[18] R. Gilad-Bachrach, A. Navot, and N. Tishby. Query by committee made real. NIPS, 18, 2005.\n[19] T. Joachims. Making large-scale SVM learning practical. In B. Sch\u00a8olkopf, C. Burges, and A. Smola,\n\neditors, Advances in Kernel Methods: Support Vector Learning. MIT Press, 1999.\n\n[20] C. Monteleoni and M. K\u00a8a\u00a8ari\u00a8ainen. Practical online active learning for classi\ufb01cation. In 24th IEEE CVPR,\n\npages 249\u2013263. IEEE Computer Society Press, 2007.\n\n[21] J. Shawe-Taylor, C.K.I. Williams, N. Cristianini, and J. Kandola. On the eigenspectrum of the Gram\n\nmatrix and the generalization error of kernel-PCA. IEEE Trans. IT, 51(7):2510\u20132522, 2005.\n\n[22] I. Steinwart and C. Scovel Fast Rates for Support Vector Machines using Gaussian Kernels Annals of\n\nStatistics, 35: 575-607, 2007.\n\n[23] S. Tong and D. Koller. Support vector machine active learning with applications to text classi\ufb01cation. In\n\n17th ICML, pages 999\u20131006. Morgan Kaufmann, 2000.\n\n[24] A. Tsybakov. Optimal aggregation of classi\ufb01ers in statistical learning. The Annals of Statistics, 32(1):135\u2013\n\n166, 2004.\n\n[25] V. Vovk. Competitive on-line statistics. International Statistical Review, 69:213\u2013248, 2001.\n[26] Y. Ying and D.X. Zhou. Online regularized classi\ufb01cation algorithms. IEEE Transactions on Information\n\nTheory, 52:4775\u20134788, 2006.\n\n\f", "award": [], "sourceid": 163, "authors": [{"given_name": "Giovanni", "family_name": "Cavallanti", "institution": null}, {"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}]}