{"title": "Data-Driven Online to Batch Conversions", "book": "Advances in Neural Information Processing Systems", "page_first": 267, "page_last": 274, "abstract": null, "full_text": "Data-Driven Online to Batch Conversions\nOfer Dekel and Yoram Singer School of Computer Science and Engineering The Hebrew University, Jerusalem 91904, Israel { oferd,singer} @cs.huji.ac.il\n\nAbstract\nOnline learning algorithms are typically fast, memory efficient, and simple to implement. However, many common learning problems fit more naturally in the batch learning setting. The power of online learning algorithms can be exploited in batch settings by using online-to-batch conversions techniques which build a new batch algorithm from an existing online algorithm. We first give a unified overview of three existing online-to-batch conversion techniques which do not use training data in the conversion process. We then build upon these data-independent conversions to derive and analyze data-driven conversions. Our conversions find hypotheses with a small risk by explicitly minimizing datadependent generalization bounds. We experimentally demonstrate the usefulness of our approach and in particular show that the data-driven conversions consistently outperform the data-independent conversions.\n\n1 Introduction\nBatch learning is probably the most common supervised machine-learning setting. In the batch setting, instances are drawn from a domain X and are associated with target values from a target set Y . The learning algorithm is given a training set of examples, where each example is an instance-target pair, and attempts to identify an underlying rule that can be used to predict the target values of new unseen examples. In other words, we would like the algorithm to generalize from the training set to the entire domain of examples. The target space Y can be either discrete, as in the case of classification, or continuous, as in the case of regression. Concretely, the learning algorithm is confined to a predetermined set of candidate hypotheses H, where each hypothesis h  H is a mapping from X to Y , and the algorithm must select a \"good\" hypothesis from H. The quality of different hypotheses in H is evaluated with respect to a loss function , where (y , y  ) is interpreted as the penalty for predicting the target value y  when the correct target is y . Therefore, (y , h(x)) indicates how well hypothesis h performs with respect to the example (x, y ). When Y is a discrete set, we often use the 0-1 loss, defined by (y , y  ) = 1y=y . We also assume that there exists a probability distribution D over the product space X  Y , and that the training set was sampled i.i.d. from this distribution. Moreover, the existence of D enables us to reason about the average performance of an hypothesis over its entire domain. Formally, the risk of an hypothesis h is defined to be, RiskD (h) = E(x,y)D [(y , h(x))] . (1 )\n\n\f\nThe goal of a batch learning algorithm is to use the training set to find a hypothesis that does well on average, or more formally, to find h  H with a small risk. In contrast to the batch learning setting, online learning takes place in a sequence of rounds. On any given round, t, the learning algorithm receives a single instance xt  X and predicts its target value using an hypothesis ht-1 , which was generated on the previous round. On the first round, the algorithm uses a default hypothesis h0 . Immediately after the prediction is made, the correct target value yt is revealed and the algorithm suffers an instantaneous loss of (yt , ht-1 (xt )). Finally, the online algorithm may use the newly obtained example (xt , yt ) to improve its prediction strategy, namely to replace ht-1 with a new hypothesis ht . Alternatively, the algorithm may choose to stick with its current hypothesis and sets ht = ht-1 . An online algorithm is therefore defined by its default hypothesis h0 and the update rule it uses to define new hypotheses. The cumulative loss suffered on a sequence of rounds is the sum of instantaneous losses suffered on each one of the rounds in the sequence. In the online setting there is typically no need for any statistical assumptions since there is no notion of generalization. The goal of the online algorithm is simply to suffer a small cumulative loss on the sequence of examples it is given, and examples that are not in this sequence are entirely irrelevant. Throughout this paper, we assume that we have access to a good online learning algorithm A for the task on hand. Moreover, A is computationally efficient and easy to implement. However, the learning problem we face fits much more naturally within the batch learning setting. We would like to develop a batch algorithm B that exhibits the desirable characteristics of A but also has good generalization properties. A simple and powerful way to achieve this is to use an online-to-batch conversion technique. This is a general name for any technique which uses A as a building block in the construction of B . Several different online-to-batch conversion techniques have been developed over the years. Littlestone and Warmuth [11] introduced an explicit relation between compression and learnability, which immediately lent itself to a conversion technique for classification algorithms. Gallant [7] presented the Pocket algorithm, a conversion of Rosenblatt's online Perceptron to the batch setting. Littlestone [10] presented the Cross-Validation conversion which was further developed by Cesa-Bianchi, Conconi and Gentile [2]. All of these techniques begin by presenting the training set (x1 , y1 ), . . . , (xm , ym ) to A in some arbitrary order. As A performs the m online rounds, it generates a sequence of online hypotheses which it uses to make predictions on each round. This sequence includes the default hypothesis h0 and the m hypotheses h1 , . . . , hm generated by the update rule. The aforementioned techniques all share a common property: they all choose h, the output of the batch algorithm B , to be one of the online hypotheses h0 , . . . , hm . In this paper, we focus on a second family of conversions, which evolved somewhat later and is due to the work of Helmbold and Warmuth [8], Freund and Schapire [6] and CesaBianchi, Conconi and Gentile [2]. The conversion strategies in this family also begin by using A to generate the sequence of online hypotheses. However, instead of relying on a single hypothesis from the sequence, they set h to be some combination of the entire sequence. Another characteristic shared by these three conversions is that the training data does not play a part in determining how the online hypotheses are combined. That is, the training data is not used in any way other than to generate the sequence h0 , . . . , hm . In this sense, these conversion techniques are data-independent. In this paper, we build on the foundations of these data-independent conversions, and define conversion techniques that explicitly use the training data to derive the batch algorithm from the online algorithm. By doing so, we effectively define the data-driven counterparts of the algorithms in [8, 6, 2]. This paper is organized as follows. In Sec. 2 we review the data-independent conversion techniques from [8, 6, 2] and give a simple unified analysis for all three conversions. At the same time, we present a general framework which serves as a building-block for our datadriven conversions. Then, in Sec. 3, we derive three special cases of the general framework\n\n\f\nand demonstrate some useful properties of the data-driven conversions. Finally, in Sec. 4, we compare the different conversion techniques on several benchmark datasets and show that our data-driven approach outperforms the existing data-independent approach.\n\n2 Voting, Averaging, and Sampling\nThe first conversion we discuss is the voting conversion [6], which applies to problems where the target space Y is discrete (and relatively small), such as classification problems. The conversion presents the training set (x1 , y1 ), . . . , (xm , ym ) to the online algorithm A, which generates the sequence of online hypotheses, h0 , . . . , hm . The conversion then outputs the hypothesis hV , which is defined as follows: given an input x  X , each online hypothesis casts a vote of hi (x) and then hV outputs the target value that receives the highest number of votes. For simplicity, assume that ties are broken arbitrarily. The second conversion is the averaging conversion [2] which applies to problems where Y is a convex set. For example, this conversion is applicable to margin-based online classifiers or to regression problems where, in both cases, Y = R. This conversion also begins by using A to m generate h0 , . . . , hm . Then the batch hypothesis hA is defined to be m1 1 i=0 hi (x). The + third and last conversion discussed here is the sampling conversion [8]. This conversion is the most general and applicable to any learning problem, however this generality comes at a price. The resulting hypothesis, hS , is a stochastic function and not a deterministic one. In other words, if applied twice to the same instance, hS may output different target values. Again, this conversion begins by applying A to the training set and obtaining the sequence of online hypotheses. Every time hS is evaluated, it randomly selects one of h0 , . . . , hm and uses it to make the prediction. Since hS is a stochastic function, the definition of RiskD (hS ) changes slightly and expectation in Eq. (1) is taken also over the random function hS . Simple data-dependent bounds on the risk of hV , hA and hS can be derived, and these bounds are special cases of the more general analysis given below. We now describe a simple generalization of these three conversion techniques. It is reasonable to assume that some of the online hypotheses generated by A are better than others. For instance, the default hypothesis h0 is determined without observing even a single training example. This surfaces the question whether it is possible to isolate the \"best\" online hypotheses and only use them to define the batch hypothesis. Formally, let [m] denote the set {0, . . . , m} and let I be some non-empty subset of [m]. Now define hV (x) to be the hypothesis which I performs voting as described above, with the single difference that only the members of i {hi : i  I } participate in the vote. Similarly, define hA (x) = (1/|I |) I I hi (x), and let hS be the stochastic function that randomly chooses a function from the set {hi : i  I } I every time it is evaluated, and predicts according to it. The data-independent conversions presented in the beginning of this section are obtained by setting I = [m]. Our idea is to use the training data to find a set I which induces the batch hypotheses hV , hA , and hS with I I I the smallest risk. Since there is an exponential number of potential subsets of [m], we need to restrict ourselves to a smaller set of candidate sets. Formally, let I be a family of subsets of [m], and we restrict our search for I to the family I . Following in the footsteps of [2], we make the simplifying assumption that none of the sets in I include the largest index m. This is a technical assumption which can be relaxed at the price of a slightly less elegant analysis. We use two intuitive concepts to guide our search for I . First, for any set J  [m - 1], j define L(J ) = (1/|J |) J (yj +1 , hj (xj +1 )). L(J ) is the empirical evaluation of the loss of the hypotheses indexed by J . We would like to find a set J for which L(J ) is small since we expect that good empirical loss of the online hypotheses indicates a low risk of the batch hypothesis. Second, we would like |J | to be large so that the presence of a few bad online hypotheses in J will not have a devastating effect on the performance of the batch hypothesis. The trade-off between these two competing concepts can be formalized\n\n\f\nas follows. Let C be a non-negative constant and define,  (J ) = L(J ) + C |J |- 2 .\n1\n\n(2 )\n\nThe function  decreases as the average empirical loss L(J ) decreases, and also as |J | increases. It therefore captures the intuition described above. The function  serves as our yardstick when evaluating the candidates in I . Specifically, we set I = arg minJ I  (J ). Below we formally justify our choice of  , and specifically show that  (J ) is a rather tight upper bound on the risk of hA , hV and hS . The first lemma relates the risk of these functions J J J with the average risk of the hypotheses indexed by J . Lemma 1. Let (x1 , y1 ), . . . , (xm , ym ) be a sequence of examples which is presented to the online algorithm A and let h0 , . . . , hm be the resulting sequence of online hypotheses. Let J be a non-empty subset of [m - 1] and let  : Y  Y  R+ be a loss function. (1) If  is i the 0-1 loss then RiskD (hV )  (2/|J |) i J RiskD (hi (x)). (2) If  is convex in its second J argument then RiskD (hA )  (1/|Ji|) J RiskD (hi (x)). (3) For any loss function  it J holds that RiskD (hS ) = (1/|J |) J J RiskD (hi (x)).\n\nProof. Beginning with the voting conversion, recall that the loss function being used is the 0-1 loss, namely there is a single correct prediction which incurs a loss of 0 and every other prediction incurs a loss of 1. For any example (x, y ), if more than half of the hypotheses in {hi }iJ predict the correct outcome then clearly hV also predicts this outcome and J (y , hV (x)) = 0. Therefore, if (y , hV (x)) = 1 then at least half of the hypotheses in J J i {hi }iJ make incorrect predictions and (|J |/2)  J (y , hi (x)). We therefore get, 2i (y , hV (x))  (y , hi (x)) . J |J |\nJ\n\nThe above holds for any example (x, y ) and therefore also holds after taking expectations on both sides of the inequality. The bound now follows from the linearity of expectation and the definition of the risk function in Eq. (1). Moving on to the second claim of the lemma, we assume that  is convex in its second argument. The claim now follows from a direct application of Jensen's inequality. Finally, hS chooses its outcome by randomly choosing an hypothesis in {hi : i  J }, J where the probability of choosing each hypothesis in this set equals (1/|J |). Therefore, the i expected loss suffered by hS on an example (x, y ) is (1/|J |) J (y , hi (x)). The risk of J hS is simply the expected value of this term with respect to the random selection of (x, y ). J Again using the linearity of expectation, we obtain the third claim of the lemma. The next lemma relates the average risk of the hypotheses indexed by J with the empirical performance of these hypotheses, L(J ). In the following lemma, we use capital letters to emphasize that we are dealing with random variables. Lemma 2. Let (X1 , Y1 ), . . . , (Xm , Ym ) be a sequence of examples independently sampled according to D. Let, H0 , . . . , Hm be the sequence of online hypotheses generated by A while observing this sequence of examples. Assume that the loss function  is upperbounded by R. Then for any J  [m - 1], 1 < , -2 i C Pr RiskD (Hi ) >  (J ) ex p |J | 2R 2\nJ\n\nwhere C is the constant used in the definition of  (Eq. (2)).\n\nThe proof of this lemma is a direct application of Azuma's bound on the concentration of Lipschitz martingales [1], and is identical to that of Proposition 1 in [2]. For concreteness,\n\n\f\nwe now focus on the averaging conversion and note that the analyses of the other two conversion strategies are virtually identical. By combining the first claim of Lemma 1 with Lemma 2, we get that for any J  I it holds that RiskD (hA )   (J ) with probability at J - . least 1 - exp C 2 /(2R2 ) Using the union bound, RiskD (hA )   (J ) for all J  I J simultaneously with probability at least, -2 . C 1 - |I | exp 2R 2 The greater the value of C , the more  is influenced by the term |J |. On the other hand, a large value of C increases the probability that  indeed upper bounds RiskD (hA ) for all J J  I . In conclusion, we have theoretically justified our choice of  in Eq. (2).\n\n3 Concrete Data-Driven Conversions\nIn this section we build on the ideas of the previous section and derive three concrete datadriven conversion techniques. Suffix Conversion: An intuitive argument against selecting I = [m], as done by the dataindependent conversions, is that many online algorithms tend to generate bad hypotheses during the first few rounds of learning. As previously noted, the default hypothesis h0 is determined without observing any training data, and we should expect the first few online hypotheses to be inferior to those that are generated further along. This argument motivates us to consider subsets J of the form {a, a + 1, . . . , m - 1}, where a is a positive integer less than or equal to m - 1. Li [9] proposed this idea in the context of the voting conversion and gave a heuristic criterion for choosing a. Our formal setting gives a different criterion for choosing a. In this conversion we define I to be the set of all suffixes of [m - 1]. After the algorithm generates h0 , . . . , hm , we set I to be I = arg minJ I  (J ). Interval Conversion: Kernel-based hypotheses are functions that take the form, h(x) = n j =1 j K (zj , x), where K is a Mercer kernel, z1 , . . . , zn are instances, often referred to as support patterns and 1 , . . . , n are real weights. A variety of different batch algorithms produce kernel-based hypotheses, including the Support Vector Machine [12]. An important learning problem, which is currently addressed by only a handful of algorithms, is to learn a kernel-based hypothesis h which is defined by at most B support patterns. The parameter B is a predefined constant often referred to as the budget of support patterns. Naturally, kernel-based hypotheses which are represented by a few support patterns are memory efficient and faster to calculate. A similar problem arises in the online learning setting where the goal is to construct online algorithms where each online hypothesis hi is a kernel-based function defined by at most B vectors. Several online algorithms have been proposed for this problem [4, 13, 5]. First note that the data-independent conversions, with I = [m], are inadequate for this setting. Although each individual online hypothesis is defined by at most B vectors, hA is defined by the union of these sets, which can be much larger than B . To convert a budget-constrained online algorithm A into a budget-constrained batch algorithm, we make an additional assumption on the update strategy employed by A. We assume that whenever A updates its online hypothesis, it adds a single new support pattern into the set used to represent the kernel hypothesis, and possibly removes some other pattern from this set. The algorithms in [4, 13, 5] all fall into this category. Therefore, if we choose I to be the set {a, a + 1, . . . , b} for some integers 0  a < b < m, and A updates its hypothesis k times during rounds a + 1 through b, then hA is defined by at most B + k I support patterns. Concretely, define I to be the set of all non-empty intervals in [m - 1]. With C set properly,  (J ) bounds RiskD (hA ) for every J  I with high probability. Next, J\n\n\f\nJ0,7\n\nh\n\n0\n\nJ\n\n0,1\n\nJ\n\n0,3\n\nJ4,7 J2,3\n\nJ8,11 J6,7\n\nh1\n\nh\n\n2\n\nh3\n\nFigure 1: An illustration of the tree-based conversion.\n\nh\n\n4\n\nJ\n\n4,5\n\nh5\n\nh\n\n6\n\nh7\n\nh\n\n8\n\nJ\n\n8,9\n\nJ10,11\n\nh9\n\nh\n\n10\n\nh11\n\nh12 . . .\n\ngenerate h0 , . . . , hm by running A with a budget parameter of B /2. Finally, choose I to be the set in I which contains at most B /2 updates and also minimizes the  function. By construction, the resulting hypothesis, hA , is defined using at most B support patterns. I Tree-Based Conversion: A drawback of the suffix conversions is that it must be performed in two consecutive stages. First h0 , . . . , hm are generated and stored in memory. Only then can we calculate  (J ) for every J  I and perform the conversion. Therefore, the memory requirements of this conversions grow linearly with m. We now present a conversion that can sidestep this problem by interleaving the conversion with the online hypothesis generation. This conversion slightly deviates from the general framework described in the previous section: instead of predefining a set of candidates I , we construct the optimal subset I in a recursive manner. As a consequence, the analysis in the previous section does not directly provide a generalization bound for this conversion. Assume for a moment that m is a power of 2. For all 0  a  m - 1 define Ja,a = {a}. Now, assume that we have already constructed the sets Ja,b and Jc,d , where a, b, c, d are integers such that a < d, b = (a + d - 1)/2, and c = b + 1. Given these sets, define Ja,d as follows: J if  (Ja,b )   (Jc,d )   (Ja,b )   (Ja,b  Jc,d ) a,b Jc,d if  (Jc,d )   (Ja,b )   (Jc,d )   (Ja,b  Jc,d ) . (3) Ja,d = Ja,b  Jc,d otherwise Finally, define I = J0,m-1 and output the batch hypothesis hA . An illustration of this I process is given in Fig. 1. Note that the definition of I requires only m - 1 recursive evaluations of Eq. (3). When m is not a power of 2, we can pad the sequence of online hypotheses with virtual hypotheses, each of which attains an infinite loss. This conversion can be performed in parallel with the online rounds since on round t we already have all of the information required to calculate Ja,b for all b < t. In the special case where the instances are vectors in Rn , h0 , . . . , hm are linear hypotheses and we use the averaging technique, the implementation of the tree-based conversion becomes memory efficient. Specifically, assume that each hi takes the form hi (x) = wi  x where wi is a vector of weights in Rn . In this case, storing an online hj pothesis hi is y equivalent to storing its weight vector wi . For any J  [m - 1], storing J hj requires j wj . Hence, once we calculate Ja,b we can storing the single n-dimensional vector J discard the original online hypotheses ha , . . . , hb and instead merely keep hAa,b . Moreover, J in order to calculate  we do not need to keep the set Ja,b itself but rather the values L(Ja,b ) and |Ja,b |. Overall, storing hAa,b , L(Ja,b ), and |Ja,b | requires only a constant amount of J memory. It can be verified using an inductive argument that the overall memory utilization of this conversion is O(log(m)), which is significantly less than the O(m) space required by the suffix conversion.\n\n4 Experiments\nWe now turn to an empirical evaluation of the averaging and voting conversions. We chose multiclass classification as the underlying task and used the multiclass version of\n\n\f\n3-fold\nMNIST LETTER\n2 0 -2 1 0 -1 1\n\n4-fold\n\n5-fold\n\n6-fold\n\n7-fold\n\n8-fold\n\n9-fold\n\n10-fold\n\nUS P S I S OL E T\n\n0 -1 4 0 -4 S I T S I T S I T S I T S I T S I T S I T S I T\n\nFigure 2: Comparison of the three data-driven averaging conversions with the dataindependent averaging conversion, for different datasets (Y-axis) and different training-set sizes (X-axis). Each bar shows the difference between the error percentages of a datadriven conversion (suffix (S), interval (I) or tree-based (T)) and of the data-independent conversion. Error bars show standard deviation over the k folds.\n\nthe Passive-Aggressive (PA) algorithm [3] as the online algorithm. The PA algorithm is a kernel-based large-margin online classifier. To apply the voting conversion, Y should be a finite set. Indeed, in multiclass categorization problems the set Y consists of all possible labels. To apply the averaging conversion Y must be a convex set. To achieve this, we use the fact that PA associates a margin value with each class, and define Y = Rs (where s is the number of classes). In our experiments, we used the datasets LETTER, MNIST, USPS (training set only), and ISOLET. These datasets are of size 20000, 70000, 7291 and 7797 respectively. MNIST and USPS both contain images of handwritten digits and thus induce 10-class problems. The other datasets contain images (LETTER) and utterances (ISOLET) of the English alphabet. We did not use the standard splits into training set and test set and instead performed crossvalidation in all of our experiments. For various values of k , we split each dataset into k parts, trained each algorithm using each of these parts and tested on the k - 1 remaining parts. Specifically, we ran this experiment for k = 3, . . . , 10. The reason for doing this is that the experiment is most interesting when the training sets are small and the learning task becomes difficult. We applied the data-independent averaging and voting conversions, as well as the three data-driven variants of these conversions (6 data-driven conversions in all). The interval conversion was set to choose an interval containing 500 updates. The parameter C was arbitrarily set to 3. Additionally, we evaluated the test error of the last hypothesis generated by the online algorithm, hm . It is common malpractice amongst practitioners to use hm as if it were a batch hypothesis, instead of using an online-to-batch conversion. As a byproduct of our experiments, we show that hm performs significantly worse than any of the conversion techniques discussed in this paper. The kernel used in all of the experiments is the Gaussian kernel with default kernel parameters. We would like to emphasize that our goal was not to achieve state-of-the-art results on these datasets but rather to compare the different conversion strategies on the same sequence of hypotheses. To achieve the best results, one would have to tune C and the various kernel parameters. The results for the different variants of the averaging conversion are depicted in Fig. 2.\n\n\f\nLETTER 5-fold LETTER 10-fold MNIST 5-fold MNIST 10-fold USPS 5-fold USPS 10-fold ISOLET 5-fold ISOLET 10-fold\n\nlast 29.9  1.8 37.3  2.1 7.2  0.5 13.8  2.3 9.7  1.0 12.7  4.7 20.1  3.8 28.6  3.6\n\naverage 21.2  0.5 26.9  0.7 5.9  0.4 9.5  0.8 7.5  0.4 10.1  0.7 17.6  4.1 25.8  2.8\n\naverage-sfx 20.5  0.6 26.5  0.6 5.3  0.6 9.1  0.8 7.1  0.4 9.5  0.8 16.7  3.3 22.7  3.3\n\nvoting 23.4  0.8 30.2  1.0 7.0  0.5 8.7  0.5 9.4  0.4 12.5  1.0 20.6  3.4 29.3  3.1\n\nvoting-sfx 21.5  0.8 27.9  0.6 6.5  0.5 8.0  0.5 8.8  0.3 11.3  0.6 18.3  3.9 26.7  4.0\n\nTable 1: Percent of errors averaged over the k folds with standard deviation. Results are given for the last online hypothesis (hm ), the data-independent averaging and voting conversions, and their suffix variants. The lowest error on each row is shown in bold. For each dataset and each training-set size, we present a bar-plot which represents by how much each of the data-driven averaging conversions improves over the data-independent averaging conversion. For instance, the left bar in each plot shows the difference between the test errors of the suffix conversion and the data-independent conversion. A negative value means that the data-driven technique outperforms the data-independent one. The results clearly indicate that the suffix and tree-based conversions consistently improve over the data-independent conversion. The interval conversion does not improve as much and occasionally even looses to the data-independent conversion. However, this is a small price to pay in situations where it is important to generate a compact kernel-based hypothesis. Due to the lack of space, we omit a similar figure for the voting conversion and merely note that the plots are very similar to the ones in Fig. 2. In Table 1 we give some concrete values of test error, and compare data-independent and data-driven versions of averaging and voting, using the suffix conversion. As a reference, we also give the results obtained by the last hypothesis generated by the online algorithm. In all of the experiments, the data-driven conversion outperforms the data-independent conversion. In general, averaging exhibits better results than voting, while the last online hypothesis is almost always inferior to all of the online-to-batch conversions.\n\nReferences\n[1] K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68:357367, 1967. [2] N. Cesa-Bianchi, A. Conconi, and C.Gentile. On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 2004. [3] K. Crammer, O. Dekel, J. Keshet, S. Shalev-Shwartz, and Y. Singer. Online passive aggressive algorithms. Journal of Machine Learning Research, 2006. [4] K. Crammer, J. Kandola, and Y. Singer. Online classification on a budget. NIPS 16, 2003. [5] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The Forgetron: A kernel-based perceptron on a fixed budget. NIPS 18, 2005. [6] Y. Freund and R. E. Schapire. Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277296, 1999. [7] S. I. Gallant. Optimal linear discriminants. ICPR 8, pages 849852. IEEE, 1986. [8] D. P. Helmbold and M. K. Warmuth. On weak learning. Journal of Computer and System Sciences, 50:551573, 1995. [9] Y. Li. Selective voting for perceptron-like on-line learning. In ICML 17, 2000. [10] N. Littlestone. From on-line to batch learning. COLT 2, pages 269284, July 1989. [11] N. Littlestone and M. Warmuth. Relating data compression and learnability. Unpublished manuscript, November 1986. [12] V. N. Vapnik. Statistical Learning Theory. Wiley, 1998. [13] J. Weston, A. Bordes, and L. Bottou. Online (and offline) on a tighter budget. AISTAT 10, 2005.\n\n\f\n", "award": [], "sourceid": 2775, "authors": [{"given_name": "Ofer", "family_name": "Dekel", "institution": null}, {"given_name": "Yoram", "family_name": "Singer", "institution": null}]}