{"title": "On the Generalization Ability of On-Line Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 359, "page_last": 366, "abstract": null, "full_text": "On the Generalization Ability\nof On-line Learning Algorithms\n\nNicol`o Cesa-Bianchi\nDTI, University of Milan\n\nvia Bramante 65\n26013 Crema, Italy\n\ncesa-bianchi@dti.unimi.it\n\nAlex Conconi\n\nClaudio Gentile\n\nDTI, University of Milan\n\nDSI, University of Milan\n\nvia Bramante 65\n26013 Crema, Italy\nconconi@dti.unimi.it\n\nvia Comelico 39\n\n20135 Milano, Italy\ngentile@dsi.unimi.it\n\nAbstract\n\nIn this paper we show that on-line algorithms for classi\ufb01cation and re-\ngression can be naturally used to obtain hypotheses with good data-\ndependent tail bounds on their risk. Our results are proven without re-\nquiring complicated concentration-of-measure arguments and they hold\nfor arbitrary on-line learning algorithms. Furthermore, when applied to\nconcrete on-line algorithms, our results yield tail bounds that in many\ncases are comparable or better than the best known bounds.\n\n1 Introduction\n\nOne of the main contributions of the recent statistical theories for regression and classi-\n\ufb01cation problems [21, 19] is the derivation of functionals of certain empirical quantities\n(such as the sample error or the sample margin) that provide uniform risk bounds for all the\nhypotheses in a certain class. This approach has some known weak points. First, obtaining\ntight uniform risk bounds in terms of meaningful empirical quantities is generally a dif-\n\ufb01cult task. Second, searching for the hypothesis minimizing a given empirical functional\nis often computationally expensive and, furthermore, the minimizing algorithm is seldom\nincremental (if new data is added to the training set then the algorithm needs be run again\nfrom scratch).\n\nOn-line learning algorithms, such as the Perceptron algorithm [17], the Winnow algo-\nrithm [14], and their many variants [16, 6, 13, 10, 2, 9], are general methods for solving\nclassi\ufb01cation and regression problems that can be used in a fully incremental fashion. That\nis, they need (in most cases) a short time to process each new training example and adjust\ntheir current hypothesis. While the behavior of these algorithms is well understood in the\nso-called mistake bound model [14], where no assumptions are made on the way the train-\ning sequence is generated, there are fewer results concerning how to use these algorithms\nto obtain hypotheses with small statistical risk.\n\nLittlestone [15] proposed a method for obtaining small risk hypotheses from a run of an\narbitrary on-line algorithm by using a cross validation set to test each one of the hypotheses\ngenerated during the run. This method does not require any convergence property of the on-\nline algorithm and provides risk tail bounds that are sharper than those obtainable choosing,\nfor instance, the hypothesis in the run that survived the longest. Helmbold, Warmuth,\n\n\fand others [11, 6, 8] showed that, without using any cross-validation sets, one can obtain\nexpected risk bounds (as opposed to the more informative tail bounds) for a hypothesis\nrandomly drawn among those generated during the run.\n\nIn this paper we prove, via re\ufb01nements and extensions of the previous analyses, that on-\nline algorithms naturally lead to good data-dependent tail bounds without employing the\ncomplicated concentration-of-measure machinery needed by other frameworks [19].\nIn\nparticular we show how to obtain, from an arbitrary on-line algorithm, hypotheses whose\nrisk is close to \u0002\u0001\u0004\u0003 with high probability (Theorems 2 and 3), where \u0003\nis the amount of\nis a data-dependent quantity measuring the cumulative loss of the on-\ntraining data and \nline algorithm on the actual training data. When applied to concrete algorithms, the loss\nbound \ntranslates into a function of meaningful data-dependent quantities. For classi\ufb01-\ncation problems, the mistake bound for the\n-norm Perceptron algorithm yields a tail risk\nbound in terms of the empirical distribution of the margins \u2014 see (4). For regression prob-\nlems, the square loss bound for ridge regression yields a tail risk bound in terms of the\neigenvalues of the Gram matrix \u2014 see (5).\n\n2 Preliminaries and notation\n\n\u0012#\u001c%$&\u0007(')'('*\u0007+\u001c\n\nand\n\n\u0006\b\u0007\n\t\n\nbe arbitrary sets and\n\n. An example is a pair\n\nis the label associated with\n\nis an\nLet\n. Random variables will\ninstance belonging to\nbe the\nbe denoted in upper case and their realizations will be in lower case. We let\n, respectively.\npair of random variables\nThroughout the paper, we assume that data are generated i.i.d. according to an unknown\n. All probabilities and expectations will be understood with\nprobability distribution over\nto denote the vector-\nrespect to this underlying distribution. We use the short-hand\nvalued random variable\n\n\u000b\r\f\u000e\u0006\u0010\u000f\u0011\t\n\u0016\u0002\u001a\u001b\t\n, where\n\u0012\u001e\u001d\u0002\u0007\u0017\u001f \u0018\n\n\u0012\u0014\u0013\u0015\u0007\u0017\u0016\u0019\u0018\nand\n\ntake values in\n\n, where\n\nand\n\n.\n\n\u001c\"!\n\nJ\u0019\u0007\u0017KLA\n\nto predictions\n\nfor some known\n\nis any (measurable) mapping from instances\n\nis a given decision space. The risk of\n\u000fE\tGFIH\ntakes values in\n\nA hypothesis\n,.\u0012\u0014\u0013/\u00180\u001a\n\u0013-\u001a\u0011\u0006\n, where 1\n,\nis de\ufb01ned by\n2(3(\u0012#,4\u00185\f7698;:<\u0012#,\u0015\u0012\u001e\u001d=\u0018>\u0007?\u001f@\u0018BA\nis a nonnegative loss function. Unless otherwise speci\ufb01ed, we will\nwhere\n:DC\nassume that\n. The on-line algorithms we\ninvestigate are de\ufb01ned within a well-known mathematical model, which is a generalization\nof a learning model introduced by Littlestone [14] and Angluin [1]. Let a training sequence\nbe \ufb01xed. In this learning model, an on-line\none at a time in trials, generating a sequence of\nalgorithm processes the examples in Q\n-th trial, the algorithm receives the\nhypotheses\ninstance\n\u00134_\nfor the label\n\u0013/_\nthe algorithm suffers a loss\nfor the label\nwhich may or may not be equal to\nits cumulative loss\n\n,4_\u001e`*$T\u0012\u0014\u0013/_B\u0018a\u001a\nis disclosed and\n,S_\u0014`W$&\u0012\u0014\u0013/_c\u0018\n. Before the next trial begins, the algorithm generates a new hypothesis\nby\n. We measure the algorithm\u2019s performance on Q\n\n\u0018\u0017\u0018\u0002\u001aX\u0012#\u0006Y\u000fZ\t0\u0018\n. At the beginning of the\n,\\[]\u0007?,\nand uses its current hypothesis\n\u0016]_\n\n\fR\u0012\u0017\u0012\u0014\u0013S$T\u0007\n\u0016U$V\u0018>\u0007(')'('W\u0007(\u0012\u001e\u0013\n\u0007)'(')'*\u0007+,\nassociated with\n\n,/_\u001e`*$\n, measuring how bad is the prediction\n\nto compute a prediction\n\u0016b_\n\n. Then, the true value of the label\n\n:<\u0012#,/_\u0014`W$&\u0012\u0014\u0013/_#\u0018V\u0007\n\u0016<_B\u0018\n\nJ MNKOM7P\n\n\u0007\n\u0016\n\n,/_\u001e`*$\n\n\u0018>'\n\n\u0018>\u0007\n\u0016\n\n_fe.$\n\n_\u0014`W$\n\n\u00185\f\nand\n\n:<\u0012g,\nij[<\u0007(')'('W\u0007\u0017i\n. In particular, throughout the paper\n\n\u0012\u0014\u0013\nwhen we want to stress the fact that the\nIn our analysis, we will write\ncumulative loss and the hypotheses of the on-line algorithm are functions of the random\nsample\nwill denote the (deterministic) initial\nhypothesis of an arbitrary on-line algorithm and, for each\nwill be a random\nvariable denoting the\n-th hypothesis of the on-line algorithm and such that the value of\n\nkml7^nl\ndoes not change upon changes in the values of\n\nij_\u0017\u0012B\u001c5$&\u0007)')'('*\u0007+\u001c\nOur goal is to relate the risk of the hypotheses produced by an on-line algorithm running\non an i.i.d. sequence\nof the algorithm on that sequence.\n\nto the cumulative loss\n\n\u001cp_fq.$\u0004\u0007(')'('W\u0007+\u001c\n\nio_\n\n\u0003 ,\n\n.\n\n\u001cr!\n\nhs\u0012#\u001cr!t\u0018\n\n\u0005\n\u0013\n\u0006\n\u0013\n\u001c\n\u001d\n\u001f\n\u0006\n\t\n\u000b\n!\n\u0018\n,\n1\n,\n1\n:\n8\nQ\n!\n!\n!\n!\n!\n$\n!\n^\n1\n\u0016\n_\n,\n_\n!\n\n\u0012\nQ\n!\n!\nd\n_\n_\nh\n!\n\u001c\n!\ni\n[\n^\n!\n\u0018\n!\n\fThe cumulative loss\nanalysis we will obtain bounds of the form\n\nhs\u0012B\u001c\n\n!t\u0018\n\nwill be our key empirical (data-dependent) quantity. Via our\n\nhs\u0012B\u001cn!t\u0018\n\n2)3\u0004\u0012\u0004\u0003\n\n\u0007)'(')'*\u0007\u0017i\n\n\u0012gi\nis a speci\ufb01c function of the sequence of hypotheses\n\nwhere\n\u0012gi [<\u0007(')'('W\u0007\u0017i\nproduced by the algorithm, and\nci\ufb01c on-line algorithms the ratio\nempirical quantities.\n\ni9[]\u0007)')'('W\u0007\u0017i\nis a suitable positive constant. We will see that for spe-\n\u0001&\u0003 can be further bounded in terms of meaningful\nhs\u0012#\u001cr!t\u0018\nOur method centers on the following simple concentration lemma about bounded losses.\n\n\u0007\t\b\u000b\n\n\u0003\r\f\u000f\u000e\n\n\u0010\u0012\u0011\n\n\u0018\u0017\u0018\u0006\u0005\n\nLemma 1 Let\non-line algorithm output (not necessarily distinct) hypotheses\non\n\nbe an arbitrary bounded loss satisfying\n\n. Then for any\n\nwe have\n\nl\u000e:Zl\n\n. Let an arbitrary\nwhen it is run\n\n\u0007(')'('W\u0007\u0017i\n\n\u001cn!\n\nJjM\n\nProof. For each\n\n\fGk<\u0007(')')'*\u0007\n\n. We have\n\n\u0018\u0006\u0005\n\n2)3\n\n_\u001e`*$\n\n2)3\u0004\u0012\u001ei\n_fe.$\n\u0003 , set\n\n\u0014/_\u001e`*$\n\n_\u001e`*$\n_fe.$\n, since\n\n\u0012\u001eij_\u0014`W$)\u0018\u000b\u0015\u001b:L\u0012\u0014ij_\u001e`*$<\u0012\u0014\u001d9_#\u0018V\u0007\u0017\u001f/_#\u0018\n\n2)3\n\n\u0012\u001ei\n\n_\u001e`*$\ntakes values in\n\n_fe.$\n\n. Also,\n\n\u0010\u0012\u0011\n\n\f\u000f\u000e\n\u0018\u0016\u0015\n\nJ\u0019\u0007?K\n\nFurthermore,\n\n_\u001e`*$\n\nlNK\n\nKOl\u0017\u0014\n698\u0018\u0014\\_\u001e`*$\u001a\u0019\u001c\u001b\n\ndenotes the\n\nwhere\nHoeffding-Azuma inequality [3] to the bounded random variables\nlemma.\n\n-algebra generated by\n\n\u001c%$\u0004\u0007(')')'*\u0007>\u001c\n\n_\u001e`*$\n\n_\u0014`W$+AS\f72)3\u0004\u0012g,4_\u0014`W$)\u0018\u001d\u0015=6\n\n\u0012g,\\_\u001e`*$<\u0012\u0014\u001d9_g\u0018V\u0007\u0017\u001f/_#\u0018\u001e\u0019\u001f\u001b\n\n_\u001e`*$\n\n_\u0014`W$?Ab\u0007+\f\n\n. A direct application of the\nproves the\n\n`W$\n\n\u0014S[b\u0007)'(')'*\u0007!\u0014\n\n3 Concentration for convex losses\n\nIn this section we investigate the risk of the average hypothesis\n\n_\u0014`W$\n\ne.$\n\n,$#&%('\n\n\u0007(')'('*\u0007?,\nis convex.\n\nwhere\nare the hypotheses generated by some on-line algorithm run on \u0003\n,\\[<\u0007?,\ntraining examples.1 The average hypothesis generates valid predictions whenever the deci-\nsion space 1\nTheorem 2 Let 1\narbitrary on-line algorithm for\nwhen the algorithm is run on\n\u001cr!\n\noutput (not necessarily distinct) hypotheses\n. Then for any\nthe following holds\n\nbe convex in the \ufb01rst argument. Let an\n\nbe convex and\n\n\u0007)')'('W\u0007\u0017i\n\nJ\\\u0007\u0017K\nJ M\n\n2)3(\u0012\n1Notice that the last hypothesis\n\nis not used in this average.\n\n\u0003)\u0007\n\n\u0003\r\f*\u000e\n\ni\u001b\u0018\r\u0005\n+-,\n\n\u0001\n\u0002\n[\n!\n\u0003\nk\nk\nl\n\u0010\n\u0007\n\u0003\n!\n\u0018\n!\n\b\n:\nJ\nK\ni\n[\n!\n\u0010\nl\nk\n\u0001\n\u0002\nk\n\u0003\n!\nd\nh\n\u0003\n\u0007\nK\n\n\u0013\n\u0003\nk\nl\n\u0010\n'\n^\n\f\nk\n\u0003\n!\nd\n\u0014\n\f\nk\n\u0003\n!\nd\nh\n\u0003\n'\n\u0015\n:\n8\nA\n8\n:\nJ\n\u001b\n \n!\n\"\n\f\nk\n\u0003\n!\nd\n_\n,\n\u0007\n$\n!\n:\nC\n1\n\u000f\n\t\nF\n8\nA\n:\ni\n[\n!\n\u0010\nM\nk\n\u0001\n\u0002\nh\nK\n\n\u0013\nk\n\u0010\n\u0011\nl\n\u0010\n'\n\fProof. Since\n\nis convex in the \ufb01rst argument, by Jensen\u2019s inequality we have\n\nyields\n\n:\u0001\n\n,\u0015\u0012\u001e\u0013/\u0018V\u0007\n\u0016\u0003\u0002\n\n,4\u0018\n\nTaking expectation with respect to\n\n\u0012g,\n\n_\u0014`W$\n\n\u0012\u0014\u001d\u0002\u0007?\u001f@\u0018\n\n:L\u0012g,4_\u0014`W$&\u0012\u001e\u0013/\u0018>\u0007\u0017\u0016\u0019\u0018G'\n2)3\n\n2(3\n. Using the last inequality along with Lemma 1 yields the thesis.\n\n!\u0005\u0004\nThis theorem, which can be viewed as the tail bound version of the expected bound in [11],\nfor \u201cmost\u201d samples\nimplies that the risk of the average hypothesis is close to \n\u0001&\u0003\n\u0001&\u0003 concentrates around\n. On the other hand, note that it is unlikely that\n2(3\n\u0012gi\n\u0001\u0004\u0003 , at least without taking strong assumptions on the underlying on-line algorithm.\n6L8\nAn application of Theorem 2 will be shown is Section 5. Here we just note that by applying\nthis theorem to the Weighted Majority algorithm [16], we can prove a version of [5, The-\norem 4] for the absolute loss without resorting to sophisticated concentration inequalities\n(details in the full paper).\n\n_\u0014`W$\n\n_fe.$\n\n!c\u0018\n\n4 Penalized risk estimation for general losses\n\nIf the loss function\nis nonconvex (such as the 0-1 loss) then the risk of the average hy-\npothesis cannot be bounded in the way shown in the previous section. However, the risk\nof the best hypothesis, among those generated by the on-line algorithm, cannot be higher\nthan the average risk of the same hypotheses. Hence, Lemma 1 immediately tells us that,\nat\nunder no conditions on the loss function other than boundedness, for most samples Q\n\u0001\u0004\u0003 . In this section we give\nleast one of the hypotheses generated has risk close to \na technique (Lemma 4) that, using a penalized risk estimate, \ufb01nds with high probability\nsuch a hypothesis. The argument used is a re\ufb01nement of Littlestone\u2019s method [15]. Unlike\nLittlestone\u2019s, our technique does not require a cross validation set. Therefore we are able\nis the size of the whole\nto obtain bounds on the risk whose main term is \nset of examples available to the learning algorithm (i.e., training set plus validation set in\nLittlestone\u2019s paper). Similar observations are made in [4], though the analysis there does\nactually refer only to randomized hypotheses with 0-1 loss (namely, to absolute loss).\n\n\u0001&\u0003 , where \u0003\n\n!t\u0018\n\n!c\u0018\n\nLet us de\ufb01ne the penalized risk estimate of hypothesis\n\nby\n\n,S_\n\nwhere \u0003\nalgorithm had not seen yet when\nsuf\ufb01x, and\n\nis the length of the suf\ufb01x Q\n,/_\n\n_fq.$\nwas generated, \n\n\u0007(')')'\u0015\u0007\n\n\u0007\t\b\u0007\u0006\n\n\u0015\u0011^\n\n\u0015=^t\u00185\u0007\n\n\u0015\u0002^\n\n\u0012\u001e\u0013/\u00185\f\nOur algorithm chooses the hypothesis\n\n, where\n\nof the training sequence that the on-line\non that\n\nis the cumulative loss of\n\n,4_\n\n\b\u0007\u0006\n\f\u000e\r<3\u0010\u000f\u0012\u0011\u0014\u0013\n[\u0016\u0015\n\n,o\f\n\n`W$\n\n^\f\u000b\n\n\f\u000f\u000e\n\n_\n\t\n\u0015\u0011^\n\n\u0007\t\b\u0007\u0006\n\n\u0015\u0011^t\u0018\u0019\u0018\n\nFor the sake of simplicity, we will restrict to losses\n. However, it should\nbe clear that losses taking values in arbitrary bounded real interval can be handled using\ntechniques similar to those shown in Section 3. We prove the following result.\n\nwith range\n\nJ\u0019\u0007(kVA\n\nTheorem 3 Let an arbitrary on-line algorithm output (not necessarily distinct) hypotheses\nchosen using\n\nwhen it is run on\n\n, the hypothesis\n\n. Then, for any\n\ni [<\u0007(')')'\u0015\u0007\u0017i\nthe penalized risk estimate based on\n\nJjM\nsatis\ufb01es\n\n2)3\n\ni\u001b\u0018\u001e\u001d\n\n\u0006\u0010\u001a\u001c\u001b\n\u0007 \u001f\n\nk\u0004\u0018\n\n\f*\u000e\n\n:\nl\n$\n!\n_\ne\n$\n\u0012\nl\n$\n!\n\u0004\n!\n_\ne\n$\n\u0018\n\"\n\u0012\nQ\nQ\n!\n\u0004\n!\n\u0018\nh\nA\n:\n!\n\u0012\nQ\n\u0012\nQ\n\n_\n\u0003\n\u0012\n\u0003\nQ\n!\n_\n\nk\n\u0013\n\u0013\n\u0003\n\u0012\n\u0003\n\u0007\nk\n\u0018\n\u0010\n'\n\b\n,\n\u000e\n_\n\u0015\n!\n\u0017\n\n_\n\u0003\n\u0012\n\u0003\n'\n:\n8\n!\n\u001c\n!\n\u0010\nl\nk\n\b\ni\n\b\n\u0001\n\u0002\n\u0012\n\b\nh\n\u0003\n\nk\n\u0003\n\u0013\n\u0012\n\u0003\n\u0007\n\u0010\n\u0011\nl\n\u0010\n'\n\fThe proof of this theorem is based on the two following technical lemmas.\n\nLemma 4 Let an arbitrary on-line algorithm output (not necessarily distinct) hypotheses\ni [<\u0007(')')'\u0015\u0007\u0017i\n\nthe following holds:\n\nwhen it is run on\n\n. Then for any\n\nJ M\n\n\u0015\u0011^t\u0018\n\u0018\n\nLet further\n\n,\n\ni\u0003\u0002\n\n\u0012\u00142)3\u0004\u0012\u001ei\n\n`*$\n\u0012\u00142(3(\u0012\u001eij_B\u0018\n\n`*$\n\n_\u0015\f\n\n\u000br\f\n\n\u0015\u0011^t\u0018\n\u0018S'\n\nhZ_\n\n\b\u0007\u0006\n\u0015\u0011^\n\nProof. Let\n\n, and set for brevity\n\n2)3\u0004\u0012\n\ni\u001b\u0018\u001e\u001d\n[\u0016\u0015\n\n<3\u0010\u000f\u0012\u0011\u0014\u0013\n\nFor any \ufb01xed\n\nwe have\n\n2(3(\u0012\u001ei\n\n2)3\u0004\u0012\n\n\u0005\u0014\u001d\ni=\u0018\u001e\u001d\n\n`W$\n\n_fe\n\nNow, if\n\nholds then either\n\nor\n\nor\n\nhold. Hence for any \ufb01xed\n\n\u0012\u001ei\n\n\u0012gi\n\n\u0018%M\n\n2)3\u0004\u0012\u001ei\n\nlO2)3\n\n\u0007\t\b\u0007\u0006\n\n2(3\nwe can write\n\n\u0015=^t\u0018%l\n\b\u0007\u0006\n\u0015=^t\u0018%l\n\b\u0007\u0006\n\u0012\u001eij_#\u0018\u000b\u0015\n\b\u0007\u0006\n\u000b\u0001\u001d\n\u000b(\u0018\n\u0007\t\b\u0007\u0006\n\u0018\u001d\u0015\u00112)3\u0004\u0012\u001ei\n\u0015\u0011^t\u0018%l\n\u000b)\u0018*\u0007\u00192)3\n\u0007\t\b\n\u0018\u0016\u0015\n\u0015\u0011^t\u0018S\u0007\u00192(3(\u0012\u001ei\n\u0007\t\b\u0007\u0006\n\u0012\u00142)3(\u0012gij_#\u0018\u0016\u0015\u00112)3\u0004\u0012\u001ei\n\u000b)\u0018%M\n\b\u0007\u0006\n\u0018\u0016\u0015\n\u0015\u0011^t\u0018\n\u0018\n\u0018\u0016\u0015\u00112)3\u0004\u0012\u001ei\n\u000b(\u0018\t\b\n\u000b(\u0018\n\u0015\u0011^t\u0018\n\u0018\n\u0012gi _B\u0018\u0016\u0015\n\u0007\t\b\n\n\b\u0007\u0006\n\u0005O2)3(\u0012gi\n\n2(3(\u0012gi\n2(3(\u0012\u001ei\n\n2(3\n`*$\n\n2(3(\u0012gi\n\n\u0018%M\n\n_fe\n\n\u0007\t\b\n\n2(3\n\n\u0001\u0007\u0006\n\ni\u001b\u0018\u001e\u001dO2)3(\u0012gi\n\n`*$\n\n_fe\n\n(1)\n\n2(3(\u0012\u001ei\n\n\u0005<\u0018S'\n\n\u0012\u001eij_B\u0018\u001e\u001d\n\u000b(\u0018\n\n\u0018S\u0007\u00192)3\n\n\u0015=^t\u0018\n\n\u000b)\u0018\n\n\u0018*\u0007\u00192)3\u0004\u0012\u001eij_g\u0018\n\n2)3\u0004\u0012\u001ei\n\n2)3\u0004\u0012\u001ei\n2(3(\u0012gi\n\n\u000b)\u0018\n\u0018\u001e\u001d\n\u0005<\u0018\n\u000b)\u0018S\u0007\\2)3\u0004\u0012\u001eij_#\u0018\u001e\u001dO2)3\n\u000b(\u0018\n\u000b\u0001\u001d\n\u0007\t\b\n\u0018\u001e\u001dO2)3\n\n2)3\u0004\u0012\u001ei\n\u0018S\u0007\\2)3\u0004\u0012\u001ei\n\n\u0005<\u0018\n\u000b(\u0018\n\n\u0012\u001ei\n\n\u0012\u001ei\n\n\u0005<\u0018\n\u000b)\u0018\u0017\u0018\n\u0005<\u00189'\n\n(2)\n(3)\n\n2(3(\u0012gi\n\n\u0018\n\u0018\n\n\b\u0007\u0006\n\n\u0015=^t\u0018\n\u0018\n\n\u0012\u00142)3(\u0012gi\nProbability (3) is zero if\n\n. Hence, plugging (2) into (1) we can write\n\nwhere in the last two inequalities we applied Chernoff-Hoeffding bounds.\n\n!\n\u001c\n!\n\u0010\nM\nk\n\u0001\n\u0017\n\b\n\u0011\n\u0013\n\u000e\n[\n\u0015\n_\n\u0015\n!\n_\n\u0018\n\u0007\n\u0013\n\b\n\u0006\n\u0012\n\u0003\n\u0018\nl\n\u0010\n'\n\u0001\n\u000b\n\f\n\u000e\n_\n\u0015\n!\n\u0007\n\u0013\n\u0012\n\u0003\ni\n\u000b\n\f\n\t\nh\n\u000b\n\f\nh\n\u0002\n\t\n\b\n\u0004\n\u0003\n\u0007\n\b\n\u0004\nh\n\u000b\n\u0003\n\u0015\n\u0001\n\u000b\n'\nJ\n\u0001\n\n\b\n\u000b\n\u0018\n\u0007\n\u0005\n\u0002\nl\n!\nd\n[\n\u0001\n\u0012\n\b\n\u0004\n_\n\u0012\n\u0003\n\b\n\u0004\n\u000b\n\u0007\n\u0012\n\u0003\n\u0015\n\u0001\n\u000b\n\u000b\n\u0018\n\u0007\n\b\n\u0004\n_\n\u0007\n\u0012\n\u0003\n\b\n\u0004\n\u000b\n\u0007\n\b\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\n\b\n\u0004\n_\n\u0012\n\u0003\n\b\n\u0004\n\u0012\n\u0003\n\u0015\n\u0001\n_\n\u000b\n\u0013\n\b\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\n\u000b\n\u0018\n^\n\u0001\n\n\b\n\u0004\n_\n\u0006\n\u0012\n\u0003\n\b\n\u0004\n\u000b\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\n_\n\u0018\n\u001d\n\u0007\n\u0005\n\u0002\nl\n\u0001\n\u0012\n\b\n\u0004\n_\nl\n_\n\b\n\u0006\n\u0012\n\u0003\n_\n\u000b\n\u0018\n\u0007\n\u0007\n\u0001\n\u0012\n\b\n\u0004\n\u000b\n\u001d\n\u000b\n\u0018\n\u0012\n\u0003\n\u0015\n\u0001\n\u000b\n\u001d\n\u000b\n\u0018\n\u0007\n\u0007\n\u0001\n\u0013\n\u0012\n\u0003\n\u0015\n\u0001\n\u0007\nl\n\u0001\n\u0012\n\b\n\u0004\n_\nl\n_\n\b\n\u0006\n\u0012\n\u0003\n\u0007\n\u0001\n\u0012\n\b\n\u0004\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\n\u0007\n\u0001\n_\n\u000b\n\u0013\n\b\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\n\u000b\n_\n\u000b\n\u0018\n\u0007\n\u0005\n\f\n\u0013\n\b\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\n\u000b\n\u0018\n\u0012\n\b\n\u0007\n\u0013\n\b\n\u0006\n\u0012\n\u0003\n\u0015\n\u0001\nl\n!\nd\n[\n\u0001\n\u0012\n\b\n\u0004\n_\nl\n\u0012\n\u0003\n\u0007\n\u0003\n\u0001\n\u0012\n\b\n\u0004\n\u000b\n\u001d\n\u000b\n\u0018\n\u0007\n\u0012\n\u0003\n\u0015\n\u0001\n\u000b\nl\n\u0010\n\u0003\n\u0007\nk\n\u0007\n\u0003\n!\nd\n[\n\u0001\n\u0012\n\b\n\u0004\n_\n_\n\u0018\n\u0006\n\u0012\n\u0003\nl\n\u0010\n\u0003\n\u0007\nk\n\u0007\n\u0010\n\u0003\n\u0003\n\u0007\nk\n\f\n\u0010\n\u0007\n\"\n\fLemma 5 Let an arbitrary on-line algorithm output (not necessarily distinct) hypotheses\n\nwhen it is run on\n\n. Then for any\n\nthe following holds:\n\n\u0007(')')'\u0015\u0007\u0017i\n\n`*$\nProof. We have\n\n\u0012\u00142)3\u0004\u0012\u001eij_B\u0018\n\n\u0012\u001e2)3\u0004\u0012g,4_g\u0018\n\n`W$\n\n\u0011\u0014\u0013\n\n[\u0016\u0015\n\n\b\u0007\u0006\n\n\u0015\u0011^t\u0018\n\u0018\u0006\u0005\n\b\u0007\u0006\n\u0015=^t\u0018\n\u0018%l\n\n`*$\n\n`*$\n\n`*$\n\n`*$\n\n_fe\n\n_fe\n\n_fe\n\nwhere the last inequality follows from\n\n_fe\n\n_fe.$\u0004\u0003\n\u0015\u0011^t\u0018\n\u0018\u0006\u0005\n\u0012\u001eij_B\u0018\u0006\u0005\n\n\u0012\u001e2(3(\u0012\u001eij_B\u0018\n\n`*$\n\n\b\u0007\u0006\n\n2)3\n\n`W$\n\n_fe\n\nJ M\n\n\u0003\u0006\f\u000f\u000e\n\n\u0012g2)3\n\n\u0012g,4_#\u0018\n\n2(3\n\n2(3\n\n2(3\n\n\u0012#,\n\n\u0012#,\\_#\u0018\n\n\u0012#,\\_#\u0018\n\n\u0003)\u0007\n\n\f\u000f\u000e\n\n\u0007\u0001\n\n`W$\n\n`W$\n\n\u0015=^t\u0018\u0017\u0018\n\n\f*\u000e\n\n\u0015\u0011^t\u0018\n\f\u000f\u000e\n\u0015=^\n\n\u0003 . Therefore\n\n\f\u000f\u000e\n\u0007\u0001\n\n\u0007\u0001\n\u0013\u0006\u0005\n\u0003\u0006\f\u000f\u000e\n\u0010\u0012\u0011\n\u0003\r\f\u000f\u000e\n\n\f\u000f\u000e\n\nby Lemma 1 (with\n\n).\n\n\fGk\n\nProof (of Theorem 3). The proof follows by combining Lemma 4 and Lemma 5, and by\noverapproximating the square root terms therein.\n\n5 Applications\n\nFor the sake of concreteness we now sketch two generalization bounds which can be ob-\ntained through a direct application of our techniques.\n\nsign\n\n_\u001e`*$\n\n_g\u00185\f\n\nThe\n\n-norm Perceptron algorithm [10, 9] is a linear threshold algorithm which keeps in the\n\nalgorithm predicts by\nand\n\n. If the algorithm\u2019s prediction is wrong (i.e., if\n\n\u001a\u0011\u0006\u000e\f\f\u000b\r\nN\u001a=H\t\b\n-th trial a weight vector\u0007\n\u001a\u001bH\t\b\n. On instance\n\n,/_\u0014`W$T\u0012\u0011\n\n\u0012\u0013\u0007*\u0018L\f\u0019\u0018\nk\u0015\u000f , where\u0012\n_B\u0018%\u001a\u0017\u000b\n\u0012\u0011\u0012\n\u0012\u0013\u0007S_\u001e`*$V\u0018\u0015\u0014\u0016\n\nC\u001a\u0019\u000f\u0019\n\u0019\u000f\u0019\n\u0018\u001c\u001b\n\u0012\u001a\n\n\u0016<_\u0015\n\nperforms the weight update\u0007S_\u001e\u001d\u001f\u00074_\u001e`*$\n\u000f\u0006#\u0015\u0018\n\f! \u0012\n\f\u0016\"\n\n) then the algorithm\n\f7\u0016\nyields the classical\nPerceptron algorithm [17]. On the other hand,\ngets an algorithm which\nperforms like a multiplicative algorithm, such as the Normalized Winnow algorithm [10].\nApplying Theorem 3 to the bound on the number\n-norm Perceptron\nalgorithm shown in [9], we immediately obtain that, with probability at least\nrespect to the draw of the training sample\nis at most\n\nof the penalized estimator\n\nof mistakes for the\n\n. Notice that\n\n, the risk\n\nk]\u0007\n_\u001e`*$\n\n\u001cr!\n\n2(3(\u0012\n\ni\u001b\u0018\n\nlGk\u0010\u000f , the\n\u0007\u000b\u0019\u000f\u0019\n\u0019\u000f\u0019\n\u0010 with\n\n(4)\n\nk\u0004\u0018\n\n\u0017%$'&\n\n\u0012\u001a(r\u0007>\u001c\n\n\u0015Ok\n\n)\u0017*\n\n\u0015Ok\n\n\u0012+(r\u0007>\u001c\n\n\u0007 \u001f\n\n\f*\u000e\n\n$'&\n\ni\n[\n!\n\u001c\n!\n\u0010\nM\nk\n\u0001\n\u0002\n\u0011\n\u0013\n\u000e\n[\n\u0015\n_\n\u0015\n!\n\u0007\n\u0013\n\u0012\n\u0003\nh\n\u0003\n\u0007\n\n\u0013\nk\n\u0010\n\nk\n\u0003\n\u0003\n\u0007\nk\n\u0010\n\u0011\nl\n\u0010\n'\n\u000e\n_\n\u0015\n!\n\u0007\n\u0013\n\u0012\n\u0003\nk\n\u0003\n!\nd\n[\n\u0007\n\u0013\n\b\n\u0006\n\u0012\n\u0003\n\f\nk\n\u0003\n!\nd\n[\n_\n\u0018\n\u0007\n\u0013\n\u0003\n!\nd\n_\ne\n[\n\u0002\nk\n\u0013\n\u0012\n\u0003\n\u0003\n\u0012\n\u0003\n\u0007\nk\n\u0018\n\u0010\nM\nk\n\u0003\n!\nd\n[\n\u0007\n\u0013\n\u0003\n!\nd\n_\ne\n[\n\nk\n\u0003\n\u0003\n\u0007\nk\n\u0010\nl\nk\n\u0003\n!\nd\n[\n\nk\n\u0003\n\u0003\n\u0007\nk\n\u0010\n\u0007\n\u0004\n!\nk\n\u0001\n^\nl\n\u0001\n\u0002\n\u0011\n\u0013\n\u000e\n[\n\u0015\n_\n\u0015\n!\n\u0007\n\u0013\n\u0012\n\u0003\nh\n\n\u0013\nk\n\u0010\n\nk\n\u0003\n\u0003\n\u0007\nk\n\u0010\n\u0011\nl\n\u0001\n\u0002\nk\n\u0003\n!\nd\n[\nh\n\u0003\n\u0007\n\n\u0013\nk\nl\n\u0010\n\u0007\nK\n\"\n\"\n\u0005\n^\n_\n\n\u000e\n\u0015\n\u0007\n$\n\u001b\n\u001b\n\u000e\n\u0005\n\u0005\n\u0013\n,\n_\n_\n\u0007\n_\n\u0005\n\f\n\u0013\n\u0005\nh\n\u0005\nk\n\u0015\n\b\n\b\ni\nk\n\u0003\n!\n\u0018\n\u0007\n\u0012\n\u0005\n\u0018\n)\n\u001b\n\u0007\nk\n\u0012\n\u0005\n\u0018\n!\n\u0018\n\u0018\n\nk\n\u0003\n\u0013\n\u0012\n\u0003\n\u0007\n\u0010\n\f\u0019*\u0019\n\n\u0019\u000f\u0019\n\n!t\u0018L\f\n\n`*$\u0003\u0002\n\nsuch that\n\n\u001dYJ\n\nand for any(\nJ\\\u0007)k\r\u0015\u001b\u0016<_\u0010(\n\r\u0005\u0004\n\n. The margin-based quantity\nlRk\nis called soft margin in [20] and accounts for\nwith respect to hyperplane\nthe distribution of margin values achieved by the examples in Q\n. Traditional data-dependent bounds using uniform convergence methods (e.g., [19]) are\ntypically expressed in terms of the sample margin\n\u0001&\u0003 , i.e., in terms of\n\u0001\u0004\u0003 occurring\nWe remark that bound (4) does not have the extra log factors appearing in the analyses\nbased on uniform convergence. Furthermore, it is signi\ufb01cantly better than the bound in [20]\nis constant, which typically occurs when the data sequence is not linearly\n\nfor any)\n\u001a\u0001\n\u0012\u001a(r\u0007\n\u000b(^aC5\u0016\nthe fraction of training points whose margin is at most)\nwhenever$\n\nin (4) has a similar \ufb02avor, though the two ratios are, in general, incomparable.\n\n. The ratio$\n(\u0001\u0014\n\n\u0012+(\"\u0007\n\nseparable.\n\n\u000f-\u0019\n\n\u0001\u0004\u0003\n\n\u001f0A\n\nand\n\n\u0012\u0014\u0016\n\n_\u001e`*$\n\n\f\u0007\u0006\n\n2)3\n\ni\u001b\u0018\n\n, the risk\n\nis at most\n\n\u0015\n\u0006\n\nif\nNl\n\n_\u001e`*$\ne.$\n_g\u0018p\f\r\f\u000f\u000e\n\nthe algorithm predicts with\n\n-th trial the vector \u0006\n. On instance\n\nthe \u201cclipping\u201d function \f\u0011\u000er\u0012\u0011\nL\u0018%\f\n\u001dNJ\n\u001fXl\u0019\nNl\n\n_\u001e`*$\n. The losses $\n\nif\n\u0018\n\u0018\nTheorem 2 to the bound on the cumulative loss\nobtain that, with probability at least\n\nof the average hypothesis estimator\n\nAs a second application, we consider the ridge regression algorithm [12] for square loss.\nAssume\n. This algorithm computes at the beginning of the\n, where\nis\n\n\u0012\u001e\u0016\n\u0012\u0010\u0006E_\u0014`W$\nare thus bounded by\n\n\u001fp\u0007\nwhich minimizes \b\n\u0019\u000f\u0019\n\u0019*\u0019\n,/_\u001e`*$<\u0012\u001a\n\n, where \f\u000f\u000e\n, \f\u000f\u000er\u0012\u001a\nL\u0018\nif\n\n\u0012\u001a\nL\u0018\n_g\u0018\nand \f\u000f\u000e\n\u0012\u001a\n\n. We can apply\nfor ridge regression (see [22, 2]) and\n\u0010 with respect to the draw of the training sample\nk\u001e\u0015\n\u0010(5)\n\f\u000f\u000e\n\f\u000f\u000e\n\ndenotes the determinant\n.2 Let us\n\u0003 . Then simple\n\n\f\u000f\u000e\nis the transpose of \u0014\n\nhs\u0012+(r\u0007+\u001c\n\u0015\u0017#\n\u0019\u000f\u0019\n\u001a\u0011H\t\b\n(\u0001\u0014\u0004\u001d\nis the# -dimensional identity matrix and \u0014\n, \u0001\nthe matrix whose columns are the data vectors\n\n_fe.$\n\f\u000f\u000e\n\f*\u000e\n\u2019s are the eigenvalues of \u0015\n\n\u0019\u000f\u0019\nfor any(\nof matrix \u0014\ndenote by \u0015\n\u0018/\u0007\n\f\u000f\u000e\nwhere the \u0016\nare the\n. Risk bounds in terms of\nsame as the nonzero eigenvalues of the Gram matrix \u0015\nthe eigenvalues of the Gram matrix were also derived in [23]; we defer to the full paper a\ncomparison between these results and ours. Finally, our bound applies also to kernel ridge\nwith the eigenvalues of the kernel\nregression [18] by replacing the eigenvalues of \u0015\nGram matrix \u0017\n\u0003 , where \u0017\nReferences\n\n_fe.$\n\f\u000f\u000e\n. The nonzero eigenvalues of \u0015\n\nhs\u0012\u001a(r\u0007+\u001c\n\u0015\u0017#\n\nis the kernel being considered.\n\nlinear algebra shows that\n\n\fGk<\u0007(')')'*\u0007\n\n\u0011\u001a\u0011\n\n, where\n\n\u001dj_T\u001d\n\n_fe.$\n\n\u0012\u001e\u001f\n\n,\n\n,\n\n\f\u000f\u000e\n\n^+\u0007\u0019\u0018\n\n\u0001\u0005\u000b\n\n\u0012tk\n\n,\n\n\u0012\u0011\n\n_\n\u0007\n\n!c\u0018\n\n[1] Angluin, D. Queries and concept learning, Machine Learning, 2(4), 319-342, 1988.\n[2] Azoury, K., and Warmuth, M. K. Relative loss bounds for on-line density estimation\n\nwith the exponential family of distributions, Machine Learning, 43:211\u2013246, 2001.\n\n[3] K. Azuma. Weighted sum of certain dependend random variables. Tohoku Mathe-\n\nmatical Journal, 68, 357\u2013367, 1967.\n\n2Using a slightly different linear regression algorithm, Forster and Warmuth [7] have proven a\nsharper bound on the expected relative loss. In particular, they have exhibited an algorithm computing\n\nhypothesis \u001a\u001c\u001b\u001d\u001a\u001f\u001e! \n>@?BA\n\"#C is bounded by D\n\nAFEHGHI .\n\n,#\" such that in expectation (over \n\n, ) the relative risk$&%'\u001e\u0010\u001a\n\n\")(+*-,/.10\u00112\u000135476\u00118\n\n\u001e;:=<\n\n(\n\u000e\n\u000e\n$\n&\nQ\n\u0004\n!\n_\ne\n$\n\u0011\n\u000b\n\u0014\n\n_\n\u0001\n)\n\u000f\n!\n(\n\u0019\n_\n\n_\nl\n)\n&\nQ\n!\n\u0018\n&\n\u0006\n\f\nH\n\b\n\t\n\f\n8\n\u0015\n\u0007\n^\n\u001b\n\u0006\n\u001b\n\u001b\n\u0007\n\u0004\n\t\n$\n\u001b\n\t\n\u0014\n\n\t\n\u0018\n\u001b\n\u000b\n_\n\u0014\n\n\u001f\n\u0005\n\u001f\n\f\n\u0015\n\u001f\n\u0015\n\u001f\n\f\n\n\u0015\n\u001f\n\u001b\n_\n\u0015\n,\n_\n\u001b\n\u0013\n\u001f\n\u001b\nh\n\u001c\n!\n\u0012\ni\nk\n\u0003\n\u0002\n\u000b\n\u0013\n(\n\u001b\n\u001b\n\u0007\n!\n\u0018\n\u0007\n\u0013\n\u001f\n\u001b\n\u0002\n\u0012\n\u0012\n\u0012\n\u0012\n\u0012\n\u000b\n\u0001\n\u0007\n!\nd\n\u001d\n_\n\u001d\n\u0013\n_\n\u0012\n\u0012\n\u0012\n\u0012\n\u0012\n\u000b\n\u0007\n\u0013\n\u001f\n\u001b\n\n\u0013\n\u0003\nk\n\f\n\u0004\n!\n_\ne\n$\n$\n\u001b\n_\n\u0015\n_\n\u0018\n\u001b\n\u0019\n\u0014\n\u0019\n\u0013\n!\n_\n^\n\u0012\n\u0012\n\u0012\n\u000b\n\u0001\n\u0007\n\u0004\n!\n\u0013\n_\n\u0012\n\u0012\n\u0012\n\u000b\n\f\n\u0012\n\u0012\n\u0012\n\u000b\n\u0001\n\u0007\n\u0015\n!\n\u0015\n\u0013\n!\n\u0012\n\u0012\n\u0012\n\u0015\n#\n\u000b\n\f\n\u0004\n\b\n\u0007\n\u0016\n_\n_\n!\n\u0015\n\u0013\n!\n!\n\u0015\n\u0013\n!\n\u0013\n!\n\u0015\n!\n\u0013\n!\n\u0015\n!\n\n\t\n\u0018\nk\nl\nl\n9\n\f[4] A. Blum, A. Kalai, and J. Langford. Beating the hold-out: bounds for k-fold and\n\nprogressive cross-validation. In 12th COLT, pages 203\u2013208, 1999.\n\n[5] S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with\n\napplications. Random Structures and Algorithms, 16, 277\u2013292, 2000.\n\n[6] N. Cesa-Bianchi, Y. Freund, D. Haussler, D. P. Helmbold, R. E. Schapire, and M. K.\n\n[7]\n\nWarmuth. How to use expert advice. Journal of the ACM, 44(3), 427\u2013485, 1997.\nJ. Forster, and M. K. Warmuth. Relative expected instantaneous loss bounds. 13th\nCOLT, 90\u201399, 2000.\n\n[8] Y. Freund and R. Schapire. Large margin classi\ufb01cation using the perceptron algo-\n\nrithm. Machine Learning, 37(3), 277\u2013296, 1999.\n\n[9] C. Gentile The robustness of the\n\n-norm algorithms. Manuscript, 2001. An extended\n\nabstract (co-authored with N. Littlestone) appeared in 12th COLT, 1\u201311, 1999.\n\n[10] A. J. Grove, N. Littlestone, and D. Schuurmans. General convergence results for\n\nlinear discriminant updates, Machine Learning, 43(3), 173\u2013210, 2001.\n\n[11] D. Helmbold and M. K. Warmuth. On weak learning. JCSS, 50(3), 551\u2013573, June\n\n1995.\n\n[12] A. Hoerl, and R. Kennard, Ridge regression: biased estimation for nonorthogonal\n\nproblems. Technometrics, 12, 55\u201367, 1970.\n\n[13] J. Kivinen and M. K. Warmuth. Additive versus exponentiated gradient updates for\n\nlinear prediction. Information and Computation, 132(1), 1\u201364, 1997.\n\n[14] N. Littlestone. Learning quickly when irrelevant attributes abound: A new linear-\n\nthreshold algorithm. Machine Learning, 2, 285\u2013318, 1988.\n\n[15] N. Littlestone. From on-line to batch learning. In 2nd COLT, 269\u2013284, 1989.\n[16] N. Littlestone and M. K. Warmuth. The weighted majority algorithm. Information\n\nand Computation, 108(2), 212\u2013261, 1994.\n\n[17] F. Rosenblatt. Principles of neurodynamics: Perceptrons and the theory of brain\n\nmechanisms. Spartan Books, Washington, D.C., 1962.\n\n[18] C. Saunders, A. Gammerman, and V. Vovk. Ridge Regression Learning Algorithm in\n\nDual Variables, In 15th ICML, 1998.\n\n[19] J. Shawe-Taylor, P. Bartlett, R. Williamson, and M. Anthony, Structural Risk Mini-\n\nmization over Data-dependent Hierarchies. IEEE Trans. IT, 44, 1926\u20131940, 1998.\n\n[20] J. Shawe-Taylor and N. Cristianini, On the generalization of soft margin algorithms,\n\n2000. NeuroCOLT2 Tech. Rep. 2000-082, http://www.neurocolt.org.\n\n[21] V.N. Vapnik, Statistical learning theory. J. Wiley and Sons, NY, 1998.\n[22] V. Vovk, Competitive on-line linear regression. In NIPS*10, 1998. Also: Tech. Rep.\nDepartment of Computer Science, Royal Holloway, University of London, CSD-TR-\n97-13, 1997.\n\n[23] R. C. Williamson, J. Shawe-Taylor, B. Sch\u00a8olkopf and A. J. Smola, Sample Based\n\nGeneralization Bounds, IEEE Trans. IT, to appear.\n\n\u0005\n\f", "award": [], "sourceid": 2113, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Alex", "family_name": "Conconi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}]}