{"title": "Learning with Incremental Iterative Regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 1630, "page_last": 1638, "abstract": "Within a statistical learning setting,  we propose and study an iterative regularization algorithm for least squares defined by  an incremental gradient method.   In particular, we show that, if all other parameters are fixed a priori, the number of passes over the data (epochs) acts as a regularization parameter, and  prove strong universal consistency, i.e.  almost sure convergence of the risk, as well as  sharp finite sample bounds for the iterates. Our  results are a step towards understanding the effect of multiple epochs in  stochastic gradient techniques in machine learning and rely  on  integrating  statistical and optimizationresults.", "full_text": "Learning with Incremental Iterative Regularization\n\nLorenzo Rosasco\n\nDIBRIS, Univ. Genova, ITALY\n\nLCSL, IIT & MIT, USA\nlrosasco@mit.edu\n\nSilvia Villa\n\nLCSL, IIT & MIT, USA\n\nSilvia.Villa@iit.it\n\nAbstract\n\nWithin a statistical learning setting, we propose and study an iterative regulariza-\ntion algorithm for least squares de\ufb01ned by an incremental gradient method. In\nparticular, we show that, if all other parameters are \ufb01xed a priori, the number of\npasses over the data (epochs) acts as a regularization parameter, and prove strong\nuniversal consistency, i.e. almost sure convergence of the risk, as well as sharp\n\ufb01nite sample bounds for the iterates. Our results are a step towards understanding\nthe effect of multiple epochs in stochastic gradient techniques in machine learning\nand rely on integrating statistical and optimization results.\n\n1 Introduction\n\nMachine learning applications often require ef\ufb01cient statistical procedures to process potentially\nmassive amount of high dimensional data. Motivated by such applications, the broad objective of\nour study is designing learning procedures with optimal statistical properties, and, at the same time,\ncomputational complexities proportional to the generalization properties allowed by the data, rather\nthan their raw amount [6]. We focus on iterative regularization as a viable approach towards this\ngoal. The key observation behind these techniques is that iterative optimization schemes applied to\nscattered, noisy data exhibit a self-regularizing property, in the sense that early termination (early-\nstop) of the iterative process has a regularizing effect [21, 24]. Indeed, iterative regularization algo-\nrithms are classical in inverse problems [15], and have been recently considered in machine learning\n[36, 34, 3, 5, 9, 26], where they have been proved to achieve optimal learning bounds, matching\nthose of variational regularization schemes such as Tikhonov [8, 31].\nIn this paper, we consider an iterative regularization algorithm for the square loss, based on a recur-\nsive procedure processing one training set point at each iteration. Methods of the latter form, often\nbroadly referred to as online learning algorithms, have become standard in the processing of large\ndata-sets, because of their low iteration cost and good practical performance. Theoretical studies\nfor this class of algorithms have been developed within different frameworks. In composite and\nstochastic optimization [19, 20, 29], in online learning, a.k.a. sequential prediction [11], and \ufb01nally,\nin statistical learning [10]. The latter is the setting of interest in this paper, where we aim at devel-\noping an analysis keeping into account simultaneously both statistical and computational aspects.\nTo place our contribution in context, it is useful to emphasize the role of regularization and different\nways in which it can be incorporated in online learning algorithms. The key idea of regularization\nis that controlling the complexity of a solution can help avoiding over\ufb01tting and ensure stability and\ngeneralization [33]. Classically, regularization is achieved penalizing the objective function with\nsome suitable functional, or minimizing the risk on a restricted space of possible solutions [33].\nModel selection is then performed to determine the amount of regularization suitable for the data\nat hand. More recently, there has been an interest in alternative, possibly more ef\ufb01cient, ways to\nincorporate regularization. We mention in particular [1, 35, 32] where there is no explicit regular-\nization by penalization, and the step-size of an iterative procedure is shown to act as a regularization\nparameter. Here, for each \ufb01xed step-size, each data point is processed once, but multiple passes are\ntypically needed to perform model selection (that is, to pick the best step-size). We also mention\n\n1\n\n\f[22] where an interesting adaptive approach is proposed, which seemingly avoid model selection\nunder certain assumptions.\nIn this paper, we consider a different regularization strategy, widely used in practice. Namely, we\nconsider no explicit penalization, \ufb01x the step size a priori, and analyze the effect of the number of\npasses over the data, which becomes the only free parameter to avoid over\ufb01tting, i.e. regularize.\nThe associated regularization strategy, that we dub incremental iterative regularization, is hence\nbased on early stopping. The latter is a well known \u201dtrick\u201d, for example in training large neural\nnetworks [18], and is known to perform very well in practice [16]. Interestingly, early stopping\nwith the square loss has been shown to be related to boosting [7], see also [2, 17, 36]. Our goal\nhere is to provide a theoretical understanding of the generalization property of the above heuristic\nfor incremental/online techniques. Towards this end, we analyze the behavior of both the excess\nrisk and the iterates themselves. For the latter we obtain sharp \ufb01nite sample bounds matching those\nfor Tikhonov regularization in the same setting. Universal consistency and \ufb01nite sample bounds for\nthe excess risk can then be easily derived, albeit possibly suboptimal. Our results are developed\nin a capacity independent setting [12, 30], that is under no conditions on the covering or entropy\nnumbers [30].\nIn this sense our analysis is worst case and dimension free. To the best of our\nknowledge the analysis in the paper is the \ufb01rst theoretical study of regularization by early stopping\nin incremental/online algorithms, and thus a \ufb01rst step towards understanding the effect of multiple\npasses of stochastic gradient for risk minimization.\nThe rest of the paper is organized as follows. In Section 2 we describe the setting and the main\nassumptions, and in Section 3 we state the main results, discuss them and provide the main elements\nof the proof, which is deferred to the supplementary material. In Section 4 we present some experi-\nmental results on real and synthetic datasets.\nNotation We denote by R+ = [0, +1[ , R++ = ]0, +1[ , and N\u21e4 = N \\{0}. Given a normed\nspace B and linear operators (Ai)1\uf8ffi\uf8ffm, Ai : B!B for every i, their composition Am \u00b7\u00b7\u00b7 A1\nwill be denoted asQm\ni=j Ai = I, where I is the identity\nof B. The operator norm will be denoted by k\u00b7k and the Hilbert-Schmidt norm by k\u00b7k HS. Also, if\nj > m, we setPm\n\ni=1 Ai. By convention, if j > m, we setQm\n\ni=j Ai = 0.\n\n2 Setting and Assumptions\n\nWe \ufb01rst describe the setting we consider, and then introduce and discuss the main assumptions that\nwill hold throughout the paper. We build on ideas proposed in [13, 27] and further developed in a\nseries of follow up works [8, 3, 28, 9]. Unlike these papers where a reproducing kernel Hilbert space\n(RKHS) setting is considered, here we consider a formulation within an abstract Hilbert space. As\ndiscussed in the Appendix A, results in a RKHS can be recovered as a special case. The formula-\ntion we consider is close to the setting of functional regression [25] and reduces to standard linear\nregression if H is \ufb01nite dimensional, see Appendix A.\nLet H be a separable Hilbert space with inner product and norm denoted by h\u00b7,\u00b7iH and k\u00b7kH. Let\n(X, Y ) be a pair of random variables on a probability space (\u2326, S, P), with values in H and R,\nrespectively. Denote by \u21e2 the distribution of (X, Y ), by \u21e2X the marginal measure on H, and by\n\u21e2(\u00b7|x) the conditional measure on R given x 2H . Considering the square loss function, the problem\nunder study is the minimizazion of the risk,\n\ninf\n\nw2HE(w),\n\nE(w) =ZH\u21e5R\n\n(hw, xiH  y)2d\u21e2(x, y) ,\n\n(1)\n\nprovided the distribution \u21e2 is \ufb01xed but known only through a training set z =\n{(x1, y1), . . . , (xn, yn)}, that is a realization of n 2 N\u21e4 independent identical copies of (X, Y ).\nIn the following, we measure the quality of an approximate solution \u02c6w 2H (an estimator) consid-\nering the excess risk\n(2)\nIf the set of solutions of Problem (1) is non empty, that is O = argminH E6 = ?, we also consider\n(3)\n\n, where w\u2020 = argmin\n\nE( \u02c6w)  inf\n\n \u02c6w  w\u2020H\n\nH E.\n\n2\n\nw2O kwkH.\n\n\fH \uf8ff \uf8ff \u21e2X-almost surely.\n\nMore precisely we are interested in deriving almost sure convergence results and \ufb01nite sample\nbounds on the above error measures. This requires making some assumptions that we discuss next.\nWe make throughout the following basic assumption.\nAssumption 1. There exist M 2 ]0, +1[ and \uf8ff 2 ]0, +1[ such that |y|\uf8ff M\u21e2 -almost surely, and\nkxk2\nThe above assumption is fairly standard. The boundness assumption on the output is satis\ufb01ed in\nclassi\ufb01cation, see Appendix A, and can be easily relaxed, see e.g. [8]. The boundness assumption\non the input can also be relaxed, but the resulting analysis is more involved. We omit these develop-\nments for the sake of clarity. It is well known that (see e.g. [14]), under Assumption 1, the risk is a\nconvex and continuous functional on L2(H,\u21e2 X), the space of square-integrable functions with norm\n\u21e2 =RH\u21e5R |f (x)|2d\u21e2X(x). The minimizer of the risk on L2(H,\u21e2 X) is the regression function\nkfk2\nf\u21e2(x) = R yd\u21e2(y|x) for \u21e2X-almost every x 2H . By considering Problem (1) we are restricting\nthe search for a solution to linear functions. Note that, since H is in general in\ufb01nite dimensional,\nthe minimum in (1) might not be achieved. Indeed, bounds on the error measures in (2) and (3)\ndepend on if, and how well, the regression function can be linearly approximated. The following\nassumption quanti\ufb01es in a precise way such a requirement.\nAssumption 2. Consider the space L\u21e2 = {f : H! R |9 w 2H with f (x) = hw, xi \u21e2X- a.s.},\nand let L\u21e2 be its closure in L2(H,\u21e2 X). Moreover, consider the operator\nL : L2(H,\u21e2 X) ! L2(H,\u21e2 X), Lf (x) =ZH hx, x0i f (x0)d\u21e2(x0),\nDe\ufb01ne g\u21e2 = argming2L\u21e2 kf\u21e2  gk\u21e2. Let r 2 [0, +1[, and assume that\ng\u21e2 = Lrg.\n\n8f 2 L2(H,\u21e2 X).\n\n(4)\n\n(5)\n\n(9g 2 L2(H,\u21e2 X))\n\nsuch that\n\nThe above assumption is standard in the context of RKHS [8]. Since its statement is somewhat\ntechnical, and we provide a formulation in a Hilbert space with respect to the usual RKHS setting,\nwe further comment on its interpretation. We begin noting that L\u21e2 is the space of linear functions\nindexed by H and is a proper subspace of L2(H,\u21e2 X) \u2013 if Assumption 1 holds. Moreover, under\nthe same assumption, it is easy to see that the operator L is linear, self-adjoint, positive de\ufb01nite and\ntrace class, hence compact, so that its fractional power in (4) is well de\ufb01ned. Most importantly, the\nfollowing equality, which is analogous to Mercer\u2019s theorem [30], can be shown fairly easily:\n\nL\u21e2 = L1/2L2(H,\u21e2 X) .\n\n(6)\nThis last observation allows to provide an interpretation of Condition (5). Indeed, given (6), for\nr = 1/2, Condition (5) states that g\u21e2 belongs to L\u21e2, rather than its closure. In this case, Problem 1\nhas at least one solution, and the set O in (3) is not empty. Vice versa, if O6 = ? then g\u21e2 2L \u21e2,\nand w\u2020 is well-de\ufb01ned. If r > 1/2 the condition is stronger than for r = 1/2, for the subspaces of\nLr(L2(H,\u21e2 X)) are nested subspaces of L2(H,\u21e2 X) for increasing r1.\n2.1\n\nIterative Incremental Regularized Learning\n\nThe learning algorithm we consider is de\ufb01ned by the following iteration.\n\nLet \u02c6w0 2H and  2 R++. Consider the sequence ( \u02c6wt)t2N generated through the following\nprocedure: given t 2 N, de\ufb01ne\n(7)\n\n\u02c6wt+1 = \u02c6un\nt ,\n\nwhere \u02c6un\n\nt is obtained at the end of one cycle, namely as the last step of the recursion\n\n\u02c6u0\nt = \u02c6wt;\n\nt = \u02c6ui1\n\u02c6ui\n\nt \n\n\nn\n\n(h\u02c6ui1\n\nt\n\n, xiiH  yi)xi,\n\ni = 1, . . . , n.\n\n(8)\n\n1If r < 1/2 then the regression function does not have a best linear approximation since g\u21e2 /2L \u21e2, and in\nparticular, for r = 0 we are making no assumption. Intuitively, for 0 < r < 1/2, the condition quanti\ufb01es how\nfar g\u21e2 is from L\u21e2, that is to be well approximated by a linear function.\n\n3\n\n\fEach cycle, called an epoch, corresponds to one pass over data. The above iteration can be seen as\nthe incremental gradient method [4, 19] for the minimization of the empirical risk corresponding to\nz, that is the functional,\n\n\u02c6E(w) =\n\n1\nn\n\nnXi=1\n\n(hw, xiiH  yi)2.\n\n(9)\n\n(see also Section B.2). Indeed, there is a vast literature on how the iterations (7), (8) can be used to\nminimize the empirical risk [4, 19]. Unlike these studies in this paper we are interested in how the\niterations (7), (8) can be used to approximately minimize the risk E. The key idea is that while \u02c6wt is\nclose to a minimizer of the empirical risk when t is suf\ufb01ciently large, a good approximate solution\nof Problem (1) can be found by terminating the iterations earlier (early stopping). The analysis in\nthe next few sections grounds theoretically this latter intuition.\nRemark 1 (Representer theorem). Let H be a RKHS of functions from X to Y de\ufb01ned by a kernel\nK : X\u21e5X! R. Let \u02c6w0 = 0, then the iteration after t epochs of the algorithm in (7)-(8) can\nbe written as \u02c6wt(\u00b7) = Pn\nk=1(\u21b5t)kKxk, for suitable coef\ufb01cients \u21b5t = ((\u21b5t)1, . . . , (\u21b5t)n) 2 Rn,\nupdated as follows:\nt)k =((ci1\n\n)j  yi\u2318 , k = i\n\nj=1 K(xi, xj)(ci1\n\n\u21b5t+1 = cn\nt\n\n)k \n)k,\n\nc0\nt = \u21b5t,\n\n\n\nn\u21e3Pn\n\n(ci\n\nt\n\n(ci1\n\nt\n\nt\n\nk 6= i\n\n3 Early stopping for incremental iterative regularization\n\nIn this section, we present and discuss the main results of the paper, together with a sketch of the\nproof. The complete proofs can be found in Appendix B. We \ufb01rst present convergence results and\nthen \ufb01nite sample bounds for the quantities in (2) and (3).\n\nTheorem 1. In the setting of Section 2, let Assumption 1 hold. Let  2\u21e40,\uf8ff 1\u21e4. Then the following\n\nhold:\n\n(i) If we choose a stopping rule t\u21e4 : N\u21e4 ! N\u21e4 such that\nlim\n\nlim\n\nt\u21e4(n) = +1 and\n\nn!+1\n\nn!+1\n\nt\u21e4(n)3 log n\n\nn\n\n= 0\n\nthen\n\nn!+1E( \u02c6wt\u21e4(n))  inf\nlim\n\nw2HE(w) = 0 P-almost surely.\n\n(10)\n\n(11)\n\n(ii) Suppose additionally that the set O of minimizers of (1) is nonempty and let w\u2020 be de\ufb01ned\nas in (3). If we choose a stopping rule t\u21e4 : N\u21e4 ! N\u21e4 satisfying the conditions in (10) then\n(12)\n\nk \u02c6wt\u21e4(n)  w\u2020kH ! 0 P-almost surely.\n\nThe above result shows that for an a priori \ufb01xed step-sized, consistency is achieved computing a\nsuitable number t\u21e4(n) of iterations of algorithm (7)-(8) given n points. The number of required\niterations tends to in\ufb01nity as the number of available training points increases. Condition (10) can\nbe interpreted as an early stopping rule, since it requires the number of epochs not to grow too fast.\nIn particular, this excludes the choice t\u21e4(n) = 1 for all n 2 N\u21e4, namely considering only one pass\nover the data. In the following remark we show that, if we let  = (n) to depend on the length of\none epoch, convergence is recovered also for one pass.\nRemark 2 (Recovering Stochastic Gradient descent). Algorithm in (7)-(8) for t = 1 is a stochastic\ngradient descent (one pass over a sequence of i.i.d. data) with stepsize /n. Choosing (n) =\n\uf8ff1n\u21b5, with \u21b5< 1/5 in Algorithm (7)-(8), we can derive almost sure convergence of E( \u02c6w1)infH E\nas n ! +1 relying on a similar proof to that of Theorem 1.\nTo derive \ufb01nite sample bounds further assumptions are needed. Indeed, we will see that the behavior\nof the bias of the estimator depends on the smoothness Assumption 2. We are in position to state\nour main result, giving a \ufb01nite sample bound.\n\n4\n\n\fTheorem 2 (Finite sample bounds in H). In the setting of Section 2, let  2\u21e40,\uf8ff 1\u21e4 for every t 2 N.\nSuppose that Assumption 2 is satis\ufb01ed for some r 2 ]1/2, +1[. Then the set O of minimizers of (1)\nis nonempty, and w\u2020 in (3) is well de\ufb01ned. Moreover, the following hold:\n(i) There exists c 2 ]0, +1[ such that, for every t 2 N\u21e4, with probability greater than 1  ,\nk \u02c6wt  w\u2020kH \uf8ff\n(ii) For the stopping rule t\u21e4 : N\u21e4 ! N\u21e4 : t\u21e4(n) =\u2303n\nk \u02c6wt\u21e4(n)  w\u2020kH \uf8ff2432 log\n\n2r+1\u2325, with probability greater than 1  ,\nkgk\u21e235 n\n\n2\u2318 t +\u2713r  1\n \u25c6r 1\n2+\u2713 r  1\n\n M\uf8ff1/2+ 2M 2\uf8ff1+ 3kgk\u21e2\uf8ffr 3\n\n\u21e3M\uf8ff1/2 + 2M 2\uf8ff1 + 3kgk\u21e2\uf8ffr 3\n\n \u25c6r 1\n\n32 log 16\npn\n\n2r. (13)\n\nkgk\u21e2t\n\nr 1\n2r+1 .\n\n2\n\n(14)\n\n16\n\n2\n\n1\n\n2\n\n2\n\n2\n\n1\n\nThe dependence on \uf8ff suggests that a big \uf8ff, which corresponds to a small , helps in decreasing the\nsample error, but increases the approximation error. Next we present the result for the excess risk.\nWe consider only the attainable case, that is the case r > 1/2 in Assumption 2. The case r \uf8ff 1/2\nis deferred to Appendix A, since both the proof and the statement are conceptually similar to the\nattainable case.\nTheorem 3 (Finite sample bounds for the risk \u2013 attainable case). In the setting of Section 2, let\n\nAssumptions 1 holds, and let  2\u21e40,\uf8ff 1\u21e4. Let Assumption 2 be satis\ufb01ed for some r 2 ]1/2, +1].\nThen the following hold:\n(i) For every t 2 N\u21e4, with probability greater than 1  ,\n\nE( \u02c6wt)  inf\n\nH E\uf8ff\n\n232 log(16/)2\n\nn\n\n(ii) For the stopping rule t\u21e4 : N\u21e4 ! N\u21e4 : t\u21e4(n) =\u2303n\n\nE( \u02c6wt\u21e4(n))  inf\n\nH E\uf8ff \"8\u271332 log\n\nhM + 2M 2\uf8ff1/2 + 3\uf8ffrkgk\u21e2i2\n\nt2 + 2\u2713 r\nt\u25c62r\n\n1\n\n\u21e2\n\n(15)\n\nkgk2\n2(1+r)\u2325, with probability greater than 1  ,\n\u21e2# nr/(r+1)\nkgk2\n\n+ 2\u2713 r\n\u25c62r\n\n16\n\n \u25c62\u21e3M + 2M 2\uf8ff1/2 + 3\uf8ffrkgk\u21e2\u23182\n\n(16)\n\nEquations (13) and (15) arise from a form of bias-variance (sample-approximation) decomposition\nof the error. Choosing the number of epochs that optimize the bounds in (13) and (15) we derive\na priori stopping rules and corresponding bounds (14) and (16). Again, these results con\ufb01rm that\nthe number of epochs acts as a regularization parameter and the best choices following from equa-\ntions (13) and (15) suggest multiple passes over the data to be bene\ufb01cial. In both cases, the stopping\nrule depends on the smoothness parameter r which is typically unknown, and hold-out cross vali-\ndation is often used in practice. Following [9], it is possible to show that this procedure allows to\nadaptively achieve the same convergence rate as in (16).\n\n3.1 Discussion\n\nIn Theorem 2, the obtained bound can be compared to known lower bounds, as well as to pre-\nvious results for least squares algorithms obtained under Assumption 2. Minimax lower bounds\nand individual lower bounds [8, 31], suggest that, for r > 1/2, O(n(r1/2)/(2r+1)) is the optimal\ncapacity-independent bound for the H norm2. In this sense, Theorem 2 provides sharp bounds on\nthe iterates. Bounds can be improved only under stronger assumptions, e.g. on the covering num-\nbers or on the eigenvalues of L [30]. This question is left for future work. The lower bounds for\nthe excess risk [8, 31] are of the form O(n2r/(2r+1)) and in this case the results in Theorems 3\nand 7 are not sharp. Our results can be contrasted with online learning algorithms that use step-size\n\n2In a recent manuscript, it has been proved that this is indeed the minimax lower bound (G. Blanchard,\n\npersonal communication)\n\n5\n\n\fas regularization parameter. Optimal capacity independent bounds are obtained in [35], see also\n[32] and indeed such results can be further improved considering capacity assumptions, see [1] and\nreferences therein. Interestingly, our results can also be contrasted with non incremental iterative\nregularization approaches [36, 34, 3, 5, 9, 26]. Our results show that incremental iterative regular-\nization, with distribution independent step-size, behaves as a batch gradient descent, at least in terms\nof iterates convergence. Proving advantages of incremental regularization over the batch one is an\ninteresting future research direction. Finally, we note that optimal capacity independent and depen-\ndent bounds are known for several least squares algorithms, including Tikhonov regularization, see\ne.g. [31], and spectral \ufb01ltering methods [3, 9]. These algorithms are essentially equivalent from a\nstatistical perspective but different from a computational perspective.\n\n3.2 Elements of the proof\n\nThe proofs of the main results are based on a suitable decomposition of the error to be estimated as\nthe sum of two quantities that can be interpreted as a sample and an approximation error, respec-\ntively. Bounds on these two terms are then provided. The main technical contribution of the paper is\nthe sample error bound. The dif\ufb01culty in proving this result is due to multiple passes over the data,\nwhich induce statistical dependencies in the iterates.\n\nError decomposition. We consider an auxiliary iteration (wt)t2N which is the expectation of the\niterations (7) and (8), starting from w0 2H with step-size  2 R++. More explicitly, the considered\niteration generates wt+1 according to\n\nwt+1 = un\nt ,\n\n(17)\n\n(18)\n\nwhere un\n\nt is given by\n\nu0\nt = wt;\n\nt = ui1\nui\n\nt \n\n\n\nnZH\u21e5Rhui1\n\nt\n\n, xiH  y x d\u21e2 (x, y) .\n\nIf we let S : H! L2(H,\u21e2 X) be the linear map w 7! hw,\u00b7iH, which is bounded by p\uf8ff under\nAssumption 1, then it is well-known that [13]\n\n(8t 2 N) E( \u02c6wt)  inf\n\nH E = kS \u02c6wt  g\u21e2k2\n\uf8ff 2\uf8ffk \u02c6wt  wtk2\n\n\u21e2 \uf8ff 2kS \u02c6wt  Swtk2\nH + 2(E(wt)  inf\n\n\u21e2 + 2kSwt  g\u21e2k2\nH E).\n\n\u21e2\n\n(19)\n\nIn this paper, we refer to the gap between the empirical and the expected iterates k \u02c6wt  wtkH as the\nsample error, and to A(t, , n) = E(wt)  infH E as the approximation error. Similarly, if w\u2020 (as\nde\ufb01ned in (3)) exists, using the triangle inequality, we obtain\n\nk \u02c6wt  w\u2020kH \uf8ff k \u02c6wt  wtkH + kwt  w\u2020kH.\n\n(20)\n\nProof main steps.\nIn the setting of Section 2, we summarize the key steps to derive a general\nbound for the sample error (the proof of the behavior of the approximation error is more standard).\nThe bound on the sample error is derived through many technical lemmas and uses concentration\ninequalities applied to martingales (the crucial point is the inequality in STEP 5 below). Its complete\nderivation is reported in Appendix B.2. We introduce the additional linear operators: T : H!\nH : T = S\u21e4S, and, for every x 2X , Sx : H! R : Sxw = hw, xi, and Tx : H!H : Tx = SxS\u21e4x.\nMoreover, set \u02c6T =Pn\ni=1 Txi/n. We are now ready to state the main steps of the proof.\nSample error bound (STEP 1 to 5)\nSTEP 1 (see Proposition 1): Find equivalent formulations for the sequences \u02c6wt and wt:\n\n\u02c6wt+1 = (I   \u02c6T ) \u02c6wt + \u2713 1\nwt+1 = (I  T ) wt + S\u21e4g\u21e2 + 2(Awt  b),\n\nS\u21e4xj yj\u25c6 + 2\u21e3 \u02c6A \u02c6wt  \u02c6b\u2318\n\nnXj=1\n\nn\n\n6\n\n\fwhere\n\n\u02c6A =\n\nA =\n\n1\nn2\n\n1\nn2\n\nnXk=2\" nYi=k+1\u21e3I \nnXk=2\" nYi=k+1\u21e3I \n\n\nn\n\n\nn\n\nTxi\u2318# Txk\nT\u2318# T\nk1Xj=1\n\nk1Xj=1\n\nT,\n\nTxj , \u02c6b =\n\nb =\n\n1\nn2\n\n1\nn2\n\nnXk=2\" nYi=k+1\u21e3I \nnXk=2\" nYi=k+1\u21e3I \n\n\nn\n\n\nn\n\nTxi\u2318# Txk\nT\u2318# T\nk1Xj=1\n\nS\u21e4xj yj.\n\nk1Xj=1\n\nS\u21e4g\u21e2.\n\nSTEP 2 (see Lemma 5): Use the formulation obtained in STEP 1 to derive the following recursive\ninequality,\n\ni=1\n\n\u21e3k\n\n( \u02c6w0  w0) + \n\nt1Xk=0\u21e3I   \u02c6T +  \u02c6A\u2318tk+1\n\n\u02c6wt  wt =\u21e3I   \u02c6T + 2 \u02c6A\u2318t\n\u02c6S\u21e4xiyi  S\u21e4g\u21e2 + (b  \u02c6b).\nwith \u21e3k = (T  \u02c6T )wk + ( \u02c6A  A)wk + 1\nnPn\nSTEP 3 (see Lemmas 6 and 7): Initialize \u02c6w0 = w0 = 0, prove that kI   \u02c6T +  \u02c6Ak \uf8ff 1, and derive\nfrom STEP 2 that,\nkwkkH + t\u21e3 1\nk \u02c6wt  wtkH \uf8ff kT  \u02c6Tk + k \u02c6A  Ak t1Xk=0\nnXi=1\nkwtkH \uf8ff\u21e2max{\uf8ffr1/2, (t)1/2r}kgk\u21e2\n\nSTEP 4 (see Lemma 8): Let Assumption 2 hold for some r 2 R+ and g 2 L2(H,\u21e2 X). Prove that\n\n\u02c6S\u21e4xiyi  S\u21e4g\u21e2 + kb  \u02c6bk\u2318.\n\nif r 2 [0, 1/2[,\nif r 2 [1/2, +1[\n\nSTEP 5 (see Lemma 9 and Proposition 2: Prove that with probability greater than 1   the\nfollowing inequalities hold:\n\n\uf8ffr1/2kgk\u21e2\n\n(8t 2H )\n\nn\n\n32\uf8ff2\n3pn\n16\uf8ff\n3pn\n\nlog\n\nlog\n\n4\n\n\n2\n\n\n,\n\n,\n\nk \u02c6A  AkHS \uf8ff\n\n \u02c6T  THS \uf8ff\n\n32\uf8ffM 2\n3pn\n\nlog\n\n4\n\n\n,\n\nk\u02c6b  bkH \uf8ff\nnXi=1\n1\nn\n\n\n\nS\u21e4xiyi  S\u21e4g\u21e2H \uf8ff\n\n16p\uf8ffM\n3pn\n\nlog\n\n2\n\n\n.\n\nSTEP 6 (approximation error bound, see Theorem 6): Prove that, if Assumption 2 holds for\nsome r 2 ]0, +1[, then E(wt)  infH E\uf8ff r/t2rkgk2\n\u21e2. Moreover, if Assumption 2 holds with\nr = 1/2, then kwt  w\u2020kH ! 0, and if Assumption 2 holds for some r 2 ]1/2, +1[, then\nkwt  w\u2020kH \uf8ff r1/2\n\nSTEP 7: Plug the sample and approximation error bounds obtained in STEP 1-5 and STEP 6 in\n(19) and (20), respectively.\n\nt r1/2kgk\u21e2.\n\n4 Experiments\n\nSynthetic data. We consider a scalar linear regression problem with random design. The input\npoints (xi)1\uf8ffi\uf8ffn are uniformly distributed in [0, 1] and the output points are obtained as yi =\nhw\u21e4, (xi)i + Ni, where Ni is a Gaussian noise with zero mean and standard deviation 1 and =\n('k)1\uf8ffk\uf8ffd is a dictionary of functions whose k-th element is 'k(x) = cos((k1)x)+sin((k1)x).\nIn Figure 1, we plot the test error for d = 5 (with n = 80 in (a) and 800 in (b)). The plots show\nthat the number of the epochs acts as a regularization parameter, and that early stopping is bene\ufb01cial\nto achieve a better test error. Moreover, according to the theory, the experiments suggest that the\nnumber of performed epochs increases if the number of available training points increases.\nReal data. We tested the kernelized version of our algorithm (see Remark 1 and Appendix A)\non the cpuSmall3, Adult and Breast Cancer Wisconsin (Diagnostic)4 real-world\n\n3Available at http://www.cs.toronto.edu/\u02dcdelve/data/comp-activ/desc.html\n4Adult and Breast Cancer Wisconsin (Diagnostic), UCI repository, 2013.\n\n7\n\n\f1.2\n\n1 \n\n0.8\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\nr\no\nr\nr\ne\n \nt\ns\ne\nT\n\n2\n\n1.5\n\n1\n0\n\n0\n\n2000\n\n4000\n\nIterations\n(a)\n\n6000\n\n8000\n\n1\n\n2\n\nIterations\n(b)\n\n3\n\n4\n\u00d7105\n\nFigure 1: Test error as a function of the number of iterations. In (a), n = 80, and total number of\niterations of IIR is 8000, corresponding to 100 epochs. In (b), n = 800 and the total number of\nepochs is 400. The best test error is obtained for 9 epochs in (a) and for 31 epochs in (b).\n\ndatasets. We considered a subset of Adult, with n = 1600. The results are shown in Figure 2. A\ncomparison of the test errors obtained with the kernelized version of the method proposed in this\npaper (Kernel Incremental Iterative Regularization (KIIR)), Kernel Iterative Regularization (KIR),\nthat is the kernelized version of gradient descent, and Kernel Ridge Regression (KRR) is reported in\nTable 1. The results show that the test error of KIIR is comparable to that of KIR and KRR.\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\nr\no\nr\nr\n\nE\n\n0\n\n0\n\nValidation Error\nTraining Error\n\n1\n\n2\n\n3\n\nIterations\n\n4\n\u00d710 6\n\nFigure 2: Training (orange) and validation (blue) classi\ufb01cation errors obtained by KIIR on the\nBreast Cancer dataset as a function of the number of iterations. The test error increases after a\ncertain number of iterations, while the training error is \u201cdecreasing\u201d with the number of iterations.\n\nTable 1: Test error comparison on real datasets. Median values over 5 trials.\n\nDataset\n\ncpuSmall\n\nAdult\n\nBreast Cancer\n\nntr\n5243\n1600\n400\n\nd\n12\n123\n30\n\nError Measure\n\nRMSE\n\nClass. Err.\nClass. Err.\n\nKIIR\n5.9125\n0.167\n0.0118\n\nKRR\n3.6841\n0.164\n0.0118\n\nKIR\n5.4665\n0.154\n0.0237\n\nAcknowledgments\nThis material is based upon work supported by CBMM, funded by NSF STC award CCF-1231216.\nand by the MIUR FIRB project RBFR12M3AC. S. Villa is member of GNAMPA of the Istituto\nNazionale di Alta Matematica (INdAM).\n\nReferences\n[1] F. Bach and A. Dieuleveut.\n\narXiv:1408.0361, 2014.\n\nNon-parametric stochastic approximation with large step sizes.\n\n[2] P. Bartlett and M. Traskin. Adaboost is consistent. J. Mach. Learn. Res., 8:2347\u20132368, 2007.\n[3] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. J. Complexity,\n\n23(1):52\u201372, 2007.\n\n[4] D. P. Bertsekas. A new class of incremental gradient methods for least squares problems. SIAM J. Optim.,\n\n7(4):913\u2013926, 1997.\n\n[5] G. Blanchard and N. Kr\u00a8amer. Optimal learning rates for kernel conjugate gradient regression. In Advances\n\nin Neural Inf. Proc. Systems (NIPS), pages 226\u2013234, 2010.\n\n8\n\n\f[6] L. Bottou and O. Bousquet. The tradeoffs of large scale learning. In Suvrit Sra, Sebastian Nowozin, and\n\nStephen J. Wright, editors, Optimization for Machine Learning, pages 351\u2013368. MIT Press, 2011.\n\n[7] P. Buhlmann and B. Yu. Boosting with the l2 loss: Regression and classi\ufb01cation. J. Amer. Stat. Assoc.,\n\n98:324\u2013339, 2003.\n\n[8] A. Caponnetto and E. De Vito. Optimal rates for regularized least-squares algorithm. Found. Comput.\n\nMath., 2006.\n\n[9] A. Caponnetto and Y. Yao. Cross-validation based adaptation for regularization operators in learning\n\ntheory. Anal. Appl., 08:161\u2013183, 2010.\n\n[10] N. Cesa-Bianchi, A. Conconi, and C. Gentile. On the generalization ability of on-line learning algorithms.\n\nIEEE Trans. Information Theory, 50(9):2050\u20132057, 2004.\n\n[11] N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge University Press, 2006.\n[12] F. Cucker and D. X. Zhou. Learning Theory: An Approximation Theory Viewpoint. Cambridge University\n\nPress, 2007.\n\n[13] E. De Vito, L. Rosasco, A. Caponnetto, U. De Giovannini, and F. Odone. Learning from examples as an\n\ninverse problem. J.Mach. Learn. Res., 6:883\u2013904, 2005.\n\n[14] E. De Vito, L. Rosasco, A. Caponnetto, M. Piana, and A. Verri. Some properties of regularized kernel\n\nmethods. Journal of Machine Learning Research, 5:1363\u20131390, 2004.\n\n[15] H. W. Engl, M. Hanke, and A. Neubauer. Regularization of inverse problems. Kluwer, 1996.\n[16] P.-S. Huang, H. Avron, T. Sainath, V. Sindhwani, and B. Ramabhadran. Kernel methods match deep\n\nneural networks on timit. In IEEE ICASSP, 2014.\n\n[17] W. Jiang. Process consistency for adaboost. Ann. Stat., 32:13\u201329, 2004.\n[18] Y. LeCun, L. Bottou, G. Orr, and K. Muller. Ef\ufb01cient backprop. In G. Orr and Muller K., editors, Neural\n\nNetworks: Tricks of the trade. Springer, 1998.\n\n[19] A. Nedic and D. P Bertsekas. Incremental subgradient methods for nondifferentiable optimization. SIAM\n\nJournal on Optimization, 12(1):109\u2013138, 2001.\n\n[20] A. Nemirovski, A. Juditsky, G. Lan, and A. Shapiro. Robust stochastic approximation approach to\n\nstochastic programming. SIAM J. Optim., 19(4):1574\u20131609, 2008.\n\n[21] A. Nemirovskii. The regularization properties of adjoint gradient method in ill-posed problems. USSR\n\nComputational Mathematics and Mathematical Physics, 26(2):7\u201316, 1986.\n\n[22] F. Orabona. Simultaneous model selection and optimization through parameter-free stochastic learning.\n\nNIPS Proceedings, 2014.\n\n[23] I. Pinelis. Optimum bounds for the distributions of martingales in Banach spaces. Ann. Probab.,\n\n22(4):1679\u20131706, 1994.\n\n[24] B. Polyak. Introduction to Optimization. Optimization Software, New York, 1987.\n[25] J. Ramsay and B. Silverman. Functional Data Analysis. Springer-Verlag, New York, 2005.\n[26] G. Raskutti, M. Wainwright, and B. Yu. Early stopping for non-parametric regression: An optimal data-\n\ndependent stopping rule. In in 49th Annual Allerton Conference, pages 1318\u20131325. IEEE, 2011.\n\n[27] S. Smale and D. Zhou. Shannon sampling II: Connections to learning theory. Appl. Comput. Harmon.\n\nAnal., 19(3):285\u2013302, November 2005.\n\n[28] S. Smale and D.-X. Zhou. Learning theory estimates via integral operators and their approximations.\n\nConstr. Approx., 26(2):153\u2013172, 2007.\n\n[29] N. Srebro, K. Sridharan, and A. Tewari. Optimistic rates for learning with a smooth loss. arXiv:1009.3896,\n\n2012.\n\n[30] I. Steinwart and A. Christmann. Support Vector Machines. Springer, 2008.\n[31] I. Steinwart, D. R. Hush, and C. Scovel. Optimal rates for regularized least squares regression. In COLT,\n\n2009.\n\n[32] P. Tarr`es and Y. Yao. Online learning as stochastic approximation of regularization paths: optimality and\n\nalmost-sure convergence. IEEE Trans. Inform. Theory, 60(9):5716\u20135735, 2014.\n\n[33] V. Vapnik. Statistical learning theory. Wiley, New York, 1998.\n[34] Y. Yao, L. Rosasco, and A. Caponnetto. On early stopping in gradient descent learning. Constr. Approx.,\n\n26:289\u2013315, 2007.\n\n[35] Y. Ying and M. Pontil. Online gradient descent learning algorithms. Found. Comput. Math., 8:561\u2013596,\n\n2008.\n\n[36] T. Zhang and B. Yu. Boosting with early stopping: Convergence and consistency. Annals of Statistics,\n\npages 1538\u20131579, 2005.\n\n9\n\n\f", "award": [], "sourceid": 1008, "authors": [{"given_name": "Lorenzo", "family_name": "Rosasco", "institution": "University of Genova"}, {"given_name": "Silvia", "family_name": "Villa", "institution": "IIT-MIT"}]}