{"title": "Scrambled Objects for Least-Squares Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 1549, "page_last": 1557, "abstract": "We consider least-squares regression using a randomly generated subspace G_P\\subset F of finite dimension P, where F is a function space of infinite dimension, e.g.~L_2([0,1]^d).  G_P is defined as the span of P random features  that are linear combinations of the basis functions of F weighted by random Gaussian i.i.d.~coefficients. In particular, we consider multi-resolution random combinations at all scales of a given mother function,  such as a hat function or a wavelet. In this latter case, the resulting Gaussian objects are called {\\em scrambled wavelets} and we show that they enable to approximate functions in Sobolev spaces H^s([0,1]^d). As a result, given N data, the least-squares estimate \\hat g built from P scrambled wavelets has excess risk ||f^* - \\hat g||_\\P^2 = O(||f^*||^2_{H^s([0,1]^d)}(\\log N)/P + P(\\log N )/N) for target functions f^*\\in H^s([0,1]^d) of smoothness order s>d/2. An interesting aspect of the resulting bounds is that they do not depend on the distribution \\P from which the data are generated, which is important in a statistical regression setting considered here. Randomization enables to adapt to any possible distribution.   We conclude by describing an efficient numerical implementation using lazy expansions with numerical complexity \\tilde O(2^d N^{3/2}\\log N + N^2), where d is the dimension of the input space.", "full_text": "Scrambled Objects for Least-Squares Regression\n\nOdalric-Ambrym Maillard and R\u00b4emi Munos\n\nSequeL Project, INRIA Lille - Nord Europe, France\n\n{odalric.maillard, remi.munos}@inria.fr\n\nAbstract\n\nWe consider least-squares regression using a randomly generated subspace GP \u2282\nF of \ufb01nite dimension P , where F is a function space of in\ufb01nite dimension,\ne.g. L2([0, 1]d). GP is de\ufb01ned as the span of P random features that are linear\ncombinations of the basis functions of F weighted by random Gaussian i.i.d. co-\nef\ufb01cients. In particular, we consider multi-resolution random combinations at all\nscales of a given mother function, such as a hat function or a wavelet. In this latter\ncase, the resulting Gaussian objects are called scrambled wavelets and we show\nthat they enable to approximate functions in Sobolev spaces H s([0, 1]d). As a\n\nresult, given N data, the least-squares estimate(cid:98)g built from P scrambled wavelets\nhas excess risk ||f\u2217 \u2212(cid:98)g||2P = O(||f\u2217||2\n\nH s([0,1]d)(log N)/P + P (log N)/N) for\ntarget functions f\u2217 \u2208 H s([0, 1]d) of smoothness order s > d/2. An interesting\naspect of the resulting bounds is that they do not depend on the distribution P from\nwhich the data are generated, which is important in a statistical regression setting\nconsidered here. Randomization enables to adapt to any possible distribution.\nWe conclude by describing an ef\ufb01cient numerical implementation using lazy ex-\npansions with numerical complexity \u02dcO(2dN 3/2 log N + N 2), where d is the di-\nmension of the input space.\n\n1 Introduction\n\nWe consider ordinary least-squares regression using randomly generated feature spaces. Let us \ufb01rst\ndescribe the general regression problem: we observe data DN = ({xn, yn}1\u2264n\u2264N ) (with xn \u2208 X a\ncompact subset of Rd, and yn \u2208 R), assumed to be independently and identically distributed (i.i.d.)\nwith xn \u223c P and\nwhere f\u2217 is the (unknown) target function, such that ||f\u2217||\u221e \u2264 L, and \u03b7n is a centered, independent\nnoise of variance bounded by \u03c32. We assume that L and \u03c3 are known.\nNow, for a given class of functions F, and f \u2208 F, we de\ufb01ne the empirical \u20182-error\n\nyn = f\u2217(xn) + \u03b7n,\n\nN(cid:88)\n\nn=1\n\nLN (f) def=\n\n1\nN\n\n[yn \u2212 f(xn)]2,\n\nand the generalization error\n\nThe goal is to return a regression function (cid:98)f \u2208 F with lowest possible generalization error L((cid:98)f).\nThe excess risk L((cid:98)f)\u2212L(f\u2217) = ||f\u2217\u2212(cid:98)f||P (where ||g||2P = EX\u223cP[g(X)2]) measures the closeness\n\nL(f) def= EX,Y [(Y \u2212 f(X))2].\n\nto optimality.\nIn this paper we consider in\ufb01nite dimensional spaces F that are generated by a denumerable family\nof functions {\u03d5i}i\u22651, called initial features (such as wavelets). We will assume that f\u2217 \u2208 F.\n\n1\n\n\fSince F is an in\ufb01nite dimensional space, the empirical risk minimizer in F is certainly subject to\nover\ufb01tting. Traditional methods to circumvent this problem have considered penalization, i.e. one\nsearches for a function in F which minimizes the empirical error plus a penalty term, for example\np for p = 1 or 2, where \u03bb is a parameter and usual choices for the\n\n(cid:98)f = arg minf\u2208F LN (f) + \u03bb||f||p\n\nnorm are \u20182 (ridge-regression [17]) and \u20181 (LASSO [16]).\nIn this paper we follow an alternative approach introduced in [10], called Compressed Least Squares\nRegression, which considers generating randomly a subspace GP (of \ufb01nite dimension P ) of F, and\nthen returning the empirical risk minimizer in GP , i.e. arg ming\u2208GP LN (g). This previous work\nconsidered the case when F is of \ufb01nite dimension. Here we consider speci\ufb01c cases of in\ufb01nite\ndimensional spaces F and provide a characterization of the resulting approximation spaces.\n\n2 Regression with random spaces\n\nLet us brie\ufb02y recall the method described in [10] and extend it to the case of in\ufb01nite dimensional\nspaces F. In this paper we assume that the set of features (\u03d5i)i\u22651 are continuous and are such that,\n(1)\n\n||\u03d5(x)||2 < \u221e, where ||\u03d5(x)||2 def=\n\n\u03d5i(x)2.\n\nsup\nx\u2208X\n\n(cid:88)\n\ni\u22651\n\nExamples of feature spaces satisfying this property include rescaled wavelets and will be described\nin Section 3.\nThe random subspace GP is generated by building a set of P random features (\u03c8p)1\u2264p\u2264P de\ufb01ned\nas linear combinations of the initial features {\u03d5i}1\u22651 weighted by random coef\ufb01cients:\n\n\u03c8p(x) def=\n\nAp,i\u03d5i(x), for 1 \u2264 p \u2264 P,\n\n(2)\n\n(cid:88)\n\ni\u22651\n\nwhere the (in\ufb01nitely many) coef\ufb01cients Ap,i are drawn i.i.d. from a centered distribution with vari-\nance 1/P . Here we explicitly choose a Gaussian distribution N (0, 1/P ). Such a de\ufb01nition of the\nfeatures \u03c8p as an in\ufb01nite sum of random variable is not obvious (this is called an expansion of a\nGaussian object) and we refer to [11] for elements of theory about Gaussian objects and for the\nexpansion of a Gaussian object. It is shown that under assumption (1), the random features are well\nde\ufb01ned. Actually, they are random samples of a centered Gaussian process indexed by the space X\nwith covariance structure given by 1\ni uivi for\ntwo square-summable sequences u and v. Indeed, EAp[\u03c8p(x)] = 0, and\n\nP h\u03d5(x), \u03d5(x0)i, where we use the notation hu, vi =\n\n(cid:80)\n\nCovAp(\u03c8p(x), \u03c8p(x0)) = EAp[\u03c8p(x)\u03c8p(x0)] =\n\n1\nP\n\n\u03d5i(x)\u03d5i(x0) =\n\n1\nP\n\nh\u03d5(x), \u03d5(x0)i .\n\n(cid:88)\n\ni\u22651\n\nThe continuity of the initial features (\u03d5i) guarantees that there exists a continuous version of the\nprocess \u03c8p which is thus a Gaussian process.\nThen we de\ufb01ne GP \u2282 F to be the (random) vector space spanned by those features, i.e.\n\nP(cid:88)\n\nGP\n\ndef= {g\u03b2(x) def=\n\n\u03b2p\u03c8p(x), \u03b2 \u2208 RP}.\n\np=1\n\nNow, the least-squares estimate gb\u03b2 \u2208 GP is the function in GP with minimal empirical error, i.e.\nand is the solution of a least-squares regression problem, i.e. (cid:98)\u03b2 = \u03a8\u2020Y \u2208 RP , where \u03a8 is the\ninverse of \u03a81. The \ufb01nal prediction function(cid:98)g(x) is the truncation (to the threshold \u00b1L) of gb\u03b2, i.e.\n(cid:98)g(x) def= TL[gb\u03b2(x)], where TL(u) def=\n\ngb\u03b2 = arg min\n(cid:189)\n\nN \u00d7 P -matrix composed of the elements: \u03a8n,p\n\ndef= \u03a8p(xn), and \u03a8\u2020 is the Moore-Penrose pseudo-\n\nif |u| \u2264 L,\nu\nL sign(u) otherwise.\n\nLN (g\u03b2),\n\ng\u03b2\u2208GP\n\n(3)\n\nNext, we provide bounds on the approximation error of f\u2217 in GP and deduce excess risk bounds.\n\n1In the full rank case when N \u2265 P , \u03a8\u2020 = (\u03a8T \u03a8)\u22121\u03a8T\n\n2\n\n\f2.1 Approximation error\n\nWe now extend the result of [10] and derive approximation error bounds both in expectation and in\nhigh probability. We restrict the set of target functions to belong to the approximation space K \u2282 F\n(also identi\ufb01ed to the kernel space associated to the expansion of a Gaussian object):\n\nK def= {f\u03b1 \u2208 F,||\u03b1||2 def=\n\ni < \u221e}.\n\u03b12\n\n(4)\n\n(cid:88)\n\ni\u22651\n\n(cid:112)\n(cid:161)\n\nRemark 1. This space may be seen from two equivalent points of view: either as a set of functions\nthat are random linear combinations of the initial features, or a set of functions that are the expec-\ntation of some random processes (interpretation in terms of kernel space). We will not develop the\nrelated theory of Gaussian processes here but we refer the reader interested in the construction of\nkernel spaces to [11]\n\n(cid:80)\ni \u03b1i\u03d5i \u2208 K. Write g\u2217 the projection of f\u03b1 onto GP w.r.t. the norm || \u00b7 ||P, i.e. g\u2217 =\nLet f\u03b1 =\narg ming\u2208GP ||f\u03b1 \u2212 g||P, and \u00afg\u2217 = TLg\u2217 its truncation at the threshold L \u2265 ||f\u03b1||\u221e. Notice that\ndue to the randomness of the features (\u03c8p)1\u2264p\u2264P of GP , the space GP is also random, and so is \u00afg\u2217.\nThe following result provides bounds for the approximation error ||f\u03b1 \u2212 \u00afg\u2217||P both in expectation\nand in high probability.\nTheorem 1. For any \u03b7 > 0, whenever P \u2265 c1 log(P \u03b32\n1 \u2212 \u03b7 (w.r.t. the choice of the random subspace GP ),\n||\u03b1||2 supx ||\u03d5(x)||2\n\nlog(1/\u03b7)/\u03b7), we have with probability\n\ng\u2208G ||f\u2217 \u2212 TL(g)||2P \u2264 c2\ninf\n||\u03b1|| supx ||\u03d5(x)|| and c1, c2 are some universal constants (see [11]). A similar result holds\n\nwhere \u03b3 =\nin expectation.\nThis result relies on the property that inf g\u2208GP ||f\u03b1 \u2212 g||P \u2264 ||f\u03b1 \u2212 gA\u03b1||P and that gA\u03b1, considered\nas a random variable w.r.t. the choice of the random elements A, concentrates around f\u03b1 (in || \u00b7 ||P-\nnorm) when P increases. Indeed, gA\u03b1(x) = (A\u03b1) \u00b7 \u03c8(x) = (A\u03b1) \u00b7 (A\u03d5(x)) which is close to \u03b1 \u00b7\n\u03d5(x) = f\u03b1(s), since inner-products are approximately preserved through random projections (from\na variant of Johnson-Lindenstrauss (JL) Lemma). The proof of Theorem 1 (provided in Appendix of\nJ from P, applying JL Lemma at those points\n[11]) relies in generating auxiliary samples X0\nand combining it with a Chernoff-Hoeffding bound for generalizing the result to hold in ||\u00b7||P-norm.\nRemark 2. An interesting property of this result is that the bound does not depend on the distribution\nP. This distribution is used in the de\ufb01nition of the norm || \u00b7 ||P to assess how well a function space\nGP can approximate a function f\u03b1. It is thus surprising that the measure P does not appear in the\nbound. Actually, the fact that GP is random enables it to be close to f\u03b1 (in high probability or in\nexpectation) whatever the measure P is. This is especially interesting in a regression setting where\nthe distribution P from which the data are generated is not known in advance.\n\n1, . . . , X0\n\n1 + log(P \u03b32\n\nlog(1/\u03b7)/\u03b7)\n\n(cid:112)\n\n(cid:162)\n\nP\n\nL\n\n,\n\n2.2 Excess risk bounds\n\nWe now combine the approximation error bound from Theorem 1 with usual estimation error bounds\nthat our prediction function(cid:98)g is the truncation(cid:98)g def= TL[gb\u03b2] of the (ordinary) least-squares estimate\ni \u03d5i \u2208 K. Remember\nfor linear spaces (see e.g. [7]). Let us consider a target function f\u2217 =\ngb\u03b2 (empirical risk minimizer in the random space GP ) de\ufb01ned by (3).\n\n(cid:80)\ni \u03b1\u2217\n\nWe now provide upper bounds (both in expectation and in high probability) on the excess risk for\nthe least-squares estimate using random subspaces (the proof is given in [11]).\nTheorem 2. Whenever P \u2265 c3 log N, we have the following bound in expectation (w.r.t. all sources\nof randomness, i.e. input data, noise, and the choice of the random features):\n||\u03b1\u2217||2 sup\n\n(5)\nNow, for any \u03b7 > 0, whenever P \u2265 c5 log(N/\u03b7), we have the following bound in high probability\n(w.r.t. the choice of the random features), where c3, c4, c5, c6 are universal constant (see [11]):\n\nEGP ,X,Y ||f\u2217 \u2212(cid:98)g||2P \u2264 c4\n(cid:161)\nEX,Y ||f\u2217 \u2212(cid:98)g||2P \u2264 c6\n\n||\u03d5(x)||2(cid:162)\n||\u03d5(x)||2(cid:162)\n\n+ L2 P log N\n\n+ L2 P log N\n\n\u03c32 P\nN\n\nlog N/\u03b7\n\nlog N\n\n(cid:161)\n\n(6)\n\n+\n\n+\n\nN\n\nP\n\n,\n\n.\n\nx\n\n||\u03b1\u2217||2 sup\n\nx\n\nP\n\n\u03c32 P\nN\n\nN\n\n3\n\n\fThe results of Theorems 1 and 2 say that if the term ||\u03b1\u2217||2 supx ||\u03d5(x)||2 is small, then the least-\nsquares estimate in the random subspace GP has low excess risk. The question we wish to address\nnow is whether we can de\ufb01ne spaces for which this is the case. In the next section we provide two\nexamples of feature spaces and characterize the space of functions for which this term is controlled.\n\n3 Regression with Scrambled Objects\n\nIn the two examples provided below we consider (in\ufb01nitely many) initial features that are trans-\nlations and rescaling of a given mother function (which is assumed to be continuous) at all scales.\nThus each random feature \u03c8p is a Gaussian object based on a multi-scale scheme built from an object\n(the mother function), and will be called a \u201cscrambled object\u201d, to refer to the disorderly construction\nof this multi-resolution random process.\nWe thus propose to solve the regression problem by ordinary Least Squares on the (random) approx-\nimation space de\ufb01ned by the span of P such scrambled objects. In the next sections we provide two\nexamples. The \ufb01rst one considers the case when the mother function is a hat function and we show\nthat the corresponding scrambled objects are Brownian motions. The second example considers\nwavelets. The proof of bounds (7) and (8) can be found in [11].\n\n3.1 Brownian motions and Brownian Sheets\nDimension 1: We start with the 1-dimensional case where X = [0, 1]. Let us choose as object\n(mother function) the hat function \u039b(x) = xI[0,1/2[ + (1 \u2212 x)I[1/2,1[. We de\ufb01ne the (in\ufb01nite) set\nof initial features as translated and rescaled hat functions: \u039bj,l(x) = 2\u2212j/2\u039b(2jx \u2212 l) for any scale\nj \u2265 1 and translation index 0 \u2264 l \u2264 2j \u2212 1. We also write \u039b0,0(x) = x. This de\ufb01nes a basis of the\nspace of continuous functions C0([0, 1]) equal to 0 at 0 (introduced by Faber in 1910, and known as\nthe Schauder basis, see [8] for an interesting overview). Those functions are indexed by the scale j\nand translation index l, but all functions may be equivalently indexed by a unique index i \u2265 1.\nWe have the property that the random features \u03c8p(x), de\ufb01ned as linear combinations of those hat\nfunctions weighted by Gaussian i.i.d. random numbers, are Brownian motions (See Example 1 of\n[11] for the proof). In addition, we can characterize the corresponding kernel space K, which is the\nSobolev space H 1([0, 1]) of order 1 (space of functions which have a weak derivative in L2([0, 1])).\n\nDimension d: For the extension to dimension d, we de\ufb01ne the initial features as the tensor\nproduct \u03d5j,l of one-dimensional hat functions (thus j and l are multi-indices). The random fea-\ntures \u03c8p(x) are Brownian sheets (extensions of Brownian motions to several dimensions) and\nthe corresponding kernel K is the so-called Cameron-Martin space [9], endowed with the norm\n||f||K = ||\n||L2([0,1]d) (see also Example 1 of [11] for the proof). One may interpret\nthis space as the set of functions which have a d-th order crossed (weak) derivative\nin\nL2([0, 1]d), vanishing on the \u201cleft\u201d boundary (edges containing 0) of the unit d-dimensional cube.\nNote that in dimension d > 1, this space differs from the Sobolev space H 1.\n\n\u2202x1...\u2202xd\n\n\u2202df\n\n\u2202df\n\n\u2202x1...\u2202xd\n\nRegression with Brownian Sheets: When one uses Brownian sheets for regression with a target\ni \u03d5i that lies in the Cameron-Martin space K de\ufb01ned previously (i.e. such that\nfunction f\u2217 =\n||\u03b1\u2217|| < \u221e), then the term ||\u03b1\u2217||2 supx\u2208X ||\u03d5(x)||2 that appears in Theorems 1 and 2 is bounded\nas:\n\n||\u03b1\u2217||2 sup\nx\u2208X\n\n||\u03d5(x)||2 \u2264 2\u2212d||f\u2217||2K.\n\n(cid:80)\ni \u03b1\u2217\n\nThus, from Theorem 2, ordinary least-squares performed on random subspaces spanned by P Brow-\nnian sheets has an expected excess risk\n\n(cid:180)\n\nP +\n\nlog N\n\nP\n\n||f\u2217||2K\n\n,\n\n(7)\n\nEGP ,X,Y ||f\u2217 \u2212(cid:98)g||2P = O\n\n(and a similar bound holds in high probability).\n\n(cid:179)log N\n\nN\n\n4\n\n\f3.2 Scrambled Wavelets in [0, 1]d\nWe now introduce a second example built from a family of orthogonal wavelets ( \u02dc\u03d5\u03b5,j,l) \u2208\nC q([0, 1]d) (where \u03b5 \u2208 {0, 1}d is a multi-index, j is a scale index, l a multi-index, see [2, 12]\nfor details of the notations) with at least q > d/2 vanishing moments. Now for s \u2208 (d/2, q), we de-\n\ufb01ne the initial features (\u03d5\u03b5,j,l) as the rescaled wavelets ( \u02dc\u03d5\u03b5,j,l), i.e. \u03d5\u03b5,j,l\n|| \u02dc\u03d5\u03b5,j,l||2 . Again,\nthe initial features may equivalently be indexed by a unique index i \u2265 1. The random features \u03c8p\nde\ufb01ned from (2) are called \u201cscrambled wavelets\u201d. It can be shown that the resulting approximation\nspace K (i.e. {f\u03b1 =\nRegression with Scambled Wavelets: Assume that the mother wavelet \u02dc\u03d5 has compact support\ni \u03b1\u2217\n[0, 1]d and is bounded by \u03bb, and assume that the target function f\u2217 =\ni \u03d5i lies in the Sobolev\nspace H s([0, 1]d) with s > d/2 (i.e. such that ||\u03b1\u2217|| < \u221e). Then, we have,\n\ni \u03b1i\u03d5i,||\u03b1|| < \u221e) is the Sobolev space H s([0, 1]d).\n(cid:80)\n\ndef= 2\u2212js\n\n(cid:80)\n\n\u02dc\u03d5\u03b5,j,l\n\n||\u03b1\u2217||2 sup\nx\u2208X\n\n||\u03d5(x)||2 \u2264 \u03bb2d(2d \u2212 1)\n1 \u2212 2\u22122(s\u2212d/2)\n\n||f\u2217||2\n\nH s([0,1]d).\n\nThus from Theorem 2, ordinary least-squares performed on random subspaces spanned by P scram-\nbled wavelets has an expected excess risk\n\nEGP ,X,Y ||f\u2217 \u2212(cid:98)g||2P = O\n\n(cid:179)log N\n\nN\n\n(and a similar bound holds in high probability).\n\nIn both examples, by choosing P of order\n\nN||f\u2217||K, one deduces the excess risk\n\n\u221a\n\nE||f\u2217 \u2212(cid:98)g||2P = O\n\n(cid:179)||f\u2217||K log N\n\n(cid:180)\n\n.\n\n\u221a\n\nN\n\n(cid:180)\n\n,\n\nP +\n\nlog N\n\nP\n\n||f\u2217||2\n\nH s([0,1]d)\n\n(8)\n\n(9)\n\n3.3 Remark about randomized spaces\n\nNote that the bounds on the excess risk obtained in (7), (8), and (9) do not depend on the distribution\nP under which the data are generated. This is crucial in our setting since P is usually unknown. It\nshould be noticed that this property does not hold when one considers non-randomized approxima-\ntion spaces. Indeed, it is relatively easy to exhibit a particularly well-chosen set of features \u03d5i that\nwill approximate functions in a given class using a particular measure P. For example when P = \u03bb,\nthe Lebesgue measure, and f\u2217 \u2208 H s([0, 1]d) (with s > d/2), then linear regression using wavelets\n(with at least d/2 vanishing moments), which form an orthonormal basis of L2,\u03bb([0, 1]d), enables\nto achieve a bound similar to (8). However, this is no more the case when P is not the Lebesgue\nmeasure and it seems dif\ufb01cult to modify the features \u03d5i in order to recover the same bound, even\nwhen P is known. This seems to be even harder when P is arbitrary and not known in advance.\nRandomization enables to de\ufb01ne approximation spaces such that the approximation error (either in\nexpectation or in high probability on the choice of the random space) is controlled, whatever the\nmeasure P used to assess the performance (even when P is unknown) is.\nFor illustration, consider a very peaky (a spot) distribution P in a high-dimensional space X . Reg-\nular linear approximation, say with wavelets (see e.g. [6]), will most probably miss the speci\ufb01c\ncharacteristics of f\u2217 at the spot, since the \ufb01rst wavelets have large support. On the contrary, scram-\nbled wavelets, which are functions that contain (random combinations of) all wavelets, will be able\nto detect correlations between the data and some high frequency wavelets, and thus discover relevant\nfeatures of f\u2217 at the spot. This is illustrated in the numerical experiment below.\nHere P is a very peaky Gaussian distribution and f\u2217 is a 1-dimensional periodic function. We con-\nsider as initial features (\u03d5i)i\u22651 the set of hat functions de\ufb01ned in Section 3.1. Figure 3.3 shows the\ntarget function f\u2217, the distribution P, and the data (xn, yn)1\u2264n\u2264100 (left plots). The middle plots\nmotions). The right plots shows the least-squares estimate using the initial features (\u03d5i)1\u2264i\u226440. The\ntop \ufb01gures represent a high level view of the whole domain [0, 1]. No method is able to learn f\u2217 on\nthe whole space (this is normal since the available data are only generated from a peaky distribu-\ntion). The bottom \ufb01gures shows a zoom [0.45, 0.51] around the data. Least-squares regression using\nscrambled objects is able to learn the structure of f\u2217 in terms of the measure P.\n\nrepresents the least-squares estimate(cid:98)g using P = 40 scrambled objects (\u03c8p)1\u2264p\u226440 (here Brownian\n\n5\n\n\fFigure 1: LS estimate of f\u2217 using N = 100 data generated from a peaky distribution P (left plots),\nusing 40 Brownian motions (\u03c8p) (middle plots) and 40 hat functions (\u03d5i) (right plots). The bottom\nrow shows a zoom around the data.\n\n4 Discussion\n\nMinimax optimality: Note that although the rate \u02dcO(N\u22121/2) deduced in (9), does not depend on\nthe dimension d of the input data X , it does not contradict the known minimax lower bounds, which\nare \u2126(N\u22122s/(2s+d)) for functions de\ufb01ned over [0, 1]d that possess s-degrees of smoothness (e.g. that\nare s-times differentiable), see e.g. Chapter 3 of [7]. Indeed, the kernel space K is composed of\nfunctions whose order of smoothness may depend on d. For illustration, in the case of scrambled\nwavelets, the kernel space is the Sobolev space H s([0, 1]d) with s > d/2. Thus 2s/(2s + d) > 1/2.\nNotice that if one considers wavelets with q vanishing moments, where q > d/2, then one may\nchoose s (such that q > s > d/2) arbitrarily close to d/2, and deduce that the excess risk rate\n\u02dcO(N\u22121/2) deduced from Theorem 2 is arbitrarily close to the minimax lower rate. Thus regression\nusing scrambled wavelets is minimax optimal (up to logarithmic factors).\nNow, concerning Brownian sheets, we are not aware of minimax lower bounds for Cameron-Martin\nspaces, thus we do not know whether regression using Brownian sheets is minimax optimal or not.\n\nLinks with RKHS Theory: There are strong links between the kernel space of Gaussian objects\n(see eq.(4)) and Reproducing Kernel Hilbert Spaces (RKHS). We now remind two properties that\nillustrate those links:\n\n\u2022 Kernel spaces of Gaussian objects can be built using a Carleman operator, i.e. a linear injec-\ntive mapping J : H 7\u2192 S (where H is a Hilbert space) such that J(h)(t) =\n\u0393t(s)h(s)ds\nwhere (\u0393t)t is a collection of functions of H. There is a bijection between Carleman oper-\nators and the set of RKHSs [4, 15].\n\n(cid:82)\n\n(cid:80)\u221e\n\n\u2022 Expansion of a Mercer kernel. The expansion of a Mercer kernel k (i.e. when X is com-\ni=1 \u03bbiei(x)ei(y),\npact Haussdorff and k is a continuous kernel) is given by k(x, y) =\nwhere (\u03bbi)i and (ei)i are the eigenvalues and eigenfunctions of the integral operator\nLk : L2,\u00b5(X ) \u2192 L2,\u00b5(X ) de\ufb01ned by (Lk(f))(x) =\nX k(x, y)f(y)d\u00b5(y). The asso-\n\u221a\nciated RKHS is K = {f =\n\u03bbiei, endowed\nwith the inner product hf\u03b1, f\u03b2i = h\u03b1, \u03b2il2. This space is thus also the kernel space of the\nGaussian object as de\ufb01ned by (4).\n\ni < \u221e}, where \u03d5i =\n\n(cid:80)\n\ni \u03b1i\u03d5i;\n\ni \u03b12\n\n(cid:82)\n\n(cid:80)\n\n6\n\n0.00.20.40.60.81.0-0.4-0.20.00.20.40.60.81.0Target function0.00.20.40.60.81.0-1.0-0.50.00.51.0Predicted function: BLSR_Hat0.00.20.40.60.81.0-0.6-0.4-0.20.00.20.40.60.81.0Predicted function: LSR_Hat0.450.460.470.480.490.500.51-0.4-0.20.00.20.40.60.81.0Target function0.450.460.470.480.490.500.51-1.0-0.50.00.51.0Predicted function: BLSR_Hat0.450.460.470.480.490.500.51-0.6-0.4-0.20.00.20.40.60.81.0Predicted function: LSR_Hat\fThe expansion of a Mercer kernel gives an explicit construction of the functions of the RKHS.\nHowever it may not be straightforward to compute the eigenvalues and eigenfunctions of the integral\noperator Lk and thus the basis functions \u03d5i in the general case.\nThe approach described in this paper enables to choose explicitly the initial basis functions, and build\nthe corresponding kernel space. For example we have presented examples of expansions using multi-\nresolution bases (such as hat functions and wavelets), which is not easy to obtain from the Mercer\nexpansion. This is interesting because from the choice of the initial basis, we can characterize the\ncorresponding approximations spaces (e.g. Sobolev space in the case of wavelets). Another more\npractical bene\ufb01t is that by using multi-resolution bases (with compact mother function), we can\nderive ef\ufb01cient numerical implementations, as described in Section 5.\n\n(cid:80)P\n\nP\n\ni.i.d\u223c \u00b5, there exist coef\ufb01cients (cp)p\u2264P such that (cid:98)f(x) =\n\n(cid:82)\nIn [14, 13], the authors consider, for a given parameterized function \u03a6 : X \u00d7\nRelated works\n(cid:82)\n\u0398 \u2192 R bounded by 1, and a probability measure \u00b5 over \u0398, the space F of functions f(x) =\n\u00b5(\u03b8)| < \u221e. They show that this is a dense subset\n\u0398 \u03b1(\u03b8)\u03a6(x, \u03b8)d\u03b8 such that ||f||\u00b5 = sup\u03b8 | \u03b1(\u03b8)\n\u0398 \u00b5(\u03b8)\u03a6(x, \u03b8)\u03a6(y, \u03b8)d\u03b8, and that if f \u2208 F, then with high\nof the RKHS with kernel k(x, y) =\nsatis\ufb01es ||(cid:98)f \u2212 f||2\nprobability over (\u03b8p)p\u2264P\np=1 cp\u03a6(x, \u03b8p)\n(cid:80)\n2 \u2264 O(||f||\u00b5\u221a\n). The method is analogous to the construction of the empirical\nestimates gA\u03b1 \u2208 GP of function f\u03b1 \u2208 K in our setting. Indeed we may formally identify \u03a6(x, \u03b8p)\nwith \u03c8p(x) =\ni Ap,i\u03d5i(x), \u03b8p with the sequence (Ap,i)i, and the law \u00b5 with the law of this\nin\ufb01nite sequence. However, in our setting we do not require the condition supx,\u03b8 \u03a6(x, \u03b8) \u2264 1 to\nhold and the fact that \u0398 is a set of in\ufb01nite sequences makes the identi\ufb01cation tedious without the\nGaussian random functions theory used here. Anyway, we believe that this link provides a better\nmutual understanding of both approaches (i.e. [14] and this paper).\nIn the work [1], the authors provide excess risk bounds for greedy algorithms (i.e. in a non-linear\napproximation setting). The bounds derived in their Theorem 3.1 is similar to the result stated in\nour Theorem 2. The main difference is that their bound makes use of the l1 norm of the coef\ufb01cients\n\u03b1\u2217 instead of the l2 norm in our setting. It would be interesting to further investigate whether this\ndifference is a consequence of the non-linear aspect of their approximation or if it results from the\ndifferent assumptions made about the approximation spaces, in terms of rate of decrease of the\ncoef\ufb01cients.\n\n5 Ef\ufb01cient implementation using a lazy multi-resolution expansion\n\nIn practice, in order to build the least-squares estimate, one needs to compute the values of the\nrandom features (\u03c8p)1\u2264p\u2264P at the data points (xn)1\u2264n\u2264N , i.e. the matrix \u03a8 = (\u03c8p(xn))p\u2264P,n\u2264N .\nDue to \ufb01nite memory and precision of computers, numerical implementations can only handle a\n\ufb01nite number F of initial features (\u03d5i)1\u2264i\u2264F . In [10] it was mentioned that the computation of \u03a8,\nwhich makes use of the random matrix A = (Ap,i)p\u2264P,i\u2264F , has a complexity O(F P N). How-\never, in the multi-resolution schemes described here, provided that the mother function has compact\nsupport (such as the hat functions or the Daubechie wavelets), we can signi\ufb01cantly speed up the\ncomputation of the matrix \u03a8 by using a tree-based lazy expansion, i.e. where the expansion of the\nrandom features (\u03c8p)p\u2264P is built only when needed for the evaluation at the points (xn)n.\nConsider the example of the scrambled wavelets. In dimension 1, using a wavelet dyadic-tree of\ndepth H (i.e. F = 2H+1), the numerical cost for computing \u03a8 is O(HP N) (using one tree per\nrandom feature). Now, in dimension d the classical extension of one-dimensional wavelets uses a\nfamily of 2d \u2212 1 wavelets, thus requires 2d \u2212 1 trees each one having 2dH nodes. While the resulting\nnumber of initial features F is of order 2d(H+1), thanks to the lazy evaluation (notice that one never\ncomputes all the initial features), one needs to expand at most one path of length H per training\npoint, and the resulting complexity to compute \u03a8 is O(2dHP N).\nNote that one may alternatively use the so-called sparse-grids instead of wavelet trees, which have\nbeen introduced by Griebel and Zenger (see [18, 3]). The main result is that one can reduce signif-\nicantly the total number of features to F = O(2H H d) (while preserving a good approximation for\nsuf\ufb01ciently smooth functions). Similar lazy evaluation techniques can be applied to sparse-grids.\n\n7\n\n\fNow, using a \ufb01nite F introduces an additional approximation (squared) error term in the \ufb01nal excess\nrisk bounds or order O(F \u2212 2s\nd ) for a wavelet basis adapted to H s([0, 1]d). This additional error (due\nto the numerical approximation) can be made arbitrarily small, e.g. o(N\u22121/2), whenever H \u2265 log N\n.\n\u221a\nThus, using P = O(\nN) random features, we deduce that the complexity of building the matrix\n\u03a8 is O(2dN 3/2 log N). Then in order to solve the least squares system, one has to compute \u03a8T \u03a8,\nthat has numerical cost O(P 2N), and then solve the system by inversion, which has numerical cost\nO(P 2.376) by [5]. Thus, the overall cost of the algorithm is O(2dN 3/2 log N + N 2).\n\nd\n\nN P + log N\n\nP ||f\u2217||2K).\n\n6 Conclusion and future works\nWe analyzed least-squares regression using sub-spaces GP that are generated by P random lin-\near combinations of in\ufb01nitely many initial features. We showed that the approximation space\nK = {f\u03b1,||\u03b1|| < \u221e} (which is also the kernel space of the related Gaussian object) provides a\ncharacterization of the set of target functions f\u2217 for which this random regression works. We il-\nlustrated the approach on two examples for which the approximation space is a known functional\nspace, namely a Cameron-Martin space when the random features are Brownian sheets (generated\nby random combinations at all scales of a hat function), and a Sobolev space in the case of scram-\nbled wavelets. We derived a general approximation error result from which we deduced excess risk\nbounds of order O( log N\nWe showed that least-squares regression with scrambled wavelets provides rates that are arbitrarily\nclose to minimax optimality. However in the case of regression with Brownian sheets, we are not\naware of minimax lower bounds for Cameron-Martin spaces in dimension d > 1.\nWe discussed a key aspect of randomized approximation spaces which is that the approximation\nerror can be controlled independently of the measure P used to assess the performance. This is\nessential in a regression setting where P is unknown, and excess risk rates independent of P are\nobtained.\nWe concluded by mentioning a nice property of using multiscale objects like Brownian sheets and\nscrambled wavelets (with compact mother wavelet) which is the possibility to be ef\ufb01ciently imple-\nmented. We described a lazy expansion approach for computing the regression function which has\na numerical complexity O(N 2 + 2dN 3/2 log N).\nA limitation of the current scrambled wavelets is that, so far, we did not consider re\ufb01ned analysis\nfor spaces H s with large smoothness s (cid:192) d/2. Possible directions for better handling such spaces\nmay involve re\ufb01ned covering number bounds which will be the object of future works.\n\nAcknowledgment\n\nThis work has been supported by French National Research Agency (ANR) through COSINUS\nprogram (project EXPLO-RA number ANR-08-COSI-004).\n\n8\n\n\fReferences\n[1] Andrew Barron, Albert Cohen, Wolfgang Dahmen, and Ronald Devore. Approximation and\n\nlearning by greedy algorithms. 36:1:64\u201394, 2008.\n\n[2] Gerard Bourdaud. Ondelettes et espaces de besov. Rev. Mat. Iberoamericana, 11:3:477\u2013512,\n\n1995.\n\n[3] Hans-Joachim Bungartz and Michael Griebel. Sparse grids.\n\nNumerica, volume 13. University of Cambridge, 2004.\n\nIn Arieh Iserles, editor, Acta\n\n[4] St\u00b4ephane Canu, Xavier Mary, and Alain Rakotomamonjy. Functional learning through kernel.\n\narXiv, 2009, October.\n\n[5] D. Coppersmith and S. Winograd. Matrix multiplication via arithmetic progressions. In STOC\n\u201987: Proceedings of the nineteenth annual ACM symposium on Theory of computing, pages\n1\u20136, New York, NY, USA, 1987. ACM.\n\n[6] R. DeVore. Nonlinear Approximation. Acta Numerica, 1997.\n[7] L. Gy\u00a8or\ufb01, M. Kohler, A. Krzy\u02d9zak, and H. Walk. A distribution-free theory of nonparametric\n\nregression. Springer-Verlag, 2002.\n\n[8] St\u00b4ephane Jaffard. D\u00b4ecompositions en ondelettes. In Development of mathematics 1950\u20132000,\n\npages 609\u2013634. Birkh\u00a8auser, Basel, 2000.\n\n[9] Svante Janson. Gaussian Hilbert spaces. Cambridge Univerity Press, Cambridge, UK, 1997.\n[10] Odalric-Ambrym Maillard and R\u00b4emi Munos. Compressed Least-Squares Regression. In NIPS\n\n2009, Vancouver Canada, 2009.\n\n[11] Odalric-Ambrym Maillard and R\u00b4emi Munos. Linear regression with random projections. Tech-\n\nnical report, Hal INRIA: http://hal.archives-ouvertes.fr/inria-00483014/, 2010.\n[12] Stephane Mallat. A Wavelet Tour of Signal Processing. Academic Press, 1999.\n[13] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In John C.\nPlatt, Daphne Koller, Yoram Singer, Sam T. Roweis, John C. Platt, Daphne Koller, Yoram\nSinger, and Sam T. Roweis, editors, NIPS. MIT Press, 2007.\n\n[14] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases.\n\n2008.\n\n[15] S. Saitoh. Theory of reproducing Kernels and its applications. Longman Scienti\ufb01c & Techni-\n\ncal, Harlow, UK, 1988.\n\n[16] Robert Tibshirani. Regression shrinkage and selection via the Lasso. Journal of the Royal\n\nStatistical Society, Series B, 58:267\u2013288, 1994.\n\n[17] A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method.\n\nSoviet Math Dokl 4, pages 1035\u20131038, 1963.\n\n[18] C. Zenger. Sparse grids. In W. Hackbusch, editor, Parallel Algorithms for Partial Differen-\ntial Equations, Proceedings of the Sixth GAMM-Seminar, volume 31 of Notes on Num. Fluid\nMech., Kiel, 1990. Vieweg-Verlag.\n\n9\n\n\f", "award": [], "sourceid": 823, "authors": [{"given_name": "Odalric", "family_name": "Maillard", "institution": null}, {"given_name": "R\u00e9mi", "family_name": "Munos", "institution": null}]}