{"title": "Nystr\u00f6m Method vs Random Fourier Features: A Theoretical and Empirical Comparison", "book": "Advances in Neural Information Processing Systems", "page_first": 476, "page_last": 484, "abstract": "Both random Fourier features and the Nystr\u00f6m method have been successfully applied to efficient kernel learning. In this work, we investigate the fundamental difference between these two approaches, and how the difference could affect their generalization performances. Unlike approaches based on random Fourier features  where the basis functions (i.e., cosine and sine functions) are sampled from a distribution  {\\it independent} from the training data, basis functions used by the Nystr\u00f6m method are randomly sampled from the training examples and are therefore {\\it data dependent}. By exploring this difference, we show that when there is a large gap in the eigen-spectrum of the kernel matrix, approaches based the Nystr\u00f6m method can yield  impressively  better generalization error bound than random Fourier features based approach. We empirically verify our theoretical findings on a wide range of large data sets.", "full_text": "Nystr\u00a8om Method vs Random Fourier Features:\n\nA Theoretical and Empirical Comparison\n\nTianbao Yang\u2020, Yu-Feng Li\u2021, Mehrdad Mahdavi(cid:92), Rong Jin(cid:92), Zhi-Hua Zhou\u2021\n\n\u2020Machine Learning Lab, GE Global Research, San Ramon, CA 94583\n\n\u2021National Key Laboratory for Novel Software Technology, Nanjing University, 210023, China\ntyang@ge.com,mahdavim,rongjin@msu.edu,liyf,zhouzh@lamda.nju.edu.cn\n\n(cid:92)Michigan State University, East Lansing, MI 48824\n\nAbstract\n\nBoth random Fourier features and the Nystr\u00a8om method have been successfully\napplied to ef\ufb01cient kernel learning. In this work, we investigate the fundamental\ndifference between these two approaches, and how the difference could affect\ntheir generalization performances. Unlike approaches based on random Fourier\nfeatures where the basis functions (i.e., cosine and sine functions) are sampled\nfrom a distribution independent from the training data, basis functions used by\nthe Nystr\u00a8om method are randomly sampled from the training examples and are\ntherefore data dependent. By exploring this difference, we show that when there\nis a large gap in the eigen-spectrum of the kernel matrix, approaches based on\nthe Nystr\u00a8om method can yield impressively better generalization error bound than\nrandom Fourier features based approach. We empirically verify our theoretical\n\ufb01ndings on a wide range of large data sets.\n\n1\n\nIntroduction\n\nKernel methods [16], such as support vector machines, are among the most effective learning meth-\nods. These methods project data points into a high-dimensional or even in\ufb01nite-dimensional feature\nspace and \ufb01nd the optimal hyperplane in that feature space with strong generalization performance.\nOne limitation of kernel methods is their high computational cost, which is at least quadratic in the\nnumber of training examples, due to the calculation of kernel matrix. Although low rank decom-\nposition approaches (e.g., incomplete Cholesky decomposition [3]) have been used to alleviate the\ncomputational challenge of kernel methods, they still require computing the kernel matrix. Other ap-\nproaches such as online learning [9] and budget learning [7] have also been developed for large-scale\nkernel learning, but they tend to yield performance worse performance than batch learning.\nTo avoid computing kernel matrix, one common approach is to approximate a kernel learning prob-\nlem with a linear prediction problem. It is often achieved by generating a vector representation of\ndata that approximates the kernel similarity between any two data points. The most well known\napproaches in this category are random Fourier features [13, 14] and the Nystr\u00a8om method [20, 8].\nAlthough both approaches have been found effective, it is not clear what are their essential dif-\nference, and which method is preferable under which situations. The objective of this work is to\nunderstand the difference between these two approaches, both theoretically and empirically\nThe theoretical foundation for random Fourier transform is that a shift-invariant kernel is the Fourier\ntransform of a non-negative measure [15]. Using this property, in [13], the authors proposed to\nrepresent each data point by random Fourier features. Analysis in [14] shows that, the generalization\nerror bound for kernel learning based on random Fourier features is given by O(N\u22121/2 + m\u22121/2),\nwhere N is the number of training examples and m is the number of sampled Fourier components.\n\n1\n\n\fAn alternative approach for large-scale kernel classi\ufb01cation is the Nystr\u00a8om method [20, 8] that\napproximates the kernel matrix by a low rank matrix.\nIt randomly samples a subset of training\n\nexamples and computes a kernel matrix (cid:98)K for the random samples. It then represents each data\n(cid:98)K. Most analysis of the Nystr\u00a8om method follows [8] and bounds the error in approximating the\n\npoint by a vector based on its kernel similarity to the random samples and the sampled kernel matrix\n\nkernel matrix. According to [8], the approximation error of the Nystr\u00a8om method, measured in\nspectral norm 1, is O(m\u22121/2), where m is the number of sampled training examples. Using the\narguments in [6], we expected an additional error of O(m\u22121/2) in the generalization performance\ncaused by the approximation of the Nystr\u00a8om method, similar to random Fourier features.\n\nContributions\nIn this work, we \ufb01rst establish a uni\ufb01ed framework for both methods from the\nviewpoint of functional approximation. This is important because random Fourier features and the\nNystr\u00a8om method address large-scale kernel learning very differently: random Fourier features aim\nto approximate the kernel function directly while the Nystr\u00a8om method is designed to approximate\nthe kernel matrix. The uni\ufb01ed framework allows us to see a fundamental difference between the\ntwo methods: the basis functions used by random Fourier features are randomly sampled from a\ndistribution independent from the training data, leading to a data independent vector representation;\nin contrast, the Nystr\u00a8om method randomly selects a subset of training examples to form its basis\nfunctions, leading to a data dependent vector representation. By exploring this difference, we show\nthat the additional error caused by the Nystr\u00a8om method in the generalization performance can be\nimproved to O(1/m) when there is a large gap in the eigen-spectrum of the kernel matrix. Empirical\nstudies on a synthetic data set and a broad range of real data sets verify our analysis.\n\n2 A Uni\ufb01ed Framework for Approximate Large-Scale Kernel Learning\nLet D = {(x1, y1), . . . , (xN , yN )} be a collection of N training examples, where xi \u2208 X \u2286 Rd,\nyi \u2208 Y. Let \u03ba(\u00b7,\u00b7) be a kernel function, H\u03ba denote the endowed Reproducing Kernel Hilbert Space,\nand K = [\u03ba(xi, xj)]N\u00d7N be the kernel matrix for the samples in D. Without loss of generality,\nwe assume \u03ba(x, x) \u2264 1,\u2200x \u2208 X . Let (\u03bbi, vi), i = 1, . . . , N be the eigenvalues and eigenvectors\nof K ranked in the descending order of eigenvalues. Let V = [Vij]N\u00d7N = (v1, . . . , vN ) denote\n\nthe eigenvector matrix. For the Nystr\u00a8om method, let (cid:98)D = {(cid:98)x1, . . . ,(cid:98)xm} denote the randomly\nsampled examples, (cid:98)K = [\u03ba((cid:98)xi,(cid:98)xj)]m\u00d7m denote the corresponding kernel matrix. Similarly, let\n{((cid:98)\u03bbi,(cid:98)vi), i \u2208 [m]} denote the eigenpairs of (cid:98)K ranked in the descending order of eigenvalues, and\n(cid:98)V = [(cid:98)Vij]m\u00d7m = ((cid:98)v1, . . . ,(cid:98)vm). We introduce two linear operators induced by examples in D and\n(cid:98)D, i.e.,\n\nLN [f ] =\n\n\u03ba(xi,\u00b7)f (xi), Lm[f ] =\n\n(1)\n\nIt can be shown that both LN and Lm are self-adjoint operators. According to [18], the eigenval-\n\nues of LN and Lm are \u03bbi/N, i \u2208 [N ] and(cid:98)\u03bbi/m, i \u2208 [m], respectively, and their corresponding\nnormalized eigenfunctions \u03d5j, j \u2208 [N ] and (cid:98)\u03d5j, j \u2208 [m] are given by\nm(cid:88)\n1(cid:113)(cid:98)\u03bbj\n\n(cid:98)Vi,j\u03ba((cid:98)xi,\u00b7), j \u2208 [m].\n\nVi,j\u03ba(xi,\u00b7), j \u2208 [N ],\n\n(cid:98)\u03d5j(\u00b7) =\n\n1(cid:112)\u03bbj\n\nN(cid:88)\n\ni=1\n\n\u03d5j(\u00b7) =\n\n(2)\n\ni=1\n\nTo make our discussion concrete, we focus on the RBF kernel 2, i.e., \u03ba(x, \u00afx) = exp(\u2212(cid:107)x \u2212\n\u00afx(cid:107)2\n2/[2\u03c32]), whose inverse Fourier transform is given by a Gaussian distribution p(u) =\nN (0, \u03c3\u22122I) [15]. Our goal is to ef\ufb01ciently learn a kernel prediction function by solving the fol-\nlowing optimization problem:\n\nN(cid:88)\n\ni=1\n\n1\nN\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n\u03ba((cid:98)xi,\u00b7)f ((cid:98)xi).\n\nN(cid:88)\n\ni=1\n\nmin\nf\u2208HD\n\n(cid:107)f(cid:107)2H\u03ba\n\n+\n\n\u03bb\n2\n\n1\nN\n\n(cid:96)(f (xi), yi),\n\n(3)\n\n1We choose the bound based on spectral norm according to the discussion in [6].\n2 The improved bound obtained in the paper for the Nystrom method is valid for any kernel matrix that\n\nsatis\ufb01es the eigengap condition.\n\n2\n\n\fwhere HD = span(\u03ba(x1,\u00b7), . . . , \u03ba(xN ,\u00b7)) is a span over all the training examples 3, and (cid:96)(z, y) is\na convex loss function with respect to z. To facilitate our analysis, we assume maxy\u2208Y (cid:96)(0, y) \u2264 1\nand (cid:96)(z, y) has a bounded gradient |\u2207z(cid:96)(z, y)| \u2264 C. The high computational cost of kernel learning\narises from the fact that we have to search for an optimal classi\ufb01er f (\u00b7) in a large space HD.\nGiven this observation, to alleviate the computational cost of kernel classi\ufb01cation, we can reduce\nspace HD to a smaller space Ha, and only search for the solution f (\u00b7) \u2208 Ha. The main challenge is\nhow to construct such a space Ha. On the one hand, Ha should be small enough to make it possible\nto perform ef\ufb01cient computation; on the other hand, Ha should be rich enough to provide good ap-\nproximation for most bounded functions in HD. Below we show that the difference between random\nFourier features and the Nystr\u00a8om method lies in the construction of the approximate space Ha. For\neach method, we begin with a description of a vector representation of data, and then connect the\nvector representation to the approximate large kernel machine by functional approximation.\n\nRandom Fourier Features The random Fourier\nfeatures are constructed by \ufb01rst sam-\npling Fourier components u1, . . . , um from p(u), projecting each example x to u1, . . . , um\nseparately,\ni.e., zf (x) =\n(sin(u(cid:62)\nmx)). Given the random Fourier features, we then\nlearn a linear machine f (x) = w(cid:62)zf (x) by solving the following optimization problem:\n\nand then passing them through sine and cosine functions,\n\n1 x), . . . , sin(u(cid:62)\n\nmx), cos(u(cid:62)\n\n1 x), cos(u(cid:62)\n\nmin\nw\u2208R2m\n\n(cid:107)w(cid:107)2\n\n2 +\n\n\u03bb\n2\n\n1\nN\n\n(cid:96)(w(cid:62)zf (xi), yi).\n\n(4)\n\nTo connect the linear machine (4) to the kernel machine in (3) by a functional approximation, we can\nconstruct a functional space Hf\nk x)\nand ck(x) = cos(u(cid:62)\n\na = span(s1(\u00b7), c1(\u00b7), . . . , sm(\u00b7), cm(\u00b7)), where sk(x) = sin(u(cid:62)\n\nk x). If we approximate HD in (3) by Hf\n\na, we have\n\n(cid:107)f(cid:107)2H\u03ba\n\n+\n\n\u03bb\n2\n\n1\nN\n\nmin\nf\u2208Hf\na\n\n(cid:96)(f (xi), yi).\n\n(5)\n\nThe following proposition connects the approximate kernel machine in (5) to the linear machine\nin (4). Proofs can be found in supplementary \ufb01le.\n\nProposition 1 The approximate kernel machine in (5) is equivalent to the following linear machine\n\nmin\nw\u2208R2m\n\n1\nN\n\nw(cid:62)(w \u25e6 \u03b3) +\n\n\u03bb\n2\nm)(cid:62) and \u03b3s/c\n\nwhere \u03b3 = (\u03b3s\n\n1,\u00b7\u00b7\u00b7 , \u03b3s\n\n1, \u03b3c\n\nm, \u03b3c\n\ni = exp(\u03c32(cid:107)ui(cid:107)2\n\n2/2).\n\n(cid:96)(w(cid:62)zf (xi), yi),\n\n(6)\n\nComparing (6) to the linear machine based on random Fourier features in (4), we can see that other\nthan the weights {\u03b3s/c\ni=1, random Fourier features can be viewed as to approximate (3) by re-\nstricting the solution f (\u00b7) to Hf\na.\n\ni }m\n\nThe Nystr\u00a8om Method The Nystr\u00a8om method approximates the full kernel matrix K by \ufb01rst sam-\n\npling m examples, denoted by (cid:98)x1,\u00b7\u00b7\u00b7 ,(cid:98)xm, and then constructing a low rank matrix by (cid:98)Kr =\nKb(cid:98)K\u2020K(cid:62)\nb , where Kb = [\u03ba(xi,(cid:98)xj)]N\u00d7m, (cid:98)K = [\u03ba((cid:98)xi,(cid:98)xj)]m\u00d7m, (cid:98)K\u2020 is the pseudo inverse of (cid:98)K,\nand r denotes the rank of (cid:98)K. In order to train a linear machine, we can derive a vector representa-\n, where (cid:98)Dr = diag((cid:98)\u03bb1, . . . ,(cid:98)\u03bbr) and\ntion of data by zn(x) = (cid:98)D\n(cid:98)Vr = ((cid:98)v1, . . . ,(cid:98)vr). It is straightforward to verify that zn(xi)(cid:62)zn(xj) = [(cid:98)Kr]ij. Given the vector\n\n(cid:98)V (cid:62)\nr (\u03ba(x,(cid:98)x1), . . . , \u03ba(x,(cid:98)xm))\n\nrepresentation zn(x), we then learn a linear machine f (x) = w(cid:62)zn(x) by solving the following\noptimization problem:\n\n\u22121/2\nr\n\n(cid:62)\n\nmin\nw\u2208Rr\n\n(cid:107)w(cid:107)2\n\n2 +\n\n\u03bb\n2\n\n1\nN\n\n(cid:96)(w(cid:62)zn(xi), yi).\n\n(7)\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\nN(cid:88)\n\ni=1\n\n3We use HD, instead of H\u03ba in (3), owing to the representer theorem [16].\n\n3\n\n\fa = span((cid:98)\u03d51, . . . ,(cid:98)\u03d5r), where(cid:98)\u03d51, . . . ,(cid:98)\u03d5r are the \ufb01rst r normalized eigenfunctions of the operator\n\nIn order to see how the Nystr\u00a8om method can be cast into the uni\ufb01ed framework of approximating the\nlarge scale kernel machine by functional approximation, we construct the following functional space\nHn\nLm. The following proposition shows that the linear machine in (7) using the vector representation\nof the Nystr\u00a8om method is equivalent to the approximate kernel machine in (3) by restricting the\nsolution f (\u00b7) to an approximate functional space Hn\na.\nProposition 2 The linear machine in (7) is equivalent to the following approximate kernel machine\n\nmin\nf\u2208Hn\n\na\n\n(cid:107)f(cid:107)2H\u03ba\n\n+\n\n\u03bb\n2\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:96)(f (xi), yi),\n\n(8)\n\nAlthough both random Fourier features and the Nystr\u00a8om method can be viewed as variants of the\nuni\ufb01ed framework, they differ signi\ufb01cantly in the construction of the approximate functional space\nHa. In particular, the basis functions used by random Fourier features are sampled from a Gaussian\ndistribution that is independent from the training examples. In contrast, the basis functions used by\nthe Nystr\u00a8om method are sampled from the training examples and are therefore data dependent.\nThis difference, although subtle, can have signi\ufb01cant impact on the classi\ufb01cation performance. In\nthe case of large eigengap, i.e., the \ufb01rst few eigenvalues of the full kernel matrix are much larger than\nthe remaining eigenvalues, the classi\ufb01cation performance is mostly determined by the top eigenvec-\ntors. Since the Nystr\u00a8om method uses a data dependent sampling method, it is able to discover the\nsubspace spanned by the top eigenvectors using a small number of samples. In contrast, since ran-\ndom Fourier features are drawn from a distribution independent from training data, it may require a\nlarge number of samples before it can discover this subspace. As a result, we expect a signi\ufb01cantly\nlower generalization error for the Nystr\u00a8om method.\nTo illustrate this point, we generate a synthetic data set consisted of two balanced classes with a\ntotal of N = 10, 000 data points generated from uniform distributions in two balls of radius 0.5\ncentered at (\u22120.5, 0.5) and (0.5, 0.5), respectively. The \u03c3 value in the RBF kernel is chosen by\ncross-validation and is set to 6 for the synthetic data. To avoid a trivial task, 100 redundant features,\neach drawn from a uniform distribution on the unit interval, are added to each example. The data\npoints in the \ufb01rst two dimensions are plotted in Figure 1(a) 4, and the eigenvalue distribution is\nshown in Figure 1(b). According to the results shown in Figure 1(c), it is clear that the Nystr\u00a8om\nmethod performs signi\ufb01cantly better than random Fourier features. By using only 100 samples, the\nNystr\u00a8om method is able to make perfect prediction, while the decision made by random Fourier fea-\ntures based method is close to random guess. To evaluate the approximation error of the functional\nspace, we plot in Figure 1(e) and 1(f), respectively, the \ufb01rst two eigenvectors of the approximate\nkernel matrix computed by the Nystr\u00a8om method and random Fourier features using 100 samples.\nCompared to the eigenvectors computed from the full kernel matrix (Figure 1(d)), we can see that\nthe Nystr\u00a8om method achieves a signi\ufb01cantly better approximation of the \ufb01rst two eigenvectors than\nrandom Fourier features.\nFinally, we note that although the concept of eigengap has been exploited in many studies of kernel\nlearning [2, 12, 1, 17], to the best of our knowledge, this is the \ufb01rst time it has been incorporated in\nthe analysis for approximate large-scale kernel learning.\n\n3 Main Theoretical Result\nLet f\u2217\nsolution to the full version of kernel learning in (3). Let f\u2217 be the optimal solution to\n\nm be the optimal solution to the approximate kernel learning problem in (8), and let f\u2217\n\nN be the\n\n(cid:18)\n\n(cid:19)\n\nmin\nf\u2208H\u03ba\n\nF (f ) =\n\n+ E [(cid:96)(f (x), y)]\n\n,\n\n(cid:107)f(cid:107)2H\u03ba\n\n\u03bb\n2\n\nwhere E[\u00b7] takes expectation over the joint distribution P (x, y). Following [10], we de\ufb01ne the excess\nrisk of any classi\ufb01er f \u2208 H\u03ba as\n\n\u039b(f ) = F (f ) \u2212 F (f\u2217).\n4Note that the scales of the two axes in Figure 1(a) are different.\n\n(9)\n\n4\n\n\f(a) Synthetic data: the \ufb01rst two\ndimensions\n\n(b) Eigenvalues (in logarith-\nmic scale) vs. rank. N is the\ntotal number of data points.\n\n(c) Classi\ufb01cation accuracy vs\nthe number of samples\n\n(d) the \ufb01rst two eigenvectors of the\nfull kernel matrix\n\n(e) the \ufb01rst two eigenvectors com-\nputed by Nystr\u00a8om method\n\n(f) the \ufb01rst two eigenvectors com-\nputed by random Fourier features\n\nFigure 1: An Illustration Example\n\nN\n\n(cid:17)1/2\n\ni=1 min(\u03b42, \u03bbi)\n\nUnlike [6], in this work, we aim to bound the generalization performance of f\u2217\nperformance of f\u2217\nIn order to obtain a tight bound, we exploit the local Rademacher complexity [10]. De\ufb01ne \u03c8(\u03b4) =\n\nN , which better re\ufb02ects the impact of approximating HD by Hn\na.\n\nm by the generalization\n\n(cid:16) 2\n(cid:80)N\nof(cid:101)\u03b5 are determined by the sub-root property of \u03c8(\u03b4) [4], and \u0001 = max\n\n. Let(cid:101)\u03b5 as the solution to(cid:101)\u03b52 = \u03c8((cid:101)\u03b5) where the existence and uniqueness\n\n. According\nto [10], we have \u00012 = O(N\u22121/2), and when the eigenvalues of kernel function follow a p-power law,\nit is improved to \u00012 = O(N\u2212p/(p+1)). The following theorem bounds \u039b(f\u2217\nN ). Section 4\nwill be devoted to the proof of this theorem.\nTheorem 1 For 16\u00012e\u22122N \u2264 \u03bb \u2264 1, \u03bbr+1 = O(N/m) and\n2 ln(2N 3)\n\nm) by \u039b(f\u2217\n\n(cid:18)(cid:101)\u03b5,\n\n(cid:113) 6 ln N\n\nN\n\n(cid:19)\n\n2 ln(2N 3)\n\n(cid:114)\n\n(cid:32)\n\n(cid:33)\n\n,\n\n(\u03bbr \u2212 \u03bbr+1)/N = \u2126(1) \u2265 3\n\nwith a probability 1 \u2212 3N\u22123, we have\n\nm) \u2264 3\u039b(f\u2217\nwhere (cid:101)O(\u00b7) suppresses the polynomial term of ln N.\n\n\u039b(f\u2217\n\nN ) +\n\n+\n\n(cid:18)\n\n\u00012 +\n\nm\n\n(cid:101)O\n\n1\n\u03bb\n\nm\n\n(cid:19)\n\n,\n\n1\nm\n\n\u221a\n\nTheorem 1 shows that the additional error caused by the approximation of the Nystr\u00a8om method is\nimproved to O(1/m) when there is a large gap between \u03bbr and \u03bbr+1. Note that the improvement\nm) to O(1/m) is very signi\ufb01cant from the theoretical viewpoint, because it is well\nfrom O(1/\nknown that the generalization error for kernel learning is O(N\u22121/2) [4]5. As a result, to achieve\na similar performance as the standard kernel learning, the number of required samples has to be\n5It is possible to achieve a better generalization error bound of O(N\u2212p/(p+1)) by assuming the eigenvalues\nof kernel matrix follow a p-power law [10]. However, large eigengap doest not immediately indicate power law\ndistribution for eigenvalues and and consequently a better generalization error.\n\n5\n\n\u22121\u22120.500.5100.10.20.30.40.50.60.70.80.911st dimension2nd dimension0.2N0.4N0.6N0.8N  N10\u2212510\u2212410\u2212310\u2212210\u22121100rankEigenvalues/NSynthetic data  5  10  20  50 100405060708090100# random samplesaccuaracySynthetic data  Nystrom MethodRandom Fourier Features02000400060008000100000.00950.010.0105Eigenvector 10200040006000800010000\u22120.02\u22120.0100.010.02Eigenvector 202000400060008000100000.00950.010.0105Eigenvector 10200040006000800010000\u22120.04\u22120.0200.020.04Eigenvector 20200040006000800010000\u22120.04\u22120.0200.020.04Eigenvector 10200040006000800010000\u22120.0500.05Eigenvector 2\fO(N ) if the additional error caused by the kernel approximation is bounded by O(1/\nm), leading\nto a high computational cost. On the other hand, with O(1/m) bound for the additional error caused\nby the kernel approximation, the number of required samples is reduced to\nN, making it more\npractical for large-scale kernel learning.\nWe also note that the improvement made for the Nystr\u00a8om method relies on the property that Hn\na \u2282\nHD and therefore requires data dependent basis functions. As a result, it does not carry over to\nrandom Fourier features.\n\n\u221a\n\n\u221a\n\n4 Analysis\n\nIn this section, we present the analysis that leads to Theorem 1. Most of the proofs can be found in\nthe supplementary materials. We \ufb01rst present a theorem to show that the excessive risk bound of f\u2217\nm\n\nis related to the matrix approximation error (cid:107)K \u2212 (cid:98)Kr(cid:107)2.\n\nTheorem 2 For 16\u00012e\u22122N \u2264 \u03bb \u2264 1, with a probability 1 \u2212 2N\u22123, we have\n\n\u039b(f\u2217\n\nm) \u2264 3\u039b(f\u2217\n\nN ) + C2\n\n(cid:32)\n\n(cid:107)K \u2212 (cid:98)Kr(cid:107)2\n\nN \u03bb\n\n\u00012\n\u03bb\n\n+\n\n(cid:33)\n\n+ e\u2212N\n\n,\n\nas\n\nwhere C2 is a numerical constant.\n\nIn the sequel, we let Kr be the best rank-r approximation matrix for K. By the triangle inequality,\n\n(cid:107)K \u2212 (cid:98)Kr(cid:107)2 \u2264 (cid:107)K \u2212 Kr(cid:107)2 + (cid:107)Kr \u2212 (cid:98)Kr(cid:107)2 \u2264 \u03bbr+1 + (cid:107)Kr \u2212 (cid:98)Kr(cid:107)2, we thus proceed to bound\n(cid:107)Kr \u2212 (cid:98)Kr(cid:107)2. Using the eigenfunctions of Lm and LN , we de\ufb01ne two linear operators Hr and (cid:98)Hr\n\nr(cid:88)\n\n(cid:98)Hr[f ](\u00b7) =\n\nr(cid:88)\n\n(cid:98)\u03d5i(\u00b7)(cid:104)(cid:98)\u03d5i, f(cid:105)H\u03ba ,\n\nHr[f ](\u00b7) =\n\n\u03d5i(\u00b7)(cid:104)\u03d5i, f(cid:105)H\u03ba ,\n\ni=1\n\nwhere f \u2208 H\u03ba. The following theorem shows that (cid:107)Kr \u2212 (cid:98)Kr(cid:107)2 is related to the linear operator\n\u2206H = Hr \u2212 (cid:98)Hr.\nTheorem 3 For(cid:98)\u03bbr > 0 and \u03bbr > 0, we have\n\ni=1\n\nN (cid:107)2,\nN \u2206HL1/2\nwhere (cid:107)L(cid:107)2 stands for the spectral norm of a linear operator L.\n\n(cid:107)(cid:98)Kr \u2212 Kr(cid:107)2 \u2264 N(cid:107)L1/2\n\n(10)\n\nN \u2206HL1/2\n\nN (cid:107)2 using matrix perturbation theory [19].\n\nGiven the result in Theorem 3, we move to bound the spectral norm of L1/2\nN . To this\nend, we assume a suf\ufb01ciently large eigengap \u2206 = (\u03bbr \u2212 \u03bbr+1)/N. The theorem below bounds\n(cid:107)L1/2\nTheorem 4 For \u2206 = (\u03bbr \u2212 \u03bbr+1)/N > 3(cid:107)LN \u2212 Lm(cid:107)HS, we have\n4(cid:107)LN \u2212 Lm(cid:107)HS\n\u2206 \u2212 (cid:107)LN \u2212 Lm(cid:107)HS\n\nN \u2206HL1/2\n\n(cid:107)L1/2\n\n,\n\n(cid:32)(cid:114)\n\nN (cid:107)2 \u2264 \u03b7\nN \u2206HL1/2\n2(cid:107)LN \u2212 Lm(cid:107)HS\n\u2206 \u2212 (cid:107)LN \u2212 Lm(cid:107)HS\n\n(cid:33)\n\n.\n\n\u03bbr+1\n\nN\n\n,\n\nwhere \u03b7 = max\n\nRemark To utilize the result in Theorem 4, we consider the case when \u03bbr+1 = O(N/m) and\n\u2206 = \u2126(1). We have\n\n(cid:107)L1/2\n\nN \u2206HL1/2\n\nN (cid:107)2 \u2264 O\n\nmax\n\n(cid:107)LN \u2212 Lm(cid:107)HS,(cid:107)LN \u2212 Lm(cid:107)2\n\nHS\n\nObviously, in order to achieve O(1/m) bound for (cid:107)L1/2\nfor (cid:107)LN \u2212 Lm(cid:107)HS, which is given by the following theorem.\n\nN \u2206HL1/2\n\n\u221a\nN (cid:107)2, we need an O(1/\n\nm) bound\n\n(cid:18)\n\n(cid:20) 1\u221a\n\nm\n\n(cid:21)(cid:19)\n\n.\n\n6\n\n\f(cid:107)LN \u2212 Lm(cid:107)HS \u2264 2 ln(2N 3)\n\n+\n\nm\n\n(cid:114)\n\n2 ln(2N 3)\n\n.\n\nm\n\nTheorem 5 For \u03ba(x, x) \u2264 1,\u2200x \u2208 X , with a probability 1 \u2212 N\u22123, we have\n\nTheorem 5 directly follows from Lemma 2 of [18]. Therefore, by assuming the conditions in The-\n\norem 1 and combining results from Theorems 3, 4, and 5, we immediately have (cid:107)K \u2212 (cid:98)Kr(cid:107)2 \u2264\nm + e\u2212N(cid:1). We complete the proof of\n\nO (N/m). Combining this bound with the result in Theorem 2 and using the union bound, we have,\nwith a probability 1 \u2212 3N\u22123, \u039b(f\u2217\nTheorem 1 by using the fact e\u2212N < 1/N \u2264 1/m.\n\nm) \u2264 3\u039b(f\u2217\n\nN ) + C\n\u03bb\n\n(cid:0)\u00012 + 1\n\n5 Empirical Studies\n\nTo verify our theoretical \ufb01ndings, we evaluate the empirical performance of the Nystr\u00a8om method\nand random Fourier features for large-scale kernel learning. Table 1 summarizes the statistics of the\nsix data sets used in our study, including two for regression and four for classi\ufb01cation. Note that\ndatasets CPU, CENSUS, ADULT and FOREST were originally used in [13] to verify the effective-\nness of random Fourier features. We evaluate the classi\ufb01cation performance by accuracy, and the\nperformance of regression by mean square error of the testing data.\nWe use uniform sampling in the Nystr\u00a8om method owing to its simplicity. We note that the empirical\nperformance of the Nystr\u00a8om method may be improved by using a different implementation [21,\n11]. We download the codes from the website http://berkeley.intel-research.net/\narahimi/c/random-features for the implementation of random Fourier features. A RBF\nkernel is used for both methods and for all the datasets. A ridge regression package from [13] is used\nfor the two regression tasks, and LIBSVM [5] is used for the classi\ufb01cation tasks. All parameters\nare selected by a 5-fold cross validation. All experiments are repeated ten times, and prediction\nperformance averaged over ten trials is reported.\nFigure 2 shows the performance of both methods with varied number of random samples. Note\nthat for large datasets (i.e., COVTYPE and FOREST), we restrict the maximum number of random\nsamples to 200 because of the high computational cost. We observed that for all the data sets, the\nNystr\u00a8om method outperforms random Fourier features 6. Moreover, except for COVTYPE with 10\nrandom samples, the Nystr\u00a8om method performs signi\ufb01cantly better than random Fourier features,\naccording to t-tests at 95% signi\ufb01cance level. We \ufb01nally evaluate that whether the large eigengap\ncondition, the key assumption for our main theoretical result, holds for the data sets. Due to the\nlarge size, except for CPU, we compute the eigenvalues of kernel matrix based on 10, 000 randomly\nselected examples from each dataset. As shown in Figure 3 (eigenvalues are in logarithm scale),\nwe observe that the eigenvalues drop very quickly as the rank increases, leading to a signi\ufb01cant gap\nbetween the top eigenvalues and the remaining eigenvalues.\n\n6 Conclusion and Discussion\n\nWe study two methods for large-scale kernel learning, i.e., the Nystr\u00a8om method and random Fourier\nfeatures. One key difference between these two approaches is that the Nystr\u00a8om method uses data\n\n6We note that the classi\ufb01cation performance of ADULT data set reported in Figure 2 does not match with\nthe performance reported in [13]. Given the fact that we use the code provided by [13] and follow the same\ncross validation procedure, we believe our result is correct. We did not use the KDDCup dataset because of the\nproblem of oversampling, as pointed out in [13].\n\nTable 1: Statistics of data Sets\n\nTASK\nReg.\nReg.\nClass.\n\nDATA # TRAIN # TEST\n819\nCPU\n2,273\n16,281\n\n6,554\nCENSUS 18,186\nADULT 32,561\n\n#Attr.\n\nTASK\n21 Class.\n119 Class.\n123 Class.\n\nDATA # TRAIN\nCOD-RNA 59,535\nCOVTYPE 464,810\nFOREST 522,910\n\n# TEST\n271,617\n116,202\n58,102\n\n#Attr.\n8\n54\n54\n\n7\n\n\fFigure 2: Comparison of the Nymstr\u00a8om method and random Fourier features. For regression tasks,\nthe mean square error (with std.) is reported, and for classi\ufb01cation tasks, accuracy (with std.) is\nreported.\n\nFigure 3: The eigenvalue distributions of kernel matrices. N is the number of examples used to\ncompute eigenvalues.\n\ndependent basis functions while random Fourier features introduce data independent basis functions.\nThis difference leads to an improved analysis for kernel learning approaches based on the Nystr\u00a8om\nmethod. We show that when there is a large eigengap of kernel matrix, the approximation error\nof Nystr\u00a8om method can be improved to O(1/m), leading to a signi\ufb01cantly better generalization\nperformance than random Fourier features. We verify the claim by an empirical study.\nAs implied from our study, it is important to develop data dependent basis functions for large-scale\nkernel learning. One direction we plan to explore is to improve random Fourier features by making\nthe sampling data dependent. This can be achieved by introducing a rejection procedure that rejects\nthe sample Fourier components when they do not align well with the top eigenfunctions estimated\nfrom the sampled data.\n\nAcknowledgments\n\nThis work was partially supported by ONR Award N00014-09-1-0663, NSF IIS-0643494, NSFC\n(61073097) and 973 Program (2010CB327903).\n\n8\n\n  10  20  50 100 200500100000.511.522.5# random samplesmean square errorCPU  Nystrom MethodRandom Fourier Features  10  20  50 100 200500100000.511.522.53# random samplesmean square errorCENSUS  Nystrom MethodRandom Fourier Features  10  20  50 100 200500100030405060708090# random samplesaccuracy(%)ADULT  Nystrom MethodRandom Fourier Features  10  20  50 100 200 500405060708090100# random samplesaccuracy(%)COD_RNA  Nystrom MethodRandom Fourier Features  10  20  50 100 200556065707580# random samplesaccuracy(%)COVTYPE  Nystrom MethodRandom Fourier Features  10  20  50 100 200556065707580# random samplesaccuracy(%)FOREST  Nystrom MethodRandom Fourier Features0.2N0.4N0.6N0.8N  N10\u2212810\u2212610\u2212410\u22122100rankEigenvalues/NCPU0.2N0.4N0.6N0.8N  N10\u2212810\u2212610\u2212410\u22122100rankEigenvalues/NCENSUS0.2N0.4N0.6N0.8N  N10\u22121010\u2212810\u2212610\u2212410\u22122100rankEigenvalues/NADULT0.2N0.4N0.6N0.8N  N10\u2212810\u2212610\u2212410\u22122100rankEigenvalues/NCOD\u2212RNA0.2N0.4N0.6N0.8N  N10\u2212810\u2212610\u2212410\u22122100rankEigenvalues/NCOVTYPE0.2N0.4N0.6N0.8N  N10\u2212810\u2212610\u2212410\u22122100rankEigenvalues/NFOREST\fReferences\n[1] A. Azran and Z. Ghahramani. Spectral methods for automatic multiscale data clustering. In\n\nCVPR, pages 190\u2013197, 2006.\n\n[2] F. R. Bach and M. I. Jordan. Learning spectral clustering. Technical Report UCB/CSD-03-\n\n1249, EECS Department, University of California, Berkeley, 2003.\n\n[3] F. R. Bach and M. I. Jordan. Predictive low-rank decomposition for kernel methods. In ICML,\n\npages 33\u201340, 2005.\n\n[4] P. L. Bartlett, O. Bousquet, and S. Mendelson. Local rademacher complexities. Annals of\n\nStatistics, pages 44\u201358, 2002.\n\n[5] C. Chang and C. Lin. Libsvm: a library for support vector machines. TIST, 2(3):27, 2011.\n[6] C. Cortes, M. Mohri, and A. Talwalkar. On the impact of kernel approximation on learning\n\naccuracy. In AISTAT, pages 113\u2013120, 2010.\n\n[7] O. Dekel, S. Shalev-Shwartz, and Y. Singer. The forgetron: A kernel-based perceptron on a\n\n\ufb01xed budget. In NIPS, 2005.\n\n[8] P. Drineas and M. W. Mahoney. On the nystrom method for approximating a gram matrix for\n\nimproved kernel-based learning. JMLR, 6:2153\u20132175, 2005.\n\n[9] J. Kivinen, A. J. Smola, and R. C. Williamson. Online learning with kernels. IEEE Transac-\n\ntions on Signal Processing, pages 2165\u20132176, 2004.\n\n[10] V. Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery\n\nProblems. Springer, 2011.\n\n[11] S. Kumar, M. Mohri, and A. Talwalkar. Ensemble nystrom method. NIPS, pages 1060\u20131068,\n\n2009.\n\n[12] U. Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013416, 2007.\n[13] A. Rahimi and B. Recht. Random features for large-scale kernel machines. NIPS, pages 1177\u2013\n\n1184, 2007.\n\n[14] A. Rahimi and B. Recht. Weighted sums of random kitchen sinks: Replacing minimization\n\nwith randomization in learning. NIPS, pages 1313\u20131320, 2009.\n\n[15] W. Rudin. Fourier analysis on groups. Wiley-Interscience, 1990.\n[16] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regulariza-\n\ntion, Optimization, and Beyond. MIT Press, 2001.\n\n[17] T. Shi, M. Belkin, and B. Yu. Data spectroscopy: eigenspace of convolution operators and\n\nclustering. The Annals of Statistics, 37(6B):3960\u20133984, 2009.\n\n[18] S. Smale and D.-X. Zhou. Geometry on probability spaces. Constructive Approximation,\n\n30(3):311\u2013323, 2009.\n\n[19] G. W. Stewart and J. Sun. Matrix Perturbation Theory. Academic Press, 1990.\n[20] C. Williams and M. Seeger. Using the nystrom method to speed up kernel machines. NIPS,\n\npages 682\u2013688, 2001.\n\n[21] K. Zhang, I. W. Tsang, and J. T. Kwok. Improved nystrom low-rank approximation and error\n\nanalysis. In ICML, pages 1232\u20131239, 2008.\n\n9\n\n\f", "award": [], "sourceid": 248, "authors": [{"given_name": "Tianbao", "family_name": "Yang", "institution": null}, {"given_name": "Yu-feng", "family_name": "Li", "institution": null}, {"given_name": "Mehrdad", "family_name": "Mahdavi", "institution": null}, {"given_name": "Rong", "family_name": "Jin", "institution": null}, {"given_name": "Zhi-Hua", "family_name": "Zhou", "institution": null}]}