{"title": "Fast Krylov Methods for N-Body Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 251, "page_last": 258, "abstract": null, "full_text": "Fast Krylov Methods for N-Body Learning\n\nNando de Freitas Department of Computer Science University of British Columbia nando@cs.ubc.ca Maryam Mahdaviani Department of Computer Science University of British Columbia maryam@cs.ubc.ca\n\nYang Wang School of Computing Science Simon Fraser University ywang12@cs.sfu.ca Dustin Lang Department of Computer Science University of Toronto dalang@cs.ubc.ca\n\nAbstract\nThis paper addresses the issue of numerical computation in machine learning domains based on similarity metrics, such as kernel methods, spectral techniques and Gaussian processes. It presents a general solution strategy based on Krylov subspace iteration and fast N-body learning methods. The experiments show significant gains in computation and storage on datasets arising in image segmentation, object detection and dimensionality reduction. The paper also presents theoretical bounds on the stability of these methods.\n\n1 Introduction\nMachine learning techniques based on similarity metrics have gained wide acceptance over the last few years. Spectral clustering [1] is a typical example. Here one forms a Laplacian matrix L = D-1/2 WD-1/2 , where the entries of W measure the similarity between data points xi  X , i = 1, . . . , N . For example, a popular choice is to set the entries of W to wij = e- \n1\n\nxi -xj 2\n\nwhere j is a user-specified parameter. D is a normalizing diagonal matrix with entries wij . The clusters can be found by running, say, K-means on the eigenvectors of di = L. K-means generates better clusters on this nonlinear embedding of the data provided one adopts a suitable similarity metric. The list of machine learning domains where one forms a covariance or similarity matrix (be it W, D-1 W or D - W) is vast and includes ranking on nonlinear manifolds [2], semi-supervised and active learning [3], Gaussian processes [4], Laplacian eigen-maps [5], stochastic neighbor embedding [6], multi-dimensional scaling, kernels on graphs [7] and many other kernel methods for dimensionality reduction, feature extraction, regression and classification. In these settings, one is interested in either inverting the similarity matrix or finding some of its eigenvectors. The computational cost of both of these operations is O(N 3 ) while the storage requirement is O(N 2 ). These costs are prohibitively large in\n\n\f\napplications where one encounters massive quantities of data points or where one is interested in real-time solutions such as spectral image segmentation for mobile robots [8]. In this paper, we present general numerical techniques for reducing the computational cost to O(N log N ), or even O(N ) in specific cases, and the storage cost to O(N ). These reductions are achieved by combining Krylov subspace iterative solvers (such as Arnoldi, Lanczos, GMRES and conjugate gradients) with fast kernel density estimation (KDE) techniques (such as fast multipole expansions, the fast Gauss transform and dual tree recursions [9, 10, 11]). Specific Krylov methods have been applied to kernel problems. For example, [12] uses Lanczos for spectral clustering and [4] uses conjugate gradients for Gaussian processes. However, the use of fast KDE methods, in particular fast multipole methods, to further accelerate these techniques has only appeared in the context of interpolation [13] and our paper on semi-supervised learning [8]. Here, we go for a more general exposition and present several new examples, such as fast nonlinear embeddings and fast Gaussian processes. More importantly, we attack the issue of stability of these methods. Fast KDE techniques have guaranteed error bounds. However, if these techiques are used inside iterative schemes based on orthogonalization of the Krylov subspace, there is a danger that the errors might grow over iterations. In practice, good behaviour has been observed. In Section 4, we present theoretical results that explain these observations and shed light on the behaviour of these algorithms. Before doing so, we begin with a very brief review of Krylov solvers and fast KDE methods.\n\n2 Krylov subspace iteration\nThis section is a compressed overview of Krylov subspace iteration. The main message is that Krylov methods are very efficient algorithms for solving linear systems and eigenvalue problems, but they require a matrix vector multiplication at each iteration. In the next section, we replace this expensive matrix-vector multiplication with a call to fast KDE routines. Readers happy with this message and familiar with Krylov methods, such as conjugate gradients and Lanczos, can skip the rest of this section. For ease of presentation, let the similarity matrix be simply A = W  RN N , with entries aij = a(xi , xj ). (One can easily handle other cases, such as A = D-1 W and A = D - W.) Typical measures of similarity include polynomial a(xi , xj ) = (xi xT + b)p , j Gaussian a(xi , xj ) = e-  (xi -xj )(xi -xj ) and sigmoid a(xi , xj ) = tanh(xi xT -  ) j kernels, where xi xT denotes a scalar inner product. Our goal is to solve linear systems j Ax = b and (possibly generalized) eigenvalue problems Ax = x. The former arise, for example, in semi-supervised learning and Gaussian processes, while the latter arise in spectral clustering and dimensionality reduction. One could attack these problems with naive iterative methods such as the power method, Jacobi and Gauss-Seidel [14]. The problem with these strategies is that the estimate x(t) , at iteration t, only depends on the previous estimate x(t-1) . Hence, these methods do typically take too many iterations to converge. It is well accepted in the numerical computation field that Krylov methods [14, 15], which make use of the entire history of solutions {x(1) , . . . , x(t-1) }, converge at a faster rate. The intuition behind Krylov subspace methods is to use the history of the solutions we have already computed. We formulate this intuition in terms of projecting an N -dimensional problem onto a lower dimensional subspace. Given a matrix A and a vector b, the associated Krylov matrix is: K = [b Ab A2 b . . . ]. The Krylov subspaces are the spaces spanned by the column vectors of this matrix. In\n1 T\n\n\f\norder to find a new estimate of x(t) we could project onto the Krylov subspace. However, K is a poorly conditioned matrix. (As in the power method, At b is converging to the eigenvector corresponding to the largest eigenvalue of A.) We therefore need to construct a well-conditioned orthogonal matrix Q(t) = [q(1)    q(t) ], with q(i)  RN , that spans the Krylov space. That is, the leading t columns of K and Q span the same space. This is easily done using the QR-decomposition of K [14], yielding the following Arnoldi relation (augmented Schuur factorization): AQ(t) = Q(t+1) H(t) ,\n\nwhere H(t) is the augmented Hessenberg matrix:  h1,1 h1,2 h1,3    h2,1 h2,2 h2,3  . . . H(t) =  . . . . . . . . . 0  0 ht,t-1 0  0 0\n\nh1, t h2,t . . . ht,t ht+1,t\n\n\n\n  .  \n\nThe eigenvalues of the smaller (t + 1)  t Hessenberg matrix approximate the eigenvalues of A as t increases. These eigenvalues can be computed efficiently by applying the Arnoldi relation recursively as shown in Figure 1. (If A is symmetric, then H is tridiagonal and we obtain the Lanczos algorithm.) Notice that the matrix vector multiplication v = Aq is the expensive step in the Arnoldi algorithm. Most Krylov algorithms resemble the Arnoldi algorithm in this. To solve systems of equations, we can minimize either the residual\nInitialization: b = arbitrary , q(1) = b/ b FOR t = 1, 2, 3, . . .  v = Aq  FOR j = 1, . . . , N  hj,t = q(t)T v  v = v - hj,t q(j )  ht+1,t = v  q(t+1) = v/ht+1,t\n(t)\n\nInitialization: q(1) = b/ b FOR t = 1, 2, 3, . . .  Perform step t of the Arnoldi algorithm , , , ,e  miny ,H(t) y - b i,  Set x(t) = Q(t) y(t)\n\nFigure 1: The Arnoldi (left) and GMRES (right) algorithms. r(t) b - Ax(t) , leading to the GMRES and MINRES algorithms, or the A-norm, leading to conjugate gradients (CG) [14]. GMRES, MINRES and CG apply to general, symmetric, and spd matrices respectively. For ease of presentation, we focus on the GMRES algorithm. At step t of GMRES, we approximate the solution by the vector in the Krylov subspace x(t)  K(t) that minimizes the norm of the residual. Since x(t) is in the Krylov subspace, it can be written as a linear combination of the columns of the Krylov matrix K (t) . Our problem therefore reduces to finding the vector y  Rt that minimizes AK(t) y - b . As before, stability considerations force us to use the QR decomposition of K (t) . That is, instead of using a linear combination of the columns of K(t) , we use a linear combination of the columns of Q(t) . So our least squares problem becomes y (t) = miny AQ(t) y - b . Since AQ(t) = Q(t+1) H(t) , we only need to solve a problem of dimension (t + 1)  t: y(t) = miny Q(t+1) H(t) y - b . Keeping in mind that the columns of the projection matrix Q are orthonormal, we can rewrite this least squares problem as min y H(t) y - Q(t+1)T b . We start the iterations with q(1) = b/ b and hence Q(t+1)T b = b i,\n\n\f\nwhere i is the unit vector with a 1 in the first entry. The final form of our least squares problem at iteration t is: H , (t) y(t) = min y- b i\ny\n\nwith solution x = Q y . The algorithm is shown in Figure 1. The least squares problem of size (t + 1)  t to compute y (t) can be solved in O(t) steps using Givens rotations [14]. Notice again that the expensive step in each iteration is the matrix-vector product v = Aq. This is true also of CG and other Krylov methods. One important property of the Arnoldi relation is that the residuals are orthogonal to the space spanned by the columns of V = Q(t+1) H(t) . That is, In the following section, we introduce methods to speed up the matrix-vector product v = Aq. These methods will incur, at most, a pre-specified (tolerance) error e (t) at iteration t. Later, we present theoretical bounds on how these errors affect the residuals and the orthogonality of the Krylov subspace. VT r(t) = H(t)T Q(t+1)T (b - Q(t+1) H(t) y(t) ) = H(t)T b i - H(t)T H(t) y(t) = 0\n\n(t)\n\n(t) (t)\n\n3 Fast KDE\nThe expensive step in Krylov methods is the operation v = Aq(t) . This step requires that we solve two O(N 2 ) kernel estimates: vi = jN qj a(xi , xj )\n(t)\n\ni = 1, 2, . . . , M .\n\n=1\n\nIt is possible to reduce the storage and computational cost to O(N ) at the expense of a small specified error tolerance , say 10-6 , using the fast Gauss transform (FGT) algorithm [16, 17]. This algorithm is an instance of more general fast multipole methods for solving N -body interactions [9]. The FGT applies when the problem is low dimensional, say xk  R3 . However, to attack larger dimensions one can adopt clustering-based partitions as in the improved fast Gauss transform (IFGT) [10]. Fast multipole methods tend to work only in low dimensions and are specific to the choice of similarity metric. Dual tree recursions based on KD-trees and ball trees [11, 18] overcome these difficulties, but on average cost O(N log N ). Due to space constraints, we can only mention these techniques here, but refer the reader to [18] for a thorough comparison.\n\n4 Stability results\nThe problem with replacing the matrix-vector multiplication at each iteration of the Krylov methods is that we do not know how the errors accumulate over successive iterations. In this section, we will derive bounds that describe what factors influence these errors. In particular, the bounds will state what properties of the similarity metric and measurable quantities affect the residuals and the orthogonality of the Krylov subspaces. Several papers have addressed the issue of Krylov subspace stability [19, 20, 21]. Our approach follows from [21]. For presentation purposes, we focus on the GMRES algorithm. Let e(t) denote the errors introduced in the approximate matrix-vector multiplication at each iteration of Arnoldi. For the purposes of upper-bounding, this is the tolerance of the fast KDE methods. Then, the fast KDE methods change the Arnoldi relation to: = A Q(t+1) H(t) , AQ(t) + E(t) = q(1) + e(1) , . . . , Aq(t) + e(t)\n\n\f\nwhere E(t) = r\n(t)\n\ne(1)\n\n, . . . , e(t)\n(t)\n\nand\n\nWe need to ensure two bounds when using fast KDE methods in Krylov iterations. First, the measured residuals r(t) should not deviate too far from the true residuals r(t) . Second, deviations from orthogonality should be upper-bounded. Let us address the first question. The deviation in residuals is given by r(t) - r(t) = E(t) y(t) . Let y(t) = [y1 , . . . , yt ]T . Then, this deviation satisfies: t  k kt (t) (t) (k) r -r = yk e |yk | e(k) .\n=1 =1\n\nr(t)\n\n= b - Ax = b - Q(t+1) H(t) y(t) are the measured residuals.\n\nThe new true residuals are therefore: = b - AQ(t) y(t) = b - Q(t+1) H(t) y(t) + E(t) y(t)\n\n.\n\n(1)\n\nThe deviation from orthogonality can be upper-bounded in a similar fashion: H  k (t)T (t) (t) Ey H(t) t |yk | e(k) ( VT r(t) = H(t)T Q(t+1)T (r(t) + E(t) y(t) ) =\n=1\n\n2) The following lemma provides a relation between the yk and the measured residuals r(k-1) . Lemma 1. [21, Lemma 5.1] Assume that t iterations of the inexact Arnoldi method have been carried out. Then, for any k = 1, . . . , t, 1 |yk |  r(k-1) (3) H(t) ) t ( The proof of the lemma follows from the QR decomposition of H(t) , see [15, 21]. This lemma, in conjunction with equations (1) and (2), allows us to establish the main theoretical result of this section: Proposition 1. Let > 0. If for every k  t we have t (H(t) ) 1 e(k) < (k-1) t r , then r(t) - r(t) < . Moreover, if e(k) < then VT r(t) < . Proof: First, we have kt k t t (H(t) ) 1 r(t) - r(t)  |yk | e(k) < t r(k-1) =1 =1 Pnd similarly, VT r(t)  H(t) a t =1 |yk | e(k) < k t (H(t) ) t H(t)\n- r(k1 1)\n\nwhere t (H(t) ) denotes the t-th singular value of H(t) .\n\n,\n\n1 t (H(t) )\n\nr(k-1) = .\n\nroposition 1 tells us that in order to keep the residuals bounded while ensuring bounded deviations from orthogonality at iteration k , we need to monitor the eigenvalues of H(t) and the measured residuals r(k-1) . Of course, we have no access to H(t) . However, monitoring the residuals is of practical value. If the residuals decrease, we can increase the tolerance of the fast KDE algorithms and viceversa. The bounds do lead to a natural way of constructing adaptive algorithms for setting the tolerance of the fast KDE algorithms.\n\n\f\n(a)\n\n(b)\n1200\n\nTime Comparison\n\n1000\n\nNAIVE Time (Seconds)\n800\n\n600\n\n400\n\nCG\n\n200\n\nCG-DT\n1000 2000 3000 4000 5000 6000 7000\n\n(d)\n\n(c)\n\nData Set Size (Number of Features)\n\nFigure 2: Figure (a) shows a test image from the PASCAL database. Figure (b) shows the SIFT features extracted from the image. Figure (c) shows the positive feature predictions for the label \"car\". Figure (d) shows the centroid of the positive features as a black dot. The plot on the right shows the computational gains obtained by using fast Krylov methods.\n\n5 Experimental results\nThe results of this section demonstrate that significant computational gains may be obtained by combining fast KDE methods with Krylov iterations. We present results in three domains: spectral clustering and image segmentation [1, 12], Gaussian process regression [4] and stochastic neighbor embedding [6]. 5.1 Gaussian processes with large dimensional features In this experiment we use Gaussian processes to predict the labels of 128-dimensional SIFT features [22] for the purposes of object detection and localization as shown in Figure 2. There are typically thousands of features per image, so it is of paramount importance to generate fast predictions. The hard computational task here involves inverting the covariance matrix of the Gaussian process. The figure shows that it is possible to do this efficiently, under the same ROC error, by combining conjugate gradients [4] with dual trees. 5.2 Spectral clustering and image segmentation We applied spectral clustering to color image segmentation; a generalized eigenvalue problem. The types of segmentations obtained are shown in Figure 3. There are no perceptible differences between them. We observed that fast Krylov methods run approximately twice as fast as the Nystrom method. One should note that the result of Nystrom depends on the quality of sampling, while fast N-body methods enable us to work directly with the full matrix, so the solution is less sensitive. Once again, fast KDE methods lead to significant computational improvements over Krylov algorithms (Lanczos in this case). 5.3 Stochastic neighbor embedding Our final example is again a generalized eigenvalue problem arising in dimensionality reduction. We use the stochastic neighbor embedding algorithm of [6] to project two 3-D structures to 2-D, as shown in Figure 4. Again, we observe significant computational improvements.\n\n\f\n10\n\n4\n\n10\n\n3\n\nRunning time(seconds)\n\n10\n\n2\n\n10\n\n1\n\n10\n\n0\n\nLanczos 10\n-1\n\nIFGT Dual Tree\n\n10\n\n-2\n\n0\n\n500\n\n1000\n\n1500\n\n2000\n\n2500 N\n\n3000\n\n3500\n\n4000\n\n4500\n\n5000\n\nFigure 3: (left) Segmentation results (order: original image, IFGT, dual trees and Nystrom) and (right) computational improvements obtained in spectral clustering.\n\n6 Conclusions\nWe presented a general approach for combining Krylov solvers and fast KDE methods to accelerate machine learning techniques based on similarity metrics. We demonstrated some of the methods on several datasets and presented results that shed light on the stability and convergence properties of these methods. One important point to make is that these methods work better when there is structure in the data. There is no computational gain if there is not statistical information in the data. This is a fascinating relation between computation and statistical information, which we believe deserves further research and understanding. One question is how can we design pre-conditioners in order to improve the convergence behavior of these algorithms. Another important avenue for further research is the application of the bounds presented in this paper in the design of adaptive algorithms. Acknowledgments We would like to thank Arnaud Doucet, Firas Hamze, Greg Mori and Changjiang Yang.\n\nReferences\n[1] A Y Ng, M I Jordan, and Y Weiss. On spectral clustering: Analysis and algorithm. In Advances in Neural Information Processing Systems, pages 849856, 2001. [2] D Zhou, J Weston, A Gretton, O Bousquet, and B Scholkopf. Ranking on data manifolds. In Advances on Neural Information Processing Systems, 2004. [3] X Zhu, J Lafferty, and Z Ghahramani. Semi-supervised learning using Gaussian fields and harmonic functions. In International Conference on Machine Learning, pages 912919, 2003. [4] M N Gibbs. Bayesian Gaussian processes for regression and classification. In PhD Thesis, University of Cambridge, 1997. [5] M Belkin and P Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):13731396, 2003. [6] G Hinton and S Roweis. Stochastic neighbor embedding. In Advances in Neural Information Processing Systems, pages 833840, 2002. [7] A Smola and R Kondor. Kernels and regularization of graphs. In Computational Learning Theory, pages 144158, 2003.\n\n\f\nTrue manifold\n10 Running time(seconds)\n4\n\nSampled data\n\nEmbedding of SNE Embedding of SNE withIFGT\n10 Running time(seconds)\n4\n\nS-curve SNE SNE with IFGT\n\nSwissroll SNE SNE with IFGT\n\n10\n\n3\n\n10\n\n3\n\n10\n\n2\n\n10\n\n2\n\n10\n\n1\n\n0\n\n1000\n\n2000 N\n\n3000\n\n4000\n\n5000\n\n10\n\n1\n\n0\n\n1000\n\n2000 N\n\n3000\n\n4000\n\n5000\n\nFigure 4: Examples of embedding on S-curve and Swiss-roll datasets.\n[8] M Mahdaviani, N de Freitas, B Fraser, and F Hamze. Fast computational methods for visually guided robots. In IEEE International Conference on Robotics and Automation, 2004. [9] L Greengard and V Rokhlin. A fast algorithm for particle simulations. Journal of Computational Physics, 73:325348, 1987. [10] C Yang, R Duraiswami, N A Gumerov, and L S Davis. Improved fast Gauss transform and efficient kernel density estimation. In International Conference on Computer Vision, Nice, 2003. [11] A Gray and A Moore. Rapid evaluation of multiple density models. In Artificial Iintelligence and Statistics, 2003. [12] J Shi and J Malik. Normalized cuts and image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 731737, 1997. [13] R K Beatson, J B Cherrie, and C T Mouat. Fast fitting of radial basis functions: Methods based on preconditioned GMRES iteration. Advances in Computational Mathematics, 11:253270, 1999. [14] J W Demmel. Applied Numerical Linear Algebra. SIAM, 1997. [15] Y Saad. Iterative Methods for Sparse Linear Systems. The PWS Publishing Company, 1996. [16] L Greengard and J Strain. The fast Gauss transform. SIAM Journal of Scientific Statistical Computing, 12(1):7994, 1991. [17] B J C Baxter and G Roussos. A new error estimate of the fast Gauss transform. SIAM Journal of Scientific Computing, 24(1):257259, 2002. [18] D Lang, M Klaas, and N de Freitas. Empirical testing of fast kernel density estimation algorithms. Technical Report TR-2005-03, Department of Computer Science, UBC, 2005. [19] G H Golub and Q Ye. Inexact preconditioned conjugate gradient method with inner-outer iteration. SIAM Journal of Scientific Computing, 21:13051320, 1999. [20] G W Stewart. Backward error bounds for approximate Krylov subspaces. Linear Algebra and Applications, 340:8186, 2002. [21] V Simoncini and D B Szyld. Theory of inexact Krylov subspace methods and applications to scientific computing. SIAM Journal on Scientific Computing, 25:454477, 2003. [22] D G Lowe. Object recognition from local scale-invariant features. In ICCV, 1999.\n\n\f\n", "award": [], "sourceid": 2865, "authors": [{"given_name": "Nando", "family_name": "Freitas", "institution": null}, {"given_name": "Yang", "family_name": "Wang", "institution": null}, {"given_name": "Maryam", "family_name": "Mahdaviani", "institution": null}, {"given_name": "Dustin", "family_name": "Lang", "institution": null}]}