#### Authors

Yirong Shen, Matthias Seeger, Andrew Ng

#### Abstract

1 Introduction

We consider (regression) estimation of a function x u(x ) from noisy observations. If the data-generating process is not well understood, simple parametric learning algorithms, for example ones from the generalized linear model (GLM) family, may be hard to apply because of the difficulty of choosing good features. In contrast, the nonparametric Gaussian process (GP) model [19] offers a flexible and powerful alternative. However, a major drawback of GP models is that the computational cost of learning is about O(n 3 ), and the cost of making a single prediction is O(n), where n is the number of training examples. This high computational complexity severely limits its scalability to large problems, and we believe has proved a significant barrier to the wider adoption of the GP model. In this paper, we address the scaling issue by recognizing that learning and predictions with a GP regression (GPR) model can be implemented using the matrix-vector multiplication (MVM) primitive z K z . Here, K Rn,n is the kernel matrix, and z Rn is an arbitrary vector. For the wide class of so-called isotropic kernels, MVM can be approximated efficiently by arranging the dataset in a tree-type multiresolution data structure such as kd-trees [13], ball trees [11], or cover trees [1]. This approximation can sometimes be made orders of magnitude faster than the direct computation, without sacrificing much in terms of accuracy. Further, the storage requirements for the tree is O(n), while a direct storage of the kernel matrix would require O(n2 ) spare. We demonstrate the efficiency of the tree approach on several large datasets. In the sequel, for the sake of simplicity we will focus on kd-trees (even though it is known that kd-trees do not scale well to high dimensional data). However, it is also completely straightforward to apply the ideas in this paper to other tree-type data structures, for example ball trees and cover trees, which typically scale significantly better to high dimensional data.

2 The Gaussian Process Regression Model Suppose that we observe some data D = {(xi , yi ) | i = 1, . . . , n}, xi X , yi R, sampled independently and identically distributed (i.i.d.) from some unknown distribution.

Our goal is to predict the response y on future test points x with small mean-squared error under the data distribution. Our model consists of a latent (unobserved) function x u so that yi = ui + i , where ui = u(xi ), and the i are independent Gaussian noise variables with zero mean and variance 2 > 0. Following the Bayesian paradigm, we place a prior distribution P (u()) on the function u() and use the posterior distribution P (u()|D) N (y |u , 2 I )P (u()) in order to predict y on new points x . Here, y = [y1 , . . . , yn ]T and u = [u1 , . . . , un ]T are vectors in Rn , and N (|, ) is the density of a Gaussian with mean and covariance . For a GPR model, the prior distribution is a (zero-mean) Gaussian process defined in terms of a positive definite kernel (or covariance) function K : X 2 R. For the purposes of this paper, a GP can be thought of as a mapping from arbitrary finite subsets ~ {x i } X of points, to corresponding zero-mean Gaussian distributions with covariance ~ ~ ~~ matrix K = (K (x i , x j ))i,j . (This notation indicates that K is a matrix whose (i, j )~~ element is K (x i , x j ).) In this paper, we focus on the problem of speeding up GPR under the assumption that the kernel is monotonic isotropic. A kernel function K (x , x ) is called isotropic if it depends only on the Euclidean distance r = x - x 2 between the points, and it is monotonic isotropic if it can be written as a monotonic function of r.

3 Fast GPR predictions Since u(x1 ), u(x2 ), . . . , u(xn ) and u(x ) are jointly Gaussian, it is easy to see that the predictive (posterior) distribution P (u |D), u = u(x ) is given by , u (1) P (u |D) = N | kT M -1 y , K (x , x ) - kT M -1 k where k = [K (x , x1 ), . . . , K (x , xn )]T Rn , and M = K + 2 I , K = (K (xi , xj ))i,j . Therefore, if p = M -1 y , the optimal prediction under the model is u = kT p , and the predictive variance (of P (u |D)) can be used to quantify our uncer^ tainty in the prediction. Details can be found in [19]. ([16] also provides a tutorial on GPs.) Once p is determined, making a prediction now requires that we compute in in kT p = K (x , xi )pi = wi p i (2) which is O(n) since it requires scanning through the entire training set and computing K (x , xi ) for each xi in the training set. When the training set is very large, this becomes prohibitively slow. In such situations, it is desirable to use a fast approximation instead of the exact direct implementation. 3.1 Weighted Sum Approximation The computations in Equation 2 can be thought of as a weighted sum, where w i = K (x , xi ) is the weight on the i-th summand pi . We observe that if the dataset is divided into groups where all data points in a group have similar weights, then it is possible to compute a fast approximation to the above weighted sum. For example, let G be a set of data points that all have weights near some value w. The contribution to the weighted sum by points in G is i i i i i wi p i = w pi + (wi - w)pi = w pi + ipi i i i i i i i Where i = wi - w. Assuming that :xi G pi is known in advance, w i :xi G pi can tihen be computed in constant time and used as an approximation to :xi G wi pi if ipi is small. :xi G