{"title": "Transductive and Inductive Methods for Approximate Gaussian Process Regression", "book": "Advances in Neural Information Processing Systems", "page_first": 977, "page_last": 984, "abstract": null, "full_text": "Transductive and Inductive Methods for\nApproximate Gaussian Process Regression\n\nAnton Schwaighofer1\n\n2\n\n1 TU Graz, Institute for Theoretical Computer Science\n\nInffeldgasse 16b, 8010 Graz, Austria\n\nhttp://www.igi.tugraz.at/aschwaig\n\nVolker Tresp2\n\n2 Siemens Corporate Technology CT IC4\n\nOtto-Hahn-Ring 6, 81739 Munich, Germany\n\nhttp://www.tresp.org\n\nAbstract\n\nGaussian process regression allows a simple analytical treatment of ex-\nact Bayesian inference and has been found to provide good performance,\nyet scales badly with the number of training data. In this paper we com-\npare several approaches towards scaling Gaussian processes regression\nto large data sets: the subset of representers method, the reduced rank\napproximation, online Gaussian processes, and the Bayesian commit-\ntee machine. Furthermore we provide theoretical insight into some of\nour experimental results. We found that subset of representers methods\ncan give good and particularly fast predictions for data sets with high\nand medium noise levels. On complex low noise data sets, the Bayesian\ncommittee machine achieves signi\ufb01cantly better accuracy, yet at a higher\ncomputational cost.\n\n1 Introduction\n\nGaussian process regression (GPR) has demonstrated excellent performance in a number\nof applications. One unpleasant aspect of GPR is its scaling behavior with the size of the\ntraining data set N. In direct implementations, training time increases as O \u0001 N 3\n\u0002 , with a\nmemory footprint of O \u0001 N 2\n\u0002 . The subset of representer method (SRM), the reduced rank\napproximation (RRA), online Gaussian processes (OGP) and the Bayesian committee ma-\nchine (BCM) are approaches to solving the scaling problems based on a \ufb01nite dimensional\napproximation to the typically in\ufb01nite dimensional Gaussian process.\n\nThe focus of this paper is on providing a unifying view on the methods and analyze their\ndifferences, both from an experimental and a theoretical point of view. For all of the dis-\ncussed methods, we also examine asymptotic and actual runtime and investigate the ac-\ncuracy versus speed trade-off. A major difference of the methods discussed here is that\nthe BCM performs transductive learning, whereas RRA, SRM and OGP methods perform\n\n\n\finduction style learning. By transduction 1 we mean that a particular method computes a\ntest set dependent model, i.e. it exploits knowledge about the location of the test data in its\napproximation. As a consequence, the BCM approximation is calculated when the inputs\nto the test data are known. In contrast, inductive methods (RRA, OGP, SRM) build a model\nsolely on basis of information from the training data.\n\nIn Sec. 1.1 we will brie\ufb02y introduce Gaussian process regression (GPR). Sec. 2 presents the\nvarious inductive approaches to scaling GPR to large data, Sec. 3 follows with transductive\napproaches. In Sec. 4 we give an experimental comparison of all methods and an analysis\nof the results. Conclusions are given in Sec. 5.\n\n1.1 Gaussian Process Regression\n\n\u0001 x i\u0002\n\nWe consider Gaussian process regression (GPR) on a set of training data D \nyi\u0002\u0004\u0003 N\ni\u0005 1,\nf \u0001 xi\u0002\u0007\u0006\nwhere targets are generated from an unknown function f via y i \nei with inde-\npendent Gaussian noise ei of variance s 2. We assume a Gaussian process prior on f \u0001 xi\u0002 ,\nmeaning that functional values f \u0001 xi\u0002 on points \u0001 xi\u0003 N\ni\u0005 1 are jointly Gaussian distributed,\nwith zero mean and covariance matrix (or Gram matrix) K N. KN itself is given by the\nkernel (or covariance) function k\u0001\t\b\n\u0002 , with K N\nk\u0001 xi\u0002\nx j\u0002 .\nThe Bayes optimal estimator \u02c6f \u0001 x\u0002\n E \u0001 f \u0001 x\u0002\u000b\n D\u0002\ntion of kernel functions [4] on training points x i\nN(cid:229)\ni\u0005 1\n\ntakes on the form of a weighted combina-\n\nwik\u0001 x\n\nxi\u0002\u0004\f\n\n(1)\n\ni j\n\n\u0001 w1\u0002\n\nThe weight vector w \n\nis the solution to the system of linear equations\n\u0001 KN \u0006\nwhere 1 denotes a unit matrix and y \nprediction f\u0010 on a set of test points x\u00101\u0002\n\f\t\f\t\f\nNw and cov\u0001 f\u0010\n\ns 21\n\u0001 y 1\u0002\nx\u0010T can be written conveniently as\ns 21\u0002\u0014\u0013 1 \u0001 K\u0010\n\n D\u0002\n\n D\u0002\nE \u0001 f\u0010\n(3)\n K\u0010\nx j\u0002 . Eq. (2) shows clearly what problem we may expect with large train-\nk\u0001 x\u0010i \u0002\nwith K\u0010\ning data sets: The solution to a system of N linear equations requires O \u0001 N 3\n\u0002 operations,\nand the size of the Gram matrix K N may easily exceed the memory capacity of an average\nwork station.\n\n. Mean and covariance of the GP\n\nw y\nyN\u0002\t\u000e\n\n K\u0010\u0012\u0011 K\u0010\n\nN \u0001 KN \u0006\n\n\f\t\f\t\f\n\n\u02c6f \u0001 x\u0002\nwN \u0002\u000f\u000e\n\n\f\t\f\r\f\n\n(2)\n\nN\n\nN\n\ni j\n\n2 Inductive Methods for Approximate GPR\n\n2.1 Reduced Rank Approximation (RRA)\n\nReduced rank approximations focus on ways of ef\ufb01ciently solving the system of linear\nequations Eq. (2), by replacing the kernel matrix K N with some approximation \u02dcKN.\nWilliams and Seeger [12] use the Nystr\u00a8om method to calculate an approximation to the\n\ufb01rst B eigenvalues and eigenvectors of K N. Essentially, the Nystr\u00a8om method performs an\neigendecomposition of the B \u0015 B covariance matrix K B, obtained from a set of B basis\npoints selected at random out of the training data. Based on the eigendecomposition of K B,\n\n1Originally, the differences between transductive and inductive learning where pointed out in sta-\ntistical learning theory [10]. Inductive methods minimize the expected loss over all possible test\nsets, whereas transductive methods minimize the expected loss for one particular test set.\n\n\u0001\n\u0002\n\b\n\n\n\u0002\n\u0002\n\u0002\n\u0002\n\u0002\n\u0002\n\u000e\n\n\fone can compute approximate eigenvalues and eigenvectors of K N. In a special case, this\nreduces to\n\n(4)\nwhere KB is the kernel matrix for the set of basis points, and K NB is the matrix of kernel\nevaluations between training and basis points. Subsequently, this can be used to obtain an\n\u0002 instead of O \u0001 N3\napproximate solution \u02dcw of Eq. (1) via matrix inversion lemma in O \u0001 NB 2\n\u0002 .\n\n KNB \u0001 KB\n\n\u0002\r\u0013 1 \u0001 KNB\n\n\u02dcKN\n\nKN\n\n2.2 Subset of Representers Method (SRM)\n\nSubset of representers methods replace Eq. (1) by a linear combination of kernel functions\non a set of B basis points, leading to an approximate predictor\n\nwith an optimal weight vector\n\n\u02dcf \u0001 x\u0002\n\nik\u0001 x\n\nxi\u0002\n\n(5)\n\nB(cid:229)\ni\u0005 1\n\u0001 KNB\n\n(6)\nNote that Eq. (5) becomes exact if the kernel function allows a decomposition of the form\n\n\u000e y\f\n\n\u0002\r\u0013 1 \u0001 KNB\n\n\u000e KNB\n\ns 2KB \u0006\n\nB\n\nx j\u0002\n\nB\u0001 KB\n\n Ki\n\n\u0013 1\u0001 K j\n\nk\u0001 xi\u0002\nchoice of the B basis points x1\u0002\n\n.\n\n\f\t\f\t\f\n\nused in literature, we will discuss them in turn.\n\nIn practical implementation, one may expect different performance depending on the\nxB. Different approaches for basis selection have been\n\nObviously, one may select the basis points at random (SRM Random) out of the training\nset. While this produces no computational overhead, the prediction outcome may be sub-\noptimal.\n\nIn the sparse greedy matrix approximation (SRM SGMA, [6]) a subset of B basis kernel\nfunctions is selected such that all kernel functions on the training data can be well approx-\nimated by linear combinations of the selected basis kernels 2. If proximity in the associated\nreproducing kernel Hilbert space (RKHS) is chosen as the approximation criterion, the op-\ntimal linear combination (for a given basis set) can be computed analytically. Smola and\nSch\u00a8olkopf [6] introduce a greedy algorithm that \ufb01nds a near optimal set of basis functions,\nwhere the algorithm has the same asymptotic complexity O \u0001 NB 2\n\u0002 as the SRM Random\nmethod.\nWhereas the SGMA basis selection focuses only on the representation power of kernel\nfunctions, one can also design a basis selection scheme that takes into account the full\nlikelihood model of the Gaussian process. The underlying idea of the greedy posterior\napproximation algorithm (SRM PostApp, [7]) is to compare the log posterior of the subset\nof representers method and the full Gaussian process log posterior. One thus can select\nbasis functions in such a fashion that the SRM log posterior best approximates 3 the full\nGP log posterior, while keeping the total number of basis functions B minimal. As for the\ncase of SGMA, this algorithm can be formulated such that its asymptotic computational\ncomplexity is O \u0001 NB2\n\n\u0002 , where B is the total number of basis functions selected.\n\n2.3 Online Gaussian Processes\n\nCsat\u00b4o and Opper [2] present an online learning scheme that focuses on a sparse model of\nthe posterior process that arises from combining a Gaussian process prior with a general\n2This method was not developed particularly for GPR, yet we expect this basis selection scheme to\n\nbe superior to a purely random choice.\n\n3However, Rasmussen [5] noted that Smola and Bartlett [7] falsely assume that the additive constant\n\nterms in the log likelihood remain constant during basis selection.\n\n\n\u0002\n\u000e\n\f\n\nb\n\u0002\nb\n\n\u0001\n\u0002\n\u0002\n\n\u0002\n\n\u0002\n\u000e\n\u0002\n\flikelihood model of data. The posterior process is assumed to be Gaussian and is modeled\nby a set of basis vectors. Upon arrival of a new data point, the updated (possibly non-\nGaussian) posterior process is being projected to the closest (in a KL-divergence sense)\nGaussian posterior. If this projection induces an error above a certain threshold, the newly\narrived data point will be included in the set of basis vectors. Similarly, basis vectors with\nminimum contribution to the posterior process may be removed from the basis set.\n\n3 Transductive Methods for Approximate GPR\n\nIn order to derive a transductive kernel classi\ufb01er, we rewrite the Bayes optimal prediction\nEq. (3) as follows:\n\n D\u0002\nis the covariance obtained when predicting training observations y given\n\nN cov\u0001 y\n f\u0010\n\nN cov\u0001 y\n f\u0010\n\n\u0002\u0014\u0013 1\u0001 K\u0010\n\n\u0002\r\u0013 1y\f\n\n\u000e\u0002\u0001\n\nK\u0010\n\n(7)\n\nN\n\n\u0013 1\n\n K\u0010\n\n\u0006 K\u0010\n\n K\u0010\n\nE \u0001 f\u0010\nHere, cov\u0001 y\n f\u0010\nthe functional values f\u0010 at the test points:\n KN \u0006\n\ncov\u0001 y\n f\u0010\n\ns 21 \u0011\n\nN\n\n\u0001 K\u0010\n\n\u0001 K\u0010\n\n\u0002\r\u0013 1K\u0010\n\nN\n\n(8)\n\nMind that this matrix can be written down without actual knowledge of f \u0010 .\na weighted sum of kernel functions on test points. In Eq. (7), the term cov\u0001 y\n f \u0010\n\nExamining Eq. (7) reveals that the Bayes optimal prediction of Eq. (3) can be expressed as\n\u0013 1y gives a\nweighting of training observations y: Training points which cannot be predicted well from\nthe functional values of the test points are given a lower weight. Data points which are\n\u201ccloser\u201d to the test points (in the sense that they can be predicted better) obtain a higher\nweight than data which are remote from the test points.\n\ntransductive methods, which we shall discuss in the next sections.\n\nEq. (7) still involves the inversion of the N \u0015 N matrix cov\u0001 y\n f\u0010\na practical method. By using different approximations for cov\u0001 y\n f \u0010\n\n\u0013 1 and thus does not make\n\u0013 1, we obtain different\nNote that in a Bayesian framework, transductive and inductive methods are equivalent, if\nwe consider matching models (the true model for the data is in the family of models we\nconsider for learning). Large data sets reveal more of the structure of the true model, but for\ncomputational reasons, we may have to limit ourselves to models with lower complexity.\nIn this case, transductive methods allow us to focus on the actual region of interest, i.e. we\ncan build models that are particularly accurate in the region where the test data lies.\n\n3.1 Transductive SRM\n\nFor large sets of test data, we may assume cov\u0001 y\n f\u0010\ns 21, meaning that test values f\u0010 allow a perfect prediction of training observations (up to\n\nnoise). With this approximation, Eq. (7) reduces to the prediction of a subset of representers\nmethod (see Sec. 2.2) where the test points are used as the set of basis points (SRM Trans).\n\nto be a diagonal matrix cov\u0001 y\n f\u0010\n\n3.2 Bayesian Committee Machine (BCM)\n\ntransductive SRM method) seems unreasonable.\n\nFor a smaller number of test data, assuming a diagonal matrix for cov\u0001 y\n f \u0010\nassumption of cov\u0001 y\n f\u0010\n\n(as for the\nInstead, we can use the less stringent\n\u0002 being block diagonal. After some matrix manipulations, we obtain\n\n\u0002\n\u0002\n\u0002\n\u0002\n\u000e\n\u0002\n\u0002\n\u0002\n\u0002\n\u0002\n\n\u0002\n\fthe following approximation for Eq. (7) with block diagonal cov\u0001 y\n f \u0010\n\n\u0002 :\n\n\u02c6E \u0001 f\u0010\n\nM(cid:229)\n C\u0013 1\n\n D\u0002\ncov\u0001 f\u0010\ni\u0005 1\n\u0013 1\n\n D\u0002\ncov\u0001 f\u0010\nC \u0001\n\n Di\n\n\u0002\u0014\u0013 1E \u0001 f\u0010\n\u0001 M \u0011 1\u0002\n\n Di\n\n\u0001 K\u0010\n\n\u0013 1 \u0006\n\n(9)\n\n(10)\n\ncov\u0001 f\u0010\n\n Di\n\n\u0013 1\f\n\nM(cid:229)\ni\u0005 1\n\f\t\f\t\f\n\nThis is equivalent to the Bayesian committee machine (BCM) approach [8]. In the BCM,\nthe training data D are partitioned into M disjoint sets D 1\nDM of approximately same\nsize (\u201cmodules\u201d), and M GPR predictors are trained on these subsets. In the prediction\nstage, the BCM calculates the unknown responses f\u0010 at a set of test points x\u00101 \f\t\f\r\f x\u0010T at once.\nThe prediction E \u0001 f\u0010\n\u0002 of GPR module i is weighted by the inverse covariance of its\nprediction. An intuitively appealing effect of this weighting scheme is that modules which\nare uncertain about their predictions are automatically weighted less than modules that are\ncertain about their predictions.\n\n Di\n\nVery good results were obtained with the BCM with random partitioning [8] into subsets\nDi. The block diagonal approximation of cov\u0001 y\n f\u0010\n\u0002 becomes particularly accurate, if each\nDi contains data that is spatially separated from other training data. This can be achieved\nby pre-processing the training data with a simple k-means clustering algorithm, resulting in\nan often drastic reduction of the BCM\u2019s error rates. In this article, we always use the BCM\nwith clustered data.\n\n4 Experimental Comparison\n\nIn this section we will present an evaluation of the different approximation methods dis-\ncussed in Sec. 2 and 3 on four data sets. In the ABALONE data set [1] with 4177 examples,\nthe goal is to predict the age of Abalones based on 8 inputs. The KIN8NM data set 4 rep-\nresents the forward dynamics of an 8 link all-revolute robot arm, based on 8192 examples.\nThe goal is to predict the distance of the end-effector from a target, given the twist angles\nof the 8 links as features. KIN40K represents the same task, yet has a lower noise level\nthan KIN8NM and contains 40\f 000 examples. Data set ART with 50000 examples was\nused extensively in [8] and describes a nonlinear map with 5 inputs with a small amount of\nadditive Gaussian noise.\n\n2\n\n1\n2d2\n\nxi \u0011 x j\n\nFor all data sets, we used a squared exponential kernel of the form k\u0001 x i\u0002\nexp \u0002\n\n, where the kernel parameter d was optimized individually for each\nmethod. To allow a fair comparison, the subset selection methods SRM SGMA and SRM\nPostApp were forced to select a given number B of basis functions (instead of using the\nstopping criteria proposed by the authors of the respective methods). Thus, all methods\nform their predictions as a linear combination of exactly B basis functions.\nTable 1 shows the average remaining variance 5 in a 10-fold cross validation procedure on\nall data sets. For each of the methods, we have run experiments with different kernel width\nd. In Table 1 we list only the results obtained with optimal d for each method.\n\nx j\u0002\n\nOn the ABALONE data set (very high level of noise), all of the tested methods achieved\n\nalmost identical performance, both with B \n\nother data sets, signi\ufb01cant performance differences were observed. Out of the inductive\n4From the DELVE archive http://www.cs.toronto.edu/\u02dcdelve/\n5remaining variance\n\nMSEmodel\nMSEmean , where MSEmean is the MSE obtained from using the\nmean of training targets as the prediction for all test data. This gives a measure of performance\nthat is independent of data scaling.\n\n100\n\n200 and B 1000 basis functions. For all\n\n\u0002\n\n\u0011\n\u0002\n\u0002\n\u0002\n\u0002\n\n\u0011\n\u0003\n\u0003\n\u0004\n\u0005\n\u0006\n\fMethod\n\nSRM PostApp\nSRM SGMA\nSRM Random\nRRA Nystr\u00a8om\nOnline GP\n\nBCM\nSRM Trans\n\nAbalone\n\n200\n42\f 81\n42\f 83\n42\f 86\n42\f 98\n42\f 87\n42\f 86\n42\f 93\n\n1000\n42\f 81\n42\f 81\n42\f 82\n41\f 10\nN/A\n42\f 81\n42\f 79\n\nKIN40K\n\nART\n\n1000\n7\f 84\n8\f 70\n9\f 01\nN/A\n\n200\n9\f 49\n18\f 32\n18\f 77\nN/A\n\nKIN8NM\n1000\n1000\n200\n1\f 12\n2\f 36\n13\f 79\n1\f 79\n4\f 25\n21\f 84\n22\f 34\n1\f 79\n4\f 39\nN/A N/A N/A\nN/A\n16\f 49 N/A 10\f 36 N/A 5\f 37 N/A\n10\f 32\n0\f 20\n1\f 64\n21\f 95\n\n2\f 81\n16\f 47\n\n200\n3\f 91\n5\f 62\n5\f 87\n\n0\f 27\n5\f 15\n\n8\f 31\n9\f 79\n\n0\f 83\n4\f 25\n\nTable 1: Remaining variance, obtained with different GPR approximation methods on four\ndata sets, with different number of basis functions selected (200 or 1000). Remain-\ning variance is given in per cent, averaged over 10-fold cross validation. Marked\nin bold are results that are signi\ufb01cantly better (with a signi\ufb01cance level of 99% or\nabove in a paired t-test) than any of the other methods\n\nmethods (SRM SGMA, SRM Random, SRM PostApp, RRA Nystr \u00a8om) best performance was\nalways achieved with SRM PostApp. Using the results in a paired t-test showed that this\nwas signi\ufb01cant at a level of 99% or above. Online Gaussian processes 6 typically performed\nslightly worse than SRM PostApp. Furthermore, we observed certain problems with the\nRRA Nystr\u00a8om method. On all but the ABALONE data set, weights \u02dcw took on values in the\nrange of 103 or above, leading to poor performance. For this reason, the results for RRA\nNystr\u00a8om were omitted from Table 1. Further comments on these problems will be given in\nSec. 4.2.\n\nComparing induction and transduction methods, we see that the BCM performs signi\ufb01-\ncantly better than any inductive method in most cases. Here, the average MSE obtained\nwith the BCM was only a fraction (25-30%) of the average MSE of the best inductive\nmethod. By a paired t-test we con\ufb01rmed that the BCM is signi\ufb01cantly better than all other\nmethods on the KIN40K and ART data sets, with signi\ufb01cance level of 99% or above. On\nthe KIN8NM data set (medium noise level) we observed a case where SRM PostApp per-\nformed best. We attribute this to the fact that k-means clustering was not able to \ufb01nd well\nseparated clusters. This reduces the performance of the BCM, since the block diagonal\napproximation of Eq. (8) becomes less accurate (see Sec. 3.2). Mind that all transductive\nmethods necessarily lose their advantage over inductive methods, when the allowed model\ncomplexity (that is, the number of basis functions) is increased.\n\nWe further noticed that, on the KIN40K and ART data sets, SRM Trans consistently outper-\nformed SRM Random, despite of SRM Trans being the most simplistic transductive method.\nThe difference in performance was only small, yet signi\ufb01cant at a level of 99%.\n\nAs mentioned above, we did not make use of the stopping criterion proposed for the SRM\nPostApp method, namely the relative gap between SRM log posterior and the log posterior\nof the full Gaussian process model. In [7], the authors suggest that the gap is indicative of\nthe generalization performance of the SRM model and use a gap of 2\f 5% in their exper-\niments. In contrast, we did not observe any correlation between the gap and the general-\nization performance in our experiments. For example, selecting 200 basis points out of the\nKIN40K data set gave a gap of 1%, indicating a good \ufb01t. As shown in Table 1, a signif-\n\nicantly better error was achieved with 1000 basis functions (giving a gap of 3\f 5 \b 10 \u0013 4).\n\nThus, it remains open how one can automatically choose an appropriate basis set size B.\n\n6Due to the numerically demanding approximations, runtime of the OGP method for B\n\nrather long. We thus only list results for B\n\n200 basis functions.\n\n1000 is\n\n\u0005\n\u0005\n\fMemory consumption\n\nComputational cost\n\nMethod\n\nInitialization\n\nExact GPR\nRRA Nystr\u00a8om\nSRM Random\nSRM Trans\nSRM SGMA\nSRM PostApp\nOnline GP\nBCM\n\nO \u0001 N2\nO \u0001 NB\u0002\n O \u0001 NB\u0002\nO \u0001 B2\n\u2014\n\nPrediction\nO \u0001 N\u0002\nO \u0001 N\u0002\n O \u0001 B\u0002\nO \u0001 B\u0002\nO \u0001 N \u0006 B2\n\nInitialization\n\nO \u0001 N3\nO \u0001 NB2\n O \u0001 NB2\nO \u0001 NB2\n\u2014\n\nRuntime\nPrediction KIN40K\nO \u0001 N\u0002\nO \u0001 N\u0002\n O \u0001 B\u0002\nO \u0001 B\u0002\nO \u0001 NB\u0002\n\nN/A\n4 min\n3 min\n3 min\n7 h\n11 h\n\nest. 150 h\n30 min\n\nTable 2: Memory consumption, asymptotic computational cost and actual runtime for dif-\nferent GP approximation methods with N training data points and B basis points,\nN. For the BCM, we assume here that training and test data are partitioned\nB\ninto modules of size B. Asymptotic cost for predictions show the cost per test\npoint. The actual runtime is given for the KIN40K data set, with 36000 training\n\nexamples, 4000 test patterns and B 1000 basis functions for each method.\n\n4.1 Computational Cost\n\nTable 2 shows the asymptotic computational cost for all approximation methods we have\ndescribed in Sec. 2 and 3. The subset of representers methods (SRM) show the most fa-\nvorable cost for the prediction stage, since the resulting model consists only of B basis\nfunctions with their associated weight vector. Table 2 also lists the actual runtime 7 for\none (out of 10) cross validation runs on the KIN40K data set. Here, methods with the same\nasymptotic complexity exhibit runtimes ranging from 3 minutes to 150 hours. For the SRM\nmethods, most of this time is spent for basis selection (SRM PostApp and SRM SGMA). We\nthus consider the slow basis selection as the bottleneck for SRM methods when working\nwith larger number of basis functions or larger data sets.\n\n4.2 Problems with RRA Nystr\u00a8om\n\nAs mentioned in Sec. 4, we observed that weights \u02dcw in RRA Nystr\u00a8om take on values in\nthe range of 103 or above on data sets KIN8NM, KIN40K and ART. This can be explained\nby considering the perturbation of linear systems. RRA Nystr \u00a8om solves Eq. (2) with an\napproximate \u02dcKN instead of KN, thus calculating an approximate \u02dcw instead of the true w.\nUsing matrix perturbation theory, we can show that the relative error of the approximate \u02dcw\nis bounded by\n\n\u02dcw \u0011 w\nw\n\nmax\n\ni\n\n\u02dcl\n\ni \u0011\ni \u0006\n\n\u02dcl\ni\n\ns 2\n\n(11)\n\ni and \u02dcl\n\nwhere l\ni denote eigenvalues of K N resp. \u02dcKN. A closer look at the Nystr\u00a8om approx-\nimation [11] revealed that already for moderately complex data sets, such as KIN8NM,\nit tends to underestimate eigenvalues of the Gram matrix, unless a very high number of\nbasis points is used. If in addition a rather low noise variance is assumed, we obtain a\nvery high value for the error bound in Eq. (11), con\ufb01rming our observations in the experi-\nments. Methods to overcome the problems associated with the Nystr\u00a8om approximation are\ncurrently being investigated [11].\n\n7Runtime was logged on Linux PCs with AMD Athlon 1GHz CPUs, with all methods implemented\n\nin Matlab and optimized with the Matlab pro\ufb01ler.\n\n\u0002\n\u0002\n\u0002\n\u0002\n\u0002\n\u0002\n\u0002\n\u0001\n\u0003\n\u0003\n\u0003\n\u0003\n\u0002\n\nl\n\f5 Conclusions\n\nOur results indicate that, depending on the computational resources and the desired accu-\nracy, one may select methods as follows: If the major concern is speed of prediction, one is\nwell advised to use the subset of representers method with basis selection by greedy pos-\nterior approximation. This method may be expected to give results that are signi\ufb01cantly\nbetter than other (inductive) methods. While being painfully slow during basis selection,\nthe resulting models are compact, easy to use and accurate. Online Gaussian processes\nachieve a slightly worse accuracy, yet they are the only (inductive) method that can easily\nbe adapted for general likelihood models, such as classi\ufb01cation and regression with non-\nGaussian noise. A generalization of the BCM to non-Gaussian likelihood models has been\npresented in [9].\n\nOn the other hand, if accurate predictions are the major concern, one may expect best results\nwith the Bayesian committee machine. On complex low noise data sets (such as KIN40K\nand ART) we observed signi\ufb01cant advantages in terms of prediction accuracy, giving an\naverage mean squared error that was only a fraction (25-30%) of the error achieved by\nthe best inductive method. For the BCM, one must take into account that it is a transduc-\ntion scheme, thus prediction time and memory consumption are larger than those of SRM\nmethods.\n\nAlthough all discussed approaches scale linearly in the number of training data, they exhibit\nsigni\ufb01cantly different runtime in practice. For the experiments we had done in this paper\n(running 10-fold cross validation on given data) the Bayesian committee machine is about\none order of magnitude slower than an SRM method with randomly chosen basis; SRM\nwith greedy posterior approximation is again an order of magnitude slower than the BCM.\n\nAcknowledgements Anton Schwaighofer gratefully acknowledges support through an\nErnst-von-Siemens scholarship.\n\nReferences\n[1] Blake, C. and Merz, C. UCI repository of machine learning databases. 1998.\n[2] Csat\u00b4o, L. and Opper, M. Sparse online gaussian processes. Neural Computation, 14(3):641\u2013\n\n668, 2002.\n\n[3] Leen, T. K., Dietterich, T. G., and Tresp, V., eds. Advances in Neural Information Processing\n\nSystems 13. MIT Press, 2001.\n\n[4] MacKay, D. J.\n\nIntroduction to Gaussian processes.\n\nIn C. M. Bishop, ed., Neural Networks\nand Machine Learning, vol. 168 of NATO Asi Series. Series F, Computer and Systems Sciences.\nSpringer Verlag, 1998.\n\n[5] Rasmussen, C. E. Reduced rank Gaussian process learning, 2002. Unpublished Manuscript.\n[6] Smola, A. and Sch\u00a8olkopf, B. Sparse greedy matrix approximation for machine learning. In\n\nP. Langely, ed., Proceedings of ICML00. Morgan Kaufmann, 2000.\n\n[7] Smola, A. J. and Bartlett, P. Sparse greedy gaussian process regression. In [3], pp. 619\u2013625.\n[8] Tresp, V. A Bayesian committee machine. Neural Computation, 12(11):2719\u20132741, 2000.\n[9] Tresp, V. The generalized bayesian committee machine.\n\nIn Proceedings of the Sixth ACM\nSIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 130\u2013139.\nBoston, MA USA, 2000.\n\n[10] Vapnik, V. N. The nature of statistical learning theory. Springer Verlag, 1995.\n[11] Williams, C. K., Rasmussen, C. E., Schwaighofer, A., and Tresp, V. Observations on the\nNystr\u00a8om method for Gaussian process prediction. Tech. rep., Available from the authors\u2019 web\npages, 2002.\n\n[12] Williams, C. K. I. and Seeger, M. Using the nystr\u00a8om method to speed up kernel machines. In\n\n[3], pp. 682\u2013688.\n\n\f", "award": [], "sourceid": 2230, "authors": [{"given_name": "Anton", "family_name": "Schwaighofer", "institution": null}, {"given_name": "Volker", "family_name": "Tresp", "institution": null}]}