{"title": "Fast Sparse Gaussian Process Methods: The Informative Vector Machine", "book": "Advances in Neural Information Processing Systems", "page_first": 625, "page_last": 632, "abstract": null, "full_text": "Fast Sparse Gaussian Process Methods:\n\nThe Informative Vector Machine\n\nNeil Lawrence\n\nUniversity of She(cid:14)eld\n211 Portobello Street\n\nShe(cid:14)eld, S1 4DP\nneil@dcs.shef.ac.uk\n\nMatthias Seeger\n\nUniversity of Edinburgh\n\n5 Forrest Hill\n\nEdinburgh, EH1 2QL\nseeger@dai.ed.ac.uk\n\nRalf Herbrich\n\nMicrosoft Research Ltd\n7 J J Thomson Avenue\nCambridge, CB3 0FB\nrherb@microsoft.com\n\nAbstract\n\nWe present a framework for sparse Gaussian process (GP) methods\nwhich uses forward selection with criteria based on information-\ntheoretic principles, previously suggested for active learning. Our\ngoal is not only to learn d{sparse predictors (which can be evalu-\nated in O(d) rather than O(n), d (cid:28) n, n the number of training\npoints), but also to perform training under strong restrictions on\ntime and memory requirements. The scaling of our method is at\nmost O(n (cid:1) d2), and in large real-world classi(cid:12)cation experiments\nwe show that it can match prediction performance of the popular\nsupport vector machine (SVM), yet can be signi(cid:12)cantly faster in\ntraining. In contrast to the SVM, our approximation produces esti-\nmates of predictive probabilities (\u2018error bars\u2019), allows for Bayesian\nmodel selection and is less complex in implementation.\n\n1\n\nIntroduction\n\nGaussian process (GP) models are powerful non-parametric tools for approximate\nBayesian inference and learning. In comparison with other popular nonlinear ar-\nchitectures, such as multi-layer perceptrons, their behavior is conceptually simpler\nto understand and model (cid:12)tting can be achieved without resorting to non-convex\noptimization routines. However, their training time scaling of O(n3) and memory\nscaling of O(n2), where n the number of training points, has hindered their more\nwidespread use. The related, yet non-probabilistic, support vector machine (SVM)\nclassi(cid:12)er often renders results that are comparable to GP classi(cid:12)ers w.r.t. prediction\nerror at a fraction of the training cost. This is possible because many tasks can\nbe solved satisfactorily using sparse representations of the data set. The SVM is\ntriggered towards (cid:12)nding such representations through the use of a particular loss\nfunction1 that encourages some degree of sparsity, i.e. the (cid:12)nal predictor depends\nonly on a fraction of training points crucial for good discrimination on the task.\nHere, we call these utilized points the active set of the sparse predictor. In case of\nSVM classi(cid:12)cation, the active set contains the support vectors, the points closest to\n\n1An SVM classi(cid:12)er is trained by minimizing a regularized loss functional, a process\n\nwhich cannot be interpreted as approximation to Bayesian inference.\n\n\fthe decision boundary and the misclassi(cid:12)ed ones. If the active set size d is much\nsmaller than n, an SVM classi(cid:12)er can be trained in average case running time be-\ntween O(n (cid:1) d2) and O(n2 (cid:1) d) with memory requirements signi(cid:12)cantly less than n2.\nNote, however, that without any restrictions on the data distribution, d can rise to\nn.\n\nIn an e(cid:11)ort to overcome scaling problems a range of sparse GP approximations have\nbeen proposed [1, 8, 9, 10, 11]. However, none of these has fully achieved the goals of\nbeing a nontrivial approximation to a non-sparse GP model and matching the SVM\nw.r.t. both prediction performance and run time. The algorithm proposed here ac-\ncomplishes these objectives and, as our experiments show, can even be signi(cid:12)cantly\nfaster in training than the SVM. Furthermore, time and memory requirements may\nbe restricted a priori. The potential bene(cid:12)ts of retaining the probabilistic charac-\nteristics of the method are numerous, since hard problems, e.g. feature and model\nselection, can be dealt with using standard techniques from Bayesian learning.\n\nOur approach builds on earlier work of Lawrence and Herbrich [2] which we extend\nhere by considering randomized greedy selections and focusing on an alternative\nrepresentation of the GP model which facilitates generalizations to settings such\nas regression and multi-class classi(cid:12)cation.\nIn the next section we introduce the\nGP classi(cid:12)cation model and a method for approximate inference. Section 3 then\ncontains the derivation of our fast greedy approximation and a description of the as-\nsociated algorithm. In Section 4, we present large-scale experiments on the MNIST\ndatabase, comparing our method directly against the SVM. Finally we close with a\ndiscussion in Section 5.\n\nWe denote vectors g = (gi)i and matrices G = (gi;j)i;j in bold-face2.\nIf I; J\nare sets of row and column indices respectively, we denote the corresponding sub-\nmatrix of G 2 Rp;q by GI;J , furthermore we abbreviate GI;(cid:1) to GI;1:::q; GI;j to\nGI;fjg, GI to GI;I, etc. The density of the Gaussian distribution with mean (cid:22) and\ncovariance matrix (cid:6) is denoted by N ((cid:1)j(cid:22); (cid:6)). Finally, we use diag((cid:1)) to represent an\n\u2018overloaded\u2019 operator which extracts the diagonal elements of a matrix as a vector\nor produces a square matrix with diagonal elements from a given vector, all other\nelements 0.\n\n2 Gaussian Process Classi(cid:12)cation\n\nAssume we are given a sample S := ((x1; y1); : : : ; (xn; yn)); xi 2 X ; yi 2 f(cid:0)1; +1g,\ndrawn independently and identically distributed (i.i.d.) from an unknown data dis-\ntribution3 P (x; y). Our goal is to estimate P (yjx) for typical x or, less ambitiously,\nto learn a predictor x ! y with small error on future data. To model this situation,\nwe introduce a latent variable u 2 R separating x and y, and some classi(cid:12)cation\nnoise model P (yju) := (cid:8)(y(cid:1)(u+b)), where (cid:8) is the cumulative distribution function\nof the standard Gaussian N (0; 1), and b 2 R is a bias parameter. From the Bayesian\nviewpoint, the relationship x ! u is a random process u((cid:1)), which, in a Gaussian\nprocess (GP) model, is given a GP prior with mean function 0 and covariance kernel\nk((cid:1);(cid:1)). This prior encodes the belief that (before observing any data) for any (cid:12)nite\nset X = f ~x1; : : : ; ~xpg (cid:26) X , the corresponding latent outputs (u( ~x1); : : : ; u( ~xp))T\nare jointly Gaussian with mean 0 2 Rp and covariance matrix (k( ~xi; ~xj))i;j 2 Rp;p.\nGP models are non-parametric, that is, there is in general no (cid:12)nite-dimensional\n\n2Whenever we use a bold symbol g or G for a vector or matrix, we denote its compo-\n\nnents by the corresponding normal symbols gi and gi;j.\n\n3We focus on binary classi(cid:12)cation, but our framework can be applied straightforwardly\n\nto regression estimation and multi-class classi(cid:12)cation.\n\n\fparametric representation for u((cid:1)).\nIt is possible to write u((cid:1)) as linear function\nin some feature space F associated with k, i.e. u(x) = wT(cid:30)(x); w 2 F, in the\nsense that a Gaussian prior on w induces a GP distribution on the linear function\nu((cid:1)). Here, (cid:30) is a feature map from X into F, and the covariance function can be\nwritten k(x; x0) = (cid:30)(x)T(cid:30)(x0). This linear function view, under which predictors\nbecome separating hyper-planes in F, is frequently used in the SVM community.\nHowever, F is, in general, in(cid:12)nite-dimensional and not uniquely determined by\nthe kernel function k. We denote the sequence of latent outputs at the training\npoints by u := (u(x1); : : : ; u(xn))T 2 Rn and the covariance or kernel matrix by\nK := (k(xi; xj))i;j 2 Rn;n.\nThe Bayesian posterior process for u((cid:1)) can be computed in principle using Bayes\u2019\nformula. However, if the noise model P (yju) is non-Gaussian (as is the case for\nbinary classi(cid:12)cation), it cannot be handled tractably and is usually approximated\nby another Gaussian process, which should ideally preserve mean and covariance\nfunction of the former. It is easy to show that this is equivalent to (cid:12)tting the mo-\nments between the (cid:12)nite-dimensional (marginal) posterior P (ujS) over the train-\ning points and a Gaussian approximation Q(u), because the conditional posterior\nP (u(x(cid:3))ju; S) for some non-training point x(cid:3) is identical to the conditional prior\nP (u(x(cid:3))ju). In general, computing Q is also infeasible, but several authors have\nproposed to approximate the global moment matching by iterative schemes which\nlocally focus on one training pattern at a time [1, 4]. These schemes (at least in\ntheir simplest forms) result in a parametric form for the approximating Gaussian\n\nn\n\nYi=1\n\nexp(cid:16)(cid:0)\n\npi\n2\n\n(1)\n\nQ(u) / P (u)\n\nP (u)Qn\n\nThis may be compared with the form of\n\n(ui (cid:0) mi)2(cid:17) :\nthe true posterior P (ujS) /\ni=1 P (yijui) and shows that Q(u) is obtained from P (ujS) by a likelihood\napproximation. Borrowing from graphical models vocabulary, the factors in (1) are\ncalled sites. Initially, all pi; mi are 0, thus Q(u) = P (u). In order to update the\nparameters for a site i, we replace it in Q(u) by the corresponding true likelihood\nfactor P (yijui), resulting in a non-Gaussian distribution whose mean and covari-\nance matrix can still be computed. This allows us to approximate it by a Gaussian\nQnew(u) using moment matching. The site update is called the inclusion of i into\nthe active set I. The factorized form of the likelihood implies that the new and old\nQ di(cid:11)er only in the parameters pi; mi of site i. This is a useful locality property of\nthe scheme which is referred to as assumed density (cid:12)ltering (ADF) (e.g. [4]). The\nspecial case of ADF4 for GP models has been proposed in [5].\n\n3 Sparse Gaussian Process Classi(cid:12)cation\n\nThe simplest way to obtain a sparse Gaussian process classi(cid:12)cation (GPC) approx-\nimation from the ADF scheme is to leave most of the site parameters at 0, i.e.\npi = 0; mi = 0 for all i 62 I, where I (cid:26) f1; : : : ; ng is the active set, jIj =: d < n. For\nthis to succeed, it is important to choose I so that the decision boundary between\nclasses is represented essentially as accurately as if we used the whole training set.\nAn exhaustive search over all possible subsets I is, of course, intractable. Here, we\nfollow a greedy approach suggested in [2], including new patterns one at a time into\nI. The selection of a pattern to include is made by computing a score function for\n\n4A generalization of ADF, expectation propagation (EP) [4], allows for several iterations\nover the data. In the context of sparse approximations, it allows us to remove points from\nI or exchange them against such outside I, although we do not consider such moves here.\n\n\fAlgorithm 1 Informative vector machine algorithm\nRequire: A desired sparsity d (cid:28) n.\n\nI = ;; m = 0; (cid:5) = diag(0); diag(A) = diag(K); h = 0; J = f1; : : : ; ng.\nrepeat\n\nCompute (cid:1)j according to (4).\n\nfor j 2 J do\nend for\ni = argmaxj2J (cid:1)j\nDo updates for pi and mi according to (2).\nUpdate matrices L, M , diag(A) and h according to (3).\nI I [ fig; J J n fig.\n\nuntil jIj = d\n\nall points in J = f1; : : : ; ng n I (or a subset thereof) and then picking the winner.\nThe heuristic we implement has also been considered in the context of active learn-\ning (see chapter 5 of [3]): score an example (xi; yi) by the decrease in entropy of\nQ((cid:1)) upon its inclusion. As a result of the locality property of ADF and the fact\nthat Q is Gaussian, it is easy to see that the entropy di(cid:11)erence H[Qnew] (cid:0) H[Q] is\nproportional to the log ratio between the variances of the marginals Qnew(ui) and\nQ(ui). Thus, our heuristic (referred to as the di(cid:11)erential entropy score) favors points\nwhose inclusion leads to a large reduction in predictive (posterior) variance at the\ncorresponding site. Whilst other selection heuristics can be argued for and utilized,\nit turns out that the di(cid:11)erential entropy score together with the simple likelihood\napproximation in (1) leads to an extremely e(cid:14)cient and competitive algorithm.\n\nIn the remainder of this section, we describe our method and give a schematic\nalgorithm. A detailed derivation and discussions of some extensions can be found\nin [7]. From (1) we have Q((cid:1)) = N ((cid:1)jh; A); A := (K (cid:0)1 + (cid:5))(cid:0)1, h := A(cid:5)m and\n(cid:5) := diag(p). If I is the current active set, then all components of p and m not in\nI are zero, and some algebra using the Woodbury formula gives\n\nA = K (cid:0) M TM ; M = L(cid:0)1(cid:5)\nwhere L is the lower-triangular Cholesky factor of\n\n1=2\n\nI KI;(cid:1) 2 Rd;n;\n\nB = I + (cid:5)\n\n1=2\nI KI (cid:5)\n\n1=2\n\nI 2 Rd;d:\n\nIn order to compute the di(cid:11)erential entropy score for a point j 62 I, we have to\nknow aj;j and hj. Thus, when including i into the active set I, we need to update\ndiag(A) and h accordingly, which in turn requires the matrices L and M to be\nkept up-to-date. The update equations for pi; mi are\n\npi =\n\nzi =\n\n(cid:23)i\n\n1 (cid:0) ai;i(cid:23)i\nyi (cid:1) (hi + b)\np1 + ai;i\n\n; mi = hi +\n\n(cid:11)i\n(cid:23)i\n\n;\n\nwhere\n\n; (cid:11)i =\n\nyi (cid:1) N (zij0; 1)\n(cid:8)(zi)p1 + ai;i\n\n;\n\n(cid:23)i = (cid:11)i(cid:18)(cid:11)i +\n\nhi + b\n\n1 + ai;i(cid:19) :\n\n(2)\n\nWe then update L ! Lnew by appending the row (lT; l) and M ! M new by\nappending the row (cid:22)T, where\n\nl = ppiM (cid:1);i;\n\nl = q1 + piKi;i (cid:0) lTl; (cid:22) = l(cid:0)1(ppiK (cid:1);i (cid:0) M Tl):\n\n(3)\n\nFinally, diag(Anew) diag(A)(cid:0) ((cid:22)2\n(cid:22). The di(cid:11)erential\nentropy score for j 62 I can be computed based on the variables in (2) (with i ! j)\nas\n(4)\n\nj )j and hnew h + (cid:11)ilp(cid:0)1=2\n\ni\n\n(cid:1)j =\n\nlog(1 (cid:0) aj;j(cid:23)j);\n\n1\n2\n\n\fIn Algorithm 1 we give an\n\nwhich can be computed in O(1), given hj and aj;j.\nalgorithmic version of this scheme.\nEach inclusion costs O(n (cid:1) d), dominated by the computation of (cid:22), apart from the\ncomputation of the kernel matrix column K (cid:1);i. Thus the total time complexity is\nO(n(cid:1)d2). The storage requirement is O(n(cid:1)d), dominated by the bu(cid:11)er for M . Given\ndiag(A) and h, the error or the expected log likelihood of the current predictor on\nthe remaining points J can be computed in O(n). These scores can be used in order\nto decide how many points to include into the (cid:12)nal I. For kernel functions with\nconstant diagonal, our selection heuristic is constant over patterns if I = ;, so the\n(cid:12)rst (or the (cid:12)rst few) inclusion candidate is chosen at random. After training is\ncomplete, we can predict on test points x(cid:3) by evaluating the approximate predictive\n\ndistribution Q(u(cid:3)jx(cid:3); S) = R P (u(cid:3)ju)Q(u) du = N (u(cid:3)j(cid:22)(x(cid:3)); (cid:27)2(x(cid:3))), where\n1=2\nI k(x(cid:3));\n\n(cid:22)(x(cid:3)) = (cid:12)Tk(x(cid:3));\n\n(cid:27)2(x(cid:3)) = k(x(cid:3); x(cid:3)) (cid:0) k(x(cid:3))T(cid:5)\n\n1=2\n\nI B(cid:0)1(cid:5)\n\n(5)\n\n1=2\n\nI B(cid:0)1(cid:5)\n\n1=2\nI mI and k(x(cid:3)) := (k(xi; x(cid:3)))i2I . We may compute (cid:27)2(x(cid:3))\nwith (cid:12) := (cid:5)\nusing one back-substitution with the factor L. The approximate predictive distri-\nbution over y(cid:3) can be obtained by averaging the noise model over the Gaussian.\nThe optimal predictor for the approximation is sgn((cid:22)(x(cid:3))+b), which is independent\nof the variance (cid:27)2(x(cid:3)).\n\nThe simple scheme above employs full greedy selection over all remaining points to\n(cid:12)nd the inclusion candidate. This is sensible during early inclusions, but computa-\ntionally wasteful during later ones, and an important extension of the basic scheme\nof [2] allows for randomized greedy selections. To this end, we maintain a selection\nindex J (cid:26) f1; : : : ; ng with J \\ I = ; at all times. Having included i into I we\nmodify the selection index J. This means that only the components J of diag(A)\nand h have to be updated, which requires only the columns M (cid:1);J . Hence, if J\nexhibits some inertia while moving over f1; : : : ; ng n I, many of the columns of M\nwill not have to be kept up-to-date. In our implementation, we employ a simple\ndelayed updating scheme for the columns of M which avoids double computations\n(see [7] for details). After a number of initial inclusions are done using full greedy\nselection, we use a J of (cid:12)xed size m together with the following modi(cid:12)cation rule:\nfor a fraction (cid:28) 2 (0; 1), retain the (cid:28) (cid:1) m best-scoring points in J, then (cid:12)ll it up to\nsize m by drawing at random from f1; : : : ; ng n (I [ J).\n\n4 Experiments\n\nWe now present results of experiments on the MNIST handwritten digits database5,\ncomparing our method against the SVM algorithm. We considered binary tasks of\nthe form \u2018c-against-rest\u2019, c 2 f0; : : : ; 9g. c is mapped to +1, all others to (cid:0)1. We\ndown-sampled the bitmaps to size 13 (cid:2) 13 and split the MNIST training set into\na (new) training set of size n = 59000 and a validation set of size 1000; the test\nset size is 10000. A run consisted of model selection, training and testing, and\nall results are averaged over 10 runs. We employed the RBF kernel k(x; x0) =\nC exp((cid:0)((cid:13)=(2 (cid:1) 169))kx (cid:0) x0k2); x 2 R169 with hyper-parameters C > 0 (process\nvariance) and (cid:13) > 0 (inverse squared length-scale). Model selection was done by\nminimizing validation set error, training on random training set subsets of size\n5000.6\n\n5Available online at http://www.research.att.com/(cid:24)yann/exdb/mnist/index.html.\n6The model selection training set for a run i is the same across tested methods. The\n\nlist of kernel parameters considered for selection has the same size across methods.\n\n\fSVM\n\ngen\n0:22\n0:20\n0:40\n0:41\n0:40\n0:29\n0:28\n0:54\n0:50\n0:58\n\nd\n\n1247\n798\n2240\n2610\n1826\n2306\n1331\n1759\n2636\n2731\n\nc\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\ntime\n1281\n864\n2977\n3687\n2442\n2771\n1520\n2251\n3909\n3469\n\nc\n0\n1\n2\n3\n4\n5\n6\n7\n8\n9\n\nIVM\n\ngen\n0:18\n0:26\n0:40\n0:39\n0:33\n0:32\n0:29\n0:51\n0:53\n0:55\n\nd\n\n1130\n820\n2150\n2500\n1740\n2200\n1270\n1660\n2470\n2740\n\ntime\n627\n427\n1690\n2191\n1210\n1758\n765\n1110\n2024\n2444\n\nTable 1: Test error rates (gen, %) and training times (time, s) on binary MNIST\ntasks. SVM: Support vector machine (SMO); d: average number of SVs.\nIVM:\nSparse GPC, randomized greedy selections; d: (cid:12)nal active set size. Figures are\nmeans over 10 runs.\n\nOur goal was to compare the methods not only w.r.t. performance, but also running\ntime. For the SVM, we chose the SMO algorithm [6] together with a fast elaborate\nkernel matrix cache (see [7] for details). For the IVM, we employed randomized\ngreedy selections with fairly conservative settings.7 Since each binary digit classi(cid:12)-\ncation task is very unbalanced, the bias parameter b in the GPC model was chosen\nto be non-zero. We simply (cid:12)xed b = (cid:8)(cid:0)1(r), where r is the ratio between +1 and\n(cid:0)1 patterns in the training set, and added a constant vb = 1=10 to the kernel k\nIdeally, both b and vb\nto account for the variance of the bias hyper-parameter.\nshould be chosen by model selection, but initial experiments with di(cid:11)erent values\nfor (b; vb) exhibited no signi(cid:12)cant (cid:13)uctuations in validation errors. To ensure a fair\ncomparison, we did initial SVM runs and initialized the active set size d with the\naverage number (over 10 runs) of SVs found, independently for each c. We then\nre-ran the SVM experiments, allowing for O(d n) cache space. Table 1 shows the\nresults.\n\nNote that IVM shows comparable performance to the SVM, while achieving sig-\nni(cid:12)cantly lower training times. For less conservative settings of the randomized\nselection parameters, further speed-ups might be realizable. We also registered\n(not shown here) signi(cid:12)cant (cid:13)uctuations in training time for the SVM runs, while\nthis (cid:12)gure is stable and a-priori predictable for the IVM. Within the IVM, we can\nobtain estimates of predictive probabilities for test points, quantifying prediction\nuncertainties. In Figure 1, which was produced for the hardest task c = 9, we reject\nfractions of test set examples based on the size of jP (y(cid:3) = +1)(cid:0)1=2j. For the SVM,\nthe size of the discriminant output is often used to quantify predictive uncertainty\nheuristically. For c = 9, the latter is clearly inferior (although the di(cid:11)erence is less\npronounced for the simpler binary tasks).\n\nIn the SVM community it is common to combine the \u2018c-against-rest\u2019 classi(cid:12)ers to\nobtain a multi-class discriminant8 as follows: for a test point x(cid:3), decide for the class\nwhose associated classi(cid:12)er has the highest real-valued output. For the IVM, the\n\n7First 2 selections at random, then 198 using full greedy, after that a selection index of\n\nsize 500 and a retained fraction (cid:28) = 1=2.\n\n8Although much recent work has looked into more powerful combination schemes, e.g.\n\nbased on error-correcting codes.\n\n\f10\u22122\n\ne\nt\na\nr\n \nr\no\nr\nr\ne\n\n10\u22123\n\n10\u22124\n0\n\nSVM\nIVM\n\n0.05\n\n0.1\n\nrejected fraction\n\n0.15\n\n0.2\n\nFigure 1: Plot of test error rate against increasing rejection rate for the SVM\n(dashed) and IVM (solid), for the task c = 9 against the rest. For SVM, we reject\nbased on \\distance\" from separating plane, for IVM based on estimates of predictive\nprobabilities. The IVM line runs below the SVM line exhibiting lower classi(cid:12)cation\nerrors for identical rejection rates.\n\nequivalent would be to compare the estimates log P (y(cid:3) = +1) from each c-predictor\nand pick the maximizing c. This is suboptimal, because the di(cid:11)erent predictors\nhave not been trained jointly.9 However, the estimates of log P (y(cid:3) = +1) do depend\non predictive variances, i.e. a measure of uncertainty about the predictive mean,\nwhich cannot be properly obtained within the SVM framework. This combination\nscheme results in test errors of 1:54%((cid:6)0:0417%) for IVM, 1:62%((cid:6)0:0316%) for\nSVM. When comparing these results to others in the literature, recall that our\nexperiments were based on images sub-sampled to size 13 (cid:2) 13 rather than the\nusual 28 (cid:2) 28.\n\n5 Discussion\n\nWe have demonstrated that sparse Gaussian process classi(cid:12)ers can be constructed\ne(cid:14)ciently using greedy selection with a simple fast selection criterion. Although we\nfocused on the change in di(cid:11)erential entropy in our experiments here, the simple\nlikelihood approximation at the basis of our method allows for other equally e(cid:14)cient\ncriteria such as information gain [3]. Our method retains many of the bene(cid:12)ts\nof probabilistic GP models (error bars, model combination, interpretability, etc.)\nwhile being much faster and more memory-e(cid:14)cient both in training and prediction.\nIn comparison with non-probabilistic SVM classi(cid:12)cation, our method enjoys the\nfurther advantages of being simpler to implement and having strictly predictable\ntime requirements. Our method can also be signi(cid:12)cantly faster10 than SVM with the\nSMO algorithm. This is due to the fact that SMO\u2019s active set typically (cid:13)uctuates\nheavily across the training set, thus a large fraction of the full kernel matrix must\nbe evaluated. In contrast, IVM requires only d=n of K.\n\n9It is straightforward to obtain the IVM for a joint GP classi(cid:12)cation model, however\nthe training costs raise by a factor of c2. Whether this factor can be reduced to c using\nfurther sensible approximations, is an open question.\n\n10We would expect SVMs to catch up with IVMs on tasks which require fairly large\nactive sets, and for which very simple and fast covariance functions are appropriate (e.g.\nsparse input patterns).\n\n\fAmong the many proposed sparse GP approximations [1, 8, 9, 10, 11], our method\nis most closely related to [1]. The latter is a sparse Bayesian online scheme which\ndoes not employ greedy selections and uses a more accurate likelihood approxima-\ntion than we do, at the expense of slightly worse training time scaling, especially\nwhen compared with our randomized version. It also requires the speci(cid:12)cation of a\nrejection threshold and is dependent on the ordering in which the training points\nare presented.\nIt incorporates steps to remove points from I, which can also be\ndone straightforwardly in our scheme, however such moves are likely to create nu-\nmerical stability problems. Smola and Bartlett [8] use a likelihood approximation\ndi(cid:11)erent from both the IVM and the scheme of [1] for GP regression, together with\ngreedy selections, but in contrast to our work they use a very expensive selection\nheuristic (O(n (cid:1) d) per score computation) and are forced to use randomized greedy\nselection over small selection indexes. The di(cid:11)erential entropy score has previously\nbeen suggested in the context of active learning (e.g. [3]), but applies more directly\nto our problem. In active learning, the label yi is not known at the time xi has to\nbe scored, and expected rather than actual entropy changes have to be considered.\nFurthermore, MacKay [3] applies the selection to multi-layer perceptron (MLP)\nmodels for which Gaussian posterior approximations over the weights can be very\npoor.\n\nAcknowledgments\n\nWe thank Chris Williams, David MacKay, Manfred Opper and Lehel Csat(cid:19)o for help-\nful discussions. MS gratefully acknowledges support through a research studentship\nfrom Microsoft Research Ltd.\n\nReferences\n\n[1] Lehel Csat(cid:19)o and Manfred Opper. Sparse online Gaussian processes. N. Comp., 14:641{\n\n668, 2002.\n\n[2] Neil D. Lawrence and Ralf Herbrich. A sparse Bayesian compression scheme - the\ninformative vector machine. Presented at NIPS 2001 Workshop on Kernel Methods,\n2001.\n\n[3] David MacKay. Bayesian Methods for Adaptive Models. PhD thesis, California Insti-\n\ntute of Technology, 1991.\n\n[4] Thomas Minka. A Family of Algorithms for Approximate Bayesian Inference. PhD\n\nthesis, MIT, January 2001.\n\n[5] Manfred Opper and Ole Winther. Gaussian processes for classi(cid:12)cation: Mean (cid:12)eld\n\nalgorithms. N. Comp., 12(11):2655{2684, 2000.\n\n[6] John C. Platt. Fast training of support vector machines using sequential minimal\noptimization. In Sch(cid:127)olkopf et. al., editor, Advances in Kernel Methods, pages 185{\n208. 1998.\n\n[7] Matthias Seeger, Neil D. Lawrence, and Ralf Herbrich. Sparse Bayesian learning:\nThe informative vector machine. Technical report, Department of Computer Science,\nShe(cid:14)eld, UK, 2002. See www.dcs.shef.ac.uk/~neil/papers/.\n\n[8] Alex Smola and Peter Bartlett. Sparse greedy Gaussian process regression. In Ad-\n\nvances in NIPS 13, pages 619{625, 2001.\n\n[9] Michael Tipping.\n\nSparse Bayesian learning and the relevance vector machine.\n\nJ. M. Learn. Res., 1:211{244, 2001.\n\n[10] Volker Tresp. A Bayesian committee machine. N. Comp., 12(11):2719{2741, 2000.\n\n[11] Christopher K. I. Williams and Matthias Seeger. Using the Nystr(cid:127)om method to speed\n\nup kernel machines. In Advances in NIPS 13, pages 682{688, 2001.\n\n\f", "award": [], "sourceid": 2240, "authors": [{"given_name": "Ralf", "family_name": "Herbrich", "institution": null}, {"given_name": "Neil", "family_name": "Lawrence", "institution": null}, {"given_name": "Matthias", "family_name": "Seeger", "institution": null}]}