{"title": "The Generalized FITC Approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 1057, "page_last": 1064, "abstract": null, "full_text": "The Generalized FITC Approximation\n\nAndrew Naish-Guzman & Sean Holden\n\nCambridge, CB3 0FD. United Kingdom\n{agpn2,sbh11}@cl.cam.ac.uk\n\nComputer Laboratory\n\nUniversity of Cambridge\n\nAbstract\n\nWe present an ef\ufb01cient generalization of the sparse pseudo-input Gaussian pro-\ncess (SPGP) model developed by Snelson and Ghahramani [1], applying it to\nbinary classi\ufb01cation problems. By taking advantage of the SPGP prior covari-\nance structure, we derive a numerically stable algorithm with O(N M 2) training\ncomplexity\u2014asymptotically the same as related sparse methods such as the in-\nformative vector machine [2], but which more faithfully represents the posterior.\nWe present experimental results for several benchmark problems showing that\nin many cases this allows an exceptional degree of sparsity without compromis-\ning accuracy. Following [1], we locate pseudo-inputs by gradient ascent on the\nmarginal likelihood, but exhibit occasions when this is likely to fail, for which we\nsuggest alternative solutions.\n\n1 Introduction\n\nGaussian processes are a \ufb02exible and popular approach to non-parametric modelling. Their con-\nceptually simple architecture is allied with a sound Bayesian foundation, so that not only does their\npredictive power rival state-of-the-art discriminative methods such as the support vector machine,\nbut they also have the additional bene\ufb01t of providing an estimate of variance, giving an error bar for\ntheir prediction. However, there is a computational price to pay for this robust framework: the time\nfor training scales as N 3 for N data points, and the cost of prediction is O(N 2) per test case.\nRecently, there has been great interest in \ufb01nding sparse approximations to the full Gaussian process\n(GP) in order to accelerate training and prediction times respectively to O(N M 2) and O(M 2),\nwhere M \u226a N is the size of an auxiliary set, often a subset of the training data, termed variously\nthe inducing inputs, pseudo-inputs or the active set [3, 4, 5, 2, 6, 7, 1]; in this paper, we use the\nterms interchangeably. Qui\u02dcnonero-Candela and Rasmussen [8] demonstrated how many of these\nschemes are related through different approximations to the joint prior over training and test points.\nIn this paper we consider the \u201cfully independent training conditional\u201d or FITC approximation, which\nappeared originally in Snelson and Ghahramani [1] as the sparse pseudo-input GP (SPGP).\n\nRestricted to a Gaussian noise model, the FITC approximation is entirely tractable; however, for\nmany problems, the Gaussian assumption is inappropriate. In this paper, we describe an extension\nfor non-Gaussian likelihoods, considering as an example probit noise for binary classi\ufb01cation. This\nis not only a common problem, but our results bear out the intuition that sparse methods are well-\nsuited: many data sets enjoy the property that class label does not \ufb02uctuate rapidly in the input space,\noften allowing large regions to be summarized with very few inducing inputs. Contrast this with\nregression problems, where higher frequency components in the latent signal demand the pseudo-\ninputs appear in much higher density.\n\nThe informative vector machine (IVM) of Lawrence et al. [2] is another sparse GP method that has\nbeen extended to non-Gaussian noise models. It is a subset of data method in which the active set\n\n1\n\n\fis grown incrementally from the training data using a fast information gain heuristic to \ufb01nd at each\nstage the optimal inclusion. When a threshold number of points have been added, the algorithm\nterminates: only data accumulated into the active set are relevant for prediction; remaining points\nin\ufb02uence the model only in the weak sense of guiding previous steps of the algorithm. Our method is\nan improvement in three regards: \ufb01rstly, the FITC approximation makes use of all the data, yielding\nfor the same active set a closer approximation to the posterior distribution. Secondly, unlike the\nstandard IVM approach, we \ufb01t a stable posterior at each iteration, providing more accurate marginal\nlikelihood estimates, and derivatives thereof, to allow more reliable model selection. Finally, we\nargue with experimental justi\ufb01cation that the ability to locate inducing inputs independently of the\ntraining data, as compared with the greedy approach that drives the IVM, can be a great advantage\nin \ufb01nding the sparsest solutions. We discuss these points and other related work in greater detail in\nsection 6.\n\nThe structure of this paper is as follows: in section 2 we describe the FITC approximation; this is\nfollowed in section 3 by a detailed description of its representation for a non-Gaussian noise model;\nsection 4 provides a brief account of the procedure for model selection; experimental results appear\nin section 5, which we discuss in section 6; our concluding remarks are in section 7.\n\n2 The FITC approximation\n\nGiven a domain X and covariance function K(\u00b7, \u00b7) \u2208 X \u00d7 X \u2192 R, a Gaussian process (GP) over\nthe space of real-valued functions of X speci\ufb01es the joint distribution at any \ufb01nite set X \u2282 X :\n\np(f |X) = N (f ; 0 , K\ufb00 ) ,\n\nwhere the f = {fn}N\nn=1 are (latent) values associated with each xn \u2208 X, and K\ufb00 is the Gram\nmatrix, the evaluation of the covariance function at all pairs (xi, xj). We apply Bayes\u2019 rule to obtain\nthe posterior distribution over the f, given the observed X and y, which with the assumption of\ni.i.d. Gaussian corrupted observations is also normally distributed. Predictions at X\u22c6 are made by\nmarginalizing over f in the (Gaussian) joint p(f , f\u22c6|X, y, X\u22c6). See [9] for a thorough introduction.\nIn order to derive the FITC approximation, we follow [8] and introduce a set of M inducing inputs\n\u00afX = {\u00afx1, \u00afx2, . . . , \u00afxM } with associated latent values u. By the consistency of GPs, we have\n\np(f , f\u22c6|X, X\u22c6, \u00afX) =Z p(f , f\u22c6|u, X, X\u22c6)p(u| \u00afX)du \u2248Z q(f |u, X)q(f\u22c6|u, \u00afX)p(u| \u00afX)du,\n\nwhere p(u| \u00afX) = N (u ; 0 , Kuu). In the \ufb01nal expression we make the critical approximation by\nimposing a conditional independence assumption on the joint prior over training and test cases:\ncommunication between them must pass through the bottleneck of the inducing inputs. The FITC\napproximation follows by letting\n\nq(f |u, X) = N(cid:0)f ; KfuK\u22121\nq(f\u22c6|u, X\u22c6) = N(cid:0)f\u22c6 ; K\u22c6uK\u22121\n\nuuu , diag (K\ufb00 \u2212 Q\ufb00 )(cid:1) ,\nuuu , diag (K\u22c6\u22c6 \u2212 Q\u22c6\u22c6)(cid:1) ,\n\n(1)\n(2)\n\n.\n= KauK\u22121\n\nuuKub. Of interest for predictions is the posterior distribution over the induc-\nwhere Qab\ning inputs; this is most ef\ufb01ciently obtained via Bayes\u2019 rule after inferring the distribution over f.1\nUsing (1) and marginalizing over the exact prior on u we obtain the approximate prior on f\n\nq(f |X) =Z N(cid:0)f ; KfuK\u22121\n\nuuu , diag (K\ufb00 \u2212 Q\ufb00 )(cid:1) N (u ; 0 , Kuu) du\n\n= N (f ; 0 , Q\ufb00 + diag (K\ufb00 \u2212 Q\ufb00 )) .\n\n(3)\n\nIn the original paper, Snelson and Ghahramani placed the pseudo-inputs randomly and learned their\nlocations by non-linear optimization of the marginal likelihood. We have adopted the idea in this\npaper, but as emphasized in [8], the FITC approximation is applicable regardless of how the inducing\n\n1We could also infer the posterior over u directly, rather than marginalizing over the inducing inputs as here.\nRunning EP in this setting, each site maintains a belief about the full M \u00d7M covariance, and we obtain a slower\nO(N M 3) algorithm. Furthermore, calculations to evaluate the derivatives of the log marginal likelihood with\nrespect to inducing inputs \u00afxm are signi\ufb01cantly complicated by their presence in both prior and likelihood.\n\n2\n\n\finputs are obtained, and other schemes for their initialization could equally well be married with our\nalgorithm.\nIn the case of classi\ufb01cation, a sigmoidal function assigns class labels yn \u2208 {\u00b11} with a probability\nthat increases monotonically with the latent fn. We use the probit with bias \u03b2,\n\np(yn|fn, \u03b2) = \u03c3(yn(fn + \u03b2))\n\n.\n\n=Z yn(fn+\u03b2)\n\n\u2212\u221e\n\nN (z ; 0 , 1) dz.\n\n(4)\n\nThe posterior distribution p(f |X, y) is only tractable for Gaussian likelihoods, hence we must resort\nto a further approximation, either by generating Monte Carlo samples from it or \ufb01tting deterministi-\ncally a Gaussian approximation. Of the latter methods, expectation propagation is possibly the most\naccurate (at least for GP classi\ufb01cation; see [10]), and it is the approach we follow below.\n\n3 Inference\n\nWe begin with a very brief account of expectation propagation (EP); for more details, see [11,\n12]. Suppose we have an intractable distribution over f whose unnormalized form factorizes into\na product of terms, such as a dense Gaussian prior t0(f ) and a series of independent likelihoods\nn=1. EP constructs the approximate posterior as a product of scaled site functions \u02dctn.\n{tn(yn|fn)}N\nFor computational tractability, these sites are usually chosen from an exponential family with natural\nparameters \u03b8, since in this case their product retains the same functional form as its components.\nThe Gaussian (\u00b5, \u03a3) has a natural parameterization (b, \u03a0) = (\u03a3\u22121\u00b5, \u2212 1\n2 \u03a3\u22121). If the prior is of\nthis form, its site function is exact:\n\nN\n\nN\n\ntn(yn|fn) \u2248 q(f ; \u03b8) = t0(f )\n\nzn\u02dctn(fn; \u03b8n),\n\n(5)\n\np(f |y) =\n\n1\nZ\n\nt0(f )\n\nYn=1\n\nYn=1\n\nwhere Z is the marginal likelihood and zn are the scale parameters. Ideally, we would choose \u03b8 at\nthe global minimum of some divergence measure d(pkq), but the necessary optimization is usually\n\nbasis: at each iteration, we select a new site n, and from the product of the cavity distribution formed\nby the current marginal with the omission of that site, and the true likelihood term tn, we obtain the\n\nintractable. EP is an iterative procedure that \ufb01nds a minimizer of KL(cid:0)p(f |y)kq(f ; \u03b8)(cid:1) on a pointwise\nso-called tilted distribution qn(fn; \u03b8\\n). A simpler optimization min\u03b8n KL(cid:0)qn(fn; \u03b8\\n)kq(fn; \u03b8)(cid:1)\n\nthen \ufb01ts only the parameters \u03b8n: this is equivalent to moment matching between the two distributions,\nwith scale zn chosen to match the zeroth-order moments. After each site update, the moments at the\nremaining sites are liable to change, and several iterations may be required before convergence.\n\nIn the discussion below we omit the moment calculations for the probit model, since they correspond\nto those of traditional GP classi\ufb01cation (for more details, consult [9]). Of greater interest is how the\nmean and covariance structure of the approximate posterior is preserved. Examining the form of the\nprior (3), we see the covariance consists of a diagonal component D0 and a rank-M term P0M0PT\n0 ,\nwhere P0 = Kfu and M0 = K\u22121\nuu (zero subscripts refer to these initial values; the matrices are\nupdated during the course of the EP iterations). Since the observations yn are generated i.i.d., we\ncan expect this decomposition to persist in the posterior.\nEP requires ef\ufb01cient operations for marginalization to obtain p(fn), and for updating the posterior\ndistribution after re\ufb01ning a site, as well as for refreshing the posterior to avoid loss of numerical\nprecision. Decomposing M = RT R into its Cholesky factor,2 we represent the posterior covariance\nA and mean h by\n\nA = D + PRT RPT ,\n\nh = \u03bd + P\u03b3,\n\n2Care must be taken that the factors share the correct orientation. When our environment offers only upper\nuu\u00b4 can be achieved without computing the explicit\n\nCholesky factors RT R, the initialization of R0 = chol`K\u22121\ninverse via the following matrix rotations:\n\nR0 := rot180 \u201cchol`rot180 (Kuu)\u00b4T\n\n\\ I\u201d .\n\n3\n\n\fwhere D is diagonal, \u03bd is N \u00d7 1 and \u03b3 is M \u00d7 1. Writing pT\n\nAnn = dn + kRpnk\n\nhn = \u03bdn + pT\n\nn \u03b3,\n\nn = P(n,\u00b7) and dn = Dnn,\nobtaining marginals in O(M 2).\n\nNow consider a change in the precision at site n by \u03c0n. De\ufb01ne the vector e of length N such that\nen = 1 and all other elements are zero. The new covariance Anew is obtained by inverting the sum\nof the old precision matrix and the change in precision. If we let E = D\u22121 + \u03c0neeT , so that\n\nE\u22121 = D \u2212\n\n\u03c0nd2\nn\n\n1 + \u03c0ndn\n\neeT\n\nand\n\n(DED)\u22121 = D\u22121 \u2212\n\n\u03c0n\n\n1 + \u03c0ndn\n\neeT ,\n\nthen from the matrix inversion lemma, A\u22121 = D\u22121\u2212D\u22121PRT (RPT D\u22121PRT +I)\u22121RPT D\u22121,\nand incorporating the update to site n,\n\nAnew = E\u22121 \u2212 E\u22121D\u22121PRT(cid:16)RPT (DED)\u22121PRT \u2212 I \u2212 RPT D\u22121PRT(cid:17)\u22121\n\n= Dnew + PnewRT\n\nnewRnewPT\n\nnew,\n\nRPT D\u22121E\u22121\n\nwhere we expand the inversion to obtain a rank-1 downdate to the Cholesky factor R;3 in summary\n\nDnew = D \u2212\n\n\u03c0nd2\nn\n\n1 + \u03c0ndn\n\neeT O(1) update,\n\nPnew = P \u2212\n\n\u03c0ndn\n\n1 + \u03c0ndn\n\nepT\n\nn O(M ) update,\n\nRnew = chol\u2193(cid:18)RT (cid:18)I \u2212 Rpn\n\n\u03c0n\n\n1 + \u03c0nAnn\n\npT\n\nn RT(cid:19) R(cid:19) O(M 2) update.\n\nIf the second site parameter, corresponding to precision times mean, is changed by bn, then\n\nA\u22121\n\nnewhnew = A\u22121h + bne\n\nwhere\n\n=\u21d2 hnew = Anew(cid:0)A\u22121\n\nnew \u2212 \u03c0neeT(cid:1) h + Anewbne\n\n= \u03bdnew + Pnew\u03b3new,\n\n\u03bdnew = \u03bd +\n\n(bn + \u03c0n\u03bdn)dn\n\n1 + \u03c0ndn\n\ne (O(1)), \u03b3new = \u03b3 +\n\nbn \u2212 \u03c0nhn\n1 + \u03c0ndn\n\nRT\n\nnewRnewpn\n\n(cid:0)O(M 2)(cid:1).\n\nIt is necessary to refresh the covariance and mean every complete EP cycle to avoid loss of precision.\n\nDnew = (I + D0\u03a0)\u22121 D0\n\n(O(N )),\n\nPnew = (I + D0\u03a0)\u22121 P0\n\n(O(N M )),\n\nRnew = rot180(cid:18)chol(cid:16)rot180(cid:0)I + R0PT\n\n0 \u03a0 (I + D0\u03a0)\u22121 P0RT\n\n0(cid:1)(cid:17)T(cid:19)/R0\n\n(cid:0)O(N M 2)(cid:1),\n\nwhere Rnew is obtained being careful to ensure the orientations of the factorizations are not mixed.\nFinally, the mean is refreshed using\n\n\u03bdnew = Dnewb in O(N ),\n\n\u03b3new = RT\n\nnewRnewPT\n\nnewb in O(N M ),\n\nwhere we have assumed h0 = 0.\nReviewing the algorithm above, we see that EP costs are dominated by the O(M 2) Cholesky down-\ndate at each site inclusion. After visiting each of the N sites, we are advised to perform a full refresh,\nwhich is O(N M 2), together leading to asymptotic complexity of O(N M 2).\n\n3.1 Predictions\n\nTo make predictions, we marginalize out u from (2). Initially, Bayes\u2019 theorem is used to \ufb01nd the\nposterior distribution over u from the inferred posterior over f:\n\nwhere c = CR0PT\n\n0 D\u22121\n0 f\n\np(u|f ) \u221d p(f |u)p(u) = N (u | R\u22121\n\n0 c, R\u22121\nand C\u22121 = I + R0PT\n\n0 CR\u2212T\n0\n0 D\u22121\n\n0 P0RT\n0 .\n\n),\n\n\u03c0n\n\n3If the factor\n\n1+\u03c0nAnn\n\nis negative, we make a rank-1 update, guaranteed to preserve the positive de\ufb01nite\nproperty. Note that on rare occasions, loss of precision can cause a downdate to result in a non-positive de\ufb01nite\ncovariance matrix. If this occurs, we should abort the update and refresh the posterior from scratch. In any\ncase, to improve conditioning, it is recommended to add a small multiple of the identity to the prior M0.\n\n4\n\n\fLet our posterior approximation be q(f |y) = N (f ; h , A). Hence\n\np(u|y) \u2248Z p(u|f )q(f |y)df = N (u | R\u22121\n\n0 \u00b5, R\u22121\n\n0 \u03a3R\u2212T\n\n0\n\n),\n\nwhere \u00b5 = CR0PT\n\n0 D\u22121\n\n0 h and \u03a3 = C + CR0PT\n\n0 D\u22121\n\n0 AD\u22121\n\n0 P0RT\n\n0 C.\n\nObtaining these terms is O(N M 2) if we take advantage of the structure of A; the most stable\nmethod is via the Cholesky factorization of C\u22121, rather than forming the explicit inverse. At x\u22c6,\n\np(f\u22c6|x\u22c6, y) =Z p(f\u22c6|u)p(u|y)du = N (f\u22c6 | \u00b5\u22c6, \u03c32\n\n\u22c6);\n\nafter precomputations, \u00b5\u22c6 = kT\nIn the classi\ufb01cation domain, we will usually be interested in\n\n0 \u00b5 is O(M ), and \u03c32\n\n\u22c6 RT\n\n\u22c6 = k\u22c6\u22c6 + kT\n\n\u22c6 RT\n\n0 (\u03a3 \u2212 I) R0k\u22c6 is O(M 2).\n\np(y\u22c6|x\u22c6, y) =Z p(y\u22c6|f\u22c6)p(f\u22c6|x\u22c6, y)df\u22c6 = \u03c3  y\u22c6\u00b5\u22c6\n\u22c6! .\np1 + \u03c32\n\n4 Model selection\n\nEP provides an estimate of the log evidence by matching the 0th-order moments zn at each inclusion.\nWhen our posterior approximation is exponential family, Seeger [12] shows the estimate to be\n\nN\n\nL =\n\nlog Cn + \u03a6(\u03b8post) \u2212 \u03a6(\u03b8prior), where\n\nlog Cn = log zn \u2212 \u03a6(\u03b8post) + \u03a6(\u03b8\\n),\n\nXn=1\n\nwhere \u03a6(\u00b7) denotes the log partition function and \u03b8 are again the natural parameters, with super-\nscripts indicating prior, posterior and cavity. Of interest for model selection are derivatives of the\nmarginal likelihood with respect to hyperparameters {\u03be, \u00afX, \u03b2}, respectively the kernel parameters,\npseudo-input locations, and noise model parameters. When the EP \ufb01xed point conditions hold (that\nis, the moments of the tilted distributions match the marginals up to second order for all sites),\n\n\u2207\u03b8priorL = \u03b7post \u2212 \u03b7prior\n\nand\n\n\u2207\u03b2n L = log zn,\n\nwhere \u03b7 denotes the moment parameters of the exponential family (for the Gaussian, these are\n(\u00b5, \u03a3 + \u00b5\u00b5T )) and \u03b2n is a parameter of site n (and does not feature in the prior). Finally, we need\nderivatives \u2207\u03be\u03b8prior and \u2207 \u00afX\u03b8prior. The long-winded details are omitted, but by careful consideration\nof the covariance structure, it is again possible to limit the complexity to O(N M 2).\nSince we run EP until convergence, our estimates for the marginal likelihood and its derivatives are\naccurate, allowing us reliablty to \ufb01t a model that maximizes the evidence. This is in contrast to the\nIVM, in which sites excluded from the active set have parameters clamped to zero, and where those\nincluded are not iterated to convergence, such that the necessary \ufb01xed point conditions do not hold.\nA particular problem, suffered also by the similar algorithm in [13], is that derivative calculations\nmust be interleaved with site inclusions, and the latter operation tends to disrupt gradient information\ngained from the previous step. These complications are all sidestepped in our SPGP implementation.\n\n5 Experiments\n\nWe conducted tests on a variety of data, including two small sets from [14]4 and the benchmark\nsuite of R\u00a8atsch.5 The dimensionality of these classi\ufb01cation problems ranges from two to sixty, and\nthe size of the training sets is of the order of 400 to 1000. Results are presented in table 1. For\ncrabs and the R\u00a8atsch sets, we average over ten folds of the data; for the synth problem, Ripley has\nalready divided the data into training and test partitions. Comparisons are made with the full GP\nclassi\ufb01er, and the SVM, a widely-used discriminative model which in practice is found to yield\nrelatively sparse solutions; we consider also the IVM, a popular framework for building sparse\n\n4Available from http://www.stats.ox.ac.uk/pub/PRNN/.\n5Available from http://ida.first.fhg.de/projects/bench/benchmarks.htm.\n\n5\n\n\fTable 1: Test errors and predictive accuracy (smaller is better) for the GP classi\ufb01er, the support\nvector machine, the informative vector machine, and the sparse pseudo-input GP classi\ufb01er.\n\nData set\n\ntrain:test dim\n\nname\n\n250:1000 2\nsynth\n80:120\ncrabs\n5\n400:4900 2\nbanana\nbreast-cancer 200:77\n9\n468:300\ndiabetes\n8\n666:400\n\ufb02are-solar\n9\n700:300\ngerman\n20\n170:100\nheart\n13\n1300:1010 18\nimage\n400:7000 20\nringnorm\n1000:2175 60\nsplice\n140:75\nthyroid\n5\n150:2051 3\ntitanic\n400:7000 20\ntwonorm\n400:4600 21\nwaveform\n\nGPC\n\nerr\n\nnlp\n\n0.097 0.227\n0.039 0.096\n0.105 0.237\n0.288 0.558\n0.231 0.475\n0.346 0.570\n0.230 0.482\n0.178 0.423\n0.027 0.078\n0.016 0.071\n0.115 0.281\n0.043 0.093\n0.221 0.514\n0.031 0.085\n0.100 0.229\n\nSVM\nerr\n\n#sv\n\n0.098 98\n0.168 67\n0.106 151\n0.277 122\n0.226 271\n0.331 556\n0.247 461\n0.166 92\n0.040 462\n0.016 157\n0.102 698\n0.056 61\n0.223 118\n0.027 220\n0.107 148\n\nIVM\n\nSPGPC\n\nerr\n\nnlp M\n\nerr\n\nnlp M\n\n0.096 0.235 150\n0.066 0.134 60\n0.105 0.242 200\n0.307 0.691 120\n0.230 0.486 400\n0.340 0.628 550\n0.290 0.658 450\n0.203 0.455 120\n0.028 0.082 400\n0.016 0.101 100\n0.225 0.403 700\n0.041 0.120 40\n0.242 0.578 100\n0.031 0.085 300\n0.100 0.232 250\n\n0.087 0.234\n4\n0.043 0.105 10\n0.107 0.261 20\n2\n0.281 0.557\n2\n0.230 0.485\n3\n0.338 0.569\n0.236 0.491\n4\n0.172 0.414\n2\n0.031 0.087 200\n0.014 0.089\n2\n0.126 0.306 200\n0.037 0.128\n6\n2\n0.231 0.520\n0.026 0.086\n2\n0.099 0.228 10\n\nlinear models. In all cases, we employed the isotropic squared exponential kernel, avoiding here the\nanisotropic version primarily to allow comparison with the SVM: lacking a probabilistic foundation,\nits kernel parameters and regularization constant must be set by cross-validation. For the IVM,\nhyperparameter optimization is interleaved with active set selection as described in [2], while for the\nother GP models, we \ufb01t hyperparameters by gradient ascent on the estimated marginal likelihood,\nlimiting the process to twenty conjugate gradient iterations; we retained for testing that of three\nto \ufb01ve randomly initialized models which the evidence most favoured. Results on the R\u00a8atsch data\nfor the semi-parametric radial basis function network are omitted for lack of space, but available at\nthe site given in footnote 5. In comparison with that model, SPGP tends to give sparser and more\naccurate results (with the bene\ufb01t of a sound Bayesian framework).\n\nIdentical tests were run for a range of active set sizes on the IVM and SPGP classi\ufb01er, and we have\nattempted to present the large body of results in its most comprehensible form: we list only the\nsparsest competitive solution obtained. This means that using M smaller than shown tends to cause\na deterioriation in performance, but not that there is no advantage in increasing the value. After all,\nas M \u2192 N we expect error rates to match those of the full model (at least for the IVM, which\nuses a subset of the training data).6 However, we believe that in exploring the behaviour of a sparse\nmodel, the essential question is: what is the greatest sparsity we can achieve without compromising\nperformance? (since if sparsity were not an issue, we would simply revert to the original GP).\nSmall values of M for the FITC approximation were found to give remarkably low error rates, and\nincremented singly would often give an improved approximation. In contrast, the IVM predictions\nwere no better than random guesses for even moderate M \u2014it usually failed if the active set was\nsmaller than a threshold around N/3, where it was simply discarding too much information\u2014and\ngreater step sizes were required for noticeable improvements in performance. With a few exceptions\nthen, for FITC we explored small M , while for the IVM we used larger values, more widely spread.\nMore challenging is the task of discriminating 4s from non-4s in the USPS digit database: the data\nare 256-dimensional, and there are 7291 training and 2007 test points. With 200 pseudo-inputs (and\n51,200 parameters for optimization), error rates for SPGPC are 1.94%, with an average negative log\nprobability of 0.051 nats. These \ufb01gures improve when the allocation is raised to 400 pseudo-inputs,\nto 1.79% and 0.048 nats. When provided with only 200 points, the IVM \ufb01gures are 9.97% and 0.421\nnats\u2014this can be regarded as a failure to generalize, since it corresponds to labelling all test inputs\nas \u201cnot 4\u201d\u2014but given an active set of 400 it reaches error rates of 1.54% and NLP of 0.085 nats.\n\n6Note that the evidence is a poor metric for choosing M since it tends to increase monotonically as the\n\nexplicative power of the full GP is restored.\n\n6\n\n\f6 Discussion\n\nA sparse approximation closely related to FITC is the \u201cdeterministic training conditional\u201d (DTC),\nwhose covariance consists solely of the low-rank term LMLT ; it has appeared elsewhere under\nthe name projected latent variables [13]. In generative terms, DTC \ufb01rst obtains a posterior process\nby conditioning on the inducing inputs; observations y are then drawn as noisy samples of the\nmean of this process. FITC is similar, but the draws are noisy samples from the posterior process\nitself\u2014hence, while the noise component for DTC is a constant corruption \u03c32, for FITC it grows\naway from the inducing inputs to Knn +\u03c32. In comparing their SPGP model with DTC, Snelson and\nGhahramani [1] suggest that it is for this reason (i.e. due to the diagonal component in the covariance\nin FITC) that the optimization of pseudo-inputs by gradient ascent on the marginal likelihood can\nsucceed: without the noise reduction afforded locally by relocating pseudo-inputs, DTC does not\nprovide a suf\ufb01ciently large gradient for them to move, and the optimization gets stuck. We believe\nthe same mechanism operates in general for non-Gaussian noise.\n\nThis dif\ufb01culty would not be signi\ufb01cant if alternative heuristics for building the active set greedily\nwere effective. We hypothesize however that the most informative vectors in the greedy sense of\nthe IVM tend to be those which lie close to the decision boundary. Such points will have a rela-\ntively strong in\ufb02uence on its shape since the effect of the kernel falls off exponentially in distance\nsquared. A preferable solution may be that empirically found to occur with Tipping\u2019s relevance\nvector machine (RVM) [15], a degenerate GP where a particular prior on weights means only a few\nbasis functions survive an evidence maximization procedure to form the model;7 there, the classi-\n\ufb01er was often parameterized by points distant from the decision boundary, suggested to be more\n\u201crepresentative\u201d of the data.\n\nWe illustrate with a simple example that, provided the optimization is feasible, very sparse solutions\nmay more easily be found if the inducing inputs can be positioned independently of the data. This\nallows the size of the active set to grow with the complexity of the problem, rather than with N , the\nnumber of training points. We drew samples from a two-dimensional \u201cxor\u201d problem, consisting of\nfour unit-variance Gaussian clusters at (\u00b11.5, \u00b11.5) with a small overlap, giving an optimal error\nrate of around 13% and in loose terms a complexity which requires an active set of size four. By\nincreasing the size of the training set N in increments from 40 to 400, we obtained the learning\ncurves of \ufb01gure 1 for the IVM and FITC models: plotted against N is the size of active set required\nfor the error rate to fall below 15%. Whereas the FITC model requires a constant four points to\nexplain the data, the demands of the IVM appear to increase almost linearly with N .\nEvidently, the FITC model is able to capture salient details more readily than the IVM, but we\nmay object that it is also a richer likelihood. We therefore show learning curves for the FITC\napproximation run using the IVM active set and, generously, optimal kernel parameters. With a\nrelatively simple and low-dimensional problem, the bene\ufb01t of the adaptable active set that FITC\noffers is clearly less signi\ufb01cant than that of the improved approximation itself\u2014although there is\na factor of 2 difference, and we believe the effects will be more pronounced for more complex\ndata. However, a sensible compromise where optimization of all pseudo-inputs is computationally\ninfeasible is to run the IVM to obtain an initial active set, but then switch to the FITC approximation\nand optimize only kernel parameters, or just a small selection of the pseudo-inputs. Another option,\nexplored by Snelson and Ghaharamani [17] for this model in the case of regression, is to learn a\nlow dimensional projection of the data\u2014advantageous, since in this setting the pseudo-inputs only\noperate under projection and can be treated as low-dimensional, potentially reducing signi\ufb01cantly\nthe scale of the optimization problem. We report results of this extension in future work.\n\n7 Conclusions\n\nWe have presented an ef\ufb01cient and numerically stable way of implementing the sparse FITC model\nin Gaussian processes. By way of example we considered binary classi\ufb01cation in which extra data\npoints are introduced to form a continuously adaptable active set. We have demonstrated that the\nlocations of these pseudo-inputs can be \ufb01t synchronously with parameters of the kernel, and that\n\n7We have not compared our model with the RVM since that approximation suffers from nonsensical variance\nestimates away from the data. Rasmussen and Qui\u02dcnonero-Candela [16] show how it can be \u201chealed\u201d through\naugmentation, but the resulting model is no longer sparse in the sense of providing O(M 2) predictions.\n\n7\n\n\fM\n\nt\ne\ns\n\ne\nv\ni\nt\nc\na\n\nf\no\n\ne\nz\ni\nS\n\n150\n\n100\n\n50\n\n8\n4\n0\n\nFITC\nIVM\nIVM/FITC\n\n2\n\n0\n\n-2\n\n40\n\n200\n\nSize of training set N\n\n360\n\n-2\n\n0\n\n2\n\nFigure 1: Left: learning curves for the toy problem described in the text. Right: contours of posterior\nprobability for FITC in ten CG iterations from a random initialization of pseudo-inputs (black dots).\n\nthis procedure allows for very sparse solutions. Certain data sets, particularly those of very high\ndimensionality, are not amenable to this approach since the number of hyperparameters is unfeasibly\nlarge for non-linear optimization. In this case, we suggest resorting to a greedy approach, using a\nfast heuristic like the IVM to build the active set, but adopting the FITC approximation thereafter.\nAn alternative which deserves investigation is to attempt an initial round of k-means clustering.\n\nReferences\n\n[1] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In Advances\n\nin Neural Information Processing Systems 18. MIT Press, 2005.\n\n[2] Neil Lawrence, Matthias Seeger, and Ralf Herbrich. Fast sparse Gaussian process methods: the informa-\n\ntive vector machine. In Advances in Neural Information Processing Systems 15. MIT Press, 2003.\n\n[3] Manfred Opper and Ole Winther. Gaussian processes for classi\ufb01cation: mean \ufb01eld methods. Neural\n\nComputation, 12(11):2655\u20132684, 2000.\n\n[4] Volker Tresp. A Bayesian committee machine. Neural Computation, 12(11):2719\u20132741, 2000.\n[5] Alex Smola and Peter Bartlett. Sparse greedy Gaussian process regression.\n\nIn Advances in Neural\n\nInformation Processing Systems 13. MIT Press, 2001.\n\n[6] Lehel Csat\u00b4o. Gaussian processes: iterative sparse approximations. PhD thesis, Aston University, 2002.\n[7] Matthias Seeger. Bayesian Gaussian process models: PAC-Bayesian generalisation error bounds and\n\nsparse approximations. PhD thesis, University of Edinburgh, 2003.\n\n[8] Joaquin Qui\u02dcnonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approximate Gaus-\n\nsian process regression. Journal of Machine Learning Research, 6(12):1939\u20131959, 2005.\n\n[9] Carl Rasmussen and Christopher Williams. Gaussian processes for machine learning. MIT Press, 2006.\n[10] Malte Kuss and Carl Edward Rasmussen. Assessing approximations for Gaussian process classi\ufb01cation.\n\nIn Advances in Neural Information Processing Systems 18. MIT Press, 2005.\n\n[11] Thomas Minka. A family of algorithms for approximate Bayesian inference. PhD thesis, Massachusetts\n\nInstitute of Technology, 2001.\n\n[12] Matthias Seeger.\n\nExpectation propagation for exponential\n\nfamilies, 2005.\n\nAvailable from\n\nhttp://www.cs.berkeley.edu/\u02dcmseeger/papers/epexpfam.ps.gz.\n\n[13] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed up sparse\nGaussian process regression. In Proceedings of the 9th International Workshop on AI Stats. Society for\nArti\ufb01cial Intelligence and Statistics, 2003.\n\n[14] Brian Ripley. Pattern recognition and neural networks. Cambridge University Press, 1996.\n[15] Michael E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1:211\u2013244, 2001.\n\n[16] Carl Edward Rasmussen and Joaquin Qui\u02dcnonero-Candela. Healing the relevance vector machine through\n\naugmentation. In Proceedings of 22nd ICML. ACM Press, 2005.\n\n[17] Edward Snelson and Zoubin Ghahramani. Variable noise and dimensionality reduction for sparse Gaus-\nsian processes. In Proceedings of the 22nd Annual Conference on Uncertainty in AI. AUAI Press, 2006.\n\n8\n\n\f", "award": [], "sourceid": 552, "authors": [{"given_name": "Andrew", "family_name": "Naish-guzman", "institution": null}, {"given_name": "Sean", "family_name": "Holden", "institution": null}]}