{"title": "Scalable Log Determinants for Gaussian Process Kernel Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 6327, "page_last": 6337, "abstract": "For applications as varied as Bayesian neural networks, determinantal point processes, elliptical graphical models, and kernel learning for Gaussian processes (GPs), one must compute a log determinant of an n by n positive definite matrix, and its derivatives---leading to prohibitive O(n^3) computations.  We propose novel O(n) approaches to estimating these quantities from only fast matrix vector multiplications (MVMs). These stochastic approximations are based on Chebyshev, Lanczos, and surrogate models, and converge quickly even for kernel matrices that have challenging spectra.  We leverage these approximations to develop a scalable Gaussian process approach to kernel learning. We find that Lanczos is generally superior to Chebyshev for kernel learning, and that a surrogate approach can be highly efficient and accurate with popular kernels.", "full_text": "Scalable Log Determinants for Gaussian Process\n\nKernel Learning\n\nKun Dong 1, David Eriksson 1, Hannes Nickisch 2, David Bindel 1, Andrew Gordon Wilson 1\n\n1 Cornell University, 2 Phillips Research Hamburg\n\nAbstract\n\nFor applications as varied as Bayesian neural networks, determinantal point pro-\ncesses, elliptical graphical models, and kernel learning for Gaussian processes\n(GPs), one must compute a log determinant of an n \u00d7 n positive de\ufb01nite matrix,\nand its derivatives \u2013 leading to prohibitive O(n3) computations. We propose novel\nO(n) approaches to estimating these quantities from only fast matrix vector mul-\ntiplications (MVMs). These stochastic approximations are based on Chebyshev,\nLanczos, and surrogate models, and converge quickly even for kernel matrices that\nhave challenging spectra. We leverage these approximations to develop a scalable\nGaussian process approach to kernel learning. We \ufb01nd that Lanczos is generally\nsuperior to Chebyshev for kernel learning, and that a surrogate approach can be\nhighly ef\ufb01cient and accurate with popular kernels.\n\n1\n\nIntroduction\n\nThere is a pressing need for scalable machine learning approaches to extract rich statistical struc-\nture from large datasets. A common bottleneck \u2014 arising in determinantal point processes [1],\nBayesian neural networks [2], model comparison [3], graphical models [4], and Gaussian process\nkernel learning [5] \u2014 is computing a log determinant over a large positive de\ufb01nite matrix. While\nwe can approximate log determinants by existing stochastic expansions relying on matrix vector\nmultiplications (MVMs), these approaches make assumptions, such as near-uniform eigenspectra\n[6], which are unsuitable in machine learning contexts. For example, the popular RBF kernel gives\nrise to rapidly decaying eigenvalues. Moreover, while standard approaches, such as stochastic power\nseries, have reasonable asymptotic complexity in the rank of the matrix, they require too many terms\n(MVMs) for the precision necessary in machine learning applications.\nGaussian processes (GPs) provide a principled probabilistic kernel learning framework, for which a\nlog determinant is of foundational importance. Speci\ufb01cally, the marginal likelihood of a Gaussian\nprocess is the probability of data given only kernel hyper-parameters. This utility function for kernel\nlearning compartmentalizes into automatically calibrated model \ufb01t and complexity terms \u2014 called\nautomatic Occam\u2019s razor \u2014 such that the simplest models which explain the data are automatically\nfavoured [7, 5], without the need for approaches such as cross-validation, or regularization, which\ncan be costly, heuristic, and involve substantial hand-tuning and human intervention. The automatic\ncomplexity penalty, called the Occam\u2019s factor [3], is a log determinant of a kernel (covariance) matrix,\nrelated to the volume of solutions that can be expressed by the Gaussian process.\nMany current approaches to scalable Gaussian processes [e.g., 8\u201310] focus on inference assuming\na \ufb01xed kernel, or use approximations that do not allow for very \ufb02exible kernel learning [11], due\nto poor scaling with number of basis functions or inducing points. Alternatively, approaches which\nexploit algebraic structure in kernel matrices can provide highly expressive kernel learning [12], but\nare essentially limited to grid structured data.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fRecently, Wilson and Nickisch [13] proposed the structured kernel interpolation (SKI) framework,\nwhich generalizes structuring exploiting methods to arbitrarily located data. SKI works by providing\naccurate and fast matrix vector multiplies (MVMs) with kernel matrices, which can then be used in\niterative solvers such as linear conjugate gradients for scalable GP inference. However, evaluating the\nmarginal likelihood and its derivatives, for kernel learning, has followed a scaled eigenvalue approach\n[12, 13] instead of iterative MVM approaches. This approach can be inaccurate, and relies on a fast\neigendecomposition of a structured matrix, which is not available in many consequential situations\nwhere fast MVMs are available, including: (i) additive covariance functions, (ii) multi-task learning,\n(iii) change-points [14], and (iv) diagonal corrections to kernel approximations [15]. Fiedler [16] and\nWeyl [17] bounds have been used to extend the scaled eigenvalue approach [18, 14], but are similarly\nlimited. These extensions are often very approximate, and do not apply beyond sums of two and\nthree matrices, where each matrix in the sum must have a fast eigendecomposition.\nIn machine learning there has recently been renewed interest in MVM based approaches to approxi-\nmating log determinants, such as the Chebyshev [19] and Lanczos [20] based methods, although these\napproaches go back at least two decades in quantum chemistry computations [21]. Independently,\nseveral authors have proposed various methods to compute derivatives of log determinants [22, 23].\nBut both the log determinant and the derivatives are needed for ef\ufb01cient GP marginal likelihood\nlearning: the derivatives are required for gradient-based optimization, while the log determinant itself\nis needed for model comparison, comparisons between the likelihoods at local maximizers, and fast\nand effective choices of starting points and step sizes in a gradient-based optimization algorithm.\nIn this paper, we develop novel scalable and general purpose Chebyshev, Lanczos, and surrogate\napproaches for ef\ufb01ciently and accurately computing both the log determinant and its derivatives\nsimultaneously. Our methods use only fast MVMs, and re-use the same MVMs for both computations.\nIn particular:\n\n\u2022 We derive fast methods for simultaneously computing the log determinant and its derivatives\nby stochastic Chebyshev, stochastic Lanczos, and surrogate models, from MVMs alone. We\nalso perform an error analysis and extend these approaches to higher order derivatives.\n\n\u2022 These methods enable fast GP kernel learning whenever fast MVMs are possible, including\napplications where alternatives such as scaled eigenvalue methods (which rely on fast eigen-\ndecompositions) are not, such as for (i) diagonal corrections for better kernel approximations,\n(ii) additive covariances, (iii) multi-task approaches, and (iv) non-Gaussian likelihoods.\n\n\u2022 We illustrate the performance of our approach on several large, multi-dimensional datasets,\nincluding a consequential crime prediction problem, and a precipitation problem with\nn = 528, 474 training points. We consider a variety of kernels, including deep kernels [24],\ndiagonal corrections, and both Gaussian and non-Gaussian likelihoods.\n\n\u2022 We have released code and tutorials as an extension to the GPML library [25] at https:\n//github.com/kd383/GPML_SLD. A Python implementation of our approach is also\navailable through the GPyTorch library: https://github.com/jrg365/gpytorch.\n\nWhen using our approach in conjunction with SKI [13] for fast MVMs, GP kernel learning is\nO(n + g(m)), for m inducing points and n training points, where g(m) \u2264 m log m. With algebraic\napproaches such as SKI we also do not need to worry about quadratic storage in inducing points,\nsince symmetric Toeplitz and Kronecker matrices can be stored with at most linear cost, without\nneeding to explicitly construct a matrix.\nAlthough we here use SKI for fast MVMs, we emphasize that the proposed iterative approaches are\ngenerally applicable, and can easily be used in conjunction with any method that admits fast MVMs,\nincluding classical inducing point methods [8], \ufb01nite basis expansions [9], and the popular stochastic\nvariational approaches [10]. Moreover, stochastic variational approaches can naturally be combined\nwith SKI to further accelerate MVMs [26].\nWe start in \u00a72 with an introduction to GPs and kernel approximations. In \u00a73 we introduce stochastic\ntrace estimation and Chebyshev (\u00a73.1) and Lanczos (\u00a73.2) approximations. In \u00a74, we describe the\ndifferent sources of error in our approximations. In \u00a75 we consider experiments on several large\nreal-world data sets. We conclude in \u00a76. The supplementary materials also contain several additional\nexperiments and details.\n\n2\n\n\f2 Background\n\nA Gaussian process (GP) is a collection of random variables, any \ufb01nite number of which have\na joint Gaussian distribution [e.g., 5]. A GP can be used to de\ufb01ne a distribution over functions\nf (x) \u223c GP(\u00b5(x), k(x, x(cid:48))), where each function value is a random variable indexed by x \u2208 Rd, and\n\u00b5 : Rd \u2192 R and k : Rd \u00d7 Rd \u2192 R are the mean and covariance functions of the process.\nThe covariance function is often chosen to be an RBF or Mat\u00e9rn kernel (see the supplementary\nmaterial for more details). We denote any kernel hyperparameters by the vector \u03b8. To be concise we\nwill generally not explicitly denote the dependence of k and associated matrices on \u03b8.\nFor any locations X = {x1, . . . , xn} \u2282 Rd, fX \u223c N (\u00b5X , KXX ) where fX and \u00b5X represent the\nvectors of function values for f and \u00b5 evaluated at each of the xi \u2208 X, and KXX is the matrix\nwhose (i, j) entry is k(xi, xj). Suppose we have a vector of corresponding function values y \u2208 Rn,\nwhere each entry is contaminated by independent Gaussian noise with variance \u03c32. Under a Gaussian\nprocess prior depending on the covariance hyperparameters \u03b8, the log marginal likelihood is given by\n\nL(\u03b8|y) = \u2212 1\n2\n\n(y \u2212 \u00b5X )T \u03b1 + log | \u02dcKXX| + n log 2\u03c0\n\n(1)\nXX (y \u2212 \u00b5X ) and \u02dcKXX = KXX + \u03c32I. Optimization of (1) is expensive, since the\nwhere \u03b1 = \u02dcK\u22121\ncheapest way of evaluating log | \u02dcKXX| and its derivatives without taking advantage of the structure of\n\u02dcKXX involves computing the O(n3) Cholesky factorization of \u02dcKXX. O(n3) computations is too\nexpensive for inference and learning beyond even just a few thousand points.\nA popular approach to GP scalability is to replace the exact kernel k(x, z) by an approximate\nkernel that admits fast computations [8]. Several methods approximate k(x, z) via inducing points\nU = {uj}m\n\nj=1 \u2282 Rd. An example is the subset of regressor (SoR) kernel:\n\n(cid:105)\n\n(cid:104)\n\nkSoR(x, z) = KxU K\u22121\n\nU U KU z\n\nXX = K SoR\n\nXX \u2208 Rn\u00d7n has rank at most m,\nwhich is a low-rank approximation [27]. The SoR matrix K SoR\nXX + \u03c32I and to compute log | \u02dcK SoR\nXX |\nallowing us to solve linear systems involving \u02dcK SoR\nin O(m2n + m3) time. Another popular kernel approximation is the fully independent training\nconditional (FITC), which is a diagonal correction of SoR so that the diagonal is the same as for the\noriginal kernel [15]. Thus kernel matrices from FITC have low-rank plus diagonal structure. This\nmodi\ufb01cation has had exceptional practical signi\ufb01cance, leading to improved point predictions and\nmuch more realistic predictive uncertainty [8, 28], making FITC arguably the most popular approach\nfor scalable Gaussian processes.\nWilson and Nickisch [13] provides a mechanism for fast MVMs through proposing the structured\nkernel interpolation (SKI) approximation,\n\nKXX \u2248 W KU U W T\n\n(2)\nwhere W is an n-by-m matrix of interpolation weights; the authors of [13] use local cubic inter-\npolation so that W is sparse. The sparsity in W makes it possible to naturally exploit algebraic\nstructure (such as Kronecker or Toeplitz structure) in KU U when the inducing points U are on a grid,\nfor extremely fast matrix vector multiplications with the approximate KXX even if the data inputs\nX are arbitrarily located. For instance, if KU U is Toeplitz, then each MVM with the approximate\nKXX costs only O(n + m log m). By contrast, placing the inducing points U on a grid for classical\ninducing point methods, such as SoR or FITC, does not result in substantial performance gains, due\nto the costly cross-covariance matrices KxU and KU z.\n\n3 Methods\n\nOur goal is to estimate, for a symmetric positive de\ufb01nite matrix \u02dcK,\n\nlog | \u02dcK| = tr(log( \u02dcK))\n\nand\n\n\u2202\n\u2202\u03b8i\n\n(cid:104)\nlog | \u02dcK|(cid:105)\n\n(cid:32)\n\n= tr\n\n\u02dcK\u22121\n\n(cid:32)\n\n\u2202 \u02dcK\n\u2202\u03b8i\n\n(cid:33)(cid:33)\n\n,\n\nwhere log is the matrix logarithm [29]. We compute the traces involved in both the log determinant\nand its derivative via stochastic trace estimators [30], which approximate the trace of a matrix using\nonly matrix vector products.\n\n3\n\n\fThe key idea is that for a given matrix A and a random probe vector z with independent entries with\nmean zero and variance one, then tr(A) = E[zT Az]; a common choice is to let the entries of the\nprobe vectors be Rademacher random variables. In practice, we estimate the trace by the sample\nmean over nz independent probe vectors. Often surprisingly few probe vectors suf\ufb01ce.\nTo estimate tr(log( \u02dcK)), we need to multiply log( \u02dcK) by probe vectors. We consider two ways to\nestimate log( \u02dcK)z: by a polynomial approximation of log or by using the connection between the\nGaussian quadrature rule and the Lanczos method [19, 20]. In both cases, we show how to re-use the\nsame probe vectors for an inexpensive coupled estimator of the derivatives. In addition, we may use\nstandard radial basis function interpolation of the log determinant evaluated at a few systematically\nchosen points in the hyperparameter space as an inexpensive surrogate for the log determinant.\n\n3.1 Chebyshev\n\nChebyshev polynomials are de\ufb01ned by the recursion\n\nT0(x) = 1, T1(x) = x, Tj+1(x) = 2xTj(x) \u2212 Tj\u22121(x) for j \u2265 1.\n\nFor f : [\u22121, 1] \u2192 R the Chebyshev interpolant of degree m is\n\nf (x) \u2248 pm(x) :=\n\ncjTj(x), where cj =\n\n2 \u2212 \u03b4j0\nm + 1\n\nm(cid:88)\n\nj=0\n\nm(cid:88)\n\nk=0\n\nf (xk)Tj(xk)\n\nwhere \u03b4j0 is the Kronecker delta and xk = cos(\u03c0(k + 1/2)/(m + 1)) for k = 0, 1, 2, . . . , m; see [31].\nUsing the Chebyshev interpolant of log(1 + \u03b1x), we approximate log | \u02dcK| by\n\nlog | \u02dcK| \u2212 n log \u03b2 = log |I + \u03b1B| \u2248 m(cid:88)\n\ncj tr(Tj(B))\n\nj=0\n\nwhen B = ( \u02dcK/\u03b2 \u2212 1)/\u03b1 has eigenvalues \u03bbi \u2208 (\u22121, 1).\nFor stochastic estimation of tr(Tj(B)), we only need to compute zT Tj(B)z for each given probe\nvector z. We compute vectors wj = Tj(B)z and \u2202wj/\u2202\u03b8i via the coupled recurrences\n\nw0 = z,\n\n\u2202w0\n\u2202\u03b8i\n\n\u2202w1\n\u2202\u03b8i\nThis gives the estimators\n\n= 0,\n\nw1 = Bz,\n\u2202B\n\u2202\u03b8i\n\n=\n\nz,\n\n\uf8ee\uf8f0 m(cid:88)\n\nj=0\n\n\u2202\u03b8i\n\n\uf8f9\uf8fb\n\nwj+1 = 2Bwj \u2212 wj\u22121 for j \u2265 1,\n\u2202wj+1\n\n= 2\n\nwj + B\n\n(cid:18) \u2202B\n\n\u2202\u03b8i\n\n\u2202wj\n\u2202\u03b8i\n\n\u2212 \u2202wj\u22121\n\u2202\u03b8i\n\nfor j \u2265 1.\n\n(cid:19)\n\uf8ee\uf8f0 m(cid:88)\n\nj=0\n\n\uf8f9\uf8fb .\n\nlog | \u02dcK| \u2248 E\n\ncjzT wj\n\nand\n\nlog | \u02dcK| \u2248 E\n\n\u2202\n\u2202\u03b8i\n\ncjzT \u2202wj\n\u2202\u03b8i\n\nThus, each derivative of the approximation costs two extra MVMs per term.\n\n3.2 Lanczos\n\nWe can also approximate zT log( \u02dcK)z via a Lanczos decomposition; see [32] for discussion of a\nLanczos-based computation of zT f ( \u02dcK)z and [20, 21] for stochastic Lanczos estimation of log\ndeterminants. We run m steps of the Lanczos algorithm, which computes the decomposition\n\n\u02dcKQm = QmT + \u03b2mqm+1eT\nm\n\nq2\n\n. . . qm] \u2208 Rn\u00d7m is a matrix with orthonormal columns such that q1 = z/(cid:107)z(cid:107),\nwhere Qm = [q1\nT \u2208 Rm\u00d7m is tridiagonal, \u03b2m is the residual, and em is the mth Cartesian unit vector. We estimate\n(3)\nwhere e1 is the \ufb01rst column of the identity. The Lanczos algorithm is numerically unstable. Several\npractical implementations resolve this issue [33, 34]. The approximation (3) corresponds to a Gauss\nquadrature rule for the Riemann-Stieltjes integral of the measure associated with the eigenvalue\n\nzT f ( \u02dcK)z \u2248 eT\n\n1 f ((cid:107)z(cid:107)2T )e1\n\n4\n\n\fdistribution of \u02dcK. It is exact when f is a polynomial of degree up to 2m \u2212 1. This approximation\nis also exact when \u02dcK has at most m distinct eigenvalues, which is particularly relevant to Gaussian\nprocess regression, since frequently the kernel matrices only have a small number of eigenvalues that\nare not close to zero.\nThe Lanczos decomposition also allows us to estimate derivatives of the log determinant at minimal\ncost. Via the Lanczos decomposition, we have\n\n\u02c6g = Qm(T \u22121e1(cid:107)z(cid:107)) \u2248 \u02dcK\u22121z.\n\nThis approximation requires no additional matrix vector multiplications beyond those used to com-\npute the Lanczos decomposition, which we already used to estimate log( \u02dcK)z; in exact arithmetic,\nthis is equivalent to m steps of CG. Computing \u02c6g in this way takes O(mn) additional time; sub-\nsequently, we only need one matrix-vector multiply by \u2202 \u02dcK/\u2202\u03b8i for each probe vector to estimate\ntr( \u02dcK\u22121(\u2202 \u02dcK/\u2202\u03b8i)) = E[( \u02dcK\u22121z)T (\u2202 \u02dcK/\u2202\u03b8i)z].\n\n3.3 Diagonal correction to SKI\n\nThe SKI approximation may provide a poor estimate of the diagonal entries of the original kernel\nmatrix for kernels with limited smoothness, such as the Mat\u00e9rn kernel. In general, diagonal corrections\nto scalable kernel approximations can lead to great performance gains. Indeed, the popular FITC\nmethod [15] is exactly a diagonal correction of subset of regressors (SoR).\nWe thus modify the SKI approximation to add a diagonal matrix D,\n\nKXX \u2248 W KU U W T + D ,\n\n(4)\nsuch that the diagonal of the approximated KXX is exact. In other words, D substracts the diagonal\nof W KU U W T and adds the true diagonal of KXX. This modi\ufb01cation is not possible for the scaled\neigenvalue method for approximating log determinants in [13], since adding a diagonal matrix makes\nit impossible to approximate the eigenvalues of KXX from the eigenvalues of KU U .\nHowever, Eq. (4) still admits fast MVMs and thus works with our approach for estimating the log\ndeterminant and its derivatives. Computing D with SKI costs only O(n) \ufb02ops since W is sparse for\nlocal cubic interpolation. We can therefore compute (W T ei)T KU U (W T ei) in O(1) \ufb02ops.\n\n3.4 Estimating higher derivatives\n\nWe have already described how to use stochastic estimators to compute the log marginal likelihood\nand its \ufb01rst derivatives. The same approach applies to computing higher-order derivatives for a\nNewton-like iteration, to understand the sensitivity of the maximum likelihood parameters, or for\nsimilar tasks. The \ufb01rst derivatives of the full log marginal likelihood are\n\nand the second derivatives of the two terms are\n\ntr\n\n(cid:34)\n\n(cid:33)\n\n\u2202L\n\u2202\u03b8i\n\n= \u2212 1\n2\n\n(cid:32)\n\u02dcK\u22121 \u2202 \u02dcK\n(cid:32)\n\u2202\u03b8i\n\u02dcK\u22121 \u22022 \u02dcK\n(cid:2)(y \u2212 \u00b5X )T \u03b1(cid:3) = 2\u03b1T \u2202 \u02dcK\n\u2202\u03b8i\u2202\u03b8j\n\u02dcK\u22121 \u2202 \u02dcK\n\u2202\u03b8j\n\n(cid:104)\nlog | \u02dcK|(cid:105)\n\n= tr\n\n\u2202\u03b8i\n\n\u22022\n\n\u2202\u03b8i\u2202\u03b8j\n\n\u22022\n\n\u2202\u03b8i\u2202\u03b8j\n\n(cid:35)\n\n\u2212 \u03b1T \u2202 \u02dcK\n\u2202\u03b8i\n\n\u03b1\n\n(cid:33)\n\n,\n\n\u02dcK\u22121 \u2202 \u02dcK\n\u2202\u03b8j\n\n\u2212 \u02dcK\u22121 \u2202 \u02dcK\n\u2202\u03b8i\n\n\u03b1 \u2212 \u03b1T \u22022 \u02dcK\n\u2202\u03b8i\u2202\u03b8j\n\n\u03b1.\n\nSuper\ufb01cially, evaluating the second derivatives would appear to require several additional solves\nabove and beyond those used to estimate the \ufb01rst derivatives of the log determinant. In fact, we can\nget an unbiased estimator for the second derivatives with no additional solves, but only fast products\nwith the derivatives of the kernel matrices. Let z and w be independent probe vectors, and de\ufb01ne\ng = \u02dcK\u22121z and h = \u02dcK\u22121w. Then\n\n\u22022\n\n(cid:104)\n\nlog | \u02dcK|(cid:105)\n\n(cid:34)\ngT \u22022 \u02dcK\n(cid:34)(cid:32)\n(cid:2)(y \u2212 \u00b5X )T \u03b1(cid:3) = 2E\n\u2202\u03b8i\u2202\u03b8j\nzT \u2202 \u02dcK\n\u2202\u03b8i\n\n= E\n\n\u2202\u03b8i\u2202\u03b8j\n\nz \u2212\n\n(cid:32)\ngT \u2202 \u02dcK\n(cid:33)(cid:32)\n\u2202\u03b8i\ngT \u2202 \u02dcK\n\u2202\u03b8j\n\n\u03b1\n\n\u03b1\n\n(cid:33)(cid:35)\n\nz\n\n,\n\n(cid:33)(cid:32)\nhT \u2202 \u02dcK\n(cid:33)(cid:35)\n\u2202\u03b8j\n\nw\n\n\u2212 \u03b1T \u22022 \u02dcK\n\u2202\u03b8i\u2202\u03b8j\n\n\u03b1.\n\n\u22022\n\n\u2202\u03b8i\u2202\u03b8j\n\n5\n\n\fHence, if we use the stochastic Lanczos method to compute the log determinant and its derivatives,\nthe additional work required to obtain a second derivative estimate is one MVM by each second\npartial of the kernel for each probe vector and for \u03b1, one MVM of each \ufb01rst partial of the kernel with\n\u03b1, and a few dot products.\n\n3.5 Radial basis functions\n\nAnother way to deal with the log determinant and its derivatives is to evaluate the log determinant\nterm at a few systematically chosen points in the space of hyperparameters and \ufb01t an interpolation\napproximation to these values. This is particularly useful when the kernel depends on a modest\nnumber of hyperparameters (e.g., half a dozen), and thus the number of points we need to precompute\nis relatively small. We refer to this method as a surrogate, since it provides an inexpensive substitute\nfor the log determinant and its derivatives. For our surrogate approach, we use radial basis function\n(RBF) interpolation with a cubic kernel and a linear tail. See e.g. [35\u201338] and the supplementary\nmaterial for more details on RBF interpolation.\n\n4 Error properties\n\nIn addition to the usual errors from sources such as solver termination criteria and \ufb02oating point arith-\nmetic, our approach to kernel learning involves several additional sources of error: we approximate\nthe true kernel with one that enables fast MVMs, we approximate traces using stochastic estimation,\nand we approximate the actions of log( \u02dcK) and \u02dcK\u22121 on probe vectors.\nWe can compute \ufb01rst-order estimates of the sensitivity of the log likelihood to perturbations in the\nkernel using the same stochastic estimators we use for the derivatives with respect to hyperparameters.\nFor example, if Lref is the likelihood for a reference kernel \u02dcK ref = \u02dcK + E, then\n\n(cid:0)E(cid:2)gT Ez(cid:3) \u2212 \u03b1T E\u03b1(cid:1) + O((cid:107)E(cid:107)2),\n\nand we can bound the change in likelihood at \ufb01rst order by (cid:107)E(cid:107)(cid:0)(cid:107)g(cid:107)(cid:107)z(cid:107) + (cid:107)\u03b1(cid:107)2(cid:1). Given bounds on\n\nLref (\u03b8|y) = L(\u03b8|y) \u2212 1\n2\n\nthe norms of \u2202E/\u2202\u03b8i, we can similarly estimate changes in the gradient of the likelihood, allowing\nus to bound how the marginal likelihood hyperparameter estimates depend on kernel approximations.\nIf \u02dcK = U \u039bU T + \u03c32I, the Hutchinson trace estimator has known variance [39]\n\n(cid:88)\n\ni(cid:54)=j\n\nij \u2264 n(cid:88)\n\ni=1\n\nVar[zT log( \u02dcK)z] =\n\n[log( \u02dcK)]2\n\nlog(1 + \u03bbj/\u03c32)2.\n\nwill be small compared to the magnitude of tr(log \u02dcK) = 2n log \u03c3 +(cid:80)n\n\nIf the eigenvalues of the kernel matrix without noise decay rapidly enough compared to \u03c3, the variance\ni=1 log(1 + \u03bbj/\u03c32). Hence,\nwe need fewer probe vectors to obtain reasonable accuracy than one would expect from bounds that\nare blind to the matrix structure. In our experiments, we typically only use 5\u201310 probes \u2014 and we\nuse the sample variance across these probes to estimate a posteriori the stochastic component of the\nerror in the log likelihood computation. If we are willing to estimate the Hessian of the log likelihood,\nwe can increase rates of convergence for \ufb01nding kernel hyperparameters.\n\u221a\nThe Chebyshev approximation scheme requires O(\n\u03ba log(\u03ba/\u0001)) steps to obtain an O(\u0001) approxima-\ntion error in computing zT log( \u02dcK)z, where \u03ba = \u03bbmax/\u03bbmin is the condition number of \u02dcK [19]. This\nbehavior is independent of the distribution of eigenvalues within the interval [\u03bbmin, \u03bbmax], and is\nclose to optimal when eigenvalues are spread quasi-uniformly across the interval. Nonetheless, when\nthe condition number is large, convergence may be quite slow. The Lanczos approach converges\nat least twice as fast as Chebyshev in general [20, Remark 1], and converges much more rapidly\nwhen the eigenvalues are not uniform within the interval, as is the case with log determinants of\nmany kernel matrices. Hence, we recommend the Lanczos approach over the Chebyshev approach in\ngeneral. In all of our experiments, the error associated with approximating zT log( \u02dcK)z by Lanczos\nwas dominated by other sources of error.\n\n6\n\n\f5 Experiments\n\nWe test our stochastic trace estimator with both Chebyshev and Lanczos approximation schemes on:\n(1) a sound time series with missing data, using a GP with an RBF kernel; (2) a three-dimensional\nspace-time precipitation data set with over half a million training points, using a GP with an RBF\nkernel; (3) a two-dimensional tree growth data set using a log-Gaussian Cox process model with an\nRBF kernel; (4) a three-dimensional space-time crime datasets with a log-Gaussian Cox model with\nMat\u00e9rn 3/2 and spectral mixture kernels; and (5) a high-dimensional feature space using the deep\nkernel learning framework [24]. In the supplementary material we also include several additional\nexperiments to illustrate particular aspects of our approach, including kernel hyperparameter recovery,\ndiagonal corrections (Section 3.3), and surrogate methods (Section 3.5). Throughout we use the SKI\nmethod [13] of Eq. (2) for fast MVMs. We \ufb01nd that the Lanczos and surrogate methods are able to\ndo kernel recovery and inference signi\ufb01cantly faster and more accurately than competing methods.\n\n5.1 Natural sound modeling\n\nHere we consider the natural sound benchmark in [13], shown in Figure 1(a). Our goal is to recover\ncontiguous missing regions in a waveform with n = 59, 306 training points. We exploit Toeplitz\nstructure in the KU U matrix of our SKI approximate kernel for accelerated MVMs.\nThe experiment in [13] only considered scalable inference and prediction, but not hyperparameter\nlearning, since the scaled eigenvalue approach requires all the eigenvalues for an m \u00d7 m Toeplitz\nmatrix, which can be computationally prohibitive with cost O(m2). However, evaluating the marginal\nlikelihood on this training set is not an obstacle for Lanczos and Chebyshev since we can use fast\nMVMs with the SKI approximation at a cost of O(n + m log m).\nIn Figure 1(b), we show how Lanczos, Chebyshev and surrogate approaches scale with the number\nof inducing points m compared to the scaled eigenvalue method and FITC. We use 5 probe vectors\nand 25 iterations for Lanczos, both when building the surrogate and for hyperparameter learning\nwith Lanczos. We also use 5 probe vectors for Chebyshev and 100 moments. Figure 1(b) shows the\nruntime of the hyperparameter learning phase for different numbers of inducing points m, where\nLanczos and the surrogate are clearly more ef\ufb01cient than scaled eigenvalues and Chebyshev. For\nhyperparameter learning, FITC took several hours to run, compared to minutes for the alternatives;\nwe therefore exclude FITC from Figure 1(b). Figure 1(c) shows the time to do inference on the 691\ntest points, while 1(d) shows the standardized mean absolute error (SMAE) on the same test points.\nAs expected, Lanczos and surrogate make accurate predictions much faster than Chebyshev, scaled\neigenvalues, and FITC. In short, Lanczos and the surrogate approach are much faster than alternatives\nfor hyperparameter learning with a large number of inducing points and training points.\n\n(a) Sound data\n\n(b) Recovery time\n\n(c) Inference time\n\n(d) SMAE\n\nFigure 1: Sound modeling using 59,306 training points and 691 test points. The intensity of the\ntime series can be seen in (a). Train time for RBF kernel hyperparameters is in (b) and the time\nfor inference is in (c). The standardized mean absolute error (SMAE) as a function of time for an\nevaluation of the marginal likelihood and all derivatives is shown in (d). Surrogate is (\u2014\u2014), Lanczos\nis (- - -), Chebyshev is (\u2014 (cid:5) \u2014), scaled eigenvalues is (\u2014 + \u2014), and FITC is (\u2014 o \u2014).\n\n5.2 Daily precipitation prediction\n\nThis experiment involves precipitation data from the year of 2010 collected from around 5500 weather\nstations in the US1. The hourly precipitation data is preprocessed into daily data if full information of\nthe day is available. The dataset has 628, 474 entries in terms of precipitation per day given the date,\nlongitude and latitude. We randomly select 100, 000 data points as test points and use the remaining\n\n1https://catalog.data.gov/dataset/u-s-hourly-precipitation-data\n\n7\n\n0123Time (s)-0.200.2Intensity30003500400045005000m101102103104Time (s)1200040006000800010000m100101102103Runtime (s)100101102103Runtime (s)0.20.40.60.811.2SMAE\fpoints for training. We then perform hyperparameter learning and prediction with the RBF kernel,\nusing Lanczos, scaled eigenvalues, and exact methods.\nFor Lanczos and scaled eigenvalues, we optimize the hyperparameters on the subset of data for\nJanuary 2010, with an induced grid of 100 points per spatial dimension and 300 in the temporal\ndimension. Due to memory constraints we only use a subset of 12, 000 entries for training with\nthe exact method. While scaled eigenvalues can perform well when fast eigendecompositions are\npossible, as in this experiment, Lanczos nonetheless still runs faster and with slightly lower MSE.\n\nMethod\nLanczos\n\nExact\n\nScaled eigenvalues\n\nn\n\n528k\n528k\n12k\n\nm MSE\n3M 0.613\n3M 0.621\n-\n0.903\n\nTime [min]\n\n14.3\n15.9\n11.8\n\nTable 1: Prediction comparison for the daily precipitation data showing the number of training points\nn, number of induced grid points m, the mean squared error, and the inference time.\n\nIncidentally, we are able to use 3 million inducing points in Lanczos and scaled eigenvalues, which is\nenabled by the SKI representation [13] of covariance matrices, for a a very accurate approximation.\nThis number of inducing points m is unprecedented for typical alternatives which scale as O(m3).\n\n5.3 Hickory data\n\nIn this experiment, we apply Lanczos to the log-Gaussian Cox process model with a Laplace\napproximation for the posterior distribution. We use the RBF kernel and the Poisson likelihood in\nour model. The scaled eigenvalue method does not apply directly to non-Gaussian likelihoods; we\nthus applied the scaled eigenvalue method in [13] in conjunction with the Fiedler bound in [18] for\nthe scaled eigenvalue comparison. Indeed, a key advantage of the Lanczos approach is that it can be\napplied whenever fast MVMs are available, which means no additional approximations such as the\nFiedler bound are required for non-Gaussian likelihoods.\nThis dataset, which comes from the R package spatstat, is a point pattern of 703 hickory trees in a\nforest in Michigan. We discretize the area into a 60 \u00d7 60 grid and \ufb01t our model with exact, scaled\neigenvalues, and Lanczos. We see in Table 2 that Lanczos recovers hyperparameters that are much\ncloser to the exact values than the scaled eigenvalue approach. Figure 2 shows that the predictions by\nLanczos are also indistinguishable from the exact computation.\n\nMethod\nExact\nLanczos\n\nScaled eigenvalues\n\nsf\n\n(cid:96)1\n\n(cid:96)2\n\n0.696\n0.693\n0.543\n\n0.063\n0.066\n0.237\n\n0.085\n0.096\n0.112\n\n\u2212 log p(y|\u03b8) Time [s]\n465.9\n1827.56\n21.4\n1828.07\n1851.69\n2.5\n\nTable 2: Hyperparameters recovered on the Hickory dataset.\n\n(a) Point pattern data\n\n(b) Prediction by exact\n\n(c) Scaled eigenvalues\n\n(d) Lanczos\n\nFigure 2: Predictions by exact, scaled eigenvalues, and Lanczos on the Hickory dataset.\n\n5.4 Crime prediction\n\nIn this experiment, we apply Lanczos with the spectral mixture kernel to the crime forecasting\nproblem considered in [18]. This dataset consists of 233, 088 incidents of assault in Chicago from\nJanuary 1, 2004 to December 31, 2013. We use the \ufb01rst 8 years for training and attempt to predict the\ncrime rate for the last 2 years. For the spatial dimensions, we use the log-Gaussian Cox process model,\nwith the Mat\u00e9rn-5/2 kernel, the negative binomial likelihood, and the Laplace approximation for the\n\n8\n\n\fposterior. We use a spectral mixture kernel with 20 components and an extra constant component for\nthe temporal dimension. We discretize the data into a 17 \u00d7 26 spatial grid corresponding to 1-by-1\nmile grid cells. In the temporal dimension we sum our data by weeks for a total of 522 weeks. After\nremoving the cells that are outside Chicago, we have a total of 157, 644 observations.\nThe results for Lanczos and scaled eigenvalues (in conjunction with the Fiedler bound due to the\nnon-Gaussian likelihood) can be seen in Table 3. The Lanczos method used 5 Hutchinson probe\nvectors and 30 Lanczos steps. For both methods we allow 100 iterations of LBFGS to recover\nhyperparameters and we often observe early convergence. While the RMSE for Lanczos and\nscaled eigenvalues happen to be close on this example, the recovered hyperparameters using scaled\neigenvalues are very different than for Lanczos. For example, the scaled eigenvalue method learns\na much larger \u03c32 than Lanczos, indicating model misspeci\ufb01cation. In general, as the data become\nincreasingly non-Gaussian the Fiedler bound (used for fast scaled eigenvalues on non-Gaussian\nlikelihoods) will become increasingly misspeci\ufb01ed, while Lanczos will be unaffected.\n\nMethod\nLanczos\n\nScaled eigenvalues\n\n(cid:96)1\n0.65\n0.32\n\n(cid:96)2\n0.67\n0.10\n\n\u03c32\n\nTrecovery[s] Tprediction[s] RMSEtrain RMSEtest\n\n69.72\n191.17\n\n264\n67\n\n10.30\n3.75\n\n1.17\n1.19\n\n1.33\n1.36\n\nTable 3: Hyperparameters recovered, recovery time and RMSE for Lanczos and scaled eigenvalues\non the Chicago assault data. Here (cid:96)1 and (cid:96)2 are the length scales in spatial dimensions and \u03c32 is the\nnoise level. Trecovery is the time for recovering hyperparameters. Tprediction is the time for prediction at\nall 157, 644 observations (including training and testing).\n\n5.5 Deep kernel learning\n\nTo handle high-dimensional datasets, we bring our methods into the deep kernel learning framework\n[24] by replacing the \ufb01nal layer of a pre-trained deep neural network (DNN) with a GP. This\nexperiment uses the gas sensor dataset from the UCI machine learning repository. It has 2565\ninstances with 128 dimensions. We pre-train a DNN, then attach a Gaussian process with RBF\nkernels to the two-dimensional output of the second-to-last layer. We then further train all parameters\nof the resulting kernel, including the weights of the DNN, through the GP marginal likelihood. In\nthis example, Lanczos and the scaled eigenvalue approach perform similarly well. Nonetheless, we\nsee that Lanczos can effectively be used with SKI on a high dimensional problem to train hundreds\nof thousands of kernel parameters.\n\n0.1366 \u00b1 0.0387\n\nMethod\nRMSE\nTime [s]\nTable 4: Prediction RMSE and per training iteration runtime.\n\nScaled eigenvalues\n0.1045 \u00b1 0.0228\n\nLanczos\n\n0.1053 \u00b1 0.0248\n\n1.6320\n\nDNN\n\n0.4438\n\n2.0680\n\n6 Discussion\n\nThere are many cases in which fast MVMs can be achieved, but it is dif\ufb01cult or impossible to\nef\ufb01ciently compute a log determinant. We have developed a framework for scalable and accurate\nestimates of a log determinant and its derivatives relying only on MVMs. We particularly consider\nscalable kernel learning, showing the promise of stochastic Lanczos estimation combined with\na pre-computed surrogate model. We have shown the scalability and \ufb02exibility of our approach\nthrough experiments with kernel learning for several real-world data sets using both Gaussian and\nnon-Gaussian likelihoods, and highly parametrized deep kernels.\nIterative MVM approaches have great promise for future exploration. We have only begun to explore\ntheir signi\ufb01cant generality. In addition to log determinants, the methods presented here could be\nadapted to fast posterior sampling, diagonal estimation, matrix square roots, and many other standard\noperations. The proposed methods only depend on fast MVMs\u2014and the structure necessary for\nfast MVMs often exists, or can be readily created. We have here made use of SKI [13] to create\nsuch structure. But other approaches, such as stochastic variational methods [10], could be used\nor combined with SKI for fast MVMs, as in [26]. Moreover, iterative MVM methods naturally\nharmonize with GPU acceleration, and are therefore likely to increase in their future applicability and\npopularity. Finally, one could explore the ideas presented here for scalable higher order derivatives,\nmaking use of Hessian methods for greater convergence rates.\n\n9\n\n\fReferences\n[1] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Founda-\n\ntions and Trends R(cid:13) in Machine Learning, 5(2\u20133):123\u2013286, 2012.\n\n[2] David JC MacKay. Bayesian methods for adaptive models. PhD thesis, California Institute of\n\nTechnology, 1992.\n\n[3] David JC MacKay. Information theory, inference and learning algorithms. Cambridge university\n\npress, 2003.\n\n[4] Havard Rue and Leonhard Held. Gaussian Markov random \ufb01elds: theory and applications.\n\nCRC Press, 2005.\n\n[5] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for Machine Learning. The MIT\n\nPress, 2006.\n\n[6] Christos Boutsidis, Petros Drineas, Prabhanjan Kambadur, Eugenia-Maria Kontopoulou, and\nAnastasios Zouzias. A randomized algorithm for approximating the log determinant of a\nsymmetric positive de\ufb01nite matrix. arXiv preprint arXiv:1503.00374, 2015.\n\n[7] Carl Edward Rasmussen and Zoubin Ghahramani. Occam\u2019s razor. In Neural Information\n\nProcessing Systems (NIPS), 2001.\n\n[8] Joaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-\nmate gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939\u20131959,\n2005.\n\n[9] Q. Le, T. Sarlos, and A. Smola. Fastfood-computing Hilbert space expansions in loglinear time.\nIn Proceedings of the 30th International Conference on Machine Learning, pages 244\u2013252,\n2013.\n\n[10] J Hensman, N Fusi, and N.D. Lawrence. Gaussian processes for big data. In Uncertainty in\n\nArti\ufb01cial Intelligence (UAI). AUAI Press, 2013.\n\n[11] Andrew Gordon Wilson. Covariance kernels for fast automatic pattern discovery and extrapo-\n\nlation with Gaussian processes. PhD thesis, University of Cambridge, 2014.\n\n[12] Andrew Gordon Wilson, Elad Gilboa, Nehorai Arye, and John P Cunningham. Fast kernel learn-\ning for multidimensional pattern extrapolation. In Advances in Neural Information Processing\nSystems, pages 3626\u20133634, 2014.\n\n[13] Andrew Gordon Wilson and Hannes Nickisch. Kernel interpolation for scalable structured\nGaussian processes (KISS-GP). International Conference on Machine Learning (ICML), 2015.\n[14] William Herlands, Andrew Wilson, Hannes Nickisch, Seth Flaxman, Daniel Neill, Wilbert\nVan Panhuis, and Eric Xing. Scalable Gaussian processes for characterizing multidimensional\nchange surfaces. Arti\ufb01cial Intelligence and Statistics, 2016.\n\n[15] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In\nAdvances in neural information processing systems (NIPS), volume 18, page 1257. MIT Press,\n2006.\n\n[16] M. Fiedler. Hankel and Loewner matrices. Linear Algebra and Its Applications, 58:75\u201395,\n\n1984.\n\n[17] Hermann Weyl. Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differen-\ntialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung). Mathematische\nAnnalen, 71(4):441\u2013479, 1912.\n\n[18] Seth Flaxman, Andrew Wilson, Daniel Neill, Hannes Nickisch, and Alex Smola. Fast kronecker\ninference in gaussian processes with non-gaussian likelihoods. In International Conference on\nMachine Learning, pages 607\u2013616, 2015.\n\n[19] Insu Han, Dmitry Malioutov, and Jinwoo Shin. Large-scale log-determinant computation\n\nthrough stochastic Chebyshev expansions. In ICML, pages 908\u2013917, 2015.\n\n[20] Shashanka Ubaru, Jie Chen, and Yousef Saad. Fast estimation of tr(F (A)) via stochastic\n\nLanczos quadrature.\n\n[21] Zhaojun Bai, Mark Fahey, Gene H Golub, M Menon, and E Richter. Computing partial\neigenvalue sums in electronic structure calculations. Technical report, Tech. Report SCCM-98-\n03, Stanford University, 1998.\n\n10\n\n\f[22] D MacKay and MN Gibbs. Ef\ufb01cient implementation of gaussian processes. Neural Computation,\n\n1997.\n\n[23] Michael L Stein, Jie Chen, Mihai Anitescu, et al. Stochastic approximation of score functions\n\nfor gaussian processes. The Annals of Applied Statistics, 7(2):1162\u20131191, 2013.\n\n[24] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P Xing. Deep kernel\nlearning. In Proceedings of the 19th International Conference on Arti\ufb01cial Intelligence and\nStatistics, pages 370\u2013378, 2016.\n\n[25] Carl Edward Rasmussen and Hannes Nickisch. Gaussian processes for machine learning\n(GPML) toolbox. Journal of Machine Learning Research (JMLR), 11:3011\u20133015, Nov 2010.\n[26] Andrew G Wilson, Zhiting Hu, Ruslan R Salakhutdinov, and Eric P Xing. Stochastic variational\ndeep kernel learning. In Advances in Neural Information Processing Systems, pages 2586\u20132594,\n2016.\n\n[27] Bernhard W Silverman. Some aspects of the spline smoothing approach to non-parametric\nregression curve \ufb01tting. Journal of the Royal Statistical Society. Series B (Methodological),\npages 1\u201352, 1985.\n\n[28] Joaquin Quinonero-Candela, Carl Edward Rasmussen, and Christopher KI Williams. Approxi-\nmation methods for Gaussian process regression. Large-scale kernel machines, pages 203\u2013223,\n2007.\n\n[29] Nicholas J Higham. Functions of matrices: theory and computation. SIAM, 2008.\n[30] Michael F Hutchinson. A stochastic estimator of the trace of the in\ufb02uence matrix for Laplacian\nsmoothing splines. Communications in Statistics-Simulation and Computation, 19(2):433\u2013450,\n1990.\n\n[31] Amparo Gil, Javier Segura, and Nico Temme. Numerical Methods for Special Functions. SIAM,\n\n2007.\n\n[32] Gene Golub and G\u00e9rard Meurant. Matrices, Moments and Quadrature with Applications.\n\nPrinceton University Press, 2010.\n\n[33] Jane K Cullum and Ralph A Willoughby. Lanczos algorithms for large symmetric eigenvalue\n\ncomputations: Vol. I: Theory. SIAM, 2002.\n\n[34] Youcef Saad. Numerical methods for large eigenvalue problems. Manchester University Press,\n\n1992.\n\n[35] Martin Dietrich Buhmann. Radial basis functions. Acta Numerica 2000, 9:1\u201338, 2000.\n[36] Gregory E Fasshauer. Meshfree approximation methods with MATLAB, volume 6. World\n\nScienti\ufb01c, 2007.\n\n[37] Robert Schaback and Holger Wendland. Kernel techniques: from machine learning to meshless\n\nmethods. Acta Numerica, 15:543\u2013639, 2006.\n\n[38] Holger Wendland. Scattered data approximation, volume 17. Cambridge university press, 2004.\n[39] Haim Avron and Sivan Toledo. Randomized algorithms for estimating the trace of an implicit\nsymmetric positive semi-de\ufb01nite matrix. J. ACM, 58(2):8:1\u20138:34, 2011. doi: 10.1145/1944345.\n1944349. URL http://dx.doi.org/10.1145/1944345.1944349.\n\n11\n\n\f", "award": [], "sourceid": 3171, "authors": [{"given_name": "Kun", "family_name": "Dong", "institution": "Cornell University"}, {"given_name": "David", "family_name": "Eriksson", "institution": "Cornell University"}, {"given_name": "Hannes", "family_name": "Nickisch", "institution": "Philips Research"}, {"given_name": "David", "family_name": "Bindel", "institution": "Cornell University"}, {"given_name": "Andrew", "family_name": "Wilson", "institution": "Cornell University"}]}