{"title": "Variational Inference for Gaussian Process Models with Linear Complexity", "book": "Advances in Neural Information Processing Systems", "page_first": 5184, "page_last": 5194, "abstract": "Large-scale Gaussian process inference has long faced practical challenges due to time and space complexity that is superlinear in dataset size. While sparse variational Gaussian process models are capable of learning from large-scale data, standard strategies for sparsifying the model can prevent the approximation of complex functions. In this work, we propose a novel variational Gaussian process model that decouples the representation of mean and covariance functions in reproducing kernel Hilbert space. We show that this new parametrization generalizes previous models. Furthermore, it yields a variational inference problem that can be solved by stochastic gradient ascent with time and space complexity that is only linear in the number of mean function parameters, regardless of the choice of kernels, likelihoods, and inducing points. This strategy makes the adoption of large-scale expressive Gaussian process models possible. We run several experiments on regression tasks and show that this decoupled approach greatly outperforms previous sparse variational Gaussian process inference procedures.", "full_text": "Variational Inference for Gaussian Process Models\n\nwith Linear Complexity\n\nChing-An Cheng\n\nByron Boots\n\nInstitute for Robotics and Intelligent Machines\n\nInstitute for Robotics and Intelligent Machines\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\ncacheng@gatech.edu\n\nGeorgia Institute of Technology\n\nAtlanta, GA 30332\n\nbboots@cc.gatech.edu\n\nAbstract\n\nLarge-scale Gaussian process inference has long faced practical challenges due\nto time and space complexity that is superlinear in dataset size. While sparse\nvariational Gaussian process models are capable of learning from large-scale\ndata, standard strategies for sparsifying the model can prevent the approximation\nof complex functions. In this work, we propose a novel variational Gaussian\nprocess model that decouples the representation of mean and covariance functions\nin reproducing kernel Hilbert space. We show that this new parametrization\ngeneralizes previous models. Furthermore, it yields a variational inference problem\nthat can be solved by stochastic gradient ascent with time and space complexity that\nis only linear in the number of mean function parameters, regardless of the choice of\nkernels, likelihoods, and inducing points. This strategy makes the adoption of large-\nscale expressive Gaussian process models possible. We run several experiments\non regression tasks and show that this decoupled approach greatly outperforms\nprevious sparse variational Gaussian process inference procedures.\n\n1\n\nIntroduction\n\nGaussian process (GP) inference is a popular nonparametric framework for reasoning about functions\nunder uncertainty. However, the expressiveness of GPs comes at a price: solving (approximate)\ninference for a GP with N data instances has time and space complexities in \u0398(N 3) and \u0398(N 2),\nrespectively. Therefore, GPs have traditionally been viewed as a tool for problems with small- or\nmedium-sized datasets\nRecently, the concept of inducing points has been used to scale GPs to larger datasets. The idea is to\nsummarize a full GP model with statistics on a sparse set of M (cid:28) N \ufb01ctitious observations [18, 24].\nBy representing a GP with these inducing points, the time and the space complexities are reduced to\nO(N M 2 + M 3) and O(N M + M 2), respectively. To further process datasets that are too large to \ufb01t\ninto memory, stochastic approximations have been proposed for regression [10] and classi\ufb01cation [11].\nThese methods have similar complexity bounds, but with N replaced by the size of a mini-batch Nm.\nDespite the success of sparse models, the scalability issues of GP inference are far from resolved.\nThe major obstruction is that the cubic complexity in M in the aforementioned upper-bound is also\na lower-bound, which results from the inversion of an M-by-M covariance matrix de\ufb01ned on the\ninducing points. As a consequence, these models can only afford to use a small set of M basis\nfunctions, limiting the expressiveness of GPs for prediction.\nIn this work, we show that superlinear complexity is not completely necessary. Inspired by the\nreproducing kernel Hilbert space (RKHS) representation of GPs [2], we propose a generalized\nvariational GP model, called DGPs (Decoupled Gaussian Processes), which decouples the bases\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fa, B\n\nSGA\nSNGA\nSMA\nCG\nCG\n\n\u03b1,\u03b2\n\nSGA\nSGA\nSMA\nCG\nCG\n\n\u03b8\n\nSGA\nSGA\nSGA\nCG\nCG\n\nSVDGP\nSVI\niVSGPR\nVSGPR\nGPR\n\n\u03b1 = \u03b2 N (cid:54)= M Time\nFALSE\nTRUE\nTRUE\nTRUE\nTRUE\n\nTRUE\nTRUE\nTRUE\nTRUE\nFALSE\n\nSpace\n\nO(DN M\u03b1 + N M 2\n\u03b2 + M 3\nO(DN M + N M 2 + M 3)\nO(DN M + N M 2 + M 3)\nO(DN M + N M 2 + M 3)\nO(DN 2 + N 3)\n\n\u03b2 ) O(N M\u03b1 + M 2\n\u03b2 )\nO(N M + M 2)\nO(N M + M 2)\nO(N M + M 2)\nO(N 2)\n\nTable 1: Comparison between SVDGP and variational GPR algorithms: SVI [10], iVSGPR [2],\nVSGPR [24], and GPR [19], where N is the number of observations/the size of a mini-batch, M, M\u03b1,\nM\u03b2 are the number of basis functions, and D is the input dimension. Here it is assumed M\u03b1 \u2265 M\u03b2\n1.\n\n(a) M = 10\n\n(b) M\u03b1 = 100, M\u03b2 = 10\n\n(c) M = 100\n\nFigure 1: Comparison between models with shared and decoupled basis. (a)(c) denote the models\nwith shared basis of size M. (b) denotes the model of decoupled basis with size (M\u03b1, M\u03b2). In each\n\ufb01gure, the red line denotes the ground truth; the blue circles denote the observations; the black line\nand the gray area denote the mean and variance in prediction, respectively.\n\n\u03b2 + M 3\n\nfor the mean and the covariance functions. Speci\ufb01cally, let M\u03b1 and M\u03b2 be the numbers of basis\nfunctions used to model the mean and the covariance functions, respectively. Assume M\u03b1 \u2265 M\u03b2.\nWe show, when DGPs are used as a variational posterior [24], the associated variational inference\nproblem can be solved by stochastic gradient ascent with space complexity O(NmM\u03b1 + M 2\n\u03b2 ) and\n\u03b2 ), where D is the input dimension. We name this\ntime complexity O(DNmM\u03b1 + NmM 2\nalgorithm SVDGP. As a result, we can choose M\u03b1 (cid:29) M\u03b2, which allows us to keep the time and space\ncomplexity similar to previous methods (by choosing M\u03b2 = M) while greatly increasing accuracy.\nTo the best of our knowledge, this is the \ufb01rst variational GP algorithm that admits linear complexity\nin M\u03b1, without any assumption on the choice of kernel and likelihood.\nWhile we design SVDGP for general likelihoods, in this paper we study its effectiveness in Gaussian\nprocess regression (GPR) tasks. We consider this is without loss of generality, as most of the\nsparse variational GPR algorithms in the literature can be modi\ufb01ed to handle general likelihoods\nby introducing additional approximations (e.g. in Hensman et al. [11] and Sheth et al. [22]). Our\nexperimental results show that SVDGP signi\ufb01cantly outperforms the existing techniques, achieving\nhigher variational lower bounds and lower prediction errors when evaluated on held-out test sets.\n\n1.1 Related Work\n\nOur framework is based on the variational inference problem proposed by Titsias [24], which treats\nthe inducing points as variational parameters to allow direct approximation of the true posterior.\nThis is in contrast to Seeger et al. [21], Snelson and Ghahramani [23], Qui\u00f1onero-Candela and\nRasmussen [18], and L\u00e1zaro-Gredilla et al. [15], which all use inducing points as hyper-parameters\nof a degenerate prior. While both approaches have the same time and space complexity, the latter\nadditionally introduces a large set of unregularized hyper-parameters and, therefore, is more likely to\nsuffer from over-\ufb01tting [1].\nIn Table 1, we compare SVDGP with recent GPR algorithms in terms of the assumptions made and the\ntime and space complexity. Each algorithm can be viewed as a special way to solve the maximization\nof the variational lower bound (5), presented in Section 3.2. Our algorithm SVDGP generalizes the\nprevious approaches to allow the basis functions for the mean and the covariance to be decoupled, so\nan approximate solution can be found by stochastic gradient ascent in linear complexity.\n\n1The \ufb01rst three columns show the algorithms to update the parameters: SGA/SNGA/SMA denotes stochastic\ngradient/natural gradient/mirror ascent, and CG denotes batch nonlinear conjugate gradient ascent. The 4th and\nthe 5th columns indicate whether the bases for mean and covariance are strictly shared, and whether a variational\nposterior can be used. The last two columns list the time and space complexity.\n\n2\n\n\fTo illustrate the idea, we consider a toy GPR example in Figure 1. The dataset contains 500 noisy\nobservations of a sinc function. Given the same training data, we conduct experiments with three\ndifferent GP models. Figure 1 (a)(c) show the results of the traditional coupled basis, which can be\nsolved by any of the variational algorithms listed in Table 1, and Figure 1 (b) shows the result using the\ndecoupled approach SVDGP. The sizes of basis and observations are selected to emulate a large dataset\nscenario. We can observe SVDGP achieves a nice trade-off between prediction performance and\ncomplexity: it achieves almost the same accuracy in prediction as the full-scale model in Figure 1(c)\nand preserves the overall shape of the predictive variance.\nIn addition to the sparse algorithms above, some recent attempts aim to revive the non-parametric\nproperty of GPs by structured covariance functions. For example, Wilson and Nickisch [27] proposes\nto space the inducing points on a multidimensional lattice, so the time and space complexities of\nusing a product kernel becomes O(N + DM 1+1/D) and O(N + DM 1+2/D), respectively. However,\nbecause M = cD, where c is the number of grid points per dimension, the overall complexity is\nexponential in D and infeasible for high-dimensional data. Another interesting approach by Hensman\net al. [12] combines variational inference [24] and a sparse spectral approximation [15]. By equally\nspacing inducing points on the spectrum, they show the covariance matrix on the inducing points have\ndiagonal plus low-rank structure. With MCMC, the algorithm can achieve complexity O(DN M ).\nHowever, the proposed structure in [12] does not help to reduce the complexity when an approximate\nGaussian posterior is favored or when the kernel hyper-parameters need to be updated.\nOther kernel methods with linear complexity have been proposed using functional gradient descent\n[14, 5]. However, because these methods use a model strictly the same size as the entire dataset, they\nfail to estimate the predictive covariance, which requires \u2126(N 2) space complexity. Moreover, they\ncannot learn hyper-parameters online. The latter drawback also applies to greedy algorithms based\non rank-one updates, e.g. the algorithm of Csat\u00f3 and Opper [4].\nIn contrast to these previous methods, our algorithm applies to all choices of inducing points,\nlikelihoods, and kernels, and we allow both variational parameters and hyper-parameters to adapt\nonline as more data are encountered.\n\n2 Preliminaries\n\nIn this section, we brie\ufb02y review the inference for GPs and the variational framework proposed\nby Titsias [24]. For now, we will focus on GPR for simplicity of exposition. We will discuss the case\nof general likelihoods in the next section when we introduce our framework, DGPs.\n\nInference for GPs\n\n2.1\nLet f : X \u2192 R be a latent function de\ufb01ned on a compact domain X \u2282 RD. Here we assume a priori\nthat f is distributed according to a Gaussian process GP(m, k). That is, \u2200x, x(cid:48) \u2208 X , E[f (x)] = m(x)\nand C[f (x), f (x(cid:48))] = k(x, x(cid:48)). In short, we write f \u223c GP(m, k).\nA GP probabilistic model is composed of a likelihood p(y|f (x)) and a GP prior GP(m, k); in GPR,\nthe likelihood is assumed to be Gaussian i.e. p(y|f (x)) = N (y|f (x), \u03c32) with variance \u03c32. Usually,\nthe likelihood and the GP prior are parameterized by some hyper-parameters, which we summarize\nas \u03b8. This includes, for example, the variance \u03c32 and the parameters implicitly involved in de\ufb01ning\nk(x, x(cid:48)). For notational convenience, and without loss of generality, we assume m(x) = 0 in the\nprior distribution and omit explicitly writing the dependence of distributions on \u03b8.\nAssume we are given a dataset D = {(xn, yn)}N\nn=1, in which xn \u2208 X and yn \u223c p(y|f (xn)). Let2\nn=1. Inference for GPs involves solving for the posterior p\u03b8\u2217 (f (x)|y)\nX = {xn}N\nfor any new input x \u2208 X , where \u03b8\u2217 = arg max\u03b8 log p\u03b8(y). For example in GPR, because the\nlikelihood is Gaussian, the predictive posterior is also Gaussian with mean and covariance\n\nn=1 and y = (yn)N\n\nm|y(x) = kx,X (KX + \u03c32I)\u22121y,\n\nk|y(x, x(cid:48)) = kx,x(cid:48) \u2212 kx,X (KX + \u03c32I)\u22121kX,x(cid:48),\n\nand the hyper-parameter \u03b8\u2217 can be found by nonlinear conjugate gradient ascent [19]\n\nmax\n\n\u03b8\n\nlog p\u03b8(y) = max\n\n\u03b8\n\nlog N (y|0, KX + \u03c32I),\n\n2In notation, we use boldface to distinguish \ufb01nite-dimensional vectors (lower-case) and matrices (upper-case)\n\nthat are used in computation from scalar and abstract mathematical objects.\n\n3\n\n(1)\n\n(2)\n\n\fwhere k\u00b7,\u00b7, k\u00b7,\u00b7 and K\u00b7,\u00b7 denote the covariances between the sets in the subscript.3 One can show that\nthese two functions, m|y(x) and k|y(x, x(cid:48)), de\ufb01ne a valid GP. Therefore, given observations y, we\nsay f \u223c GP(m|y, k|y).\nAlthough theoretically GPs are non-parametric and can model any function as N \u2192 \u221e, in practice\nthis is dif\ufb01cult. As the inference has time complexity \u2126(N 3) and space complexity \u2126(N 2), applying\nvanilla GPs to large datasets is infeasible.\n\n2.2 Variational Inference with Sparse GPs\n\nTo scale GPs to large datasets, Titsias [24] introduced a scheme to compactly approximate the true\nposterior with a sparse GP, GP( \u02c6m|y, \u02c6k|y), de\ufb01ned by the statistics on M (cid:28) N function values:\n{Lmf (\u02dcxm)}M\nm=1, where Lm is a bounded linear operator4 and \u02dcxm \u2208 X . Lmf (\u00b7) is called an\ninducing function and \u02dcxm an inducing point. Common choices of Lm include the identity map (as\nused originally by Titsias [24]) and integrals to achieve better approximation or to consider multi-\ndomain information [26, 7, 3]. Intuitively, we can think of {Lmf (\u02dcxm)}M\nm=1 as a set of potentially\nindirect observations that capture salient information about the unknown function f.\nm=1 and let fX \u2208 RN\nTitsias [24] solves for GP( \u02c6m|y, \u02c6k|y) by variational inference. Let \u02dcX = {\u02dcxm}M\nand f \u02dcX \u2208 RM be the (inducing) function values de\ufb01ned on X and \u02dcX, respectively. Let p(f \u02dcX ) be\nthe prior given by GP(m, k) and de\ufb01ne q(f \u02dcX ) = N (f \u02dcX| \u02dcm, \u02dcS) to be its variational posterior, where\n\u02dcm \u2208 RM and \u02dcS \u2208 RM\u00d7M are the mean and the covariance of the approximate posterior of f \u02dcX.\nTitsias [24] proposes to use q(fX , f \u02dcX ) = p(fX|f \u02dcX )q(f \u02dcX ) as the variational posterior to approximate\np(fX , f \u02dcX|y) and to solve for q(f \u02dcX ) together with the hyper-parameter \u03b8 through\np(y|fX )p(fX|f \u02dcX )p(f \u02dcX )\n\n(cid:90)\n\nmax\n\n\u03b8, \u02dcX, \u02dcm,\u02dcS\n\nL\u03b8( \u02dcX, \u02dcm, \u02dcS) = max\n\n\u03b8, \u02dcX, \u02dcm,\u02dcS\n\nq(fX , f \u02dcX ) log\n\nq(fX , f \u02dcX )\n\n(3)\n\ndfX df \u02dcX ,\nf \u02dcX , KX \u2212 \u02c6KX )\n\nwhere L\u03b8 is a variational lower bound of log p\u03b8(y), p(fX|f \u02dcX ) = N (fX|KX, \u02dcX K\u22121\nis the conditional probability given in GP(m, k), and \u02c6KX = KX, \u02dcX K\u22121\nAt \ufb01rst glance, the speci\ufb01c choice of variational posterior q(fX , f \u02dcX ) seems heuristic. However,\nalthough parameterized \ufb01nitely, it resembles a full-\ufb02edged GP GP( \u02c6m|y, \u02c6k|y):\n\nK \u02dcX,X.\n\n\u02dcX\n\n\u02dcX\n\n\u02c6m|y(x) = kx, \u02dcX K\u22121\n\n\u02dcX\n\n\u02dcm,\n\n\u02c6k|y(x, x(cid:48)) = kx,x(cid:48) + kx, \u02dcX K\u22121\n\n\u02dcX\n\nK\u22121\n\u02dcX\n\nk \u02dcX,x(cid:48).\n\n(4)\n\n(cid:16)\u02dcS \u2212 K \u02dcX\n\n(cid:17)\n\nThis result is further studied in Matthews et al. [16] and Cheng and Boots [2], where it is shown that\n(3) is indeed minimizing a proper KL-divergence between Gaussian processes/measures.\nBy comparing (2) and (3), one can show that the time and the space complexities now reduce\nto O(DN M + M 2N + M 3) and O(M 2 + M N ), respectively, due to the low-rank structure of\n\u02c6K \u02dcX [24]. To further reduce complexity, stochastic optimization, such as stochastic natural ascent\n[10] or stochastic mirror descent [2] can be applied. In this case, N in the above asymptotic bounds\nwould be replaced by the size of a mini-batch Nm. The above results can be modi\ufb01ed to consider\ngeneral likelihoods as in [22, 11].\n\n3 Variational Inference with Decoupled Gaussian Processes\n\nDespite the success of sparse GPs, the scalability issues of GPs persist. Although parameterizing a GP\nwith inducing points/functions enables learning from large datasets, it also restricts the expressiveness\nof the model. As the time and the space complexities still scale in \u2126(M 3) and \u2126(M 2), we cannot\nlearn or use a complex model with large M.\nIn this work, we show that these two complexity bounds, which have long accompanied GP models,\nare not strictly necessary, but are due to the tangled representation canonically used in the GP\n\n3If the two sets are the same, only one is listed.\n4Here we use the notation Lmf loosely for the compactness of writing. Rigorously, Lm is a bounded linear\n\noperator acting on m and k, not necessarily on all sample paths f.\n\n4\n\n\fliterature. To elucidate this, we adopt the dual representation of Cheng and Boots [2], which treats\nGPs as linear operators in RKHS. But, unlike Cheng and Boots [2], we show how to decouple the\nbasis representation of mean and covariance functions of a GP and derive a new variational problem,\nwhich can be viewed as a generalization of (3). We show that this problem\u2014with arbitrary likelihoods\nand kernels\u2014can be solved by stochastic gradient ascent with linear complexity in M\u03b1, the number\nof parameters used to specify the mean function for prediction.\nIn the following, we \ufb01rst review the results in [2]. We next introduce the decoupled representation,\nDGPs, and its variational inference problem. Finally, we present SVDGP and discuss the case with\ngeneral likelihoods.\n\nx \u00b5 and k(x, x(cid:48)) = \u03c6T\n\n3.1 Gaussian Processes as Gaussian Measures\nLet an RKHS H be a Hilbert space of functions with the reproducing property: \u2200x \u2208 X , \u2203\u03c6x \u2208 H\nsuch that \u2200f \u2208 H, f (x) = \u03c6T\nx f.5 A Gaussian process GP(m, k) is equivalent to a Gaussian\nmeasure \u03bd on Banach space B which possesses an RKHS H [2]:6 there is a mean functional \u00b5 \u2208 H\nand a bounded positive semi-de\ufb01nite linear operator \u03a3 : H \u2192 H, such that for any x, x(cid:48) \u2208 X ,\n\u2203\u03c6x, \u03c6x(cid:48) \u2208 H, we can write m(x) = \u03c6T\nx \u03a3\u03c6x(cid:48). The triple (B, \u03bd,H) is known\nas an abstract Wiener space [9, 6], in which H is also called the Cameron-Martin space. Here the\nrestriction that \u00b5, \u03a3 are RKHS objects is necessary, so the variational inference problem in the next\nsection can be well-de\ufb01ned.\nWe call this the dual representation of a GP in RKHS H (the mean function m and the covariance\nfunction k are realized as linear operators \u00b5 and \u03a3 de\ufb01ned in H). With abuse of notation, we write\nN (f|\u00b5, \u03a3) in short. This notation does not mean a GP has a Gaussian distribution in H, nor does it\nimply that the sample paths from GP(m, k) are necessarily in H. Precisely, B contains the sample\npaths of GP(m, k) and H is dense in B. In most applications of GP models, B is the Banach space\nof continuous function C(X ;Y) and H is the span of the covariance function. As a special case, if\nH is \ufb01nite-dimensional, B and H coincide and \u03bd becomes equivalent to a Gaussian distribution in a\nEuclidean space.\nx \u03c6x(cid:48) and \u03c6x : X \u2192 H\nIn relation to our previous notation in Section 2.1: suppose k(x, x(cid:48)) = \u03c6T\nis a feature map to some Hilbert space H. Then we have assumed a priori that GP(m, k) =\n(cid:80)dim H\nN (f|0, I) is a normal Gaussian measure; that is GP(m, k) samples functions f in the form f (x) =\nl=1 \u03c6l(x)T \u0001l, where \u0001l \u223c N (0, 1) are independent. Note if dimH = \u221e, with probability one\nf is not in H, but fortunately H is large enough for us to approximate the sampled functions. In\nparticular, it can be shown that the posterior GP(m|y, k|y) in GPR has a dual RKHS representation\nin the same RKHS as the prior GP [2].\n\n3.2 Variational Inference in Gaussian Measures\n\n(cid:90)\n\nL\u03b8(q(f )) = max\n\nq(f ),\u03b8\n\np\u03b8(y|f )p(f )\n\nCheng and Boots [2] proposes a dual formulation of (3) in terms of Gaussian measures7:\n\nq(f ) log\n\nmax\nq(f ),\u03b8\n\n(5)\nwhere q(f ) = N (f|\u02dc\u00b5, \u02dc\u03a3) is a variational Gaussian measure and p(f ) = N (f|0, I) is a normal prior.\nIts connection to the inducing points/functions in (3) can be summarized as follows [2, 3]: De\ufb01ne\nm=1 am\u03c8\u02dcxm, where \u03c8\u02dcxm \u2208 H is de\ufb01ned such that\n\na linear operator \u03a8 \u02dcX : RM \u2192 H as a (cid:55)\u2192 (cid:80)M\n\n\u00b5 = E[Lmf (\u02dcxm)]. Then (3) and (5) are equivalent, if q(f ) has a subspace parametrization,\n\ndf = max\nq(f ),\u03b8\n\nq(f )\n\nEq[log p\u03b8(y|f )] \u2212 KL[q||p],\n\n\u03c8T\n\u02dcxm\n\n(6)\nwith a \u2208 RM and A \u2208 RM\u00d7M satisfying \u02dcm = K \u02dcX a, and \u02dcS = K \u02dcX + K \u02dcX AK \u02dcX. In other words,\nthe variational inference algorithms in the literature are all using a variational Gaussian measure in\nwhich \u02dc\u00b5 and \u02dc\u03a3 are parametrized by the same basis {\u03c8\u02dcxm|\u02dcxm \u2208 \u02dcX}M\ni=1.\n\n\u02dc\u03a3 = I + \u03a8 \u02dcX A\u03a8T\n\u02dcX ,\n\n\u02dc\u00b5 = \u03a8 \u02dcX a,\n\nL : H \u2192 H, even if H is in\ufb01nite-dimensional.\n\nx f for (cid:104)f, \u03c6x(cid:105)H, and f T Lg for (cid:104)f, Lg(cid:105)H, where f, g \u2208 H and\n5To simplify the notation, we write \u03c6T\n6Such H w.l.o.g. can be identi\ufb01ed as the natural RKHS of the covariance function of a zero-mean prior GP.\n7 We assume q(f ) is absolutely continuous wrt p(f ), which is true as p(f ) is non-degenerate. The integral\n\ndenotes the expectation of log p\u03b8(y|f ) + log p(f )\n\nq(f ) over q(f ), and q(f )\n\np(f ) denotes the Radon-Nikodym derivative.\n\n5\n\n\fCompared with (3), the formulation in (5) is neater: it follows the de\ufb01nition of the very basic\nvariational inference problem. This is not surprising, since GPs can be viewed as Bayesian linear\nmodels in an in\ufb01nite-dimensional space. Moreover, in (5) all hyper-parameters are isolated in the\nlikelihood p\u03b8(y|f ), because the prior is \ufb01xed as a normal Gaussian measure.\n\n3.3 Disentangling the GP Representation with DGPs\n\n\u02dc\u00b5 = \u03a8\u03b1a,\n\n\u02dc\u03a3 = (I + \u03a8\u03b2B\u03a8T\n\nWhile Cheng and Boots [2] treat (5) as an equivalent form of (3), here we show that it is a generaliza-\ntion. By further inspecting (5), it is apparent that sharing the basis \u03a8 \u02dcX between \u02dc\u00b5 and \u02dc\u03a3 in (6) is not\nstrictly necessary, since (5) seeks to optimize two linear operators, \u02dc\u00b5 and \u02dc\u03a3. With this in mind, we\npropose a new parametrization that decouples the bases for \u02dc\u00b5 and \u02dc\u03a3:\n\u03b2 )\u22121\n\n(7)\nwhere \u03a8\u03b1 : RM\u03b1 \u2192 H and \u03a8\u03b2 : RM\u03b2 \u2192 H denote linear operators de\ufb01ned similarly to \u03a8 \u02dcX and\nB (cid:23) 0 \u2208 RM\u03b2\u00d7M\u03b2 . Compared with (6), here we parametrize \u02dc\u03a3 through its inversion with B so the\ncondition that \u02dc\u03a3 (cid:23) 0 can be easily realized as B (cid:23) 0. This form agrees with the posterior covariance\nin GPR [2] and will give a posterior that is strictly less uncertain than the prior. Note the choice of\ndecoupled parametrization is not unique. In particular, the bases can be partially shared, or (a, B)\ncan be further parametrized (e.g. B can be parametrized using the canonical form in (4)) to improve\nthe numerical convergence rate. Please refer to Appendix A for a discussion.8\nThe decoupled subspace parametrization (7) corresponds to a DGP, GP( \u02c6m\u03b1|y, \u02c6k\u03b2|y), with mean and\ncovariance functions as 9\n\n(cid:0)B\u22121 + K\u03b2\n\n(cid:1)\u22121\n\nk\u03b2,x(cid:48).\n\n(8)\n\n\u02c6m\u03b1|y(x) = kx,\u03b1a,\n\n\u02c6k\u03b2|y(x, x(cid:48)) = kx,x(cid:48) \u2212 kx,\u03b2\n\nWhile the structure of (8) looks similar to (4), directly replacing the basis \u02dcX in (4) with \u03b1 and \u03b2 is\nnot trivial. Because the equations in (4) are derived from the traditional viewpoint of GPs as statistics\non function values, the original optimization problem (3) is not de\ufb01ned if \u03b1 (cid:54)= \u03b2 and therefore, it is\nnot clear how to learn a decoupled representation traditionally. Conversely, by using the dual RKHS\nrepresentation, the objective function to learn (8) follows naturally from (5), as we will show next.\n\n3.4\n\nSVDGP: Algorithm and Analysis\n\nSubstituting the decoupled subspace parametrization (7) into the variational inference problem in (5)\nresults in a numerical optimization problem: maxq(f ),\u03b8 Eq[log p\u03b8(y|f )] \u2212 KL[q||p] with\n\nKL[q||p] =\n\nEq[log p\u03b8(y|f )] =\n\n1\n2\n\nN(cid:88)\n\naT K\u03b1a +\n\nlog |I + K\u03b2B| +\n\n1\n2\n\nEq(f (xn))[log p\u03b8(yn|f (xn))]\n\ntr(cid:0)K\u03b2(B\u22121 + K\u03b2)\u22121(cid:1)\n\n\u22121\n2\n\n(9)\n\n(10)\n\nn=1\n\nwhere each expectation is over a scalar Gaussian q(f (xn)) given by (8) as functions of (a, \u03b1) and\n(B, \u03b2). Our objective function contains [11] as a special case, which assumes \u03b1 = \u03b2 = \u02dcX. In\naddition, we note that Hensman et al. [11] indirectly parametrize the posterior by \u02dcm and \u02dcS = LLT ,\nwhereas we parametrize directly by (6) with a for scalability and B = LLT for better stability (which\nalways reduces the uncertainty in the posterior compared with the prior).\nWe notice that (a, \u03b1) and (B, \u03b2) are completely decoupled in (9) and potentially combined again in\n(10). In particular, if p\u03b8(yn|f (xn)) is Gaussian as in GPR, we have an additional decoupling, i.e.\nL\u03b8(a, B, \u03b1, \u03b2) = F\u03b8(a, \u03b1)+G\u03b8(B, \u03b2) for some F\u03b8(a, \u03b1) and G\u03b8(B, \u03b2). Intuitively, the optimization\n8Appendix A is partially based on a discussion with Hugh Salimbeni at the NIPS conference. Here we adopt\nthe fully decoupled, directly parametrized form in (7) to demonstrate the idea. We leave the full comparison of\ndifferent decoupled parametrizations in future work.\n9In practice, we can parametrize B = LLT with Cholesky factor L \u2208 RM\u03b2\u00d7M\u03b2 so the problem is\n\nunconstrained. The required terms in (8) and later in (9) can be stably computed as (cid:0)B\u22121 + K\u03b2\n\n(cid:1)\u22121 =\n\nLH\u22121LT and log |I + K\u03b2B| = log |H|, where H = I + LT K\u03b2L.\n\n6\n\n\fAlgorithm 1 Online Learning with DGPs\nParameters: M\u03b1, M\u03b2, Nm, N\u2206\nInput: M(a, B, \u03b1, \u03b2, \u03b8) , D\n1: \u03b80 \u2190 initializeHyperparameters( sampleMinibatch(D, Nm) )\n2: for t = 1 . . . T do\n3: Dt \u2190 sampleMinibatch(D, Nm)\n4: M.addBasis(Dt, N\u2206, M\u03b1, M\u03b2)\n5: M.updateModel(Dt, t)\n6: end for\n\nover (a, \u03b1) aims to minimize the \ufb01tting-error, and the optimization over (B, \u03b2) aims to memorize the\nsamples encountered so far; the mean and the covariance functions only interact indirectly through\nthe optimization of the hyper-parameter \u03b8.\nOne salient feature of SVDGP is that it tends to overestimate, rather than underestimate, the variance,\nwhen we select M\u03b2 \u2264 M\u03b1. This is inherited from the non-degeneracy property of the variational\nframework [24] and can be seen in the toy example in Figure 1. In the extreme case when M\u03b2 = 0,\nwe can see the covariance in (8) becomes the same as the prior; moreover, the objective function\nof SVDGP becomes similar to kernel methods (exactly the same as kernel ridge regression, when the\nlikelihood is Gaussian). The additional inclusion of expected log-likelihoods here allows SVDGP\nto learn the hyper-parameters in a uni\ufb01ed framework, as its objective function can be viewed as\nminimizing a generalization upper-bound in PAC-Bayes learning [8].\nSVDGP solves the above optimization problem by stochastic gradient ascent. Here we purposefully\nignore speci\ufb01c details of p\u03b8(y|f ) to emphasize that SVDGP can be applied to general likelihoods as it\nonly requires unbiased \ufb01rst-order information, which e.g. can be found in [22]. In addition to having\na more adaptive representation, the main bene\ufb01t of SVDGP is that the computation of an unbiased\ngradient requires only linear complexity in M\u03b1, as shown below (see Appendix Bfor details).\nKL-Divergence Assume |\u03b1| = O(DM\u03b1) and |\u03b2| = O(DM\u03b2). By (9), One can show\n\u2207aKL[q||p] = K\u03b1a and \u2207BKL[q||p] = 1\n2 (I+K\u03b2B)\u22121K\u03b2BK\u03b2(I+BK\u03b2)\u22121. Therefore, the time\ncomplexity to compute \u2207aKL[q||p] can be reduced to O(NmM\u03b1) if we sample over the columns\nof K\u03b1 with a mini-batch of size Nm. By contrast, the time complexity to compute \u2207BKL[q||p]\n\u03b2 ) and cannot be further reduced, regardless of the parametrization of B.10 The\nwill always be \u0398(M 3\ngradient with respect to \u03b1 and \u03b2 can be derived similarly and have time complexity O(DNmM\u03b1)\nand O(DM 2\nExpected Log-Likelihood Let \u02c6m(a, \u03b1) \u2208 RN and \u02c6s(B, \u03b2) \u2208 RN be the vectors of the mean and\ncovariance of scalar Gaussian q(f (xn)) for n \u2208 {1, . . . , N}. As (10) is a sum over N terms, by\nsampling with a mini-batch of size Nm, an unbiased gradient of (10) with respect to (\u03b8, \u02c6m, \u02c6s) can\nbe computed in O(Nm). To compute the full gradient with respect to (a, B, \u03b1, \u03b2), we compute\nthe derivative of \u02c6m and \u02c6s with respect to (a, B, \u03b1, \u03b2) and then apply chain rule. These steps take\nO(DNmM\u03b1) and O(DNmM\u03b2 + NmM 2\nThe above analysis shows that the curse of dimensionality in GPs originates in the covariance function.\nFor space complexity, the decoupled parametrization (7) requires memory in O(NmM\u03b1 + M 2\n\u03b2 );\nfor time complexity, an unbiased gradient with respect to (a, \u03b1) can be computed in O(DNmM\u03b1),\nbut that with respect to (B, \u03b2) has time complexity \u2126(DNmM\u03b2 + NmM 2\n\u03b2 ). This motivates\nchoosing M\u03b2 = O(M ) and M\u03b1 in O(M 2\n\u03b2 ), which maintains the same complexity as\nprevious variational techniques but greatly improves the prediction performance.\n\n\u03b2 ) for (a, \u03b1) and (B, \u03b2), respectively.\n\n\u03b2 ) or O(M 3\n\n\u03b2 + M 3\n\n\u03b2 ), respectively.\n\n\u03b2 + M 3\n\n\u03b2 + M 3\n\n4 Experimental Results\n\nWe compare our new algorithm, SVDGP, with the state-of-the-art incremental algorithms for sparse\nvariational GPR, SVI [10] and iVSGPR [2], as well as the classical GPR and the batch algorithm VS-\nGPR [24]. As discussed in Section 1.1, these methods can be viewed as different ways to optimize (5).\nTherefore, in addition to the normalized mean square error (nMSE) [19] in prediction, we report\n\n10Due to K\u03b2, the complexity would remain as O(M 3\n\n\u03b2 ) even if B is constrained to be diagonal.\n\n7\n\n\fKUKA1 - Variational Lower Bound (105)\nSVDGP\nVSGPR\n1.262\n0.195\n\n0.472\n0.265\n\n0.649\n0.201\n\n0.391\n0.076\n\niVSGPR\n\nSVI\n\nKUKA1 - Prediction Error (nMSE)\n\nSVDGP\n0.037\n0.013\n\nSVI\n\n0.169\n0.025\n\niVSGPR\n\nVSGPR\n\nGPR\n\n0.128\n0.033\n\n0.139\n0.026\n\n0.231\n0.045\n\nmean\nstd\n\nGPR\n\n-5.335\n7.777\n\nGPR\n\nMUJOCO1 - Variational Lower Bound (105)\nSVDGP\n6.007\n0.673\n\n2.822\n0.871\n\n4.543\n0.898\n\n2.178\n0.692\n\niVSGPR\n\nVSGPR\n\nSVI\n\n-10312.727\n22679.778\n\nmean\nstd\n\nMUJOCO1 - Prediction Error (nMSE)\nVSGPR\n\niVSGPR\n\nSVI\n\n0.163\n0.053\n\n0.099\n0.026\n\n0.118\n0.016\n\nSVDGP\n0.072\n0.013\n\nGPR\n\n0.213\n0.061\n\nmean\nstd\n\nmean\nstd\n\nTable 2: Experimental results of KUKA1 and MUJOCO1 after 2,000 iterations.\n\nthe performance in the variational lower bound (VLB) (5), which also captures the quality of the\npredictive variance and hyper-parameter learning.11 These two metrics are evaluated on held-out test\nsets in all of our experimental domains.\nAlgorithm 1 summarizes the online learning procedure used by all stochastic algorithms,12 where\neach learner has to optimize all the parameters on-the-\ufb02y using i.i.d. data. The hyper-parameters are\n\ufb01rst initialized heuristically by median trick using the \ufb01rst mini-batch. We incrementally build up the\nvariational posterior by including N\u2206 \u2264 Nm observations in each mini-batch as the initialization of\nnew variational basis functions. Then all the hyper-parameters and the variational parameters are\nupdated online. These steps are repeated for T iterations.\nFor all the algorithms, we assume the prior covariance is de\ufb01ned by the SE-ARD kernel [19] and\nwe use the generalized SE-ARD kernel [2] as the inducing functions in the variational posterior (see\nAppendix C for details). We note that all algorithms in comparison use the same kernel and optimize\nboth the variational parameters (including inducing points) and the hyperparameters.\n\u221a\nIn particular, we implement SGA by ADAM [13] (with default parameters \u03b21 = 0.9 and \u03b22 = 0.999).\nt)\u22121, where\nThe step-size for each stochastic algorithms is scheduled according to \u03b3t = \u03b30(1 + 0.1\n\u03b30 \u2208 {10\u22121, 10\u22122, 10\u22123} is selected manually for each algorithm to maximize the improvement\nin objective function after the \ufb01rst 100 iterations. We test each stochastic algorithm for T = 2000\niterations with mini-batches of size Nm = 1024 and the increment size N\u2206 = 128. Finally, the\nmodel sizes used in the experiments are listed as follows: M\u03b1 = 1282 and M\u03b2 = 128 for SVDGP;\nM = 1024 for SVI; M = 256 for iVSGPR; M = 1024, N = 4096 for VSGPR; N = 1024 for GP.\nThese settings share similar order of time complexity in our current Matlab implementation.\n\n4.1 Datasets\n\nInverse Dynamics of KUKA Robotic Arm This dataset records the inverse dynamics of a KUKA\narm performing rhythmic motions at various speeds [17]. The original dataset consists of two parts:\nKUKA1 and KUKA2, each of which have 17,560 of\ufb02ine data and 180,360 online data with 28 attributes\nand 7 outputs. In the experiment, we mix the online and the of\ufb02ine data and then split 90% as training\ndata (178,128 instances) and 10% testing data (19,792 instances) to satisfy the i.i.d. assumption.\n\nWalking MuJoCo MuJoCo (Multi-Joint dynamics with Contact) is a physics engine for research\nin robotics, graphics, and animation, created by [25]. In this experiment, we gather 1,000 walking\ntrajectories by running TRPO [20]. In each time frame, the MuJoCo transition dynamics have a\n23-dimensional input and a 17-dimensional output. We consider two regression problems to predict\n9 of the 17 outputs from the input13: MUJOCO1 which maps the input of the current frame (23\ndimensions) to the output, and MUJOCO2 which maps the inputs of the current and the previous\nframes (46 dimensions) to the output. In each problem, we randomly select 90% of the data as\ntraining data (842,745 instances) and 10% as test data (93,608 instances).\n\n4.2 Results\n\nWe summarize part of the experimental results in Table 2 in terms of nMSE in prediction and VLB.\nWhile each output is treated independently during learning, Table 2 present the mean and the standard\n\n11The exact marginal likelihood is computationally infeasible to evaluate for our large model.\n12The algorithms differs only in whether the bases are shared and how the model is updated (see Table 1).\n13Because of the structure of MuJoCo dynamics, the rest 8 outputs can be trivially known from the input.\n\n8\n\n\f(a) Sample Complexity\n\n(b) Time Complexity\n\nFigure 2: An example of online learning results (the 9th output of MUJOCO1 dataset). The blue, red,\nand yellow lines denote SVDGP, SVI, and iVSGPR, respectively.\n\ndeviation over all the outputs as the selected metrics are normalized. For the complete experimental\nresults, please refer to Appendix D.\nWe observe that SVDGP consistently outperforms the other approaches with much higher VLBs and\nmuch lower prediction errors; SVDGP also has smaller standard deviation. These results validate our\ninitial hypothesis that adopting a large set of basis functions for the mean can help when modeling\ncomplicated functions. iVSGPR has the next best result after SVDGP, despite using a basis size of\n256, much smaller than that of 1,024 in SVI, VSGPR, and GPR. Similar to SVDGP, iVSGPR also\ngeneralizes better than the batch algorithms VSGPR and GPR, which only have access to a smaller set\nof training data and are more prone to over-\ufb01tting. By contrast, the performance of SVI is surprisingly\nworse than VSGPR. We conjecture this might be due to the fact that the hyper-parameters and the\ninducing points/functions are only crudely initialized in online learning. We additionally \ufb01nd that the\nstability of SVI is more sensitive to the choice of step size than other methods. This might explain\nwhy in [10, 2] batch data was used to initialize the hyper-parameters and the learning rate to update\nthe hyper-parameters was selected to be much smaller than that for stochastic natural gradient ascent.\nTo further investigate the properties of different stochastic approximations, we show the change of\nVLB and the prediction error over iterations and time in Figure 2. Overall, whereas iVSGPR and SVI\nshare similar convergence rate, the behavior of SVDGP is different. We see that iVSGPR converges the\nfastest, both in time and sample complexity. Afterwards, SVDGP starts to descend faster and surpass\nthe other two methods. From Figure 2, we can also observe that although SVI has similar convergence\nto iVSGPR, it slows down earlier and therefore achieves a worse result. These phenomenon are\nobserved in multiple experiments.\n\n5 Conclusion\n\nWe propose a novel, fully-differentiable framework, Decoupled Gaussian Processes DGPs, for large-\nscale GP problems. By decoupling the representation, we derive a variational inference problem\nthat can be solved with stochastic gradients with linear time and space complexity. Compared with\nexisting algorithms, SVDGP can adopt a much larger set of basis functions to predict more accurately.\nEmpirically, SVDGP signi\ufb01cantly outperforms state-of-the-arts variational sparse GPR algorithms in\nmultiple regression tasks. These encouraging experimental results motivate further application of\nSVDGP to end-to-end learning with neural networks in large-scale, complex real world problems.\n\nAcknowledgments\n\nThis work was supported in part by NSF NRI award 1637758. The authors additionally thank the\nreviewers and Hugh Salimbeni for productive discussion which improved the quality of the paper.\n\nReferences\n[1] Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding probabilistic\nsparse Gaussian process approximations. In Advances in Neural Information Processing Systems,\n\n9\n\n\fpages 1525\u20131533, 2016.\n\n[2] Ching-An Cheng and Byron Boots. Incremental variational sparse Gaussian process regression.\n\nIn Advances in Neural Information Processing Systems, pages 4403\u20134411, 2016.\n\n[3] Ching-An Cheng and Han-Pang Huang. Learn the Lagrangian: A vector-valued RKHS approach\nto identifying Lagrangian systems. IEEE Transactions on Cybernetics, 46(12):3247\u20133258,\n2016.\n\n[4] Lehel Csat\u00f3 and Manfred Opper. Sparse representation for Gaussian process models. Advances\n\nin Neural Information Processing Systems, pages 444\u2013450, 2001.\n\n[5] Bo Dai, Bo Xie, Niao He, Yingyu Liang, Anant Raj, Maria-Florina F Balcan, and Le Song.\nScalable kernel methods via doubly stochastic gradients. In Advances in Neural Information\nProcessing Systems, pages 3041\u20133049, 2014.\n\n[6] Nathaniel Eldredge. Analysis and probability on in\ufb01nite-dimensional spaces. arXiv preprint\n\narXiv:1607.03591, 2016.\n\n[7] Anibal Figueiras-Vidal and Miguel L\u00e1zaro-gredilla. Inter-domain Gaussian processes for sparse\ninference using inducing features. In Advances in Neural Information Processing Systems,\npages 1087\u20131095, 2009.\n\n[8] Pascal Germain, Francis Bach, Alexandre Lacoste, and Simon Lacoste-Julien. Pac-bayesian\ntheory meets bayesian inference. In Advances in Neural Information Processing Systems, pages\n1884\u20131892, 2016.\n\n[9] Leonard Gross. Abstract wiener spaces. In Proceedings of the Fifth Berkeley Symposium on\nMathematical Statistics and Probability, Volume 2: Contributions to Probability Theory, Part 1,\npages 31\u201342. University of California Press, 1967.\n\n[10] James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian processes for big data. arXiv\n\npreprint arXiv:1309.6835, 2013.\n\n[11] James Hensman, Alexander G. de G. Matthews, and Zoubin Ghahramani. Scalable variational\nGaussian process classi\ufb01cation. In International Conference on Arti\ufb01cial Intelligence and\nStatistics, 2015.\n\n[12] James Hensman, Nicolas Durrande, and Arno Solin. Variational Fourier features for Gaussian\n\nprocesses. arXiv preprint arXiv:1611.06740, 2016.\n\n[13] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[14] Jyrki Kivinen, Alexander J Smola, and Robert C Williamson. Online learning with kernels.\n\nIEEE transactions on signal processing, 52(8):2165\u20132176, 2004.\n\n[15] Miguel L\u00e1zaro-Gredilla, Joaquin Qui\u00f1onero-Candela, Carl Edward Rasmussen, and An\u00edbal R.\nFigueiras-Vidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning\nResearch, 11(Jun):1865\u20131881, 2010.\n\n[16] Alexander G. de G. Matthews, James Hensman, Richard E. Turner, and Zoubin Ghahramani. On\nsparse variational methods and the Kullback-Leibler divergence between stochastic processes. In\nProceedings of the Nineteenth International Conference on Arti\ufb01cial Intelligence and Statistics,\n2016.\n\n[17] Franziska Meier, Philipp Hennig, and Stefan Schaal. Incremental local Gaussian regression. In\n\nAdvances in Neural Information Processing Systems, pages 972\u2013980, 2014.\n\n[18] Joaquin Qui\u00f1onero-Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-\nmate Gaussian process regression. The Journal of Machine Learning Research, 6:1939\u20131959,\n2005.\n\n[19] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian processes for machine\n\nlearning. 2006.\n\n[20] John Schulman, Sergey Levine, Pieter Abbeel, Michael I. Jordan, and Philipp Moritz. Trust\nregion policy optimization. In Proceedings of the 32nd International Conference on Machine\nLearning, pages 1889\u20131897, 2015.\n\n[21] Matthias Seeger, Christopher Williams, and Neil Lawrence. Fast forward selection to speed\nIn Arti\ufb01cial Intelligence and Statistics 9, number\n\nup sparse Gaussian process regression.\nEPFL-CONF-161318, 2003.\n\n10\n\n\f[22] Rishit Sheth, Yuyang Wang, and Roni Khardon. Sparse variational inference for generalized\nGP models. In Proceedings of the 32nd International Conference on Machine Learning, pages\n1302\u20131311, 2015.\n\n[23] Edward Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In\n\nAdvances in Neural Information Processing Systems, pages 1257\u20131264, 2005.\n\n[24] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 567\u2013574, 2009.\n\n[25] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based\nIn IEEE/RSJ International Conference on Intelligent Robots and Systems, pages\n\ncontrol.\n5026\u20135033. IEEE, 2012.\n\n[26] Christian Walder, Kwang In Kim, and Bernhard Sch\u00f6lkopf. Sparse multiscale Gaussian process\nregression. In Proceedings of the 25th international conference on Machine learning, pages\n1112\u20131119. ACM, 2008.\n\n[27] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian\nIn Proceedings of the 32nd International Conference on Machine\n\nprocesses (KISS-GP).\nLearning, pages 1775\u20131784, 2015.\n\n11\n\n\f", "award": [], "sourceid": 2671, "authors": [{"given_name": "Ching-An", "family_name": "Cheng", "institution": "Georgia Tech"}, {"given_name": "Byron", "family_name": "Boots", "institution": "Georgia Tech / Google Brain"}]}