{"title": "Algorithmic Linearly Constrained Gaussian Processes", "book": "Advances in Neural Information Processing Systems", "page_first": 2137, "page_last": 2148, "abstract": "We algorithmically construct multi-output Gaussian process priors which satisfy linear differential equations. Our approach attempts to parametrize all solutions of the equations using Gr\u00f6bner bases. If successful, a push forward Gaussian process along the paramerization is the desired prior. We consider several examples from physics, geomathmatics and control, among them the full inhomogeneous system of Maxwell's equations. By bringing together stochastic learning and computeralgebra in a novel way, we combine noisy observations with precise algebraic computations.", "full_text": "Algorithmic Linearly Constrained Gaussian Processes\n\nDepartment of Electrical Engineering and Computer Science\n\nOstwestfalen-Lippe University of Applied Sciences\n\nMarkus Lange-Hegermann\n\nLemgo\n\nmarkus.lange-hegermann@hs-owl.de\n\nAbstract\n\nWe algorithmically construct multi-output Gaussian process priors which satisfy\nlinear differential equations. Our approach attempts to parametrize all solutions of\nthe equations using Gr\u00f6bner bases. If successful, a push forward Gaussian process\nalong the paramerization is the desired prior. We consider several examples from\nphysics, geomathematics and control, among them the full inhomogeneous system\nof Maxwell\u2019s equations. By bringing together stochastic learning and computer\nalgebra in a novel way, we combine noisy observations with precise algebraic\ncomputations.\n\n1\n\nIntroduction\n\nIn recent years, Gaussian process regression has become a prime regression technique [37]. Roughly,\na Gaussian process can be viewed as a suitable1 probability distribution on a set of functions, which we\ncan condition on observations using Bayes\u2019 rule. The resulting mean function is used for regression.\nThe strength of Gaussian process regression lies in avoiding over\ufb01tting while still \ufb01nding functions\ncomplex enough to describe any behavior present in given observations, even in noisy or unstructured\ndata. Gaussian processes are usually applied when observations are rare or expensive to produce.\nApplications range, among many others, from robotics [9], biology [19], global optimization [33],\nastrophysics [13] to engineering [47].\nIncorporating justi\ufb01ed assumptions into the prior helps these applications: the full information content\nof the scarce observations can be utilized to create a more precise regression model. Examples of\nsuch assumptions are smooth or rough behavior, trends, homogeneous or heterogeneous noise, local\nor global behavior, and periodicity (cf. \u00a74 in [37],[11]). Such assumptions are usually incorporated in\nthe covariance structure of the Gaussian process.\nEven certain physical laws, given by certain linear differential equations, could be incorporated\ninto the covariance structures of Gaussian process priors. Thereby, despite their random nature,\nall realizations and the mean function of the posterior strictly adhere to these physical laws2. For\nexample, [29, 41] constructed covariance structures for divergence-free and curl-free vector \ufb01elds,\nwhich [50, 45] used to model electromagnetic phenomena.\nA \ufb01rst step towards systematizing this construction was achieved in [24]. In certain cases, a map\ninto the solution set for physical laws could be found by a computation that does not necessarily\nterminate. Having found such a map, one could assume a Gaussian process prior in its domain and\npush it forward. This results in a Gaussian process prior for the solutions of the physical laws.\n\n1They are the maximum entropy prior for \ufb01nite mean and variance in the unknown behavior [22, 23].\n2For notational simplicity, we refrain from using the phrases \u201calmost surely\u201d and \u201cup to equivalence\u201d in this\n\npaper, e.g. by assuming separability.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fIn Section 2, we stress that the map from [24] into the solution set should be a parametrization, i.e.,\nsurjective. In Section 3, we combine this with an algorithm which computes this parametrization if it\nexists or reports failure if it does not exist.\nThis algorithm is a homological result in algebraic system theory (cf. \u00a77.(25) in [32]). Using\nGr\u00f6bner bases, it is fully algorithmic and works for a wide variety of operator rings; among them\nthe polynomial ring in the partial derivatives, which models linear systems of differential equations\nwith constant coef\ufb01cients; the (various) Weyl algebras which model linear systems of differential\nequations with variable coef\ufb01cients (cf. Example 4.2); and similar rings for difference equations and\ncombined delay differential equations. To demonstrate the use of Gr\u00f6bner bases, Example 4.3 shows\nexplicit computer algebra code.\nUsing the results of this paper, one can add information to Gaussian processes3 not only by\n\n(i) conditioning on observations (Bayes\u2019 rule), but also by\n(ii) restricting to solutions of linear operator matrices by constructing a suitable prior.\n\nSince these two constructions are compatible, we can combine strict, global information from\nequations with noisy, local information from observations. The author views this combination of\ntechniques from homological algebra and machine learning as the main result of this paper, and the\nconstruction of covariance functions satisfying physical laws as a proof of concept.\nEven though Gaussian processes are a highly precise interpolation tool, they lack in two regards:\nmissing extrapolation capabilities and high computation time, cubically in the amount of observations.\nThese problems have, to a certain degree, been addressed: more powerfull covariance structures\n[25, 21, 51, 53, 6] and several fast approximations to Gaussian process regression [48, 18, 52, 20, 10]\nhave been proposed. This paper addresses these two problems from a complementary angle. The\nlinear differential equations allow to extrapolate and reduce the needed amount of observations, which\nimproves computation time.\nThe promises in this introduction are demontrated in Example 4.1. It constructs a Gaussian process\nsuch that all of its realizations satisfy the inhomogeneous Maxwell equations of electromagnetism.\nConditioning this Gaussian process on a single observation of electric current yields, as expected, a\nmagnetic \ufb01eld circling around this electric current. This shows how to combine data (the electric\ncurrent) with differential equations for a global model, which extrapolates away from the data.\n\n2 Differential Equations and Gaussian Processes\n\nThis section is mostly expository and summarizes Gaussian processes and how differential operators\nact on them. Subsection 2.1 summarizes Gaussian process regression. We then introduce differential\n(Subsection 2.2) and other operators (Subsection 2.3).\n\n2.1 Gaussian processes\nA Gaussian process g = GP(\u00b5, k) is a distribution on the set of functions Rd \u2192 R(cid:96) such that\nthe function values g(x1), . . . , g(xn) at x1, . . . , xn \u2208 Rd have a joint Gaussian distribution. It is\nspeci\ufb01ed by a mean function\n\n\u00b5 : Rd \u2192 R(cid:96) : x (cid:55)\u2192 E(g(x))\n\nand a positive semide\ufb01nite covariance function\n\nk : Rd \u2295 Rd \u2192 R(cid:96)\u00d7(cid:96)(cid:23)0 : (x, x(cid:48)) (cid:55)\u2192 E(cid:0)(g(x) \u2212 \u00b5(x))(g(x(cid:48)) \u2212 \u00b5(x(cid:48)))T(cid:1) .\n\nAssume the regression model yi = g(xi) and condition on n observations\n\n(cid:8)(xi, yi) \u2208 R1\u00d7d \u2295 R1\u00d7(cid:96) | i = 1, . . . , n(cid:9) .\n\nDenote by k(x, X) \u2208 R(cid:96)\u00d7(cid:96)n resp. k(X, X) \u2208 R(cid:96)n\u00d7(cid:96)n(cid:23)0\nthe (covariance) matrices obtained by con-\ncatenating the matrices k(x, xj) resp. the positive semide\ufb01nite block partitioned matrix with blocks\n\n3The construction of covariance functions is applicable to kernels more generally.\n\n2\n\n\fk(xi, xj). Write \u00b5(X) \u2208 R(cid:96)\u00d7n for the matrix obtained by concatenating the vectors \u00b5(xi) and\ny \u2208 R1\u00d7(cid:96)n for the row vector obtained by concatenating the rows yi. The posterior\n\nGP(cid:16)\n(x, x(cid:48)) (cid:55)\u2192 k(x, x(cid:48)) \u2212 k(x, X)k(X, X)\u22121k(x(cid:48), X)T(cid:17)\n\nx (cid:55)\u2192 \u00b5(x) + (y \u2212 \u00b5(X))k(X, X)\u22121k(x, X)T ,\n\n,\n\nis again a Gaussian process and its mean function is used as regression model.\n\n2.2 Differential equations\n\nRoughly speaking, Gaussian processes are the linear objects among stochastic processes. Hence, we\n\ufb01nd a rich interplay of Gaussian processes and linear operators.\nFor simplicity, let R = R[\u2202x1, . . . , \u2202xd ] be the polynomial ring in the partial differential operators.\nFor different or more general operator rings see Subsection 2.3. This ring models linear (partial)\ndifferential equations with constant coef\ufb01cients, as it acts on the vector space F = C\u221e(Rd, R) of\nsmooth functions, where \u2202xi acts by partial derivative w.r.t. xi. The set of realizations of a Gaussian\nprocess with squared exponential covariance function is dense in F (cf. Thm. 12, Prop. 42 in [43]).\nThe class of Gaussian processes is closed under matrices B \u2208 R(cid:96)\u00d7(cid:96)(cid:48)(cid:48)\nof linear differential operators\nwith constant coef\ufb01cients. Let g = GP(\u00b5, k) be a Gaussian process with realizations in a space F (cid:96)(cid:48)(cid:48)\nof vectors with functions in F as entries. De\ufb01ne the Gaussian process B\u2217g as the Gaussian process\ninduced by the pushforward measure under B of the Gaussian measure induced by g. It holds that\n\nB\u2217g = GP(B\u00b5(x), Bk(x, x(cid:48))(B(cid:48))T ) ,\n\n(1)\n\nwhere B(cid:48) denotes the operation of B on functions with argument x(cid:48) \u2208 Rd [4, Thm. 9].\nThe covariance function k for such Gaussian processes B\u2217g as in (1) is often singular. This is to\nbe expected, as B\u2217g is rarely dense in F (cid:96). For numerical stability, we tacitly assume the model\nyi = g(xi) + \u03b5 for small Gaussian white noise term \u03b5 and adopt k by adding var(\u03b5) to k(xi, xi) for\nobservations xi.\nExample 2.1. Let g = GP(0, k(x, x(cid:48))) be a scalar univariate Gaussian process with differentiable\nrealizations. Then, the Gaussian process of derivatives of functions is given by\n\n\u2217 g as taking derivatives as measurement data and\n\n\u2202x\n\nWe say that a Gaussian process is in a function space, if its realizations are contained in said space.\nFor A \u2208 R(cid:96)(cid:48)\u00d7(cid:96) de\ufb01ne the solution set\n\nsolF (A) := {f \u2208 F (cid:96) | Af = 0} .\n\nSuch solution sets and Gaussian processes are connected in an almost tautological way.\nLemma 2.2. Let g = GP(\u00b5, k) be a Gaussian process in F (cid:96)\u00d71. Then g is also a Gaussian process\nin solF (A) for A \u2208 R(cid:96)(cid:48)\u00d7(cid:96) if and only if \u00b5 \u2208 solF (A) and A\u2217(g \u2212 \u00b5) is the constant zero process.\n\nProof. Assume that g is a Gaussian process in solF (A). Then, the mean function is a realization,\nthus \u00b5 \u2208 solF (A). Furthermore, for \u02dcg := (g \u2212 \u00b5) = GP(0, k) all realizations are annihilated by A,\nand hence A\u2217\u02dcg is the constant zero process.\nConversely, assume that \u00b5 \u2208 solF (A) and A\u2217(g \u2212 \u00b5) is the constant zero process. This implies\n0 = A\u2217(g \u2212 \u00b5) = A\u2217g \u2212 A\u2217\u00b5 = A\u2217g, i.e. all realizations of g become zero after a pushforward by\nA. In particular, all realizations of g are contained in solF (A).\n\n3\n\n(cid:3)\n\u2217 g = GP\nOne can interpret this Gaussian process(cid:2) \u2202\n\n(cid:2) \u2202\n\nproducing a regression model of derivatives.\n\n\u2202x\n\n(cid:19)\n\u2202x\u2202x(cid:48) k(x, x(cid:48))\n\n\u22022\n\n.\n\n(cid:18)\n(cid:3)\n\n0,\n\n\fThis lemma implies another advantage of choosing a zero mean function: it is always a solution of\nthe linear differential equations.\nOur goal is to construct Gaussian processes with realizations dense in the solution set solF (A) of an\noperator matrix A \u2208 R(cid:96)(cid:48)\u00d7(cid:96). The following remark, implicit in [24], is a \ufb01rst step towards an answer.\nRemark 2.3. Let A \u2208 R(cid:96)(cid:48)\u00d7(cid:96) and B \u2208 R(cid:96)\u00d7(cid:96)(cid:48)(cid:48)\nwith AB = 0. Let g = GP(0, k) be a Gaussian\nprocess in F (cid:96)(cid:48)(cid:48)\nThis remark is implied by Lemma 2.2, as A\u2217(B\u2217g) = (AB)\u2217g = 0\u2217g = 0.\nWe call B \u2208 R(cid:96)\u00d7(cid:96)(cid:48)(cid:48)\ndenseness of the realizations of a Gaussian process B\u2217g in solF (A).\nProposition 2.4. Let B \u2208 R(cid:96)\u00d7(cid:96)(cid:48)(cid:48)\nbe a Gaussian process dense in F (cid:96)(cid:48)(cid:48)\n\nbe a parametrization of solF (A) for A \u2208 R(cid:96)(cid:48)\u00d7(cid:96). Let g = GP(0, k)\n\n. Then, the set of realizations of B\u2217g is contained in solF (A).\n\na parametrization of solF (A) if solF (A) = BF (cid:96)(cid:48)(cid:48)\n\n. Parametrizations yield the\n\n. Then, the set of realizations of B\u2217g is dense in solF (A).\n\nThis proposition is a consequence of partial derivatives being bounded, and hence continuous, when\nF is equipped with the Fr\u00e9chet topology generated by the family of seminorms\n\n(cid:107)f(cid:107)a,b := sup\ni\u2208Zd\u22650\n|i|\u2264a\n\nsup\n\nz\u2208[\u2212b,b]d\n\n\u2202zi f (z)|\n| \u2202\n\nfor a, b \u2208 Z\u22650 (cf. \u00a710 in [49]). Now, the continuous surjective map induced by B maps a dense set\nto a dense set.\n\n2.3 Further operator rings\n\nThe theory presented for differential equations with constant coef\ufb01cients also holds for other rings R\nof linear operators and function spaces F. The following three operator rings are prominent examples.\nThe polynomial ring R = R[x1, . . . , xd] models polynomial equations when it acts on the set F of\nsmooth functions de\ufb01ned on a (Zariski-)open set in Rd.\nFor ordinary linear differential equations with rational4 coef\ufb01cients consider the Weyl algebra\nR = R(t)(cid:104)\u2202t(cid:105), with the non-commutative relation \u2202tt = t\u2202t + 1 representing the product rule of\ndifferentiation. We consider solutions in the set F of smooth functions de\ufb01ned on a co-\ufb01nite set.\nThe polynomial ring R = R[\u03c3x1, . . . , \u03c3xd ] models linear shift equations with constant coef\ufb01cients\nwhen it acts on the set F = RZd\u22650 of d-dimensional sequences by translation of the arguments.\n\n3 Computing parametrizations\n\nBy the last section, constructing a parametrization B of solF (A) yields a Gaussian process dense\nin the solution set solF (A) of an operator matrix A \u2208 R(cid:96)(cid:48)\u00d7(cid:96). Subsection 3.1 gives necessary and\nsuf\ufb01cient conditions for a parametrization to exist and Subsection 3.2 describes their computation.\n\n3.1 Existence of parametrizations\n\nIt turns out that we can decide whether a parametrization exists purely algebraically, only using\noperations over R that do not involve F.\nBy r-ker(A) we denote the right kernel of A \u2208 R(cid:96)(cid:48)\u00d7(cid:96), i.e. r-ker(A) = {m \u2208 R(cid:96)\u00d71 | Am = 0}.\nBy l-ker(A) we denote the left kernel of A, i.e. l-ker(A) = {m \u2208 R1\u00d7(cid:96)(cid:48) | mA = 0}. Abusing\nnotation, denote any matrix as left resp. right kernel if its rows resp. columns generate the kernel as\nan R-module.\nTheorem 3.1. Let A \u2208 R(cid:96)(cid:48)\u00d7(cid:96). De\ufb01ne matrices B = r-ker(A) and A(cid:48) = l-ker(B). Then solF (A(cid:48))\nis the largest subset of solF (A) that is parametrizable and B parametrizes solF (A(cid:48)).\n\n4No major changes for polynomial, holonomic, or meromorphic coef\ufb01cients.\n\n4\n\n\fA well-known special case of this theorem are \ufb01nite dimensional vector spaces, with R = F a \ufb01eld.\nIn that case, solF (A) can be found by solving the homogeneous system of linear equations Ab = 0\nwith the Gaussian algorithm and write a base for the solutions of b in the columns of a matrix B.\nThis matrix B is also called the (right) kernel of A. Now, we wonder whether there are additional\nequations satis\ufb01ed by the above solutions, i.e. when does Ab = 0 imply A(cid:48)b = 0. These equations\nA(cid:48) are the (left) kernel of B. At least in the case of \ufb01nite dimensional vector spaces5, there are no\nadditional equations6. However, for general rings R, the left kernel A(cid:48) of the right kernel B of A\nis not necessarily A (up to an equivalence). For example, the solution set solF (A(cid:48)) is the subset of\ncontrollable behaviors in solF (A).\nCorollary 3.2. In Theorem 3.1, solF (A) is parametrizable if and only if the rows of A and A(cid:48)\ngenerate the same row-module. Since AB = 0, this is the case if all rows of A(cid:48) are contained in the\nrow module generated by the rows of A. In this case, solF (A) is parametrized by B. Furthermore, a\nGaussian process g with realizations dense in F (cid:96)(cid:48)(cid:48)\nleads to a Gaussian process B\u2217g with realizations\ndense in solF (A).\n\nFor a formal proof of this theorem and its corollary see [55, Thm. 2], [54, Thm. 3, Alg. 1,\nLemma 1.2.3], or [32, \u00a77.(24)] and for additional characterizations, generalizations, and proofs\nusing more homological machinery see [36, 35, 2, 42, 7, 40] and references therein.\nThe approach assigns a prior to the parametrising functions and pushes this prior forward to a prior of\nthe solution set solF (A). The paramerization is not canonical, and hence different parametrizations\nmight lead to different priors.\n\n3.2 Algorithms\n\nSummarizing Theorem 3.1 and Corollary 3.2 algorithmically, we need to compute right kernels (of\nA), compute left kernels (of B), and decide whether rows (of A(cid:48)) are contained in a row module\n(generated by the rows of A). All these computations are an application of Gr\u00f6bner basis algorithms.\nIn the recent decades, Gr\u00f6bner bases algorithms have become one of the core algorithms of computer\nalgebra, with manifold applications in geometry, system theory, natural sciences, automatic theorem\nproving, post-quantum cryptography, and many others. Reduced Gr\u00f6bner bases generalize the reduced\nechelon form from linear systems to systems of polynomial (and hence linear operator) equations,\nby bringing them into a standard form7. They are computed by Buchberger\u2019s algorithm, which is\na generalization of the Gaussian and Euclidean algorithm and a special case of the Knuth-Bendix\ncompletion algorithm.\nSimilar to the reduced echelon form, Gr\u00f6bner bases allow to compute all solutions over R (not F)\nof the homogeneous system and compute, if it exists, a particular solution over R (not F) for an\ninhomogeneous system. Solving homogeneous systems is the same as computing its right resp. left\nkernel. Solving inhomogeneous equations decides whether an element is contained in a module.\nAlternatively, the uniqueness of reduced Gr\u00f6bner bases also decides submodule equality.\nA formal description of Gr\u00f6bner bases would exceed the scope of this note. Instead, we refer to\nthe excellent literature [46, 12, 1, 17, 14, 5]. Gr\u00f6bner basis algorithms exist for many rings R.\nThey historically emerged from polynomial rings, and have since been generalized to the Weyl\nalgebra, the shift algebra, and, more generally, G-algebras [26, 27] and Ore-algebras [39, 38]. They\nare implemented in various computer algebra systems, Singular [8] and Macaulay2 [16] are two\nwell-known examples. Even though the complexity of Gr\u00f6bner bases is in the vicinity of EXPSPACE\ncompleteness (cf. [30, 31, 3]), the \u201caverage interesting example\u201d (e.g. every example in this paper)\nusually terminates instantaneously. This holds in particular since the Gr\u00f6bner basis computations\nonly involve the operator equations, but not the data in any way.\n\n5As \ufb01nite dimensional vector spaces are re\ufb02exive, i.e. isomorphic to their bi-dual.\n6More precisely, A and A(cid:48) have the same row space.\n7This standard form depends on choices, speci\ufb01cally a so-called monomial order.\n\n5\n\n\f3.3 Hyperparameters\n\nMany covariance functions8 incorporate hyperparameters and advanced methods speci\ufb01cally add\nmore hyperparameters to Gaussian processes, see e.g. [44, 6, 51], for additional \ufb02exibility. The\napproach in this paper is the opposite by restricting the Gaussian process prior to solutions of an\noperator matrix. Of course, the prior of the parametrizing functions can still contain hyperparameters,\nwhich can be determined by maximizing the likelihood. Many important applications contain\nunknown parameters in the equations. Such parameters can also be estimated by the likelihood.\nConsider ordinary differential equations, with constant resp. variable coef\ufb01cients. The solution set\nof an operator matrix is a direct sum of parametrizable functions and a \ufb01nite dimensional set of\nfunctions, due to the Smith form resp. Jacobson form. In many cases, in particular the case of constant\ncoef\ufb01cients, the solution set of the \ufb01nite dimensional summand can easily be computed. This paper\nalso allows to compute with the parametrizable summand of the solution set and estimate parameters\nand hyperparameters of both summands together.\n\n4 Examples\n\nExample 4.1. Maxwell\u2019s equations of electromagnetism uses curl and divergence operators as\nbuilding blocks. It is a well-known result that the solutions of the inhomogeneous Maxwell equations\nare parametrized by the electric and magnetic potentials. We verify this and use the parametrization\nto construct a Gaussian process, such that its realizations adhere to Maxwell\u2019s equations. In Figure 1,\nwe condition this prior on a single observation of \ufb02owing electric current, which leads to the magnetic\n\ufb01eld circling around the current. This usage of differential equations shows an extrapolation away\nfrom the data point in space and into other components.\nThe inhomogenous Maxwell equations are given by the operator matrix\n\nA :=\n\n0\n\u2202z\n\u2212\u2202y\n0\n\u2212\u2202t\n0\n0\n\u2202x\n\n\u2212\u2202z\n0\n\u2202x\n0\n0\n\u2212\u2202t\n0\n\u2202y\n\n\u2202y\n\u2202t\n\u2212\u2202x\n0\n0\n0\n0\n\u2202x\n0\n0\n0\n\u2202z\n\u2212\u2202t \u2212\u2202y\n0\n\u2202z\n\n0\n\u2202t\n0\n\u2202y\n\u2212\u2202z\n0\n\u2202x\n0\n\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n0\n\u2202t\n0\n0\n0\n0\n0\n\u2202z\n0\n0\n\u2202y \u22121\n0\n0\n0\n\u2212\u2202x\n0 \u22121\n0\n0\n0 \u22121\n0\n0\n0\n0 \u22121\n0\n0\n0\n\napplied to three components of the electric \ufb01eld, three components of the magnetic (pseudo) \ufb01eld,\nthree components of electric current, and one component of electric \ufb02ux. We have set all constants to\n1.\nUsing Gr\u00f6bner bases, one computes the right kernel\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\u2202x\n\u2202y\n\u2202z\n0\n0\n0\n\nB :=\n\n\u2212\u2202t\u2202x\n\u2212\u2202t\u2202y\n\u2212\u2202t\u2202z\n\u22022\nx + \u22022\n\ny + \u22022\nz\n\n\u2202t\n0\n0\n0\n\u2212\u2202z\n\u2202y\nz \u2212 \u22022\ny + \u22022\n\u22022\n\u2212\u2202y\u2202x\n\u2212\u2202z\u2202x\n\u2202t\u2202x\n\nt\n\n0\n\u2202t\n0\n\u2202z\n0\n\u2212\u2202x\n\u2212\u2202y\u2202x\nz \u2212 \u22022\n\u22022\nx + \u22022\n\u2212\u2202z\u2202y\n\u2202t\u2202y\n\nt\n\n0\n0\n\u2202t\n\u2212\u2202y\n\u2202x\n0\n\u2212\u2202z\u2202x\n\u2212\u2202z\u2202y\ny \u2212 \u22022\nx + \u22022\n\u22022\n\u2202t\u2202z\n\nt\n\nof A and veri\ufb01es that it parametrizes the set of solutions of the inhomogeneous Maxwell equations.\nFor the demonstration in Figure 1 we assume squared exponential covariance functions and a\nzero mean function for four uncorrelated parametrising functions (electric potential and magnetic\npotentials).\n\n8Sometimes even the mean function contains hyperparameters. These additional hyperparameters are usually\n\nnot very expressive, compared to the non-parametric Gaussian process model.\n\n6\n\n\fy\n\n1\n\nx\n\n1\n\nFigure 1: We condition the prior on solutions of Maxwell\u2019s equations from Example 4.1 on an\nelectric current in z-direction and zero electric \ufb02ux at the origin x = y = z = t = 0. The diagram\nshows the mean posterior magnetic \ufb01eld in the (z, t) = (0, 0)-plane. As expected by the right hand\nrule, it circles around the point with electric current.\n\nExample 4.2. We consider the time-varying control system \u2202tx(t) = t3u(t) from [34, Exam-\nple 1.5.7] over the one-dimensional Weyl algebra R = R(t)(cid:104)\u2202t(cid:105).\n\nThis system, given by the matrix A :=(cid:2)\u2202t \u2212t3(cid:3), is parametrizable by\n\n(cid:20) 1\n\n(cid:21)\n\n.\n\n1\nt3 \u2202t\n\nB =\n\nFor a parametrizing functions with squared exponential covariance functions k(t1, t2) =\nexp(\u2212 1\n\n2 (t1 \u2212 t2)2) and a zero mean function, the covariance function for (x, u) is\n\n(cid:35)\n\n(cid:18)\n\n(cid:19)\n\n(cid:34) 1\n\nt2\u2212t1\n\nt3\n1\n\nt1\u2212t2\n\nt3\n2\n\n(t2\u2212t1\u22121)(t1\u2212t2\u22121)\n\nt3\n2t3\n1\n\nexp\n\n(t1 \u2212 t2)2\n\n\u2212 1\n2\n\n.\n\nFor a demonstration of how to observe resp. control such a system see Figures 2 resp. 3.\n\n1\n\n0\n\n1\n\nx(t)\n\n2\n\nu(t)\n\n3\n\n4\n\n5\n\nFigure 2: The state function x(t) of the system in Example 4.2 can be in\ufb02uenced by assigning\nan input function u(t). E.g., leaving the state x(t) unspeci\ufb01ed except for the boundary condition\n10 , . . . , 5} leads to the above posterior\nx(1) = 0 and \ufb01xing the input u(t) = 1\n\nmeans. This model yields x(5) \u2248 1.436537, close to(cid:82) 5\n\nt4+1 for t \u2208 {1, 11\n\n10 , 12\nt4+1 dt \u2248 1.436551.\nt3\n\n1\n\nExample 4.3. We reproduce the well-known fact that divergence-free (vector) \ufb01elds can be\nparametrized by the curl operator. This has been used in connection with Gaussian processes\nto model electric and magnetic phenomena [29, 50, 45]. The same algebraic computation also\nconstructs a prior for tangent \ufb01elds of a sphere.\n\n7\n\n\fx(t)\n\nu(t)\n\n1\n\n1\n\n0\n\n-1\n\n-2\n\n2\n\n3\n\nFigure 3: We control the system in Example 4.2 by specifying a desired behavior for the state x(t)\nand letting the Gaussian process construct a suitable input u(t), which is completely unspeci\ufb01ed\nby us. Starting with x(1) = 1 we give u(t) one time step to control x(t) to zero, e.g., by setting\nx(t) = 0 for t \u2208 { 20\n\n10 , . . . , 5}.\n\n10 , 21\n\nLet R = Q[x1, x2, x3] resp. R = Q[\u22021, \u22022, \u22023] be the polynomial ring in three indeterminates, which\nwe can both interpret as the polynomial ring in the coordinates resp. in the differential operators.\nConsider the matrix A = [x1 x2 x3] representing the normals of circles centered around the\norigin resp. the divergence. The right kernel of A is given by the operator\n\n(cid:34) 0\n\nB =\n\nx3 \u2212x2\n\u2212x3\nx1\n0\nx2 \u2212x1\n0\n\n(cid:35)\n\n(cid:35)\n\nrepresenting tangent spaces of circles centered around the origin resp. the curl, and these parametrize\nthe solutions of A. A posterior mean \ufb01eld is demonstrated in Figure 4 when assuming equal covariance\nfunctions k for 3 uncorrelated parametrizing functions and the covariance function for the tangent\n\ufb01eld is\n\n(cid:34)y1y2 + z1z2\n\n\u2212x1y2\n\u2212x1z2\n\n\u2212y1x2\n\u2212y1z2\n\nx1x2 + z1z2\n\n\u2212z1x2\n\u2212z1y2\n\nx1x2 + y1y2\n\n\u00b7 k(x1, y1, z1, x2, y2, z2) .\n\nWe demonstrate how to compute B and A(cid:48) for this example using Macaulay2 [16].\n\ni1 : R=QQ[d1,d2,d3]\no1 = R\no1 : PolynomialRing\ni2 : A=matrix{{d1,d2,d3}}\no2 = | d1 d2 d3 |\n\n1\n\n3\n\n<--- R\n\no2 : Matrix R\ni3 : B = generators kernel A\no3 = {1} | -d2 0\n{1} | d1\n{1} | 0\n\n-d3 0\nd2 d1\n\n-d3 |\n\n|\n|\n\n3\n\n3\n\no3 : Matrix R\ni4 : A1 = transpose generators kernel transpose B\no4 = | d1 d2 d3 |\n\n<--- R\n\n1\n\n3\n\n<--- R\n\no4 : Matrix R\nExample 4.4. We construct a prior for smooth tangent \ufb01elds on the sphere without sources and sinks.\nWe work in the third polynomial Weyl algebra R = R[x, y, z](cid:104)\u2202x, \u2202y, \u2202z(cid:105). I.e., we are interested in\nsolA(F) = {v \u2208 C\u221e(S2, R3)|Av = 0} for\n\n(cid:21)\n\n.\n\nA :=\n\ny\n\u2202x \u2202y\n\nz\n\u2202z\n\n(cid:20) x\n\n8\n\n\fThe right kernel\n\n(cid:35)\n\n(cid:34)\u2212z\u2202y + y\u2202z\n\nz\u2202x \u2212 x\u2202z\n\u2212y\u2202x + x\u2202y\n\n.\n\nB :=\n\ncan be checked to yield a parametrization of solF (A) Assuming a squared exponential covariance\nfunctions k for the parametrizing function, a demonstration can be found in Figure 4\n\nFigure 4: Taking the squared exponential covariance function for k in Example 4.3 yields the left\nsmooth mean tangent \ufb01eld on the sphere after conditioning at 4 evenly distributed points on the\nequator with two opposite tangent vectors pointing north and south each. The two visible of these\nfour vectors are displayed signi\ufb01cantly bigger. Conditioning the prior in Example 4.4 at 2 opposite\npoints on the equator with tangent vectors both pointing north (displayed bigger) yields the right\nmean \ufb01eld.\n\n5 Conclusion\n\nThe paper constructs multi-output Gaussian process priors, which adhere to linear operator equations.\nWith these priors few observations yield a precise regression model with strong extrapolation capabil-\nities (cf. Examples 4.1, 4.3, and 4.4). This construction is fully algorithmic and rather general, as it\nallows linear systems of differential equations with constant or variable coef\ufb01cients, shift equations,\nor multiplications with variables. It could be applied to settings from physics (cf. Examples 4.1),\ngeometric settings with potential applications in geomathematics and weather prediction (cf. Exam-\nples 4.1, 4.3, and 4.4), or to observe and control systems (cf. Example 4.2). The main restriction is\nthat the solutions of the system of equations must be parametrizable.\nThe author hopes that the results can be generalized from parametrizable solution sets to the general\ncase using a Monge parametrization (computable via the purity \ufb01ltration [36, 35, 2]) and right hand\nsides [15]. It would also be interesting to apply to parameter estimation (cf. Example 4.2), boundary\nconditions [15], and to clarify the connection between the algebra, functional analysis, topology,\nand measure theory used in this paper. Finally, experimental results would be interesting which\ncovariance function for the parametrizing functions is most suitable.\n\nAcknowledgments\n\nThe authors thanks M. Barakat, S. Gutsche, C. Kaus, D. Moser, S. Posur, and O. Wittich for\ndiscussions concerning this paper, W. Plesken, A. Quadrat, D. Robertz, and E. Zerz for introducing\nhim to the algebraic background of this paper, S. Thewes for introducing him to Gaussian processes,\nand the authors of [24] for providing the starting point of this work. This work owes much to\ncomments from anonymous reviewers.\n\n9\n\n\fReferences\n[1] William W. Adams and Philippe Loustaunau. An introduction to Gr\u00f6bner bases. Graduate\n\nStudies in Mathematics. American Mathematical Society, 1994.\n\n[2] Mohamed Barakat. Purity \ufb01ltration and the \ufb01ne structure of autonomy. In Proceedings of the\n19th International Symposium on Mathematical Theory of Networks and Systems - MTNS 2010,\npages 1657\u20131661, Budapest, Hungary, 2010.\n\n[3] David Bayer and Michael Stillman. On the complexity of computing syzygies. Journal of\n\nSymbolic Computation, 6(2-3):135\u2013147, 1988.\n\n[4] A. Bertinet and Thomas C. Agnan. Reproducing Kernel Hilbert Spaces in Probability and\n\nStatistics. Kluwer Academic Publishers, 2004.\n\n[5] Bruno Buchberger. An algorithm for \ufb01nding the basis elements of the residue class ring of a\nzero dimensional polynomial ideal. J. Symbolic Comput., 41(3-4):475\u2013511, 2006. Translated\nfrom the 1965 German original by Michael P. Abramson.\n\n[6] Roberto Calandra, Jan Peters, Carl E. Rasmussen, and Marc P. Deisenroth. Manifold Gaussian\nprocesses for regression. In International Joint Conference on Neural Networks, pages 3338\u2013\n3345, 2016.\n\n[7] Fr\u00e9d\u00e9ric Chyzak, Alban Quadrat, and Daniel Robertz. Effective algorithms for parametrizing\nlinear control systems over Ore algebras. Appl. Algebra Engrg. Comm. Comput., 16(5):319\u2013376,\n2005.\n\n[8] Wolfram Decker, Gert-Martin Greuel, Gerhard P\ufb01ster, and Hans Sch\u00f6nemann. SINGULAR\n4-1-0 \u2014 A computer algebra system for polynomial computations. http://www.singular.\nuni-kl.de, 2016.\n\n[9] Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian processes for data-\nef\ufb01cient learning in robotics and control. IEEE Trans. Pattern Anal. Mach. Intell., 37(2):408\u2013\n423, 2015.\n\n[10] Kun Dong, David Eriksson, Hannes Nickisch, David Bindel, and Andrew Gordon Wilson.\nScalable log determinants for gaussian process kernel learning. 2017. (arXiv:1711.03481).\n\n[11] David Duvenaud. Automatic Model Construction with Gaussian Processes. PhD thesis,\n\nUniversity of Cambridge, 2014.\n\n[12] David Eisenbud. Commutative Algebra with a View Toward Algebraic Geometry, volume 150\n\nof Graduate Texts in Mathematics. Springer-Verlag, 1995.\n\n[13] Roman Garnett, Shirley Ho, and Jeff G. Schneider. Finding galaxies in the shadows of quasars\nwith Gaussian processes. In Francis R. Bach and David M. Blei, editors, ICML, volume 37 of\nJMLR Workshop and Conference Proceedings, pages 1025\u20131033. JMLR.org, 2015.\n\n[14] Vladimir P. Gerdt. Involutive algorithms for computing Gr\u00f6bner bases. In Computational\ncommutative and non-commutative algebraic geometry, volume 196 of NATO Sci. Ser. III\nComput. Syst. Sci., pages 199\u2013225. 2005.\n\n[15] Thore Graepel. Solving noisy linear operator equations by gaussian processes: Application\nto ordinary and partial differential equations. In Proceedings of the Twentieth International\nConference on International Conference on Machine Learning, ICML\u201903, pages 234\u2013241.\nAAAI Press, 2003.\n\n[16] Daniel R. Grayson and Michael E. Stillman. Macaulay2, a software system for research in\n\nalgebraic geometry. http://www.math.uiuc.edu/Macaulay2/.\n\n[17] G. Greuel and G. P\ufb01ster. A Singular introduction to commutative algebra. Springer-Verlag,\n\n2002. With contributions by Olaf Bachmann, Christoph Lossen and Hans Sch\u00f6nemann.\n\n[18] James Hensman, Nicol\u00f3 Fusi, and Neil D. Lawrence. Gaussian processes for big data. In\n\nProceedings of the Twenty-Ninth Conference on Uncertainty in Arti\ufb01cial Intelligence, 2013.\n\n10\n\n\f[19] Antti Honkela, Jaakko Peltonen, Hande Topa, Iryna Charapitsa, Filomena Matarese, Korbinian\nGrote, Hendrik G. Stunnenberg, George Reid, Neil D. Lawrence, and Magnus Rattray. Genome-\nwide modeling of transcription kinetics reveals patterns of rna production delays. Proceedings\nof the National Academy of Sciences, 112(42):13115\u201313120, 2015.\n\n[20] Pavel Izmailov, Alexander Novikov, and Dmitry Kropotov. Scalable Gaussian processes with\nbillions of inducing inputs via tensor train decomposition, 2017. (arXiv:math/1710.07324).\n\n[21] Phillip A Jang, Andrew Loeb, Matthew Davidow, and Andrew G Wilson. Scalable levy process\npriors for spectral kernel learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus,\nS. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems\n30, pages 3940\u20133949. 2017.\n\n[22] Edwin T. Jaynes. Prior probabilities. IEEE Transactions on systems science and cybernetics,\n\n4(3):227\u2013241, 1968.\n\n[23] Edwin T. Jaynes and G. Larry Bretthorst. Probability Theory: The Logic of Science. Cambridge\n\nUniversity Press, 2003.\n\n[24] Carl Jidling, Niklas Wahlstr\u00f6m, Adrian Wills, and Thomas B. Sch\u00f6n. Linearly constrained\n\nGaussian processes. 2017. (arXiv:1703.00787).\n\n[25] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Penning-\nton, and Jascha Sohl-Dickstein. Deep neural networks as Gaussian processes, 2017.\n(arXiv:1711.00165).\n\n[26] Viktor Levandovskyy. Non-commutative Computer Algebra for polynomial algebras: Gr\u00f6bner\nbases, applications and implementation. PhD thesis, University of Kaiserslautern, June 2005.\n\n[27] Viktor Levandovskyy and Hans Sch\u00f6nemann. PLURAL\u2014a computer algebra system for\nnoncommutative polynomial algebras. In Proceedings of the 2003 International Symposium on\nSymbolic and Algebraic Computation, pages 176\u2013183 (electronic). ACM, 2003.\n\n[28] Ives Mac\u00eado and Rener Castro. Learning divergence-free and curl-free vector \ufb01elds with\nmatrix-valued kernels. Instituto Nacional de Matematica Pura e Aplicada, Brasil, Tech. Rep,\n2008.\n\n[29] Ernst Mayr. Membership in polynomial ideals over Q is exponential space complete. In STACS\n89 (Paderborn, 1989), volume 349 of Lecture Notes in Comput. Sci., pages 400\u2013406. Springer,\nBerlin, 1989.\n\n[30] Ernst W Mayr and Albert R Meyer. The complexity of the word problems for commutative\n\nsemigroups and polynomial ideals. Advances in mathematics, 46(3):305\u2013329, 1982.\n\n[31] Ulrich Oberst. Multidimensional constant linear systems. Acta Appl. Math., 20(1-2):1\u2013175,\n\n1990.\n\n[32] Michael A. Osborne, Roman Garnett, and Stephen J. Roberts. Gaussian processes for global\noptimization. In 3rd international conference on learning and intelligent optimization (LION3),\npages 1\u201315, 2009.\n\n[33] Alban Quadrat. An introduction to constructive algebraic analysis and its applications. In\nJourn\u00e9es Nationales de Calcul Formel, volume 1 of Les cours du CIRM, pages 279\u2013469. CIRM,\nLuminy, 2010. (http://ccirm.cedram.org/ccirm-bin/fitem?id=CCIRM_2010__1_2_\n281_0).\n\n[34] Alban Quadrat. Syst\u00e8mes et Structures \u2013 Une approche de la th\u00e9orie math\u00e9matique des syst\u00e8mes\n\npar l\u2019analyse alg\u00e9brique constructive. April 2010. Habilitation thesis.\n\n[35] Alban Quadrat. Grade \ufb01ltration of linear functional systems. Acta Appl. Math., 127:27\u201386,\n\n2013.\n\n[36] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine\n\nLearning (Adaptive Computation and Machine Learning). The MIT Press, 2006.\n\n11\n\n\f[37] Daniel Robertz. JanetOre: A Maple package to compute a JANET basis for modules over ORE\n\nalgebras, 2003-2008.\n\n[38] Daniel Robertz. Formal Computational Methods for Control Theory. PhD thesis, RWTH\n\nAachen, 2006.\n\n[39] Daniel Robertz. Recent progress in an algebraic analysis approach to linear systems. Multidi-\n\nmensional Syst. Signal Process., 26(2):349\u2013388, April 2015.\n\n[40] Michael Scheuerer and Martin Schlather. Covariance models for divergence-free and curl-free\n\nrandom vector \ufb01elds. Stochastic Models, 28(3):433\u2013451, 2012.\n\n[41] Werner M. Seiler and Eva Zerz. The inverse syzygy problem in algebraic systems theory.\n\nPAMM, 10(1):633\u2013634, 2010.\n\n[42] C.-J. Simon-Gabriel and B. Sch\u00f6lkopf. Kernel distribution embeddings: Universal ker-\nnels, characteristic kernels and kernel metrics on distributions. Technical report, 2016.\n(arXiv:1604.05251).\n\n[43] Edward Snelson, Carl Edward Rasmussen, and Zoubin Ghahramani. Warped gaussian processes.\nIn Sebastian Thrun, Lawrence K. Saul, and Bernhard Sch\u00f6lkopf, editors, NIPS, pages 337\u2013344.\nMIT Press, 2003.\n\n[44] Arno Solin, Manon Kok, Niklas Wahlstr\u00f6m, Thomas B. Sch\u00f6n, and Simo S\u00e4rkk\u00e4. Modeling and\ninterpolation of the ambient magnetic \ufb01eld by Gaussian processes. 2015. (arXiv:1509.04634).\n\n[45] Bernd Sturmfels. What is... a Gr\u00f6bner basis? Notices of the AMS, 52(10):2\u20133, 2005.\n\n[46] Silja Thewes, Markus Lange-Hegermann, Christoph Reuber, and Ralf Beck. Advanced Gaussian\nProcess Modeling Techniques. In Design of Experiments (DoE) in Powertrain Development.\nExpert, 2015.\n\n[47] Michalis K. Titsias. Variational learning of inducing variables in sparse Gaussian processes. In\n\nArti\ufb01cial Intelligence and Statistics 12, pages 567\u2013574, 2009.\n\n[48] F. Treves. Topological Vector Spaces, Distributions and Kernels. Dover books on mathematics.\n\nAcademic Press, 1967.\n\n[49] Niklas Wahlstr\u00f6m, Manon Kok, Thomas B. Sch\u00f6n, and Fredrik Gustafsson. Modeling magnetic\n\ufb01elds using Gaussian processes. In in Proceedings of the 38th International Conference on\nAcoustics, Speech, and Signal Processing (ICASSP), 2013.\n\n[50] Andrew G. Wilson and Ryan Prescott Adams. Gaussian process kernels for pattern discovery\nand extrapolation. In ICML (3), volume 28 of JMLR Workshop and Conference Proceedings,\npages 1067\u20131075. JMLR.org, 2013.\n\n[51] Andrew G. Wilson, Christoph Dann, and Hannes Nickisch. Thoughts on massively scalable\n\nGaussian processes. 2015. (arXiv:1511.01870).\n\n[52] Andrew G. Wilson, Zhiting Hu, Ruslan Salakhutdinov, and Eric P. Xing. Deep kernel learning.\n\n2015. arXiv:1511.02222).\n\n[53] Eva Zerz. Topics in multidimensional linear systems theory, volume 256 of Lecture Notes in\n\nControl and Information Sciences. London, 2000.\n\n[54] Eva Zerz, Werner M Seiler, and Marcus Hausdorf. On the inverse syzygy problem. Communi-\n\ncations in Algebra, 38(6):2037\u20132047, 2010.\n\n12\n\n\f", "award": [], "sourceid": 1093, "authors": [{"given_name": "Markus", "family_name": "Lange-Hegermann", "institution": "Hochschule Ostwestfalen-Lippe"}]}