{"title": "Deep Random Splines for Point Process Intensity Estimation of Neural Population Data", "book": "Advances in Neural Information Processing Systems", "page_first": 13346, "page_last": 13356, "abstract": "Gaussian processes are the leading class of distributions on random functions, but they suffer from well known issues including difficulty scaling and inflexibility with respect to certain shape constraints (such as nonnegativity). Here we propose Deep Random Splines, a flexible class of random functions obtained by transforming Gaussian noise through a deep neural network whose output are the parameters of a spline. Unlike Gaussian processes, Deep Random Splines allow us to readily enforce shape constraints while inheriting the richness and tractability of deep generative models. We also present an observational model for point process data which uses Deep Random Splines to model the intensity function of each point process and apply it to neural population data to obtain a low-dimensional representation of spiking activity. Inference is performed via a variational autoencoder that uses a novel recurrent encoder architecture that can handle multiple point processes as input. We use a newly collected dataset where a primate completes a pedaling task, and observe better dimensionality reduction with our model than with competing alternatives.", "full_text": "Deep Random Splines for Point Process Intensity\n\nEstimation of Neural Population Data\n\nGabriel Loaiza-Ganem\nDepartment of Statistics\n\nColumbia University\n\ngl2480@columbia.edu\n\nSean M. Perkins\n\nDepartment of Biomedical Engineering\n\nColumbia University\n\nsp3222@columbia.edu\n\nKaren E. Schroeder\n\nDepartment of Neuroscience\n\nColumbia University\n\nks3381@columbia.edu\n\nMark M. Churchland\n\nDepartment of Neuroscience\n\nColumbia University\n\nmc3502@columbia.edu\n\nJohn P. Cunningham\nDepartment of Statistics\n\nColumbia University\n\njpc2181@columbia.edu\n\nAbstract\n\nGaussian processes are the leading class of distributions on random functions, but\nthey suffer from well known issues including dif\ufb01culty scaling and in\ufb02exibility with\nrespect to certain shape constraints (such as nonnegativity). Here we propose Deep\nRandom Splines, a \ufb02exible class of random functions obtained by transforming\nGaussian noise through a deep neural network whose output are the parameters\nof a spline. Unlike Gaussian processes, Deep Random Splines allow us to readily\nenforce shape constraints while inheriting the richness and tractability of deep\ngenerative models. We also present an observational model for point process data\nwhich uses Deep Random Splines to model the intensity function of each point\nprocess and apply it to neural population data to obtain a low-dimensional repre-\nsentation of spiking activity. Inference is performed via a variational autoencoder\nthat uses a novel recurrent encoder architecture that can handle multiple point\nprocesses as input. We use a newly collected dataset where a primate completes\na pedaling task, and observe better dimensionality reduction with our model than\nwith competing alternatives.\n\n1\n\nIntroduction\n\nGaussian Processes (GPs) are one of the main tools for modeling random functions [30]. They allow\ncontrol of the smoothness of the function by choosing an appropriate kernel but have the disadvantage\nthat, except in special cases (for example Gilboa et al. [16], Flaxman et al. [14]), inference in GP\nmodels scales poorly in both memory and runtime. Furthermore, GPs cannot easily handle shape\nconstraints. It can often be of interest to model a function under some shape constraint, for example\nnonnegativity, monotonicity or convexity/concavity [28, 32, 29, 25]. While some shape constraints\ncan be enforced by transforming the GP or by enforcing them at a \ufb01nite number of points, doing so\ncannot always be done and usually makes inference harder, see for example Lin and Dunson [23].\nSplines are another popular tool for modeling unknown functions [36]. When there are no shape\nconstraints, frequentist inference is straightforward and can be performed using linear regression,\nby writing the spline as a linear combination of basis functions. Under shape constraints, the basis\nfunction expansion usually no longer applies, since the space of shape constrained splines is not\ntypically a vector space. However, the problem can usually still be written down as a tractable\nconstrained optimization problem [32]. Furthermore, when using splines to model a random function,\na distribution must be placed on the spline\u2019s parameters, so the inference problem becomes Bayesian.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDiMatteo et al. [9] proposed a method to perform Bayesian inference in a setting without shape\nconstraints, but the method relies on the basis function expansion and cannot be used in a shape\nconstrained setting. Furthermore, fairly simple distributions have to be placed on the spline parameters\nfor their approximate posterior sampling algorithm to work adequately, which results in the splines\nhaving a restrictive and oversimpli\ufb01ed distribution.\nOn the other hand, deep probabilistic models take advantage of the major progress in neural networks\nto \ufb01t rich, complex distributions to data in a tractable way [31, 27, 20, 15, 19]. However, their goal is\nnot usually to model random functions.\nIn this paper, we introduce Deep Random Splines (DRS), an alternative to GPs for modeling random\nfunctions. DRS are a deep probabilistic model in which standard Gaussian noise is transformed\nthrough a neural network to obtain the parameters of a spline, and the random function is then the\ncorresponding spline. This combines the complexity of deep generative models and the ability to\nenforce shape constraints of splines.\nWe use DRS to model the nonnegative intensity functions of Poisson processes [21]. In order to\nensure that the splines are nonnegative, we use a parameterization of nonnegative splines that can be\nwritten as an intersection of convex sets, and then use the method of alternating projections [35] to\nobtain a point in that intersection (and differentiate through that during learning). To perform scalable\ninference, we use a variational autoencoder [20] with a novel encoder architecture that takes multiple,\ntruly continuous point processes as input (not discretized in bins, as is common).\nOur contributions are: (i) Introducing DRS, (ii) using the method of alternating projections to\nconstrain splines, (iii) proposing a variational autoencoder model whith a novel encoder architecture\nfor point process data which uses DRS, and (iv) showing that our model outperforms commonly\nused alternatives in both simulated and real data.\nThe rest of the paper is organized as follows: we \ufb01rst explain DRS, how to parameterize them and\nhow constraints can be enforced in section 2. We then present our model and how to do inference in\nsection 3. We then compare our model against competing alternatives in simulated data and in two\nreal spiking activity datasets, one of which we collected, in section 4, and observe that our method\noutperforms the alternatives. Finally, we summarize our work in section 5.\n\n2 Deep Random Splines\n\nThroughout the paper we will consider functions on the interval [T1, T2) and will select I + 1\n\ufb01xed knots T1 = t0 < \u00b7\u00b7\u00b7 < tI = T2. We will refer to a function as a spline of degree d and\nsmoothness s < d if the function is a d-degree polynomial in each interval [ti\u22121, ti) for i = 1, . . . , I,\nis continuous, and s times differentiable. We will denote the set of splines of degree d and smoothness\ns by Gd,s = {g\u03c8 : \u03c8 \u2208 \u03a8d,s}, where \u03a8d,s is the set of parameters of each polynomial in each interval.\nThat is, every \u03c8 \u2208 \u03a8d,s contains the parameters of each of the I polynomial pieces (it does not\ncontain the locations of the knots as we take them to be \ufb01xed since we observed over\ufb01tting when not\ndoing so). While the most natural ways to parameterize splines of degree d are a linear combination\nof basis functions or with the d + 1 polynomial coef\ufb01cients of each interval, these parameterizations\ndo not lend themselves to easily enforce constraints such as nonnegativity [32]. We will thus use\na different parameterization which we will explain in detail in the next section. We will denote by\n\u03a8 \u2286 \u03a8d,s the subset of spline parameters that result in the splines having the shape constraint of\ninterest, for example, nonnegativity.\nDRS are a distribution over Gd,s. To sample from a DRS, a standard Gaussian random variable\nZ \u2208 Rm is transformed through a neural network parameterized by \u03b8, f\u03b8 : Rm \u2192 \u03a8. The DRS is\nthen given by gf\u03b8(Z) and inference on \u03b8 can be performed through a variational autoencoder [20].\nNote that f maps to \u03a8, thus ensuring that the spline has the relevant shape constraint.\n\n2.1 Constraining Splines\n\nWe now explain how we can enforce piecewise polynomials to form a nonnegative spline. We add the\nnonnegativity constraint to the spline as we will use it for our model in section 3, but constraints such\nas monotonicity and convexity/concavity can be enforced in an analogous way. In order to achieve\nthis, we use a parameterization of nonnegative splines that might seem overly complicated at \ufb01rst.\n\n2\n\n\fHowever, it has the critical advantage that it decomposes into the intersection of convex sets that are\neasily characterized in terms of the parameters, which is not the case for the naive parameterization\nwhich only includes the d + 1 coef\ufb01cients of every polynomial. We will see how to take advantage of\nthis fact in the next section.\nA beautiful but perhaps lesser known spline result (see Lasserre [22]) gives that a polynomial p(t) of\ndegree d, where d = 2k + 1 for some k \u2208 N, is nonnegative in the interval [l, u) if and only if it can\nbe written down as follows:\n\np(t) = (u \u2212 t)[t]\n(cid:62)\n\nQ1[t] + (t \u2212 l)[t]\n\n(cid:62)\n\n(1)\nwhere [t] = (1, t, t2, . . . , tk)(cid:62) and Q1 and Q2 are (k + 1) \u00d7 (k + 1) symmetric positive semide\ufb01nite\nmatrices. It follows that a piecewise polynomial of degree d with knots t0, . . . , tI de\ufb01ned as p(i)(t)\nfor t \u2208 [ti\u22121, ti) for i = 1, . . . , I is nonnegative if and only if it can be written as:\n\nQ2[t]\n\n1 [t] + (t \u2212 ti\u22121)[t]\nQ(i)\n\np(i)(t) = (ti \u2212 t)[t]\n(cid:62)\n1 and Q(i)\n\n(2)\n2 are (k + 1) \u00d7 (k + 1) symmetric positive semide\ufb01nite\nfor i = 1, . . . , I, where each Q(i)\nmatrices. We can thus parameterize every piecewise nonnegative polynomial on our I intervals with\n(Q(i)\ni=1. If no constraints are added on these parameters, the resulting piecewise polynomial\nmight not be smooth, so certain constraints have to be enforced in order to guarantee that we are\nparameterizing a nonnegative spline and not just a nonnegative piecewise polynomial. To that end,\nwe de\ufb01ne C1 as the set of (Q(i)\n\n1 , Q(i)\n\n2 )I\n\n2 [t]\n\nQ(i)\n\n(cid:62)\n\n2 )I\n\ni=1 such that:\n\n1 , Q(i)\np(i)(ti) = p(i+1)(ti) for i = 1, . . . , I \u2212 1\n\n(3)\nthat is, C1 is the set of parameters whose resulting piecewise polynomial as in equation 2 is continuous.\nAnalogously, let Cj for j = 2, 3, . . . be the set of (Q(i)\n\n\u2202j\u22121\n\u2202tj\u22121 p(i)(ti) =\n\n2 )I\n\n1 , Q(i)\n\ni=1 such that:\n\u2202j\u22121\n\u2202tj\u22121 p(i+1)(ti) for i = 1, . . . , I \u2212 1\n\n(4)\nso that Cj is the set of parameters whose corresponding piecewise polynomials have matching left\nand right (j \u2212 1)-th derivatives. Let C0 be the set of (Q(i)\ni=1 which are symmetric positive\nsemide\ufb01nite. We can then parameterize the set of nonnegative splines on [T1, T2) by \u03a8 = \u2229s+1\nj=0Cj.\nNote that the case where d is even can be treated analogously (see appendix 1).\n\n1 , Q(i)\n\n2 )I\n\n2.2 The Method of Alternating Projections\n\nIn order to use a DRS, f\u03b8 has to map to \u03a8, that is, we need to have a way for a neural network to map to\nthe parameter set corresponding to nonnegative splines. We achieve this by taking f\u03b8(z) = h( \u02dcf\u03b8(z)),\nwhere \u02dcf\u03b8 is an arbitrary neural network and h is a surjective function onto \u03a8. The most natural choice\nfor h is the projection onto \u03a8. However, while computing the projection onto \u03a8 (for \u03a8 as in section\n2.1) can be done by solving a convex optimization problem, it cannot be done analytically. This is\nan issue because when we train the model, we will need to differentiate f\u03b8 with respect to \u03b8. Note\nthat Amos and Kolter [3] propose a method to have an optimization problem as a layer in a neural\nnetwork. One might hope to use their method for our problem, but it cannot be applied due to the\nsemide\ufb01nite constraint on our matrices.\nThe method of alternating projections [35, 4] allows us to approximately compute such a func-\nIf C0, . . . ,Cs+1 are closed, convex sets in RD, then the sequence \u03c8(k) =\ntion h analytically.\nj=0Cj for any starting \u03c8(0), where Pj is the projection\nPk mod (s+2)(\u03c8(k\u22121)) converges to a point in \u2229s+1\nonto Cj for j = 0, . . . , s + 1. The method of alternating projections then consists on iteratively\nprojecting onto each set in a cyclic fashion. We call computing \u03c8(k) from \u03c8(k\u22121) the k-th iteration of\nthe method of alternating projections. This method can be useful to obtain a point in the intersection\nif each Pj can be easily computed.\nIn our case, projecting onto C0 can be done by doing eigenvalue decompositions of Q(i)\n1 and Q(i)\n2\nand zeroing out negative elements in the diagonal matrices containing the eigenvalues. While this\nprojection might seem computationally expensive, the matrices are small and this can be done\nef\ufb01ciently. For example, for cubic splines (d = 3), there are 2I matrices each one of size 2 \u00d7 2.\n\n3\n\n\fProjecting onto Cj for j = 1, . . . s + 1 can be done analytically as it can be formulated as a\nquadratic optimization problem with linear constraints. Furthermore, because of the local nature of\nthe constraints where every interval is only constrained by its neighboring intervals, this quadratic\noptimization problem can be reduced to solving a tridiagonal system of linear equations of size I \u2212 1\nwhich can be solved ef\ufb01ciently in O(I) time with simpli\ufb01ed Gaussian elimination. We prove this\nfact, using the KKT conditions, in appendix 2.\nBy letting h be the \ufb01rst M iterations of the method of alternating projections, we can ensure that f\u03b8\nmaps (approximately) to \u03a8, while still being able to compute \u2207\u03b8f\u03b8(z). Note that we could \ufb01nd such\nan h function using Dykstra\u2019s algorithm (not to be confused with Dijkstra\u2019s shortest path algorithm),\nwhich is a modi\ufb01cation of the method of alternating projections that converges to the projection of\n\u03c8(0) onto \u2229s+1\nj=0Cj [13, 5, 34]), but we found that the method of alternating projections was faster to\ndifferentiate when using reverse mode automatic differentiation packages [1].\nAnother way of \ufb01nding such an h would be unrolling any iterative optimization method that solves\nthe projection onto \u03a8, such as gradient-based methods or Newton methods. We found the alternating\nprojections method more convenient as it does not involve additional hyperparameters such as learning\nrate that drastically affect performance. Furthermore, the method of alternating projections is known\nto have a linear convergence rate (as fast as gradient-based methods) that is independent of the\nstarting point [4]. This last observation is important, as the starting point in our case is determined\nby the output of \u02dcf\u03b8, so that the convergence rate being independent of the starting point ensures that\n\u02dcf\u03b8 cannot learn to ignore h, which is not the case for gradient-based and Newton methods (for a\n\ufb01xed number of iterations and learning rate, there might exist an initial point that is too far away to\nactually reach the projection). Finally, note that if we wanted to enforce, for example, that the spline\nbe monotonic, we could parameterize its derivative and force it to be nonnegative or nonpositive.\nConvexity or concavity can be enforced analogously.\n\n3 Deep Random Splines as Intensity Functions of Point Processes\n\nSince we will use DRS as intensity functions for Poisson processes, we begin this section with a brief\nreview of these processes.\n\n3.1 Poisson Processes\nAn inhomogeneous Poisson process in a set S is a random subset of S. The process can (for our\npurposes) be parameterized by an intensity function g : S \u2192 R+ and in our case, S = [T1, T2).\nWe write S \u223c PPS (g) to denote that the random set S, whose elements we call events, follows a\nk=1 \u223c PPS (g), then |S \u2229 A|, the number of\nPoisson process on S with intensity g. If S = {xk}K\nA g(t)dt and the log likelihood\nof S is given by:\n\nevents in any A \u2286 S, follows a Poisson distribution with parameter(cid:82)\n(cid:90)\n\nlog p({xk}K\n\nk=1|g) =\n\nlog g(xk) \u2212\n\nK(cid:88)\n\nk=1\n\ng(t)dt\n\nS\n\n(5)\n\nSplines have the very important property that they can be analytically integrated (as the integral of\npolynomials can be computed in closed form), which allows to exactly evaluate the log likelihood in\nequation 5 when g is a spline. As a consequence, \ufb01tting a DRS to observed events is more tractable\nthan \ufb01tting models that use GPs to represent g, such as log-Gaussian Cox processes [28]. Inference in\nthe latter type of models is very challenging, despite some efforts by Cunningham et al. [8], Adams\net al. [2], Lloyd et al. [24]. Splines also vary smoothly, which incorporates the reasonable assumption\nthat the expected number of events changes smoothly over time. These properties were our main\nmotivations for choosing splines to model intensity functions.\n\n3.2 Our Model\n\nSuppose we observe N simultaneous point processes in [T1, T2) a total of R repetitions (we will\ncall each one of these repetitions/samples a trial). Let Xr,n denote the n-th point process of the r-th\ntrial. Looking ahead to an application we study in the results, data of this type is a standard setup for\nmicroelectrode array data, where N neurons are measured from time T1 to time T2 for R repetitions,\nand each event in the point processes corresponds to a spike (the time at which the neurons \u201c\ufb01red\u201d).\n\n4\n\n\fEach Xr,n is also called a spike train. The model we propose, which we call DRS-VAE, is as follows:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f3Zr \u223c N (0, Im) for r = 1, . . . , R\n\n\u03c8r,n = f (n)\nXr,n|\u03c8r,n \u223c PP [T1,T2)(g\u03c8r,n )\n\n(Zr) for n = 1, . . . , N\n\n\u03b8\n\n(6)\n\nFigure 1: Encoder architecture.\n\n\u03b8\n\n: Rm \u2192 \u03a8 is obtained as described in section 2.2. The hidden state Zr for the r-th\nwhere each f (n)\ntrial Xr := (Xr,1, . . . , Xr,N ) can be thought as a low-dimensional representation of Xr. Note that\nwhile the intensity function of every point process and every trial is a DRS, the latent state Zr of each\ntrial is shared among the N point processes. Note also that the data we are modeling can be thought\nof as R marked point processes [21], where the mark of the event xr,n,k (the k-th event of the n-th\npoint process of the r-th trial) is n. In this setting, g\u03c8r,n corresponds to the conditional (on Zr and on\nthe mark being n) intensity of the process for the r-th trial.\nOnce again, one might think that our parameterization of nonnegative splines is unnecessarily\ncomplicated and that having f (n)\nin equation 6 be a simpler parameterization of an arbitrary spline\n(e.g. basis coef\ufb01cients) and using \u03c4 (g\u03c8r,n ) instead of g\u03c8r,n, where \u03c4 is a nonnegative function, might\nbe a better solution to enforcing nonnegativity constraints. The function \u03c4 would have to be chosen\nin such a way that the integral of equation 5 can still be computed analytically, making \u03c4 (t) = t2 a\nnatural choice. While this procedure would avoid having to use the method of alternating projections,\nwe found that squared splines perform very poorly as they oscillate too much. Alternatively, we also\ntried using a B-spline basis with nonnegative coef\ufb01cients, resulting in nonnegative splines. While\nthe approximation error between a nonnegative smooth function and its B-spline approximation\nwith nonnegative coef\ufb01cients can be bounded [33], note that not every nonnegative spline can be\nwritten down as a linear combination of B-splines with nonnegative coef\ufb01cients. In practice we found\nthe bound to be too loose and also obtained better performance through the method of alternating\nprojections.\n\n\u03b8\n\n3.3\n\nInference\n\nAutoencoding variational Bayes [20] is a technique to perform inference in the following type of\nmodel:\n\n(7)\nwhere Xr are the observables and Zr \u2208 Rm the corresponding latents. Since maximum likelihood is\nnot usually tractable, the posterior p(z|x) is approximated with q\u03c6(z|x), which is given by:\n\nXr|Zr \u223c p\u03b8(x|zr)\n\n(cid:26)Zr \u223c N (0, Im) for r = 1, . . . , R\nq\u03c6(zr|xr), with q\u03c6(zr|xr) = N(cid:16)\n\nq\u03c6(z|x) =\n\n\u00b5\u03c6(xr), diag(cid:0)\u03c32\n\n\u03c6(xr)(cid:1)(cid:17)\n\n(8)\n\nR(cid:89)\n\nr=1\n\nR(cid:88)\n\nwhere the encoder (\u00b5\u03c6, \u03c3\u03c6) is a neural network parameterized by \u03c6. The ELBO L, a lower bound\nof the log likelihood, is then maximized over both the generative parameters \u03b8 and the variational\nparameters \u03c6:\n\nL(\u03b8, \u03c6) =\n\n\u2212KL(q\u03c6(zr|xr)||p(zr)) + Eq\u03c6(zr|xr)[log p\u03b8(xr|zr)]\n\n(9)\n\nr=1\n\nMaximizing the ELBO with stochastic gradient methods is enabled by the use of the reparameter-\nization trick. In order to perform inference in our model, we use autoencoding variational Bayes.\nBecause of the point process nature of the data, \u00b5\u03c6 and \u03c3\u03c6 require a recurrent architecture, since\n\n5\n\nxrNxr2xr1LSTMNLSTM2LSTM1dense......1\fR(cid:88)\n\n(cid:104) N(cid:88)\n\nKr,n(cid:88)\n\n(cid:90)\n\n(cid:105)\n\ng\u03c8r,n (t)dt\n\nS\n\n(10)\n\ntheir input xr = (xr,1, xr,2, . . . , xr,N ) consists of N point processes. This input does not consist\nof a single sequence, but N sequences of different lengths (numbers of events), which requires a\nspecialized architecture. We use N separate LSTMs [18], one per point process. Each LSTM takes\nas input the events of the corresponding point process. The \ufb01nal states of each LSTM are then\nconcatenated and transformed through a dense layer (followed by an exponential activation in the case\nof \u03c3\u03c6 to ensure positivity) in order to map to the hidden space Rm. We also tried bidirectional LSTMs\n[17] but found regular LSTMs to be faster while having similar performance. The architecture is\ndepicted in \ufb01gure 1. The ELBO for our model is then given by:\n\nL(\u03b8, \u03c6) =\n\n\u2212KL(q\u03c6(zr|xr)||p(zr)) + Eq\u03c6(zr|xr )\n\nlog g\u03c8r,n (xr,n,k) \u2212\n\nr=1\n\nn=1\n\nk=1\n\nwhere Kr,n is the number of events in the n-th point process of the r-th trial. Gao et al. [15] have\na similar model, where a hidden Markov model is transformed through a neural network to obtain\nevent counts on time bins. The hidden state for a trial in their model is then an entire hidden Markov\nchain, which will have signi\ufb01cantly higher dimension than our hidden state. Also, their model can be\nrecovered from ours if we change the standard Gaussian distribution of Zr in equation 6 to re\ufb02ect\ntheir Markovian structure and choose G to be piecewise constant, nonnegative functions. We also\nemphasize the fact that our model is very easy to extend: for example, it would be straightforward\nto extend it to multi-dimensional point processes (not neural data any more) by changing G and its\nparameterization. It is also straightforward to use a more complicated point process than the Poisson\none by allowing the intensity to depend on previous event history. Furthermore, DRS can be used in\nsettings that require random functions, even if no point process is involved.\nOne of the advantages of our method is that it scales well (not cubically, like most GP methods) with\nrespect to most of its parameters like number of trials, number of knots, number of iterations of the\nalternating projections algorithm, hidden dimension and number of neurons. The only parameter\nwith which our method does not scale as well is the number of spikes since the LSTM-based encoder\nhas to process every spike individually (not spike counts over time bins). However, this issue can\nbe addressed by using a non-amortized inference approach (i.e. not having an encoder and having\nseparate variational parameters for each trial). We found that the amortized approach using our\nproposed encoder was better for the datasets we analyzed, but even larger datasets might bene\ufb01t from\nthe non-amortized approach.\n\n4 Experiments\n\n4.1 Simulated Data\n\nWe simulated data with the following procedure: First, we set 2 different types of trials. For each\ntype of trial, we sampled one true intensity function on [0, 10) for each of the N = 2 point processes\nby sampling from a GP and exponentiating the result. We then sampled 600 times from each type\nof trial, resulting in 1200 trials. We randomly selected 1000 trials for training and set aside the rest\nfor testing. We then \ufb01t the model described in section 3.2 and compare against other methods that\nperform intensity estimation while recovering a low-dimensional representation of trial: the PP-GPFA\nmodel [12], the PfLDS model [15] and the GPFA model [38]. The two latter models discretize time\ninto B time bins and have a latent variable per time bin and per trial (as opposed to our model which\nis only per trial), while the former recovers continuous latent trajectories. They do this as a way of\nenforcing temporal smoothness by placing an appropriate prior over their latent trajectories, which we\ndo not have to do as we implicitly enforce temporal smoothness by using splines to model intensity\nfunctions. Note that Du et al. [10], Yang et al. [37], Mei and Eisner [26] and Du et al. [11] all propose\nrelated methods in which the intensity of point processes is estimated. However, we do not compare\nagainst these as the two former ones model dynamic networks, making a direct comparison dif\ufb01cult,\nand the two latter do not use latent variables, which is one of the main advantages and goals of our\nmethod as a way to perform dimensionality reduction for neural population data.\nWe used a uniform grid with 11 knots (resulting in I = 10 intervals), d = 3 and s = 2. Since a\ntwice-differentiable cubic spline on I intervals has I + 3 degrees of freedom, when discretizing time\nfor PfLDS and GPFA we use B = I + 3 = 13 time bins. This way the distribution recovered by\nPfLDS also has B = 13 degrees of freedom, while the distribution recovered by GPFA has even\nmore. We set the latent dimension m in our model to 2 and we also set the latent dimension per\ntime bin in PfLDS and GPFA to 2, meaning that the overall latent dimension for an entire trial was\n\n6\n\n\fFigure 2: Posterior means of the hidden variables of DRS-VAE by type of trial on simulated data (top\nleft panel), QQ-plot of time-rescaled intensities on simulated data (bottom left panel), comparison of\nposterior intensities of our method (DRS-VAE) against competing alternatives on simulated data (top\nright panel) and reaching data (bottom right panel).\n\n2B = 26. These two choices make the comparison conservative as they allow more \ufb02exibility for the\ntwo competing methods than for ours. For PP-GPFA we set the continuous latent trajectory to have\ndimension 2. Our architecture and hyperparameter choices are included in appendix 3.\nThe top left panel of \ufb01gure 2 shows the posterior means of the hidden variables in our model for\neach of the 200 test trials. Each posterior mean is colored according to its type of trial. We can\nsee that different types of trials form separate clusters, meaning that our model successfully obtains\nlow-dimensional representations of the trials. Note that the model is trained without having access\nto the type of each trial; colors are assigned in the \ufb01gure post hoc. The top right panel shows the\nevents (in black) for a particular point process on a particular trial, along with the true intensity (in\ngreen) that generated the events and posterior samples from our model (in purple), PP-GPFA (in\norange), PfLDS (in blue), and GPFA (in red) of the corresponding intensities. Note that since PfLDS\nand GPFA parameterize the number of counts on each time bin, they do not have a corresponding\nintensity. We plot instead a piecewise constant intensity on each time bin in such a way that the\nexpected number of events in each time bin is equal to the integral of the intensity. We can see that our\nmethod recovers a smooth function that is closer to the truth than the ones recovered with competing\nmethods. The bottom left panel of \ufb01gure 2 further illustrates this point with a QQ-plot (where time is\nrescaled as in [6]), and we can see once again that our method recovers intensities that are closer to\nthe truth.\nTable 1 shows performance from our model compared against PP-GPFA, PfLDS and GPFA. The\nsecond column shows the per-trial ELBO on test data, and we can see that our model has a larger\nELBO than the alternatives. While having a better ELBO does not imply that our log likelihood\nis better, it does suggest that it is. Since both PfLDS and GPFA put a distribution on event counts\non time bins instead of a distribution on event times as our models does, the log likelihoods are\nnot directly comparable. However, in the case of PfLDS, we can easily convert from the Poisson\n\n7\n\n0.40.30.20.10.00.10.2latent dimension 11.00.50.00.51.0latent dimension 20246810time (unitless)024681012intensitytrue intensityDRS-VAE posterior samplesPfLDS posterior samplesGPFA posterior meansPP-GPFA posterior meansevents0.00.20.40.60.81.0theoretical quantiles0.00.20.40.60.81.0observed quantilesDRS-VAEPfLDSPP-GPFA10050050100150200250300time (ms)0.0000.0020.0040.0060.0080.0100.012intensityDRS-VAE posterior samplesPfLDS posterior samplesPP-GPFA posterior samplesspike train\flikelihood on time bins to the piecewise constant intensity Poisson process likelihood, so that the\nnumbers become comparable. In order to get a quantitative comparison between our model and\nGPFA, we take advantage of the fact that we know the true intensity that generated the data and\ncompare average L2 distance, across point processes and trials, between posterior intensity samples\nand actual intensity function. Once again, we can see that our method outperforms the alternatives.\nTable 1 also includes the standard deviation of these L2 distances. Since the standard deviations are\nsomewhat large in comparison to the means, for each of the two competing alternatives, we carry out\na two sample t-test comparing the L2 distance means obtained with our method against the alternative.\nThe p-values indicate that our method recovers intensity functions that are closer to the truth in a\nstatistically signi\ufb01cant way.\n\n4.2 Real Data\n\n4.2.1 Reaching Data\n\nWe also \ufb01t our model to the dataset collected by Churchland et al. [7]. The dataset, after preprocessing\n(see appendix 4 for details), consists of measurements of 20 neurons for 3590 trials on the interval\n[\u2212100, 300) (in ms) of a primate. In each trial, the primate reaches with its arm to a speci\ufb01c location,\nwhich changes from trial to trial (we can think of the 40 locations as types of trials), where time 0\ncorresponds to the beginning of the movement. We randomly split the data into a training set with\n3000 trials and a test set with the rest of the trials.\nWe used twice-differentiable cubic splines and 18 uniformly spaced knots (that is, 17 intervals). For\nthe comparison against PfLDS, we split time into 20 bins, resulting in time bins of 20ms (which\nis a standard length), once again making sure that the degrees of freedom are comparable. This\nmakes once more for a conservative comparison as we \ufb01x the number of knots in our model so that\nthe number of degrees of freedom match against the already tuned comparison instead of tuning\nthe number of knots directly. Further architectural details are included in appendix 3. Since we do\nnot have access to the ground truth, we do not compare against GPFA as the L2 metric computed\nin the previous section cannot be used here. Again, we used a hidden dimension m = 2 for our\nmodel, resulting in hidden trajectories of dimension 40 for PfLDS, and continuous trajectories of\ndimension 2 for PP-GPFA. We experimented with larger values of m but did not observe signi\ufb01cant\nimprovements in either model.\nThe bottom right panel of \ufb01gure 2 shows the spike train (black) for a particular neuron on a particular\ntrial, along with posterior samples from our model (in purple), PP-GPFA (in orange) and PfLDS (in\nblue) of the corresponding intensities. We can see that the posterior samples from our method look\nmore plausible and smoother than the other ones.\nTable 1 also shows the per-trial ELBO on test data for our model and for the competing alternatives.\nAgain, our model has a larger ELBO, even when PfLDS has access to 20 times more hidden\ndimensions: our method is more successful at producing low-dimensional representations of trials\nthan PfLDS. The table also shows the percentage of correctly predicted test trial types when using\n15-nearest neighbors on the posterior means of train data (the entire trajectories are used for PfLDS\nand 20 uniformly spaced points along each dimension of the continuous trajectories of PP-GPFA,\nresulting in 40 dimensional latent representations). While 23.7% might seem small, it should be\nnoted that it is signi\ufb01cantly better than random guessing (which would have 2.5% accuracy) and\nthat the model was not trained to minimize this objective. Regardless, we can see that our method\noutperforms both PP-GPFA and PfLDS in this metric, even when using a much lower-dimensional\nrepresentation of each trial. The table also includes the percentage of explained variation when doing\n\nTable 1: Quantitative comparison of our method (DRS-VAE) against competing alternatives.\n\nSIMULATED DATA\n\nMETHOD ELBO\nDRS-VAE\nPfLDS\nGPFA\nPP-GPFA\n\n57.1\n52.3\n\u2212\n29.0\n\nL2\n\n0.11 \u00b1 0.09\n0.21 \u00b1 0.10\n0.21 \u00b1 0.10\n0.38 \u00b1 0.24\n\np-VALUE\n\n\u2212\n\n< 10\u221244\n< 10\u221245\n< 10\u221270\n\nREACHING DATA\nELBO\n\u2212500.8\n\u2212505.7\n\n15-NN\n\n23.7%\n3.1%\n\u2212\n\n\u2212\n\n\u2212523.2\n\n14.1%\n\n30.5%\n\nSSG/SST\n\n73.9%\n6.2%\n\u2212\n\nCYCLING DATA\n\nELBO\n\n15-NN\n\nSSG/SST\n\n6372\n6532\n\n\u2212\n6079\n\n55.9%\n11.7%\n\n\u2212\n\n51.1%\n\n70.0%\n3.2%\n\u2212\n\n14.6%\n\n8\n\n\fANOVA on the test posterior means (denoted SSG/SST), using trial type as groups. Once again, we\ncan see that our model recovers a more meaningful representation of the trials.\n\n4.2.2 Cycling Data\n\nWe also \ufb01t our model to our newly collected dataset. After preprocessing (see supplementary material),\nit consists of 1300 and 188 train and test trials, respectively. During each trial, 20 neurons were\nrecorded as the primate turns a hand-held pedal to navigate through a virtual environment. There are\n8 trial types, based on whether the primate is pedaling forward or backward and over what distance.\nWe use the same hyperparameter settings as for the reaching data, except we use 26 uniformly spaced\nknots (25 intervals) and 28 bins for PfLDS, as well as a hidden dimension m = 10, resulting in hidden\ntrajectories of dimension 280 for PfLDS (analogously, we set PP-GPFA to have 10 dimensional\ncontinuous trajectories, and take 28 uniformly spaced points along each dimension to obtain 280\ndimensional latent representations). Results are also summarized in table 1. We can see that while our\nELBO is higher than for PP-GPFA, it is actually lower than for PfLDS, which we believe is caused\nby an artifact of preprocessing the data rather than any essential performance loss.\nWhile the ELBO was better for PfLDS, the quality of our latent representations is signi\ufb01cantly better,\nas shown by the accuracy of 15-nearest neighbors to predict test trial types (random guessing would\nhave 12.5% accuracy) and the ANOVA percentage of explained variation of the test posterior means,\nwhich are also better than for PP-GPFA. This is particularly impressive as our latent representations\nhave 28 times fewer dimensions. We did experiment with different hyperparameter settings, and\nfound that the ELBO of PfLDS increased slightly when using more time bins (at the cost of even\nhigher-dimensional latent representations), whereas our ELBO remained the same when increasing\nthe number of intervals. However, even in this setting the accuracy of 15-nearest neighbors and the\npercentage of explained variation did not improve for PfLDS.\n\n5 Conclusions\n\nIn this paper we introduced Deep Random Splines, an alternative to Gaussian processes to model\nrandom functions. Owing to our key modeling choices and use of results from the spline and\noptimization literatures, \ufb01tting DRS is tractable and allows one to enforce shape constraints on\nthe random functions. While we only enforced nonnegativity and smoothness in this paper, it\nis straightforward to enforce constraints such as monotonicity (or convexity/concavity). We also\nproposed a variational autoencoder that takes advantage of DRS to accurately model and produce\nmeaningful low-dimensional representations of neural activity.\nFuture work includes using DRS-VAE for multi-dimensional point processes, for example spatial\npoint processes. While splines would become harder to use in such a setting, they could be replaced\nby any family of easily-integrable nonnegative functions, such as, for example, conic combinations\nof Gaussian kernels. Another line of future work involves using a more complicated point process\nthan the Poisson, for example a Hawkes process, by allowing the parameters of the spline in a certain\ninterval to depend on the previous spiking history of previous intervals. Finally, DRS can be applied\nin more general settings than the one explored in this paper since they can be used in any setting\nwhere a random function is involved, having many potential applications beyond what we analyzed\nhere.\n\nAcknowledgments\n\nWe thank the Simons Foundation, Sloan Foundation, McKnight Endowment Fund, NIH NINDS\n5R01NS100066, NSF 1707398, and the Gatsby Charitable Foundation for support.\n\nReferences\n\n[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,\nM. Isard, et al. Tensor\ufb02ow: a system for large-scale machine learning. In OSDI, volume 16,\npages 265\u2013283, 2016.\n\n9\n\n\f[2] R. P. Adams, I. Murray, and D. J. MacKay. Tractable nonparametric bayesian inference\nin poisson processes with gaussian process intensities. In Proceedings of the 26th Annual\nInternational Conference on Machine Learning, pages 9\u201316. ACM, 2009.\n\n[3] B. Amos and J. Z. Kolter. Optnet: Differentiable optimization as a layer in neural networks. In\n\nInternational Conference on Machine Learning, pages 136\u2013145, 2017.\n\n[4] H. H. Bauschke and J. M. Borwein. On projection algorithms for solving convex feasibility\n\nproblems. SIAM review, 38(3):367\u2013426, 1996.\n\n[5] J. P. Boyle and R. L. Dykstra. A method for \ufb01nding projections onto the intersection of convex\nsets in hilbert spaces. In Advances in order restricted statistical inference, pages 28\u201347. Springer,\n1986.\n\n[6] E. N. Brown, R. Barbieri, V. Ventura, R. E. Kass, and L. M. Frank. The time-rescaling theorem\nand its application to neural spike train data analysis. Neural computation, 14(2):325\u2013346,\n2002.\n\n[7] M. M. Churchland, J. P. Cunningham, M. T. Kaufman, J. D. Foster, P. Nuyujukian, S. I. Ryu,\nand K. V. Shenoy. Neural population dynamics during reaching. Nature, 487(7405):51, 2012.\n[8] J. P. Cunningham, K. V. Shenoy, and M. Sahani. Fast gaussian process methods for point\nprocess intensity estimation. In Proceedings of the 25th international conference on Machine\nlearning, pages 192\u2013199. ACM, 2008.\n\n[9] I. DiMatteo, C. R. Genovese, and R. E. Kass. Bayesian curve-\ufb01tting with free-knot splines.\n\nBiometrika, 88(4):1055\u20131071, 2001.\n\n[10] N. Du, L. Song, M. Yuan, and A. J. Smola. Learning networks of heterogeneous in\ufb02uence. In\n\nAdvances in Neural Information Processing Systems, pages 2780\u20132788, 2012.\n\n[11] N. Du, H. Dai, R. Trivedi, U. Upadhyay, M. Gomez-Rodriguez, and L. Song. Recurrent marked\ntemporal point processes: Embedding event history to vector. In Proceedings of the 22nd\nACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages\n1555\u20131564. ACM, 2016.\n\n[12] L. Duncker and M. Sahani. Temporal alignment and latent gaussian process factor inference\nIn Advances in Neural Information Processing Systems, pages\n\nin population spike trains.\n10445\u201310455, 2018.\n\n[13] R. L. Dykstra. An algorithm for restricted least squares regression. Journal of the American\n\nStatistical Association, 78(384):837\u2013842, 1983.\n\n[14] S. Flaxman, A. Wilson, D. Neill, H. Nickisch, and A. Smola. Fast kronecker inference in\ngaussian processes with non-gaussian likelihoods. In International Conference on Machine\nLearning, pages 607\u2013616, 2015.\n\n[15] Y. Gao, E. W. Archer, L. Paninski, and J. P. Cunningham. Linear dynamical neural population\nmodels through nonlinear embeddings. In Advances in Neural Information Processing Systems,\npages 163\u2013171, 2016.\n\n[16] E. Gilboa, Y. Saat\u00e7i, and J. P. Cunningham. Scaling multidimensional inference for structured\ngaussian processes. IEEE transactions on pattern analysis and machine intelligence, 37(2):\n424\u2013436, 2015.\n\n[17] A. Graves and J. Schmidhuber. Framewise phoneme classi\ufb01cation with bidirectional lstm and\n\nother neural network architectures. Neural Networks, 18(5-6):602\u2013610, 2005.\n\n[18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):\n\n1735\u20131780, 1997.\n\n[19] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical\nmodels with neural networks for structured representations and fast inference. In Advances in\nneural information processing systems, pages 2946\u20132954, 2016.\n\n[20] D. P. Kingma and M. Welling. Auto-encoding variational bayes. In International Conference\n\non Learning Representations, 2014.\n\n[21] J. F. C. Kingman. Poisson processes, volume 3. Clarendon Press, 1992.\n[22] J.-B. Lasserre. Moments, positive polynomials and their applications, volume 1. World\n\nScienti\ufb01c, 2010.\n\n10\n\n\f[23] L. Lin and D. B. Dunson. Bayesian monotone regression using gaussian process projection.\n\nBiometrika, 101(2):303\u2013317, 2014.\n\n[24] C. Lloyd, T. Gunter, M. Osborne, and S. Roberts. Variational inference for gaussian process\nmodulated poisson processes. In International Conference on Machine Learning, pages 1814\u2013\n1822, 2015.\n\n[25] E. Mammen. Estimating a smooth monotone regression function. The Annals of Statistics,\n\npages 724\u2013740, 1991.\n\n[26] H. Mei and J. M. Eisner. The neural hawkes process: A neurally self-modulating multivariate\npoint process. In Advances in Neural Information Processing Systems, pages 6754\u20136764, 2017.\n[27] S. Mohamed and B. Lakshminarayanan. Learning in implicit generative models. In International\n\nConference on Learning Representations, 2017.\n\n[28] J. M\u00f8ller, A. R. Syversveen, and R. P. Waagepetersen. Log gaussian cox processes. Scandinavian\n\njournal of statistics, 25(3):451\u2013482, 1998.\n\n[29] J. O. Ramsay. Monotone regression splines in action. Statistical science, pages 425\u2013441, 1988.\n[30] C. E. Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine\n\nlearning, pages 63\u201371. Springer, 2004.\n\n[31] D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate\ninference in deep generative models. In International Conference on Machine Learning, pages\n1278\u20131286, 2014.\n\n[32] J. W. Schmidt and W. Hess. Positivity of cubic polynomials on intervals and positive spline\n\ninterpolation. BIT Numerical Mathematics, 28(2):340\u2013352, 1988.\n\n[33] W. Shen, S. Ghosal, et al. Adaptive bayesian density regression for high-dimensional data.\n\nBernoulli, 22(1):396\u2013420, 2016.\n\n[34] R. J. Tibshirani. Dykstra\u2019s algorithm, admm, and coordinate descent: Connections, insights,\nand extensions. In Advances in Neural Information Processing Systems, pages 517\u2013528, 2017.\n[35] J. von Neumann. The geometry of orthogonal spaces, functional operators-vol. ii. Annals of\n\nMath. Studies, 22, 1950.\n\n[36] G. Wahba. Spline models for observational data, volume 59. Siam, 1990.\n[37] J. Yang, V. Rao, and J. Neville. Decoupling homophily and reciprocity with latent space network\n\nmodels. In UAI, 2017.\n\n[38] M. B. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Gaussian-\nprocess factor analysis for low-dimensional single-trial analysis of neural population activity.\nIn Advances in neural information processing systems, pages 1881\u20131888, 2009.\n\n[39] M. Yuan and Y. Lin. Model selection and estimation in regression with grouped variables.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 68(1):49\u201367, 2006.\n\n11\n\n\f", "award": [], "sourceid": 7339, "authors": [{"given_name": "Gabriel", "family_name": "Loaiza-Ganem", "institution": "Columbia University"}, {"given_name": "Sean", "family_name": "Perkins", "institution": "Columbia University"}, {"given_name": "Karen", "family_name": "Schroeder", "institution": "Columbia University"}, {"given_name": "Mark", "family_name": "Churchland", "institution": "Columbia University"}, {"given_name": "John", "family_name": "Cunningham", "institution": "University of Columbia"}]}