{"title": "Extracting low-dimensional dynamics from multiple large-scale neural population recordings by learning to predict correlations", "book": "Advances in Neural Information Processing Systems", "page_first": 5702, "page_last": 5712, "abstract": "A powerful approach for understanding neural population dynamics is to extract low-dimensional trajectories from population recordings using dimensionality reduction methods. Current approaches for dimensionality reduction on neural data are limited to single population recordings, and can not identify dynamics embedded across multiple measurements. We propose an approach for extracting low-dimensional dynamics from multiple, sequential recordings. Our algorithm scales to data comprising millions of observed dimensions, making it possible to access dynamics distributed across large populations or multiple brain areas. Building on subspace-identification approaches for dynamical systems, we perform parameter estimation by minimizing a moment-matching objective using a scalable stochastic gradient descent algorithm: The model is optimized to predict temporal covariations across neurons and across time. We show how this approach naturally handles missing data and multiple partial recordings, and can identify dynamics and predict correlations even in the presence of severe subsampling and small overlap between recordings. We demonstrate the effectiveness of the approach both on simulated data and a whole-brain larval zebrafish imaging dataset.", "full_text": "Extracting low-dimensional dynamics from\n\nmultiple large-scale neural population recordings\n\nby learning to predict correlations\n\nMarcel Nonnenmacher1, Srinivas C. Turaga2 and Jakob H. Macke1\u2217\n\n1research center caesar, an associate of the Max Planck Society, Bonn, Germany\n\njakob.macke@caesar.de\n\n2HHMI Janelia Research Campus, Ashburn, VA\n\nmarcel.nonnenmacher@caesar.de, turagas@janelia.hhmi.org\n\nAbstract\n\nA powerful approach for understanding neural population dynamics is to extract\nlow-dimensional trajectories from population recordings using dimensionality\nreduction methods. Current approaches for dimensionality reduction on neural\ndata are limited to single population recordings, and can not identify dynamics\nembedded across multiple measurements. We propose an approach for extracting\nlow-dimensional dynamics from multiple, sequential recordings. Our algorithm\nscales to data comprising millions of observed dimensions, making it possible\nto access dynamics distributed across large populations or multiple brain areas.\nBuilding on subspace-identi\ufb01cation approaches for dynamical systems, we perform\nparameter estimation by minimizing a moment-matching objective using a scalable\nstochastic gradient descent algorithm: The model is optimized to predict temporal\ncovariations across neurons and across time. We show how this approach naturally\nhandles missing data and multiple partial recordings, and can identify dynamics\nand predict correlations even in the presence of severe subsampling and small\noverlap between recordings. We demonstrate the effectiveness of the approach\nboth on simulated data and a whole-brain larval zebra\ufb01sh imaging dataset.\n\nIntroduction\n\n1\nDimensionality reduction methods based on state-space models [1, 2, 3, 4, 5] are useful for uncover-\ning low-dimensional dynamics hidden in high-dimensional data. These models exploit structured\ncorrelations in neural activity, both across neurons and over time [6]. This approach has been used to\nidentify neural activity trajectories that are informative about stimuli and behaviour and yield insights\ninto neural computations [7, 8, 9, 10, 11, 12, 13]. However, these methods are designed for analyzing\none population measurement at a time and are typically applied to population recordings of a few\ndozens of neurons, yielding a statistical description of the dynamics of a small sample of neurons\nwithin a brain area. How can we, from sparse recordings, gain insights into dynamics distributed\nacross entire circuits or multiple brain areas? One promising approach to scaling up the empirical\nstudy of neural dynamics is to sequentially record from multiple neural populations, for instance by\nmoving the \ufb01eld-of-view of a microscope [14]. Similarly, chronic multi-electrode recordings make it\npossible to record neural activity within a brain area over multiple days, but with neurons dropping\nin and out of the measurement over time [15]. While different neurons will be recorded in different\nsessions, we expect the underlying dynamics to be preserved across measurements.\nThe goal of this paper is to provide methods for extracting low-dimensional dynamics shared across\nmultiple, potentially overlapping recordings of neural population activity. Inferring dynamics from\n\n\u2217current primary af\ufb01liation: Centre for Cognitive Science, Technical University Darmstadt\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fsuch data can be interpreted as a missing-data problem in which data is missing in a structured manner\n(referred to as \u2019serial subset observations\u2019 [16], SSOs). Our methods allow us to capture the relevant\nsubspace and predict instantaneous and time-lagged correlations between all neurons, even when\nsubstantial blocks of data are missing. Our methods are highly scalable, and applicable to data sets\nwith millions of observed units. On both simulated and empirical data, we show that our methods\nextract low-dimensional dynamics and accurately predict temporal and cross-neuronal correlations.\nStatistical approach: The standard approach for dimensionality reduction of neural dynamics is\nbased on search for a maximum of the log-likelihood via expectation-maximization (EM) [17, 18].\nEM can be extended to missing data in a straightforward fashion, and SSOs allow for ef\ufb01cient\nimplementations, as we will show below. However, we will also show that subsampled data can lead\nto slow convergence and high sensitivity to initial conditions. An alternative approach is given by\nsubspace identi\ufb01cation (SSID) [19, 20]. SSID algorithms are based on matching the moments of the\nmodel with those of the empirical data: The idea is to calculate the time-lagged covariances of the\nmodel as a function of the parameters. Then, spectral methods (e.g. singular value decompositions)\nare used to reconstruct parameters from empirically measured covariances. However, these methods\nscale poorly to high-dimensional datasets where it impossible to even construct the time-lagged\ncovariance matrix. Our approach is also based on moment-matching \u2013 rather than using spectral\napproaches, however, we use numerical optimization to directly minimize the squared error between\nempirical and reconstructed time-lagged covariances without ever explicitly constructing the full\ncovariance matrix, yielding a subspace that captures both spatial and temporal correlations in activity.\nThis approach readily generalizes to settings in which many data points are missing, as the cor-\nresponding entries of the covariance can simply be dropped from the cost function. In addition,\nit can also generalize to models in which the latent dynamics are nonlinear. Stochastic gradient\nmethods make it possible to scale our approach to high-dimensional (p = 107) and long (T = 105)\nrecordings. We will show that use of temporal information (through time-lagged covariances) allows\nthis approach to work in scenarios (low overlap between recordings) in which alternative approaches\nbased on instantaneous correlations are not applicable [2, 21].\nRelated work:\nSeveral studies have addressed estimation of linear dynamical systems from\nsubsampled data: Turaga et al. [22] used EM to learn high-dimensional linear dynamical models form\nmultiple observations, an approach which they called \u2018stitching\u2019. However, their model assumed high-\ndimensional dynamics, and is therefore limited to small population sizes (N \u2248 100). Bishop & Yu\n[23] studied the conditions under which a covariance-matrix can be reconstructed from multiple partial\nmeasurements. However, their method and analysis were restricted to modelling time-instantaneous\ncovariances, and did not include temporal activity correlations. In addition, their approach is not based\non learning parameters jointly, but estimates the covariance in each observation-subset separately,\nand then aligns these estimates post-hoc. Thus, while this approach can be very effective and is\nimportant for theoretical analysis, it can perform sub-optimally when data is noisy. In the context\nof SSID methods, Markovsky [24, 25] derived conditions for the reconstruction of missing data\nfrom deterministic univariate linear time-invariant signals, and Liu et al. [26] use a nuclear norm-\nregularized SSID to reconstruct partially missing data vectors. Balzano et al. [21, 27] presented a\nscalable dimensionality reduction approach (GROUSE) for data with missing entries. This approach\ndoes not aim to capture temporal corrrelations, and is designed for data which is missing at random.\nSoudry et al. [28] considered population subsampling from the perspective of inferring functional\nconnectivity, but focused on observation schemes in which there are at least some simultaneous\nobservations for each pair of variables.\n2 Methods\n2.1 Low-dimensional state-space models with linear observations\nModel class: Our goal is to identify low-dimensional dynamics from multiple, partially overlapping\nrecordings of a high-dimensional neural population, and to use them to predict neural correlations.\nWe denote neural activity by Y = {yt}T\nt=1, a length-T discrete-time sequence of p-dimensional\nvectors. We assume that the underlying n-dimensional dynamics x linearly modulate y,\n\n(1)\n(2)\nwith diagonal observation noise covariance matrix R \u2208 Rp\u00d7p. Thus, each observed variable y(i)\n,\ni = 1, . . . , p is a noisy linear combination of the shared time-evolving latent modes xt.\n\nyt = Cxt + \u03b5t,\nxt+1 = f (xt, \u03b7t),\n\n\u03b5t \u223c N (0, R)\n\u03b7t \u223c p(\u03b7),\n\nt\n\n2\n\n\fFigure 1: Identifying low-dimensional dynamics shared across neural recordings a) Different\nsubsets of a large neural population are recorded sequentially (here: neurons 1 to 11, cyan, are recored\n\ufb01rst, then neurons 10 to 20, green). b) Low-dimensional (n = 3) trajectories extracted from data\nin a: Our approach (orange) can extract the dynamics underlying the entire population, whereas an\nestimation on each of the two observed subsets separately will not be able to align dynamics across\nsubsets. c) Subspace-maps (linear projection matrices C) inferred from each of the two observed\nsubsets separately (and hence not aligned), and for the entire recording. d) Same information as in\nb, but as phase plots. e) Pairwise covariances\u2013 in this observation scheme, many covariances (red)\nare unobserved, but can be reconstructed using our approach. f) Recovery of unobserved pairwise\ncovariances (red). Our approach is able to recover the unobserved covariance across subsets.\n\nWe consider stable latent zero-mean dynamics on x with time-lagged covariances \u03a0s\n:=\nCov[xt+s, xt] \u2208 Rn\u00d7n for time-lag s \u2208 {0, . . . , S}. Time-lagged observed covariances \u039b(s) \u2208\nRp\u00d7p can be computed from \u03a0s as\n\n\u039b(s) := C\u03a0sC(cid:62) + \u03b4s=0R.\n\n(3)\n\nAn important special case is the classical linear dynamical system (LDS) with f (xt, \u03b7t) = Axt + \u03b7t,\nwith \u03b7t \u223c N (0, Q) and \u03a0s = As\u03a00. As we will see below, our SSID algorithm works directly on\nthese time-lagged covariances, so it is also applicable also to generative models with non-Markovian\nGaussian latent dynamics, e.g. Gaussian Process Factor Analysis [2].\nPartial observations and missing data: We treat multiple partial recordings as a missing-data\nproblem\u2013 we use yt to model all activity measurements across multiple experiments, and assume that\nat any time t, only some of them will be observed. As a consequence, the data-dimensionality p could\nnow easily be comprised of thousands of neurons, even if only small subsets are observed at any\ngiven time. We use index sets \u2126t \u2286 {1, . . . , p}, where i \u2208 \u2126t indicates that variable i is observed at\ntime point t. We obtain empirical estimates of time-lagged pairwise covariances for variable each\npair (i, j) over all of those time points where the pair of variables is jointly observed with time-lag s.\nWe de\ufb01ne co-occurrence counts T s\n\nij = |{t|i \u2208 \u2126t+s \u2227 j \u2208 \u2126t}|.\n\nIn total there could be up to Sp2 many co-occurrence counts\u2013 however, for SSOs the number of unique\ncounts is dramatically lower. To capitalize on this, we de\ufb01ne co-ocurrence groups F \u2286 {1, . . . , p},\nsubsets of variables with identical observation patterns: \u2200i, j \u2208 F \u2200t \u2264 T : i \u2208 \u2126t iff j \u2208 \u2126t. All\nelement pairs (i, j) \u2208 F 2 share the same co-occurence count T s\nij per time-lag s. Co-occurence groups\nare non-overlapping and together cover the whole range {1, . . . , p}. There might be pairs (i, j) which\nare never observed, i.e. for which T s\nij = 0 for each s. We collect variable pairs co-observed at least\ntwice at time-lag s, \u2126s = {(i, j)|T s\nij > 1}. For these pairs we can calculate an unbiased estimate of\nthe s-lagged covariance,\n\ny(i)\nt+sy(j)\n\nt\n\n:= \u02dc\u039b(s)(ij).\n\n(4)\n\n(cid:88)\n\nt\n\nCov[y(i)\n\nt+s, y(j)\n\nt\n\n] \u2248\n\n1\nij \u2212 1\nT s\n\n3\n\n-2000200510152010200.30.60.00.30.60.9before switchafter switch5101520123-2000200321time relative to switch-pointlatent dim. #2# neuron# latent dim.# neuron# neuron# latent dim. neuron#latent dim. #1000dground truth (unknown)stitched model estimateseparate model estimatetime-lag s = 5time-lag s = 012162024681216201216202468stitchedseparate0bacef1020\f2.2 Expectation maximization for stitching linear dynamical systems\nEM can readily be extended to missing data by removing likelihood-terms corresponding to missing\ndata [29]. In the E-step of our stitching-version of EM (sEM), we use the default Kalman \ufb01lter and\nsmoother equations with subindexed Ct = C(\u2126t,:) and Rt = R(\u2126t,\u2126t) parameters for each time point\nt. We speed up the E-step by tracking convergence of latent posterior covariances, and stop updating\nthese when they have converged [30]\u2013 for long T , this can result in considerably faster smoothing.\nFor the M-step, we adapt maximum likelihood estimates of parameters \u03b8 = {A, Q, C, R}. Dynamics\nparameters (A, Q) are unaffected by SSOs. The update for C is given by\n\nC(i,:) =\n\n(cid:18)(cid:88)\n(cid:18)(cid:88) E[xtxT\n\nt E[xt]T \u2212 1\ny(i)\n|Oi|\nt ] \u2212 1\n|Oi|\n\n\u00d7\n\n(cid:16)(cid:88)\n(cid:16)(cid:88) E[xt]\n\n(cid:17)(cid:16)(cid:88) E[xt]T(cid:17)(cid:19)\n(cid:17)(cid:16)(cid:88) E[xt]T(cid:17)(cid:19)\u22121\n\ny(i)\nt\n\n(5)\n\n,\n\nwhere Oi = {t|i \u2208 \u2126t} is the set of time points for which yi is observed, and all sums are over\nt \u2208 Oi. For SSOs, we use temporal structure in the observation patterns \u2126t to avoid unnecessary\ncalculations of the inverse in (5): all elements i of a co-occurence group share the same Oi.\n2.3 Scalable subspace-identi\ufb01cation with missing data via moment-matching\nSubspace identi\ufb01cation: Our algorithm (Stitching-SSID, S3ID) is based on moment-matching\napproaches for linear systems [31]. We will show that it provides robust initialisation for EM,\nand that it performs more robustly (in the sense of yielding samples which more closely capture\nempirically measured correlations, and predict missing ones) on non-Gaussian and nonlinear data.\nFor fully observed linear dynamics, statistically consistent estimators for \u03b8 = {C, A, \u03a00, R} can\nbe obtained from {\u02dc\u039b(s)}s [20] by applying an SVD to the pK \u00d7 pL block Hankel matrix H with\nblocks Hk,l = \u02dc\u039b(k + l \u2212 1). For our situation with large p and massively missing entries in \u02dc\u039b(s), we\nde\ufb01ne an explicit loss function which penalizes the squared difference between empirically observed\ncovariances and those predicted by the parametrised model (3),\n\nL(C,{\u03a0s}, R) =\n\n1\n2\n\nrs||\u039b(s) \u2212 \u02dc\u039b(s)||2\n\u2126s,\n\n(6)\n\nwhere || \u00b7 ||\u2126 denotes the Froebenius norm applied to all elements in index set \u2126. For linear dynamics,\nwe constrain \u03a0s by setting \u03a0s = As\u03a00 and optimize over A instead of over \u03a0s. We refer to this\nalgorithm as \u2018linear S3ID\u2019, and to the general one as \u2018nonlinear S3ID\u2019. However, we emphasize that\nonly the latent dynamics are (potentially) nonlinear, dimensionality reduction is linear in both cases.\nOptimization via stochastic gradients: For large-scale applications, explicit computation and\nstorage of the observed \u02dc\u039b(s) is prohibitive since they can scale as |\u2126s| \u223c p2, which renders\ncomputation of the full loss L impractical. We note, however, that the gradients of L are linear in\n. This allows us to obtain unbiased stochastic estimates of the gradients by\nuniformly subsampling time points t and corresponding pairs of data vectors yt+s, yt with time-lag\ns, without explicit calculation of the loss L. The batch-wise gradients are given by\n\n\u02dc\u039b(s)(i,j) \u221d(cid:80)\n\nt+sy(j)\n\nt y(i)\n\nt\n\n(cid:17)\n\n[\u039b(s)(cid:62)](i,:) \u2212 y(i)\n\nt y(cid:62)\n\nt+s\n\n(cid:16)\n\nN i,t\n\ns C\n\nt\n\n(cid:17)\n\n(cid:16)\n(cid:88)\n(cid:18)\n\ns +\n\n(cid:17)\n\nt\n\n=\n\n=\n\nC(cid:62)\n\ns C\u03a0(cid:62)\nN i,t\n\n\u039b(s)(i,:) \u2212 y(i)\n\nt+sy(cid:62)\n\u039b(s)(i,:) \u2212 y(i)\n\n\u2202Lt,s\n\u2202C(i,:)\n\u2202Lt,s\n\u2202\u03a0s\n\u2202Lt,s\n\u2202Rii\ns \u2208 Np\u00d7p is a diagonal matrix with [N i,t\n\n(cid:16)\n\u039b(0)(i,i) \u2212(cid:16)\n\ni\u2208\u2126t+s\n\u03b4s0\nT 0\nii\n\n(cid:17)2(cid:19)\n\nt+sy(cid:62)\n\ny(i)\nt\n\n(i,:)\n\n=\n\n,\n\n(cid:88)\n\ns\n\n4\n\nN i,t+s\n\ns\n\nC\u03a0s\n\n(7)\n\n(8)\n\n(9)\n\nwhere N i,t\nGradients scale linearly in p both in memory and computation and allow us to minimize L without\nexplicit computation of the empirical time-lagged covariances, or L itself. To monitor performance\nand convergence for large systems, we compute the loss over a random subset of covariances. The\ncomputation of gradients for C and R can be fully vectorized over all elements i of a co-occurence\ngroup, as these share the same matrices N i,t\ns . We use ADAM [32] for stochastic gradient descent,\n\ns ]jj = 1\nT s\nij\n\nif j \u2208 \u2126t, and 0 otherwise.\n\n\f\uf8eb\uf8ec\uf8ec\uf8ed C 1\n\nO1\nJ =\n\n(J,:)\nC 1\n(J,:)A1\n\u00b7\u00b7\u00b7\n(J,:)(A1)n\u22121\nC 1\nJ and O2\n\n\uf8f6\uf8f7\uf8f7\uf8f8 =\n\n\uf8eb\uf8ec\uf8ec\uf8ed C 2\n\n(J,:)\nC 2\n(J,:)A2\n\u00b7\u00b7\u00b7\n(J,:)(A2)n\u22121\nC 2\n\n\uf8f6\uf8f7\uf8f7\uf8f8 M\u22121 = O2\n\nJ M\u22121.\n\nwhich combines momentum over subsequent gradients with individual self-adjusting step sizes for\neach parameter. By using momentum on the stochastic gradients, we effectively obtain a gradient\nthat aggregates information from empirical time-lagged covariances across multiple gradient steps.\n2.4 How temporal information helps for stitching\nThe key challenge in stitching is that the latent space inferred by an LDS is de\ufb01ned only up to\nchoice of coordinate system (i.e. a linear transformation of C). Thus, stitching is successful if one\ncan align the Cs corresponding to different subpopulations into a shared coordinate system for the\nlatent space of all p neurons [23] (Fig. 1). In the noise-free regime and if one ignores temporal\ninformation, this can work only if the overlap between two sub-populations is at least as large as\nthe latent dimensionality, as shown by [23]. However, dynamics (i.e. temporal correlations) provide\nadditional constraints for the alignment which can allow stitching even without overlap:\nAssume two subpopulations I1, I2 with parameters \u03b81, \u03b82, latent spaces x1, x2 and with overlap set\nJ = I1 \u2229 I2 and overlap o = |J|. The overlapping neurons y(J)\nare represented by both the matrix\nJ,:, each in their respective latent coordinate systems. To stitch, one needs to identify\nrows C 1\nthe base change matrix M aligning latent coordinate systems consistently across the two populations,\n(J,:)M\u22121. When only considering\ni.e. such that M x1 = x2 satis\ufb01es the constraints C 1\ntime-instantaneous covariances, this yields o linear constraints, and thus the necessary condition that\no \u2265 n, i.e. the overlap has to be at least as large the latent dimensionality [23].\nIncluding temporal correlations yields additional constraints, as the time-lagged activities also have\nto be aligned, and these constraints can be combined in the observability matrix J:\n\n(J,:) = C 2\n\nJ,: and C 2\n\nt\n\ns, \u03a02\n\ns = M \u03a02\n\ns satisfy \u03a01\n\nJ have full rank (i.e. rank n), then M is uniquely constrained,\n\nIf both observability matrices O1\nand this identi\ufb01es the base change required to align the latent coordinate systems.\nTo get consistent latent dynamics, the matrices A1 and A2 have to be similar, i.e. M A1M\u22121 = A2,\nsM(cid:62).\nand correspondingly the time-lagged latent covariance matrices \u03a01\nThese dynamics might yield additional constraints: For example, if both A1 and A2 have unique (and\nthe same) eigenvalues (and we know that we have identi\ufb01ed all latent dimensions), then one could\nalign the latent dimensions of x which share the same eigenvalues, even in the absence of overlap.\n2.5 Details of simulated and empirical data\nLinear dynamical system: We simulate LDSs to test algorithms S3IDand sEM. For dynamics\nmatrices A, we generate eigenvalues with absolute values linearly spanning the interval [0.9, 0.99] and\ncomplex angles independently von Mises-distributed with zero mean and concentration \u03ba = 1000,\nresulting in smooth latent tractories. To investigate stitching-performance on SSOs, we divded the\nentire population size of size p = 1000 into two subsets I1 = [1, . . . p1], I2 = [p2 . . . p], p2 \u2264 p1\nwith overlap o = p1 \u2212 p2. We simulate for Tm = 50k time points, m = 1, 2 for a total of T = 105\ntime points. We set the Rii such that 50% of the variance of each variable is private noise. Results are\naggregated over 20 data sets for each simulation. For the scaling analysis in section 3.2, we simulate\npopulation sizes p = 103, 104, 105, at overlap o = 10%, for Tm = 15k and 10 data sets (different\nrandom initialisation for LDS parameters and noise) for each population size. We compute subspace\nprojection errors between C and \u02c6C as e(C, \u02c6C) = ||(I \u2212 \u02c6C \u02c6C(cid:62))C||F /||C||F .\nSimulated neural networks: We simulate a recurrent network of 1250 exponential integrate-and-\n\ufb01re neurons [33] (250 inhibitory and p = 1000 excitatory neurons) with clustered connectivity for\nT = 60k time points. The inhibitory neurons exhibit unspeci\ufb01c connectivity towards the excitatory\nunits. Excitatory neurons are grouped into 10 clusters with high connectivity (30%) within cluster\nand low connectivity (10%) between clusters, resulting in low-dimensional dynamics with smooth,\noscillating modes corresponding to the 10 clusters.\nLarval-zebra\ufb01sh imaging: We applied S3ID to a dataset obtained by light-sheet \ufb02uorescence\nimaging of the whole brain of the larval zebra\ufb01sh [34]. For this data, every data vector yt represents\n\n5\n\n\fa 2048 \u00d7 1024 \u00d7 41 three-dimensional image stack of of \ufb02uorescence activity recorded sequentially\nacross 41 z-planes, over in total T = 1200 time points of recording at 1.15 Hz scanning speed across\nall z-planes. We separate foreground from background voxels by thresholding per-voxel \ufb02uorescence\nactivity variance and select p = 7, 828, 017 voxels of interest (\u2248 9.55% of total) across all z-planes,\nand z-scored variances.\n\n3 Results\n3.1 Stitching on simulated data\n\nFigure 2: Dimensionality reduction for multiple partial recordings a) Simulated LDS with\np = 1K neurons and n = 10 latent variables, two subpopulations, varying degrees of overlap\no. a) Subspace estimation performance for S3ID, sEM and reference algorithms (GROUSE and\nnaive FA). Subspace projection errors averaged over 20 generated data sets, \u00b11 SEM. S3ID returns\ngood subspace estimates across a wide range of overlaps. b) Estimation of dynamics. Correlations\nbetween ground-truth and estimated time-lagged covariances for unobserved pair-wise covariances.\nc) Subspace projection error for sEM as a function of iterations, for different overlaps. Errors per\ndata set, and means (bold lines). Convergence of sEM slows down with decreasing overlap.\nTo test how well parameters of LDS models can be reconstructed from high-dimensional partial\nobservations, we simulated an LDS and observed it through two overlapping subsets, parametrically\nvarying the size of overlap between them from o = 1% to o = 100%.\nAs a simple baseline, we apply a \u2018naive\u2019 Factor Analysis, for which we impute missing data as 0.\nGROUSE [21], an algorithm designed for randomly missing data, recovers a consistent subspace\nfor overlap o = 30% and greater, but fails for smaller overlaps. As sEM (maximum number of 200\niterations) is prone to get stuck in local optima, we randomly initialise it with 4 seeds per \ufb01t and report\nresults with highest log-likelihood. sEM worked well even for small overlaps, but with increasingly\nvariable results (see Fig. 2c). Finally, we applied our SSID algorithm S3ID which exhibited good\nperformance, even for small overlaps.\n\nFigure 3: Choice of latent dimensionality\nEigenvalue spectra of system matrices esti-\nmated from simulated LDS data with o = 5%\noverlap and different latent dimensionalities\nn. a) Eigenvalues of instantaneous covariance\nmatrix \u03a00. b) Eigenvalues of linear dynamics\nmatrix A. Both spectra indicate an elbow at\nreal data dimensionality n = 10 when S3ID is\nrun with n \u2265 10.\n\nTo quantify recovery of dynamics, we compare predictions for pairwise time-lagged covariances\nbetween variables not co-observed simultaneously (Fig. 2b). Because GROUSE itself does not capture\ntemporal correlations, we obtain estimated time-lagged correlations by projecting data yt onto the\nobtained subspace and extract linear dynamics from estimated time-lagged latent covariances. S3ID\nis optimized to capture time-lagged covariances, and therefore outperforms alternative algorithms.\n\n6\n\nbac1101000.00.20.40.60.80.60.81.00.81.00.91.000.900.950510150501001502000.00.20.40.60.8overlap otime-lag sEM iterationssubsp. proj. errorsubsp. proj. erroro = 100.0 %o = 30.0 %o = 10.0 %o = 2.5 %o = 1.0 %corr. of cov.30 % overlap 1 % overlap 5 % overlapcorr. of cov.corr. of cov.S3IDsEMGROUSEFA (naive)102030400.00.10.250102030400.00.51.0n =10n =50n =20ab# latent dim. # latent dim. normalized variancedynamics eigenvalue50\fComparison with\nFigure 4:\npost-hoc alignment of subspaces\na) Multiple partial recordings with\n20 sequentially recorded subpopula-\ntions. b) We apply S3ID to the full\npopulation, as well as factor analysis\nto each of these subpopulations. The\nlatter gives 20 subspace estimates,\nwhich we sequentially align using\nsubpopulation overlaps.\n\nWhen we use a latent dimensionality (n = 20, 50) larger than the true one (n = 10), we observe\n\u2018elbows\u2019 in the eigen-spectra of instantaneous covariance estimate \u03a00 and dynamics matrix A located\nat the true dimensionality (Fig. 3). This observation suggests we can use standard techniques for\nchoosing latent dimensionalities in applications where the real n is unknown. Choosing n too large\nor too small led to some decrease in prediction quality of unobserved (time-lagged) correlations.\nImportantly though, performance degraded gracefully when the dimensionality was chosen too big:\nFor instance, at 5% overlap, correlation between predicted and ground-truth unobserved instantaneous\ncovariances was 0.99 for true latent dimensionality n = 10 (Fig. 2b). At smaller n = 5 and n = 8,\ncorrelations were 0.69 and 0.89, respectively, and for larger n = 20 and n = 50, they were 0.97 and\n0.96. In practice, we recommend using n larger than the hypothesized latent dimensionality.\nS3ID and sEM jointly estimate the subspace C across the entire population. An alternative approach\nwould be to identify the subspaces for the different subpopulations via separate matrices C(I,:) and\nsubsequently align these estimates via their pairwise overlap [23]. This works very well on this\nexample (as for each subset there is suf\ufb01cient data to estimate each CI,: individually). However, in\nFig. 4 we show that this approach performs suboptimally in scenarios in which data is more noisy or\ncomprised of many (here 20) subpopulations. In summary, S3ID can reliably stitch simulated data\nacross a range of overlaps, even for very small overlaps.\n3.2 Stitching for different population sizes: Combining S3ID with sEM works best\n\nFigure 5: Initializing EM with SSID for fast and robust convergence LDS with p = 103, 104, 105\nneurons and n = 10 latent variables, 10% overlap. a) Largest principal angles as a function of\ncomputation time. We compare randomly initalised sEM with sEM initialised from S3ID after a\nsingle pass over the data. b) Comparison of \ufb01nal subspace estimate. We can combine the high\nreliability of S3ID with the low \ufb01nal subspace angle of EM by initialising sEM with S3ID. c)\nComparison of total run-times. Initialization by S3ID does not change overall runtime.\nThe above results were obtained for \ufb01xed population size p = 1000. To investigate how performance\nand computation time scale with population size, we simulate data from an LDS with \ufb01xed overlap\no = 10% for different population sizes. We run S3ID with a single pass, and subsequently use its\n\ufb01nal parameter estimates to initialize sEM. We set the maximum number of iterations for sEM to 50,\ncorresponding to approximately 1.5h of training time for p = 105 observed variables. We quantify\nthe subspace estimates by the largest principal angle between ground-truth and estimated subspaces.\nWe \ufb01nd that the best performance is achieved by the combined algorithm (S3ID + sEM, Fig. 5a,b). In\nparticular, S3ID reliably and quickly leads to a reduction in error (Fig. 5a), but (at least when capped\nat one pass over the data), further improvements can be achieved by letting sEM do further \u2018\ufb01ne-\n\n7\n\npost-hoc alignmentS3IDab5007501000250150000100000050010001time tnumber of dimensionsvariable isubsp. projection error0.20.40.6p = 10p = 10p = 10abc10101010101010101.00.10.011010101010101010time [s]time [s] (sEM)time [s]principal anglelargest principal anglesEMS3IDS3ID+sEMsEMS3IDS3ID+sEMsEMS3IDS3ID+sEMS3IDS3ID+sEMS3IDS3ID+sEMsEM012340-1-234543211234\ftuning\u2019 of parameters from the initial estimate [35]. When starting sEM from random initializations,\nwe \ufb01nd that it often gets stuck in local minima (potentially, shallow regions of the log-likelihood).\nWhile convergence issues for EM have been reported before, we remark that these issues seems to be\nmuch more severe for stitching. We hypothesize that the presence of two potential solutions (one for\neach observation subset) makes parameter inference more dif\ufb01cult.\nComputation times for both stitching algorithms scale approximately linear with observed population\nsize p (Fig. 5c). When initializing sEM by S3ID, we found that the cose of S3IDis amortized by\nfaster convergence of sEM. In summary, S3ID performs robustly across different population sizes,\nbut can be further improved when used as an initializer for sEM.\n3.3 Spiking neural networks\nHow well can our approach capture and predict correlations in spiking neural networks, from partial\nobservations? To answer this question, we applied S3ID to a network simulation of inhibitory and\nexcitatory neurons (Fig. 6a), divided into into 10 clusters with strong intra-cluster connectivity. We\napply S3ID-initialised sEM with n = 20 latent dimensions to this data and \ufb01nd good recovery of\ntime-instantaneous covariances (Fig. 6b), but poor recovery of long-range temporal interactions.\nSince sEM assumes linear latent dynamics, we test whether this is due to a violation of the linearity\nassumption by applying S3ID with nonlinear latent dynamics, i.e. by learning the latent covariances\n\u03a0s, s = 0, . . . , 39. This comes at the cost of learning 40 rather than 2 n \u00d7 n matrices to characterise\nthe latent space, but we note that this here still amounts to only 76.2% of the parameters learned for\nC and R. We \ufb01nd that the nonlinear latent dynamics approach allows for markedly better predictions\nof time-lagged covariances (Fig. 6b).\nWe attempt to recover cluster membership for each of the neurons from the estimated emission\nmatrices C using K-means clustering on the rows of C. Because the 10 clusters are distributed over\nboth subpopulations, this will only be successful if the latent representations for the two subpoplations\nare suf\ufb01ciently aligned. While we \ufb01nd that both approaches can assign most neurons correctly, only\nthe nonlinear version of S3ID allows correct recovery for every neuron. Thus, the \ufb02exibility of\nS3ID allows more accurate reconstruction and prediction of correlations in data which violates the\nassumptions of linear Gaussian dynamics.\nWe also applied dynamics-agnostic S3ID when undersampling two out of the ten clusters. Prediction\nof unobserved covariances for the undersampled clusters was robust down to sampling only 50% of\nneurons from those clusters. For 50/40/30% sampling, we obtained correlations of instantaneous\ncovariances of 0.97/0.80/0.32 for neurons in the undersampled clusters. Correlation across all clusters\nremained above 0.97 throughout. K-means on the rows of learned emission matrix C still perfectly\nidenti\ufb01ed the ten clusters at 40% sampling, whereas below that it fused the undersampled clusters.\n\nFigure 6: Spiking network simulation a) Spiking data for 100 example neurons from 10 clusters,\nand two observations with 10% overlap (clusters shuf\ufb02ed across observations-subsets). b) Cor-\nrelations between ground-truth and estimated time-lagged covariances for non-observed pairwise\ncovariances, for S3ID with or without linearity assumption, as well as for sEM initialised with linear\nS3ID. c) Recovery of cluster membership, using K-means clustering on estimated C.\n3.4 Zebra\ufb01sh imaging data\nFinally, we want to determine how well the approach works on real population imaging data, and test\nwhether it can scale to millions of dimensions. To this end, we apply (both linear and nonlinear) S3ID\n\n8\n\n# neuron060120time t# neuron (shuffled)1sa060120time tbc010203040time-lag s0.40.60.81.0corr. of covfully observed nonlinearpartially obs. linearpartially obs. nonlinear12004006008001000# neuron246810# cluster\fFigure 7: Zebra\ufb01sh imaging data Multiple partial recordings for p = 7, 828, 017-dimensional data\nfrom light-sheet \ufb02uoresence imaging of larval zebra\ufb01sh. Data vectors represent volumetric frames\nfrom 41 planes. a) Simulated observation scheme: we assume the imaging data was recorded over two\nsessions with a single imaging plane in overlap. We apply S3ID with latent dimensionality n = 10\nwith linear and nonlinear latent dynamics. b) Quanti\ufb01cation of covariance recovery. Comparison\nof held-out ground-truth and estimated instantaneous covariances, for 106 randomly selected voxel\npairs not co-observed under the observation scheme in a. We estimate covariances from two models\nlearned from partially observed data (green: dynamics-agnostic; magenta: linear dynamics) and from\na control \ufb01t to fully-observed data (orange, dynamics-agnostic). left: Instantaneous covariances.\nright: Prediction of time-lagged covariances. Correlation of covariances as a function of time-lag.\n\nto volume scans of larval zebra\ufb01sh brain activity obtained with light-sheet \ufb02uorescence microscopy,\ncomprising p = 7, 828, 017 voxels. We assume an observation scheme in which the \ufb01rst 21 (out\nof 41) imaging planes are imaged in the \ufb01rst session, and the remaining 21 planes in the second,\ni.e. with only z-plane 21 (234.572 voxels) in overlap (Fig. 7a,b). We evaluate the performance by\npredicting (time-lagged) pairwise covariances for voxel pairs not co-observed under the assumed\nmultiple partial recording, using eq. 3. We \ufb01nd that nonlinear S3ID is able to reconstruct correlations\nwith high accuracy (Fig. 7c), and even outperforms linear S3ID applied to full observations. FA\napplied to each imaging session and aligned post-hoc (as by [23]) obtained a correlation of 0.71 for\ninstantaneous covariances, and applying GROUSE to the observation scheme gave correlation 0.72.\n4 Discussion\nIn order to understand how large neural dynamics and computations are distributed across large neural\ncircuits, we need methods for interpreting neural population recordings with many neurons and in\nsuf\ufb01ciently rich complex tasks [12]. Here, we provide methods for dimensionality reduction which\ndramatically expand the range of possible analyses. This makes it possible to identify dynamics\nin data with millions of dimensions, even if many observations are missing in a highly structured\nmanner, e.g. because measurements have been obtained in multiple overlapping recordings. Our\napproach identi\ufb01es parameters by matching model-predicted covariances with empirical ones\u2013 thus,\nit yields models which are optimized to be realistic generative models of neural activity. While\nmaximum-likelihood approaches (i.e. EM) are also popular for \ufb01tting dynamical system models\nto data, they are not guaranteed to provide realistic samples when used as generative models, and\nempirically often yield worse \ufb01ts to measured correlations, or even diverging \ufb01ring rates.\nOur approach readily permits several possible generalizations: First, using methods similar to [35], it\ncould be generalized to nonlinear observation models, e.g. generalized linear models with Poisson\nobservations. In this case, one could still use gradient descent to minimize the mismatch between\nmodel-predicted covariance and empirical covariances. Second, one could impose non-negativity\nconstraints on the entries of C to obtain more interpretable network models [36]. Third, one could\ngeneralize the latent dynamics to nonlinear or non-Markovian parametric models, and optimize the\nparameters of these nonlinear dynamics using stochastic gradient descent. For example, one could\noptimize the kernel-function of GPFA directly by matching the GP-kernel to the latent covariances.\nAcknowledgements We thank M. Ahrens for the larval zebra\ufb01sh data. Our work was supported by\nthe caesar foundation.\n\n9\n\nz = 21z = 1z = 4106001200est. covariancefully observed, nonlinearpartially observed, nonlinearfully observed, nonlinearpartially observed, nonlinearfully observed, linear024680.80.91.0corr. of cov.ground-truth covariancetime ttime-lag simaging plane zab-0.500.5-0.50 0.5\fReferences\n[1] J. P. Cunningham and M. Y. Byron, \u201cDimensionality reduction for large-scale neural recordings,\u201d\n\nNature neuroscience, vol. 17, no. 11, pp. 1500\u20131509, 2014.\n\n[2] M. Y. Byron, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani,\n\u201cGaussian-process factor analysis for low-dimensional single-trial analysis of neural population\nactivity,\u201d in Advances in neural information processing systems, pp. 1881\u20131888, 2009.\n\n[3] J. H. Macke, L. Buesing, J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani., \u201cEmpirical\nmodels of spiking in neural populations.,\u201d in Advances in Neural Information Processing\nSystems, pp. 1350\u20131358, 2011.\n\n[4] D. Pfau, E. A. Pnevmatikakis, and L. Paninski, \u201cRobust learning of low-dimensional dynamics\nfrom large neural ensembles,\u201d in Advances in neural information processing systems, pp. 2391\u2013\n2399, 2013.\n\n[5] Y. Gao, L. Busing, K. V. Shenoy, and J. P. Cunningham, \u201cHigh-dimensional neural spike train\nanalysis with generalized count linear dynamical systems,\u201d in Advances in Neural Information\nProcessing Systems, pp. 2044\u20132052, 2015.\n\n[6] M. M. Churchland, J. P. Cunningham, M. T. Kaufman, J. D. Foster, P. Nuyujukian, S. I. Ryu,\nand K. V. Shenoy, \u201cNeural population dynamics during reaching,\u201d Nature, vol. 487, no. 7405,\np. 51, 2012.\n\n[7] O. Mazor and G. Laurent, \u201cTransient dynamics versus \ufb01xed points in odor representations by\n\nlocust antennal lobe projection neurons,\u201d Neuron, vol. 48, no. 4, pp. 661\u201373, 2005.\n\n[8] K. L. Briggman, H. D. I. Abarbanel, and W. B. Kristan, Jr, \u201cOptical imaging of neuronal\n\npopulations during decision-making,\u201d Science, vol. 307, no. 5711, pp. 896\u2013901, 2005.\n\n[9] D. V. Buonomano and W. Maass, \u201cState-dependent computations: spatiotemporal processing in\n\ncortical networks.,\u201d Nat Rev Neurosci, vol. 10, no. 2, pp. 113\u2013125, 2009.\n\n[10] K. V. Shenoy, M. Sahani, and M. M. Churchland, \u201cCortical control of arm movements: a\n\ndynamical systems perspective,\u201d Annu Rev Neurosci, vol. 36, pp. 337\u201359, 2013.\n\n[11] V. Mante, D. Sussillo, K. V. Shenoy, and W. T. Newsome, \u201cContext-dependent computation by\n\nrecurrent dynamics in prefrontal cortex,\u201d Nature, vol. 503, no. 7474, pp. 78\u201384, 2013.\n\n[12] P. Gao and S. Ganguli, \u201cOn simplicity and complexity in the brave new world of large-scale\n\nneuroscience,\u201d Curr Opin Neurobiol, vol. 32, pp. 148\u201355, 2015.\n\n[13] N. Li, K. Daie, K. Svoboda, and S. Druckmann, \u201cRobust neuronal dynamics in premotor cortex\n\nduring motor planning,\u201d Nature, vol. 532, no. 7600, pp. 459\u201364, 2016.\n\n[14] N. J. Sofroniew, D. Flickinger, J. King, and K. Svoboda, \u201cA large \ufb01eld of view two-photon\n\nmesoscope with subcellular resolution for in vivo imaging,\u201d eLife, vol. 5, 2016.\n\n[15] A. K. Dhawale, R. Poddar, S. B. Wolff, V. A. Normand, E. Kopelowitz, and B. P. \u00d6lveczky,\n\u201cAutomated long-term recording and analysis of neural activity in behaving animals,\u201d eLife,\nvol. 6, 2017.\n\n[16] Q. J. Huys and L. Paninski, \u201cSmoothing of, and parameter estimation from, noisy biophysical\n\nrecordings,\u201d PLoS Comput Biol, vol. 5, no. 5, p. e1000379, 2009.\n\n[17] A. P. Dempster, N. M. Laird, and D. B. Rubin, \u201cMaximum likelihood from incomplete data via\nthe em algorithm,\u201d Journal of the royal statistical society. Series B (methodological), pp. 1\u201338,\n1977.\n\n[18] Z. Ghahramani and G. E. Hinton, \u201cParameter estimation for linear dynamical systems,\u201d tech.\nrep., Technical Report CRG-TR-96-2, University of Totronto, Dept. of Computer Science, 1996.\n[19] P. Van Overschee and B. De Moor, Subspace identi\ufb01cation for linear systems: Theory\u2014\n\nImplementation\u2014Applications. Springer Science & Business Media, 2012.\n\n[20] T. Katayama, Subspace methods for system identi\ufb01cation. Springer Science & Business Media,\n\n2006.\n\n[21] L. Balzano, R. Nowak, and B. Recht, \u201cOnline identi\ufb01cation and tracking of subspaces from\nhighly incomplete information,\u201d in Communication, Control, and Computing (Allerton), 2010\n48th Annual Allerton Conference on, pp. 704\u2013711, IEEE, 2010.\n\n10\n\n\f[22] S. Turaga, L. Buesing, A. M. Packer, H. Dalgleish, N. Pettit, M. Hausser, and J. Macke,\n\u201cInferring neural population dynamics from multiple partial recordings of the same neural\ncircuit,\u201d in Advances in Neural Information Processing Systems, pp. 539\u2013547, 2013.\n\n[23] W. E. Bishop and B. M. Yu, \u201cDeterministic symmetric positive semide\ufb01nite matrix completion,\u201d\n\nin Advances in Neural Information Processing Systems, pp. 2762\u20132770, 2014.\n\n[24] I. Markovsky, \u201cThe most powerful unfalsi\ufb01ed model for data with missing values,\u201d Systems &\n\nControl Letters, 2016.\n\n[25] I. Markovsky, \u201cA missing data approach to data-driven \ufb01ltering and control,\u201d IEEE Transactions\n\non Automatic Control, 2016.\n\n[26] Z. Liu, A. Hansson, and L. Vandenberghe, \u201cNuclear norm system identi\ufb01cation with missing\n\ninputs and outputs,\u201d Systems & Control Letters, vol. 62, no. 8, pp. 605\u2013612, 2013.\n\n[27] J. He, L. Balzano, and J. Lui, \u201cOnline robust subspace tracking from partial information,\u201d arXiv\n\npreprint arXiv:1109.3827, 2011.\n\n[28] D. Soudry, S. Keshri, P. Stinson, M.-h. Oh, G. Iyengar, and L. Paninski, \u201cEf\ufb01cient\" shotgun\"\ninference of neural connectivity from highly sub-sampled activity data,\u201d PLoS Comput Biol,\nvol. 11, no. 10, p. e1004464, 2015.\n\n[29] S. C. Turaga, L. Buesing, A. Packer, H. Dalgleish, N. Pettit, M. Hausser, and J. H. Macke,\n\u201cInferring neural population dynamics from multiple partial recordings of the same neural\ncircuit,\u201d in Advances in Neural Information Processing Systems, pp. 539\u2013547, 2013.\n\n[30] E. A. Pnevmatikakis, K. R. Rad, J. Huggins, and L. Paninski, \u201cFast kalman \ufb01ltering and forward\u2013\nbackward smoothing via a low-rank perturbative approach,\u201d Journal of Computational and\nGraphical Statistics, vol. 23, no. 2, pp. 316\u2013339, 2014.\n\n[31] M. Aoki, State space modeling of time series. Springer Science & Business Media, 1990.\n[32] D. Kingma and J. Ba, \u201cAdam: A method for stochastic optimization,\u201d arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[33] R. Brette and W. Gerstner, \u201cAdaptive exponential integrate-and-\ufb01re model as an effective\ndescription of neuronal activity,\u201d Journal of neurophysiology, vol. 94, no. 5, pp. 3637\u20133642,\n2005.\n\n[34] M. B. Ahrens, M. B. Orger, D. N. Robson, J. M. Li, and P. J. Keller, \u201cWhole-brain functional\nimaging at cellular resolution using light-sheet microscopy.,\u201d Nature Methods, vol. 10, no. 5,\npp. 413\u2013420, 2013.\n\n[35] L. Buesing, J. H. Macke, and M. Sahani, \u201cSpectral learning of linear dynamics from generalised-\nlinear observations with application to neural population data,\u201d in Advances in Neural Informa-\ntion Processing Systems, pp. 1682\u20131690, 2012.\n\n[36] L. Buesing, T. A. Machado, J. P. Cunningham, and L. Paninski, \u201cClustered factor analysis of\nmultineuronal spike data,\u201d in Advances in Neural Information Processing Systems, pp. 3500\u2013\n3508, 2014.\n\n11\n\n\f", "award": [], "sourceid": 2907, "authors": [{"given_name": "Marcel", "family_name": "Nonnenmacher", "institution": "Research center caesar"}, {"given_name": "Srinivas", "family_name": "Turaga", "institution": "Janelia Research Campus, Howard Hughes Medical Institute"}, {"given_name": "Jakob", "family_name": "Macke", "institution": "research center caesar, an associate of the Max Planck Society"}]}