{"title": "Model-based targeted dimensionality reduction for neuronal population data", "book": "Advances in Neural Information Processing Systems", "page_first": 6690, "page_last": 6699, "abstract": "Summarizing high-dimensional data using a small number of parameters is a ubiquitous first step in the analysis of neuronal population activity. Recently developed methods use \"targeted\" approaches that work by identifying multiple, distinct low-dimensional subspaces of activity that capture the population response to individual experimental task variables, such as the value of a presented stimulus or the behavior of the animal. These methods have gained attention because they decompose total neural activity into what are ostensibly different parts of a neuronal computation. However, existing targeted methods have been developed outside of the confines of probabilistic modeling, making some aspects of the procedures ad hoc, or limited in flexibility or interpretability. Here we propose a new model-based method for targeted dimensionality reduction based on a probabilistic generative model of the population response data.  The low-dimensional structure of our model is expressed as a low-rank factorization of a linear regression model. We perform efficient inference using a combination of expectation maximization and direct maximization of the marginal likelihood. We also develop an efficient method for estimating the dimensionality of each subspace. We show that our approach outperforms alternative methods in both mean squared error of the parameter estimates, and in identifying the correct dimensionality of encoding using simulated data. We also show that our method provides more accurate inference of low-dimensional subspaces of activity than a competing algorithm, demixed PCA.", "full_text": "Model-based targeted dimensionality reduction for\n\nneuronal population data\n\nMikio C. Aoi\u2217\n\nPrinceton Neuroscience Institute\n\nPrinceton University,\nPrinceton, NJ 08544\n\nmaoi@princeton.edu\n\nJonathan W. Pillow \u2020\n\nPrinceton Neuroscience Institute\n\nPrinceton University,\nPrinceton, NJ 08544\n\npillow@princeton.edu\n\nAbstract\n\nSummarizing high-dimensional data using a small number of parameters is a\nubiquitous \ufb01rst step in the analysis of neuronal population activity. Recently\ndeveloped methods use \u201ctargeted\" approaches that work by identifying multiple,\ndistinct low-dimensional subspaces of activity that capture the population response\nto individual experimental task variables, such as the value of a presented stimulus\nor the behavior of the animal. These methods have gained attention because they\ndecompose total neural activity into what are ostensibly different parts of a neuronal\ncomputation. However, existing targeted methods have been developed outside of\nthe con\ufb01nes of probabilistic modeling, making some aspects of the procedures ad\nhoc, or limited in \ufb02exibility or interpretability. Here we propose a new model-based\nmethod for targeted dimensionality reduction based on a probabilistic generative\nmodel of the population response data. The low-dimensional structure of our model\nis expressed as a low-rank factorization of a linear regression model. We perform\nef\ufb01cient inference using a combination of expectation maximization and direct\nmaximization of the marginal likelihood. We also develop an ef\ufb01cient method\nfor estimating the dimensionality of each subspace. We show that our approach\noutperforms alternative methods in both mean squared error of the parameter\nestimates, and in identifying the correct dimensionality of encoding using simulated\ndata. We also show that our method provides more accurate inference of low-\ndimensional subspaces of activity than a competing algorithm, demixed PCA.\n\n1\n\nIntroduction\n\nNeuroscience has recently seen a massive expansion in the number of neurons that can be recorded\nfrom a single animal, largely due to transformative technological advancements in electrode design\nand two-photon imaging. One of the effects of our increased measurement capacity is an increased\ninterest in the properties of the activity of groups of neurons (i.e. population activity), as opposed\nto analyzing the activity of single-neurons independently [1]. One goal of analyzing population\nactivity is to characterize the ways in which groups of neurons coordinate to perform task-relevant\ncomputations.\nDimensionality reduction is central to the analysis of population activity [1]. Concomitant with the\nbroader use of classical dimensionality reduction methods like PCA and ICA come the recognition\nthat these methods often do not take full advantage of well-characterized properties of neuronal\npopulation data such as tensor structure or temporal correlations in spike rates and a number of recent\ndata analysis techniques have been developed to improve our ability to meet such speci\ufb01c challenges\n\n\u2217http://mikioaoi.com\n\u2020http://pillowlab.princeton.edu/jpillow/\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\f[2\u20137]. Of particular interest have been methods of dimensionality reduction for population data that\ndistinguish between the effects of various inputs and outputs, or \"task variables,\" such as a stimulus\nstrength, an experimental context, or a behavioral outcome [8\u201311]. We will refer to these methods\ncollectively as \"targeted\" methods.\nAlthough several targeted methods of dimensionality reduction exist, two recent methods stand\nout among existing methods: demixed principle components analysis (dPCA) [8] and targeted\ndimensionality reduction (TDR) [9]. Both of these methods were developed for the analysis of\nneuronal population data that inherently have observations of neuronal activity structured as matrices\n(ex. neurons by row, time by columns) and both attempt to identify low-dimensional subspaces that\nbest describe the population responses to an individual task variable.\nThe most recent version of dPCA [8] is a general method with relatively weak modeling assumptions,\narbitrary dimensionality, and a fast estimation algorithm based on low-rank regression [12]. However,\ndPCA requires that all observed neurons display \ufb01ring rates for all possible combinations of task\nvariables, a condition that may be too strict to be applicable for complex experiments. In contrast,\nTDR [9] utilizes a linear regression-based approach that circumvents the need to have observed every\nneuron at every combination of task variables by imposing an explicit relationship between regressors\nand outputs. However, the TDR method is limited to a one-dimensional subspace per task variable. It\nis not clear that only one dimension is suf\ufb01cient to describe the population activity associated with a\ngiven task variable. For example, sequential activation of neurons during decision making has been\nobserved in rodents, where the precise ordering of activations depends on which decision the animal\nmakes [13] and population code \u201cmorphing\" has been observed in monkeys where decision encoding\nchanges over time [14]. These types of dynamic encoding schemes are inherently high-dimensional\nand any method constrained to too-few dimensions will fail to fully characterize such activity. Lastly,\nnone of the existing methods have principled approaches to identifying the dimensionality of the data,\nmaking post hoc analysis particularly dif\ufb01cult.\nHere we propose a model-based method for targeted dimensionality reduction based on an extension\nof the framework proposed by [9]. Our approach overcomes a number of the drawbacks of existing\nmethods. Using a probabilistic generative model of the data, we can infer the optimal dimensionality\nof the low-dimensional subspaces required to faithfully describe underlying latent structure in the\ndata. In the following, we describe the model, which we call model-based targeted dimensionality\nreduction (MBTDR), its assumptions, and an ef\ufb01cient estimation procedure for model parameters\nand dimensionality. We then demonstrate the accuracy of our estimation algorithm against alternative\nmethods of estimation.\n\n2 Explicitly low-dimensional model of population activity\n\n2.1 High-dimensional description\n\nOur model begins with a description of trial-by-trial neuronal activity in terms of a linear regression\nwith respect to the task variables. We assume that the activity yi,k(t) of the ith neuron at time t on\ntrial k can be described by a linear combination of P task variables x(p)\nk , p = 1, . . . , P (ex. stimulus\nvariables, behavioral outcomes, and nonlinear combinations), such that\n\nk \u03b2i,2(t) + \u00b7\u00b7\u00b7 + x(P )\n\nyi,k(t) = x(1)\n\nk \u03b2i,1(t) + x(2)\nwhere the values of the P task variables x(p)\nare known, the \u03b2i,p(t) are unknown coef\ufb01cients, and\nk\n\u0001i,k(t) is noise. This basic model structure is identical to that of the regression model used in [9] and\nhas been successfully employed in characterizing neuronal activity of single neurons [15].\nTo represent all neurons simultaneously, we simply concatenate all i = 1, . . . , n responses into a\nvector and write\n\nk \u03b2i,P (t) + \u0001i,k(t).\n\nyk(t) = x(1)\n\nk \u03b21(t) + x(2)\n\nk \u03b22(t) + \u00b7\u00b7\u00b7 + x(P )\n\nk \u03b2P (t) + \u0001k(t),\n\nwhere yk(t) = (y1,k(t), . . . , yn,k(t))(cid:62), \u03b2p(t) = (\u03b21,p(t), . . . , \u03b2n,p(t))(cid:62), and \u0001k(t) =\n(\u00011,k(t), . . . , \u0001n,k(t))(cid:62).\nNeuronal recordings are often performed in experiments where trial epochs are of \ufb01xed dura-\ntion. We can take advantage of this structure by regarding the observation on each trial to be\n\n2\n\n\fa matrix, Yk = (yk(1), . . . , yk(T )), which is a linear combination of P coef\ufb01cient matrices\nBp = (\u03b2p(1), . . . , \u03b2p(T )) giving the observation model\n\nk B2 + \u00b7\u00b7\u00b7 + x(P )\n\nYk = x(1)\n\nk B1 + x(2)\n\n(1)\nwhere Ek = (\u0001k(1), . . . , \u0001k(T )). A schematic illustration of this basic setting is shown in Figure 1.\nIn general, not all neurons are observed simultaneously. Most often they are observed sequentially\nor in sequential blocks. Suppose we do not observe all rows of Yk on all trials but instead observe\nnk \u2264 n neurons. If we let Yk be a latent matrix of all recorded neurons from all trials, then we can\ndescribe the observed neurons on any given trial by Zk = HkYk, where Hk is a nk \u00d7 n matrix\nwhere each row is a 1-hot vector providing the index of an observed neuron.\n\nk BP + Ek,\n\n2.2 Low-dimensional description of observations\n\nWith no additional constraints our observation model (1) is extremely high dimensional and is\neffectively a separate linear regression for each neuron at every time point. This would only be a\nsensible model if we believed that neurons were not in fact coordinating activity between each other\nor across time. We would like to be able to express the prior belief that there are correlations across\nthe population but that correlations in activity due to different values of stimuli are not necessarily\nthe same as those due to the behavior of the animal.\nTo accomplish this we can describe each characteristic response matrix Bp by a low-rank factorization,\ni.e. Bp = WpSp, where Wp and Sp are n \u00d7 rp and rp \u00d7 T respectively, where rp = rank(Bp).\nEquivalently, we can say that rp is the dimensionality of the encoding of task variable p. This\nformulation has an intuitive interpretation, illustrated schematically in Figure 1A: the characteristic\ni (t) of each neuron to the pth task variable can be expressed as a linear combination of rp\nresponse \u03b2p\nweighted basis functions \u03b2p\nj (t), where rp is the dimensionality of the encoding,\n{s(p)\nj (t)}rp\nj=1 are neuron-dependent\nmixing weights.\nThe example in Figure 1A displays a model with two task variables (x1, x2), where the x1 subspace\nis 1D and the x2 subspace is 2D. The columns of Wp\u2019s weight each time-varying basis function\ndifferently for each neuron. Collectively, these weights de\ufb01ne the subspace of activity that encodes\ntask variable xp. For x1, the encoding is 1D because only one basis function is needed to describe\nthe population response to x1. The x2 response is slightly more complex, with different responses at\ndifferent times, requiring at least two basis functions.\n\nj=1 are a common set of time-varying basis functions, and {w(p)\n\ni (t) =(cid:80)rp\n\nj=1 w(p)\n\ni,j }rp\n\ni,j s(p)\n\n3 Model estimation\n\nThe goal of inference is to estimate the factors of Bp and the ranks rp. Our proposed estimation\nstrategy is to estimate one set of factors ({Wp} or {Sp}) while integrating out the other. For\nexample, if we de\ufb01ne a prior over the mixing weights {Wp} denoted by p(W), and a data likelihood\np(Z|W, S) then the marginal likelihood of the matrix of time-varying basis functions S can be\nobtained by\n\n(cid:90) \u221e\n\n\u2212\u221e\n\np(Z|S, \u03bb) =\n\np(Z|W, S, \u03bb)p(W)dW.\n\nIn principle, either set of factors may be selected. In practice however the set of factors with lowest\ndimension should be selected to keep computational costs low. In this paper we focus on the case\nwhere T (cid:28) n and we therefore will estimate {Sp} while integrating over {Wp}. The fact that either\nset of factors may be determined in this way means that there is a duality between rows and columns\nimposed by this model that is similar in principle to the duality between factors and latent states for\nprobabilistic principle components analysis [16].\nIn practice inference can be considerably simpli\ufb01ed if we let the noise distribution and prior distribu-\ntion of W both be Gaussian, which permits closed-form expression of the marginal and posterior\ndensities. We will let all elements of W to be independent standard normal, (i.e., w \u223c N (0, I\u02dcrn)\nIn addition, we let the noise covariance on all trials be given by\nEk \u223c MN (0, D\u22121, IT ), where MN (M, A, B) denotes the matrix normal distribution with row\ncovariance A and column covariance B, and D \u2261 diag(\u03bb1, . . . , \u03bbn) where \u03bbi is the inverse noise\n\nwhere \u02dcr = (cid:80)\n\np rank(Bp)).\n\n3\n\n\fFigure 1: A: Schematic illustration of low-rank regression model. The n \u00d7 T response matrix can be\ndecomposed into two response matrices (\u03b21, \u03b22), each corresponding to one task variable (upper panel).\nEach response matrix can be factorized into a small number of row and columns vectors, making the\npopulation response a linear combination of a small number of common basis functions weighted differently\nfor each neuron. B: Results of simulation study evaluating parameter estimation accuracy for different\nestimation procedures. Legend indicates the method used. Abscissa indicates the number of trials used for\nthe simulations. Error bars indication 95% con\ufb01dence intervals over 100 runs. C: Duration of computation\nfor methods and trials used in B.\n\nvariance of neuron i. We therefore assume that the weights are a priori independent and that the\nnoise variance is independent across both neurons and time. In principle, our framework supports the\napplication of more structured priors and noise covariances but we will save the exploration of more\nelaborate models for future work.\n\n\u02dcN =(cid:80)\n\n3.1 Marginal likelihood of timecourses S\nSince our model is linear and Gaussian, the marginalized density p(Z|S, \u03bb) is also Gaussian and\ncan be easily derived using standard Gaussian identities [17]. However, a naive derivation of the\nmarginal likelihood requires the log determinant and inverse of a matrix which is \u02dcN T \u00d7 \u02dcN T , where\ni Ni, such that Ni is the number of observed trials for neuron i. Thus, if all neurons are\nobserved on all trials, then the dimensions of the marginal covariance will be nN T \u00d7 nN T , which\ncan be prohibitively large for even moderately sized datasets since the determinant and inverse in\ngeneral will have computational complexity O( \u02dcN 3T 3). Luckily, the expression for the marginal\nlikelihood can be dramatically reduced by exploiting the factorization of regression parameters.\nIf we let S \u2261 blkdiag(S1, . . . , SP ), and \u03bb = (\u03bb1, . . . , \u03bbn) then we can derive (see Supplementary\nMaterial for details) the following expression for the marginal likelihood in terms of S and \u03bb,\n\n(cid:32)\n\nn(cid:88)\n\n(cid:0)\u2212NiT log \u03bbi + \u03bbiyi\n\ni S](cid:1)(cid:33)\n\n.\n\n(cid:96)(s, \u03bb) = \u2212 1\n\n2\n\n\u02dcnT log 2\u03c0 +\n\n(cid:62)yi + log |Ci| \u2212 \u03bb2\n\ni Trace[RiS(cid:62)C\u22121\n\ni=1\n\nwhere the matrices Ri and Ci are de\ufb01ned by\n\n(2)\n\n(cid:62) \u2297 IT )yiyi\n\n(cid:62)(Xi \u2297 IT ),\n\nRi = (Xi\n\n(cid:62)Xi and yi = (cid:0)y1\n\n(3)\nrespectively, where Xi is the Ni \u00d7 P design matrix that only includes trials where neuron i was\nobserved, Ai = Xi\ni being the length-T response of\nneuron i on trial k.\nThe expression in (2) reveals two things about the structure of dependencies within the model. First,\nwe notice that the likelihood factorizes over neurons, making evaluation of the likelihood potentially\n\n(cid:62)(cid:1) (cid:62), with yk\n\n(cid:62), . . . , yN\n\nCi = \u03bbiS(Ai \u2297 IT )S(cid:62) + I\u02dcr,\n\ni\n\ni\n\n4\n\n# trials102103MSE0.10.20.30.40.5SVDbilinearMMLEM# trials102103time (sec)10-1100ABC=+21time1T=+1212neurons1n\fhighly parallelizable. Second, the trace term is remenicent of the quadratic term of a matrix normal\nmodel, indicating that we can intuitively think of the posterior covariance Ci and the rank-1 matrix\nRi as the neuron-dependent contributions to the row and column covariances of S, respectively.\nMaximum marginal likelihood (MML) estimates for S and \u03bbi can be obtained by directly maximizing\n(2) by gradient ascent.\n\n3.2 Posterior distribution of neural weights W\n\nOnce an estimate of S and \u03bb is obtained we can do posterior inference on W. Because our model is\nlinear and Gaussian, the posterior density p(W|Z, S, \u03bb) is also Gaussian and admits closed-form\nexpressions for the posterior expectation and variance of W. Because of our low-rank model structure,\nthe posterior of the weight matrices {Wp} factorizes over neurons and we can estimate the weights\nW for each neuron separately and achieve computational savings relative to joint estimation over all\nneurons simultaneously.\nWe can de\ufb01ne a \u02dcr \u00d7 1 vector \u03c9i that contains all of the weights for neuron i. Collectively, the \u03c9i can\nbe expressed as\n\n\uf8eb\uf8ec\uf8ed \u03c91\n\n...\n\u03c9n\n\n\uf8f6\uf8f7\uf8f8 = vec\n\n\uf8eb\uf8ec\uf8ed W1\n\n...\nWP\n\n\uf8f6\uf8f7\uf8f8 .\n\n(cid:62)\n\n(cid:62)\n\nThis notation allows us to do ef\ufb01cient posterior inference over \u03c9i, where the posterior expectation\nand covariance of \u03c9i are given by\n\nE\u03c9i|S,Z[\u03c9i] = \u03bbiC\u22121\n\ni S(Xi\n\n(cid:62) \u2297 IT )\u03b6i,\n\nCov\u03c9i|S,Z[\u03c9i] = C\u22121\n\ni\n\n,\n\nwhere Ci is de\ufb01ned as in (3).\n\n3.3 Decoding\n\nOnce estimates of Bp are obtained we can decode new trials using the observation likelihood. This is\na distinct feature of our method that is not available to dPCA and TDR. The former methods are used\nfor estimation of the encoding but must learn a separate decoder to decode task variables from the\nactivity. Because of the probabilistic formulation of our model we can do encoding and decoding\nwithin the same framework, allowing us to directly ask questions about how the structure of the\nencoding in\ufb02uences the ability of down-stream populations to decode the information in the recorded\npopulation. While we do not pursue decoding further in this paper we included a description of the\noptimal decoder in the Supplementary Material.\n\n4 ECME algorithm for parameter estimation\n\nIn general, maximization of the marginal likelihood (2) can be relatively slow when the number of\nparameters is large. We therefore derive a \u201cexpectation-conditional maximization, either\" algorithm\n(ECME) [18] where parameters are block-wise estimated by either maximizing the conditional\nexpectation of the complete data log likelihood or the marginal likelihood. Our algorithm has\nclosed-form updates for each parameter block.\nNote that, for Bayesian linear regression with Gaussian likelihood and prior, an otherwise unstructured\nmodel would have, for M parameters, an ECME update with computational complexity of O(M 3).\nIn contrast, due to the additional low-rank structure of our model, and despite each M-step updating\n\u02dcrT + n parameters, our M-step updates have computational complexity O(\u02dcr3), where there are\ntypically \u02dcr (cid:28) min{n, T}. This means that the actual computational cost of ECME is limited only by\nthe underlying dimensionality of the data, and not to the total number of parameters per se.\nAs we demonstrate in Section 6.1, while our EM algorithm provides parameter estimates that are\nonly slightly worse in mean-squared error as maximizing the marginal likelihood directly, this small\nadditional error has a serious impact on dimensionality estimation. We therefore use our ECME\nalgorithm to provide fast, high-quality initialization for maximizing the marginal likelihood by\ngradient descent.\n\n5\n\n\f5 A greedy algorithm for rank estimation\nWhile our model can identify subspaces of any dimension up to Dmax = min{n, T}, the dimension-\nality of each subspace must be speci\ufb01ed a priori. Although we may use standard model selection\ntechniques to compare the goodness of \ufb01t between models with alternative con\ufb01gurations an exhaus-\ntive search over all possible models would require searching over DP\nmax possible con\ufb01gurations. We\ntherefore developed a greedy algorithm for estimating the optimal dimensionality. A summary of the\nprocedure is presented in Algorithm 1.\nRecall that the dimensionality of each task-variable encoding corresponds to the rank of each Bp.\nWe begin the algorithm by \ufb01rst estimating the model parameters with rank rp = 1 for all p (although\nin principle we may start at rp = 0, denoting the null model for all elements of Bp), giving us a\np=1 rp and at the \ufb01rst iteration \u02dcr1 = P . At the jth iteration,\nwe estimate the parameters of P models, where each model has the dimension of one of the task\nvariables increased by 1, while keeping all other dimensionalities the same as in the previous iteration.\nWe then have P models, each with total dimensionality \u02dcrj+1 = \u02dcrj + 1. We then evaluate the AIC of\neach of these P models and keep the model that displayed the greatest decrease in AIC relative the\nthe previous iteration for the next iteration. In this way we grow the total dimensionality of the model\nby one on each iteration. The algorithm is formally outlined in Algorithm 1. 3\n\nmodel with total dimensionality \u02dcr =(cid:80)P\n\nAlgorithm 1 Estimation of dimensionality\nLet r \u2261 (r1, . . . , rP ), ep is the elemental vector, AIC(r) is the Akaike information criterion for a\nmodel with ranks r\n1: procedure DIMEST(r0,Data)\nr \u2190 r0, AIC0 \u2190 AIC(r0)\n2:\nrepeat\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n\nend for\nif There is no p s.t. AICp < AIC0 then Break\nend if\np\u2217 \u2190 arg min\nr0 \u2190 r, AIC0 \u2190 AICp\u2217\n\n(cid:46) Calculate AIC for +1 rank for each task variable\n\nfor p = 1, . . . , P do\n\nAICp \u2190 AIC(r + ep)\n\n(cid:46) +1 rank for variable that most decreases AIC\n\nAICp, rp\u2217 \u2190 rp\u2217 + 1\n\n(cid:46) Initialize\n\np\n\n10:\n11:\n12:\n13: end procedure\n\nuntil There is no p s.t. AICp < AIC0\nreturn r\n\n(cid:46) Stop when AIC can no longer be decreased\n\n6 Simulation studies\n\n6.1 Evaluation of parameter estimation with simulated data\n\nWe applied our greedy algorithm on simulated data in order to determine if it could accurately recover\nthe true ranks using n = 100 neurons and T = 15 time points. For each run of our simulations\nwe \ufb01rst selected a random dimensionality between 1-6 for each of P = 3 task variables (two\ngraded variables with values drawn from {\u22122,\u22121, 0, 1, 2} and one binary task variable with values\n{\u22121, 1}). Using these dimensionalities, the elements of Wp and Sp were drawn independently from\na N (0, 1) distribution. To give us heterogeneous noise variances, the noise variance for each neuron\nwas drawn from an exponential distribution with mean parameter \u03c32 = 50. The resulting average\nSNR for any one task variable was -0.26 (\u00b10.75, log10 units). We then simulated observations\naccording to our model with varying numbers of trials (N \u2208 {50, 200, 500, 1000, 1500, 2000}). In\norder to simulate incomplete observations, we set the probability of observing any given neuron on\nany given trial to \u03c0obs = .4. While we conducted experiments with varying numbers of trials and\nobservation probabilities we generally found that decreased observation probabilities acted effectively\nas a decrease in sample size with a concomitant decrease in estimation accuracy. The results were\n\n3Demonstration\n\ncode\n\nis\n\navailable\n\nfor\n\nhttp://www.mikioaoi.com/samplecode/RDRdemo.zip\n\ndownload\n\nat\n\nthe\n\n\ufb01rst\n\nauthor\u2019s website\n\nat\n\n6\n\n\fnot particularly sensitive to the precise observation probability in this regime and we report only the\nresults for the settings listed above.\nFor each set of observations, we estimated the parameters of the model in one of three ways, which\nwe describe below.\nWe consider the following four methods for parameter optimization:\n1. Linear regression and SVD. The elements of Bp for all p were estimated by linear regression for\neach neuron and time point independently. Each estimate of the complete matrix Bp could then\n(cid:62), where Dp is the\nbe expressed by its singular value decomposition (SVD) as Bp = UpDpVp\nn \u00d7 T diagonal matrix of d = min{n, T} singular values. We then set the smallest d \u2212 rp singular\nvalues to zero with the resulting matrix of rp nonzero singular values denoted by D(rp)\n. The rank-rp\n(cid:62), with\nestimates of Wp and Sp are then given by W(rp)\nthe corresponding rank-rp estimate of Bp given by B(rp)\nThe corresponding likelihood is given by\n\np = UpD(rp)1/2\np = W(rp)\n\nand S(rp)\n\np = D(rp)1/2\n.\n\np\n\np S(rp)\n\np\n\nVp\n\n(cid:96)(Bp|Z, H(cid:48), \u02c6D) \u221d(cid:88)\n\nTrace[(Zk \u2212(cid:88)\n\nx(p)Bp)(cid:62)H(cid:48)DH(cid:48)(cid:62)(Zk \u2212(cid:88)\n\np\n\np\n\nx(p)Bp]\n\n(4)\n\nk\n\np\n\np\n\n2. Bilinear optimization. After initializing with the rank-rp estimates of Wp and Sp from the SVD\nmethod, the parameters can be further re\ufb01ned by bilinear regression. On each iteration, the values of\nWp\u2019s are \ufb01xed, which leads to closed-form updates for conditional maximum likelihood estimates\nof Sp\u2019s and vice versa. Thus, the algorithm will alternate between estimating Wps and Sps until\nconvergence. The bilinear regression method uses the same likelihood as shown in (4).\n3. ECME. As described in the Supplementary Material.\n4. Maximum marginal likelihood (MML). After initializing with the ECME estimates of Wp and Sp,\nwe estimate Sp by maximizing the marginal likelihood given by (2). No estimation of the Wp factors\nis required since the marginal likelihood only depends on Sp.\nFor each setting of trial number N, we repeated this process 100 times and evaluated how well our\nalgorithm estimated the true model parameters. The results are presented in Figure 1B,C.\nWe found that ECME and MML both produced mean-squared error (MSE) that was substantially\nsmaller at all sample sizes that either the SVD or bilinear methods. While the differences in MSE\nbetween the ECME algorithm and MML were small, Figure 1C shows that the ECME algorithm was\nsubstantially faster than either MML or bilinear regression.\n\n6.2 Evaluation of dimension estimation with simulated data\n\nFor each of the 100 runs of our simulation experiments we also evaluated how well our algorithm\nestimated the dimension of the characteristic responses by evaluating the difference between the true\nand estimated dimension of each task variable and counting the number of times that difference was\nobserved. The results are presented in Figure 2A.\nWe found that all four methods tended to under-estimate the dimensionality as the number of trials\ndecreased but that this underestimation was less pronounced for the ECME and MML methods, for\nwhich the vast majority of estimates resulted in the correct ranks even in the case of N = 50. Note\nthat not only is this half the number of trials as neurons but since each neuron was only observed on\nabout 40% of the trials this gives an average of only 20 trials per neuron. Therefore, our procedure\nrecovers the true rank of the model the vast majority of the time even under conditions of vary small\ntrial number relative to the size of observations.\nWe were surprised that despite the modest difference in MSE between the ECME and MML estimation\nalgorithms, the dimension estimation seemed to be sensitive to these differences, with the ECME\nperforming worse than MML despite the fact that these methods are in theory maximizing the same\nobjective function. Nevertheless we propose that, due to the ECME algorithm\u2019s superior speed,\nECME be used as an ef\ufb01cient initializer for MML estimation. We found that initializing the rank\nestimates this way limits the use of MML for rank estimation to just a few iterations.\n\n7\n\n\fFor neuroscience applications, the observed spike counts are better described by a Poisson distribution\nthan by a Gaussian. We therefore evaluated the robustness of our algorithm to this type of model\nmisspeci\ufb01cation by performing the same dimensionality estimation experiment with 2000 trials\nwith observations drawn from Poisson(y(t)) distribution at each time bin. Our results are virtually\nindistinguishable from experiments using Gaussian observations (Fig. 2B, dashed line).\n\nFigure 2: Simulation studies. A: Results of simulation study evaluating performance of Algorithm 1 for\ndimensionality estimation by means of different parameter estimation procedures. The legend indicates the\nsample size. Abscissa indicates the error in dimensionality estimate. Ordinate gives the number of estimated\nsubspaces that obtained the corresponding error. Dashed line indicates model-mismatch experiment with\nPoisson observations and sample size 2000. B: Results of subspace estimation by our MML method\ncompared with dPCA. The MML method out-performs dPCA at all but the highest SNR, where performance\nis similar.\n\n7 Comparison with dPCA\n\n7.1 Simulation experiments\n\nThe central goal of both our method and dPCA is to recover a basis that de\ufb01nes a set of low-\ndimensional subspaces that describe how the population varies with respect to each task variable (or\npre-de\ufb01ned combination of task variables). In order to compare the quality of the subspaces identi\ufb01ed\nby each method, we conducted a simple simulation study. The simulation setting was identical to\nthose described in Section 6.1 using 100 trials per run except that, to keep the simulations as simple\nas possible, we de\ufb01ned just two binary task variables that were drawn randomly on each trial. The\nexperiment was repeated for 100 runs.\nWe performed both dPCA and estimated the parameters on each run using MML and then compared\nthe % mean-squared error between the true subspace and the estimated subspaces. We de\ufb01ned the\ntrue subspace based on the left singular vectors of the Bp matrices used for simulating the data. If\nU is the true subspace and \u02c6U is the estimated subspace then the % mean squared error is given by\n(cid:107)U \u2212 \u02c6U \u02c6U(cid:62)U(cid:107)2\nThe basis for the subspace estimated by MBTDR can be obtained by \ufb01rst estimating each Bp = WS,\nwhere S was estimated by MML and W was estimated by its posterior mean. We then used the\nleft singular vectors of the estimated Bp to de\ufb01ne the estimated basis. For the dPCA estimate, the\nanalogous subspace is de\ufb01ned by their \u201cencoding\" subspace [8]. For both methods we assumed the\ncorrect dimensionality. We used the version of dPCA that is for non-sequential estimation and uses\ncross-validated regularization parameters. The results are presented in Figure 2B.\nWhen the subspace is recoverable (i.e. principle angle is signi\ufb01cantly less than 90 degrees), our\nmethod is virtually always closer to the true subspace. It is notable that the principle angle is an\nextremely sensitive measure of errors between subspaces and that both methods provide reasonable\nresults when checked by eye. It is also notable that any differences are observable at all, which give\nus con\ufb01dence that these results are quite strong.\n\n2/(cid:107)U(cid:107)2\n2.\n\n8\n\nBdim. estimation error-4-20Number of models050100150200250300SVD-4-20Bilinear-4-20ECME-4-20MML2000100050020050PoissA00.20.40.60.81subspace MSE (MBTDR)00.10.20.30.40.50.60.70.80.91subspace MSE (dPCA)0.3-0.7-1-1.3-1.7-2-2.3mean SNR (dB)\f8 Concluding remarks\n\nWe have introduced a new, model-based method to identify low-dimensional subspaces of neuronal\nactivity that describe the response of neuronal populations to variations in task variables. We have\nalso introduced a procedure of estimating both the parameters of this model and the dimensionality\nof each of the corresponding subspaces of activity. We compared our method in simulations to dPCA\nand showed that our method better recovers the low-dimensional subspace of activity for noisy data.\nThere are a number of additional advantages to using a model-based method for dimension reduction.\nThe \ufb01rst is that our modeling framework is general enough that we could include even more structure\nto the model such as structured prior covariances and noise covariance. Our modeling approach also\nallows us to answer otherwise elusive questions about what quantities of the data are important. For\nexample, virtually all other targeted methods effectively use peri-stimulus time histograms (PSTH\u2019s)\nas the suf\ufb01cient statistics for subspace estimation. One interesting revelation of our model is that the\n(cid:62)yi, Ni) and\nPSTHs are not suf\ufb01cient statistics. The suf\ufb01cient statistics of our model are (IRi, Ai, yi\nthese suf\ufb01cient statistics can not be derived directly from the PSTHs. This suggests that methods that\nrely solely on the PSTHs may not be capturing important characteristics of the data.\n\nAcknowledgments\n\nThis work was supported by grants from the Simons Foundation (SCGB AWD1004351 and\nAWD543027), the NIH (R01EY017366, R01NS104899) and a U19 NIH-NINDS BRAIN Initiative\nAward (NS104648-01).\n\nReferences\n[1] John P Cunningham and M Yu Byron. Dimensionality reduction for large-scale neural record-\n\nings. Nature neuroscience, 17(11):1500\u20131509, 2014.\n\n[2] Jeffrey S Seely, Matthew T Kaufman, Stephen I Ryu, Krishna V Shenoy, John P Cunningham,\nand Mark M Churchland. Tensor analysis reveals distinct population structure that parallels the\ndifferent computational roles of areas m1 and v1. PLoS Comput Biol, 12(11):e1005164, 2016.\n\n[3] Ari S Morcos and Christopher D Harvey. History-dependent variability in population dynamics\n\nduring evidence accumulation in cortex. Nature Neuroscience, 2016.\n\n[4] B. M. Yu, J. P. Cunningham, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. Gaussian-\nprocess factor analysis for low-dimensional single-trial analysis of neural population activity.\nJournal of Neurophysiology, 102(1):614, 2009.\n\n[5] Yuan Zhao and Il Memming Park. Variational latent gaussian process for recovering single-trial\n\ndynamics from population spike trains. arXiv preprint arXiv:1604.03053, 2016.\n\n[6] Afsheen Afshar, Gopal Santhanam, Byron M Yu, Stephen I Ryu, Maneesh Sahani, and Krishna V\nShenoy. Single-trial neural correlates of arm movement preparation. Neuron, 71(3):555\u2013564,\nAug 2011.\n\n[7] Mark M. Churchland, Byron M. Yu, Maneesh Sahani, and Krishna V Shenoy. Techniques for\nextracting single-trial activity patterns from large-scale neural recordings. Curr Opin Neurobiol,\n17(5):609\u2013618, Oct 2007.\n\n[8] Dmitry Kobak, Wieland Brendel, Christos Constantinidis, Claudia E Feierstein, Adam Kepecs,\nZachary F Mainen, Xue-Lian Qi, Ranulfo Romo, Naoshige Uchida, and Christian K Machens.\nDemixed principal component analysis of neural population data. eLife, 5:e10989, 2016.\n\n[9] Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. Context-dependent\n\ncomputation by recurrent dynamics in prefrontal cortex. Nature, 503(7474):78\u201384, 2013.\n\n[10] Christian K. Machens. Demixing population activity in higher cortical areas. Frontiers in\n\nComputational Neuroscience, 4(0), 2010.\n\n9\n\n\f[11] C.K. Machens, R. Romo, and C.D. Brody. Functional, but not anatomical, separation of \"what\"\n\nand \"when\" in prefrontal cortex. The Journal of Neuroscience, 30(1):350\u2013360, 2010.\n\n[12] A.J. Izenman. Reduced-rank regression for the multivariate linear model. Journal of multivariate\n\nanalysis, 5(2):248\u2013264, 1975.\n\n[13] Christopher D Harvey, Philip Coen, and David W Tank. Choice-speci\ufb01c sequences in parietal\n\ncortex during a virtual-navigation decision task. Nature, 484(7392):62\u201368, 2012.\n\n[14] Aishwarya Parthasarathy, Roger Herikstad, Jit Hon Bong, Felipe Salvador Medina, Camilo\nLibedinsky, and Shih-Cheng Yen. Mixed selectivity morphs population codes in prefrontal\ncortex. Nature neuroscience, 20(12):1770\u20131779, 2017.\n\n[15] Carlos D Brody, Adri\u00e1n Hern\u00e1ndez, Antonio Zainos, and Ranulfo Romo. Timing and neural\nencoding of somatosensory parametric working memory in macaque prefrontal cortex. Cerebral\ncortex, 13(11):1196\u20131207, 2003.\n\n[16] N. Lawrence. Probabilistic non-linear principal component analysis with gaussian process latent\n\nvariable models. The Journal of Machine Learning Research, 6:1816, 2005.\n\n[17] C. M. Bishop. Pattern recognition and machine learning. Springer New York:, 2006.\n\n[18] Chuanhai Liu and Donald B Rubin. The ecme algorithm: a simple extension of em and ecm\n\nwith faster monotone convergence. Biometrika, 81(4):633\u2013648, 1994.\n\n10\n\n\f", "award": [], "sourceid": 3364, "authors": [{"given_name": "Mikio", "family_name": "Aoi", "institution": "Princeton University"}, {"given_name": "Jonathan", "family_name": "Pillow", "institution": "Princeton University"}]}