{"title": "Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity", "book": "Advances in Neural Information Processing Systems", "page_first": 1881, "page_last": 1888, "abstract": "We consider the problem of extracting smooth low-dimensional ``neural trajectories'' that summarize the activity recorded simultaneously from tens to hundreds of neurons on individual experimental trials. Beyond the benefit of visualizing the high-dimensional noisy spiking activity in a compact denoised form, such trajectories can offer insight into the dynamics of the neural circuitry underlying the recorded activity. Current methods for extracting neural trajectories involve a two-stage process: the data are first ``denoised'' by smoothing over time, then a static dimensionality reduction technique is applied. We first describe extensions of the two-stage methods that allow the degree of smoothing to be chosen in a principled way, and account for spiking variability that may vary both across neurons and across time. We then present a novel method for extracting neural trajectories, Gaussian-process factor analysis (GPFA), which unifies the smoothing and dimensionality reduction operations in a common probabilistic framework. We applied these methods to the activity of 61 neurons recorded simultaneously in macaque premotor and motor cortices during reach planning and execution. By adopting a goodness-of-fit metric that measures how well the activity of each neuron can be predicted by all other recorded neurons, we found that GPFA provided a better characterization of the population activity than the two-stage methods. From the extracted single-trial neural trajectories, we directly observed a convergence in neural state during motor planning, an effect suggestive of attractor dynamics that was shown indirectly by previous studies.", "full_text": "Gaussian-process factor analysis for low-dimensional\n\nsingle-trial analysis of neural population activity\n\nByron M. Yu1,2,4, John P. Cunningham1, Gopal Santhanam1,\n\nStephen I. Ryu1,3, Krishna V. Shenoy1,2\n\n1Department of Electrical Engineering, 2Neurosciences Program,\n\n3Department of Neurosurgery, Stanford University, Stanford, CA 94305\n\n{byronyu,jcunnin,gopals,seoulman,shenoy}@stanford.edu\n\nManeesh Sahani4\n\n4Gatsby Computational Neuroscience Unit, UCL\n\nLondon, WC1N 3AR, UK\n\nmaneesh@gatsby.ucl.ac.uk\n\nAbstract\n\nWe consider the problem of extracting smooth, low-dimensional neural trajecto-\nries that summarize the activity recorded simultaneously from tens to hundreds of\nneurons on individual experimental trials. Current methods for extracting neural\ntrajectories involve a two-stage process: the data are \ufb01rst \u201cdenoised\u201d by smooth-\ning over time, then a static dimensionality reduction technique is applied. We \ufb01rst\ndescribe extensions of the two-stage methods that allow the degree of smoothing\nto be chosen in a principled way, and account for spiking variability that may vary\nboth across neurons and across time. We then present a novel method for extract-\ning neural trajectories, Gaussian-process factor analysis (GPFA), which uni\ufb01es\nthe smoothing and dimensionality reduction operations in a common probabilis-\ntic framework. We applied these methods to the activity of 61 neurons recorded\nsimultaneously in macaque premotor and motor cortices during reach planning\nand execution. By adopting a goodness-of-\ufb01t metric that measures how well the\nactivity of each neuron can be predicted by all other recorded neurons, we found\nthat GPFA provided a better characterization of the population activity than the\ntwo-stage methods.\n\n1\n\nIntroduction\n\nNeural responses are typically studied by averaging noisy spiking activity across multiple experi-\nmental trials to obtain \ufb01ring rates that vary smoothly over time. However, particularly in cognitive\ntasks (such as motor planning or decision making) where the neural responses are more a re\ufb02ection\nof internal processing rather than external stimulus drive, the timecourse of the neural responses may\ndiffer on nominally identical trials. In such settings, it is critical that the neural data not be averaged\nacross trials, but instead be analyzed on a trial-by-trial basis [1, 2, 3, 4].\n\nSingle-trial analyses can leverage the simultaneous monitoring of large populations of neurons in\nvivo, currently ranging from tens to hundreds in awake, behaving animals. The approach adopted\nby recent studies is to consider each neuron being recorded as a noisy sensor re\ufb02ecting the time-\nevolution of an underlying neural process [3, 5, 6, 7, 8, 9, 10]. The goal is to uncover this neural\nprocess by extracting a smooth, low-dimensional neural trajectory from the noisy, high-dimensional\nrecorded activity on a single-trial basis. The neural trajectory provides a compact representation of\n\n1\n\n\fthe high-dimensional recorded activity as it evolves over time, thereby facilitating data visualization\nand studies of neural dynamics under different experimental conditions.\n\nA common method to extract neural trajectories is to \ufb01rst estimate a smooth \ufb01ring rate pro\ufb01le for\neach neuron on a single trial (e.g., by convolving each spike train with a Gaussian kernel), then\napply a static dimensionality reduction technique (e.g., principal components analysis, PCA) [8, 11].\nSmooth \ufb01ring rate pro\ufb01les may also be obtained by averaging across a small number of trials (if the\nneural timecourses are believed to be similar on different trials) [6, 7, 9, 10], or by applying more\nadvanced statistical methods for estimating \ufb01ring rate pro\ufb01les from single spike trains [12, 13].\nNumerous linear and non-linear dimensionality reduction techniques exist, but to our knowledge\nonly PCA [8, 9, 11] and locally linear embedding (LLE) [6, 7, 10, 14] have been applied in this\ncontext to neural data.\n\nWhile this two-stage method of performing smoothing then dimensionality reduction has provided\ninformative low-dimensional views of neural population activity, there are several aspects that can be\nimproved. (i) For kernel smoothing, the degree of smoothness is often chosen in an ad hoc way. We\nwould instead like to learn the appropriate degree of smoothness from the data. Because the opera-\ntions of kernel smoothing, PCA, and LLE are all non-probabilistic, standard likelihood techniques\nfor model selection are not applicable. Even if a probabilistic dimensionality reduction algorithm is\nused, the likelihoods would not be comparable because different smoothing kernels yield different\nsmoothed data. (ii) The same kernel width is typically used for all spike trains, which implicitly\nassumes that the neural population activity evolves with a single timescale. We would instead like\nto allow for the possibility that the system operates under multiple timescales. (iii) PCA and LLE\nhave no explicit noise model and, therefore, have dif\ufb01culty distinguishing between spiking noise\n(whose variance may vary both across neurons and across time) and changes in the underlying low-\ndimensional neural state. (iv) Because the smoothing and dimensionality reduction are performed\nsequentially, there is no way for the dimensionality reduction algorithm to in\ufb02uence the degree or\nform of smoothing used. This is relevant both to the identi\ufb01cation of the low-dimensional space, as\nwell as to the extraction of single-trial neural trajectories.\n\nWe \ufb01rst brie\ufb02y describe relatively straightforward extensions of the two-stage methods that can help\nto address issues (i) and (iii) above. For (i), we adopt a goodness-of-\ufb01t metric that measures how\nwell the activity of each neuron can be predicted by the activity of all other recorded neurons, based\non data not used for model \ufb01tting. This metric can be used to compare different smoothing kernels\nand allows for the degree of smoothness to be chosen in a principled way. In Section 6, we will use\nthis as a common metric by which different methods for extracting neural trajectories are compared.\nFor (iii), we can apply the square-root transform to stabilize the spiking noise variance and factor\nanalysis (FA) [15] to explicitly model possibly different independent noise variances for different\nneurons. These extensions are detailed in Sections 2 and 3.\n\nNext, we introduce Gaussian-process factor analysis (GPFA), which uni\ufb01es the smoothing and di-\nmensionality reduction operations in a common probabilistic framework. GPFA takes steps toward\naddressing all of the issues (i)\u2013(iv) described above, and is shown in Section 6 to provide a bet-\nter characterization of the recorded population activity than the two-stage methods. Because GPFA\nperforms the smoothing and dimensionality reduction operations simultaneously rather than sequen-\ntially, the degree of smoothness and the relationship between the low-dimensional neural trajectory\nand the high-dimensional recorded activity can be jointly optimized. Different dimensions in the\nlow-dimensional space (within which the neural state evolves) can have different timescales, whose\noptimal values can be found automatically by \ufb01tting the GPFA model to the recorded activity. As in\nFA, GPFA speci\ufb01es an explicit noise model that allows different neurons to have different indepen-\ndent noise variances. The time series model involves Gaussian processes (GP), which only require\nthe speci\ufb01cation of the correlation structure of the neural state over time.\n\nA critical assumption when attempting to extract a low-dimensional neural trajectory is that the\nrecorded activity evolves within a low-dimensional manifold. Previous studies have typically as-\nsumed that the neural trajectories lie in a three-dimensional space for ease of visualization. In this\nwork, we will investigate whether this low-dimensional assumption is justi\ufb01ed in the context of\nmotor preparation and execution and, if so, attempt to identify the appropriate dimensionality. Sec-\ntions 2 and 3 detail GPFA and the goodness-of-\ufb01t metric, respectively. Section 4 relates GPFA to\ndynamical systems approaches. After describing the experimental setup in Section 5, we apply the\n\n2\n\n\fdeveloped methods to neural activity recorded in premotor and motor cortices during reach planning\nand execution in Section 6.\n\n2 Gaussian-process factor analysis\n\nThe motivation for GPFA can be traced back to the use of PCA for extracting informative low-\ndimensional views of high-dimensional neural data. Consider spike counts taken in non-overlapping\ntime bins. PCA (or its probabilistic form, PPCA [15]) attempts to \ufb01nd the directions in the high-\ndimensional data with greatest variance. This is problematic for neural data for two reasons. First,\nbecause neurons with higher mean counts are known to exhibit higher count variances, the directions\nfound by PCA tend to be dominated by the most active neurons. Second, PCA assumes that the\nspiking noise variance is time independent; however, neurons are known to change their \ufb01ring rates,\nand therefore noise variances, over time. A possible solution is to replace the Gaussian likelihood\nmodel of PPCA with a point-process [5] or Poisson [3] likelihood model. Here, we consider a\nsimpler approach that preserves computational tractability. The square-root transform is known to\nboth stabilize the variance of Poisson counts and allow Poisson counts to be more closely modeled\nby a Gaussian distribution, especially at low Poisson means [16]. Thus, the two issues above can\nbe largely resolved by applying PCA/PPCA to square-rooted spike counts, rather than raw spike\ncounts. However, the spiking noise can deviate from a Poisson distribution [17], in which case the\nnoise variance is not entirely stabilized. As will be shown in Section 6, the square-rooted counts\ncan be better characterized by further replacing PCA/PPCA with FA [15], which allows different\nneurons to have different independent noise variances.\n\nIn this work, we extend FA for use with time series data. PCA, PPCA, and FA are all static dimen-\nsionality reduction techniques. In other words, none of them take into account time labels when\napplied to time series data; the time series data are simply treated as a collection of data points.\nGPFA is an extension of FA that can leverage the time label information to provide more powerful\ndimensionality reduction. The GPFA model is simply a set of factor analyzers (one per timepoint,\neach with identical parameters) that are linked together in the low-dimensional state space by a\nGaussian process (GP) [18] prior. Introducing the GP allows for the speci\ufb01cation of a correlation\nstructure across the low-dimensional states at different timepoints. For example, if the system under-\nlying the time series data is believed to evolve smoothly over time, we can specify that the system\u2019s\nstate should be more similar between nearby timepoints than between faraway timepoints. Extract-\ning a smooth, low-dimensional neural trajectory can therefore be viewed as a compromise between\nthe low-dimensional projection of each data point found by FA and the desire to string them together\nusing a smooth function over time. The GPFA model can also be obtained by letting time indices\nplay the role of inputs in the semiparametric latent factor model [19].\nThe following is a mathematical description of GPFA. Let y:,t \u2208 R\nq\u00d71 be the high-dimensional\nvector of square-rooted spike counts recorded at timepoint t \u2208 {1, . . . , T}, where q is the number of\nneurons being recorded simultaneously. We seek to extract a corresponding low-dimensional latent\nneural state x:,t \u2208 R\np\u00d71 at each timepoint, where p is the dimensionality of the state space (p < q).\nFor notational convenience, we group the neural states from all timepoints into a neural trajectory\ndenoted by the matrix X = [x:,1, . . . , x:,T ] \u2208 R\np\u00d7T . Similarly, the observations can be grouped\ninto a matrix Y = [y:,1, . . . , y:,T ] \u2208 R\nq\u00d7T . We de\ufb01ne a linear-Gaussian relationship between the\nobservations y:,t and neural states x:,t\n\ny:,t | x:,t \u223c N (Cx:,t + d, R) ,\n\nq\u00d7p, d \u2208 R\n\nq\u00d71, and R \u2208 R\n\n(1)\nwhere C \u2208 R\nq\u00d7q are model parameters to be learned. As in FA, we\nconstrain the covariance matrix R to be diagonal, where the diagonal elements are the independent\nnoise variances of each neuron. In general, different neurons can have different independent noise\nvariances. Although a Gaussian is not strictly a distribution on square-rooted counts, its use in (1)\npreserves computational tractability.\nThe neural states x:,t at different timepoints are related through Gaussian processes, which embody\nthe notion that the neural trajectories should be smooth. We de\ufb01ne a separate GP for each dimension\nof the state space indexed by i \u2208 {1, . . . , p}\n\nxi,: \u223c N (0, Ki) ,\n\n3\n\n(2)\n\n\f1\u00d7T is the ith row of X and Ki \u2208 R\n\nwhere xi,: \u2208 R\nT\u00d7T is the covariance matrix for the ith GP\n[20]. The form of the GP covariance can be chosen to provide different smoothing properties on the\nneural trajectories. In this work, we chose the commonly-used squared exponential (SE) covariance\nfunction\n\n(cid:3)\n\n(cid:2)\n\u2212(t1 \u2212 t2)2\n2 \u00b7 \u03c4 2\n\n+ \u03c32\n\nn,i \u00b7 \u03b4t1,t2 ,\n\nKi(t1, t2) =\u03c3 2\n\nf,i \u00b7 exp\n\ni\n\n(3)\nwhere Ki(t1, t2) denotes the (t1, t2)th entry of Ki and t1, t2 \u2208 {1, . . . , T}. The SE covariance\nf,i \u2208 R+, characteristic timescale \u03c4i \u2208 R+, and noise variance\nis de\ufb01ned by its signal variance \u03c32\nn,i \u2208 R+. Due to redundancy in the scale of X and C, we \ufb01x the scale of X and allow C to\n\u03c32\nbe learned unconstrained, without loss of generality. By direct analogy to FA, we de\ufb01ned the prior\ndistribution of the neural state x:,t at each timepoint t to be N (0, I) by setting \u03c32\nf,i = 1 \u2212 \u03c32\nn,i,\nn,i \u2264 1. Furthermore, because we seek to extract smooth neural trajectories, we set \u03c32\nwhere 0 < \u03c32\nto a small value (10\u22123). Thus, the timescale \u03c4i is the only (hyper)parameter of the SE covariance\nthat is learned. The SE is an example of a stationary covariance; other stationary and non-stationary\nGP covariances [18] can be applied in a seamless way.\nThe parameters of the GPFA model can be learned in a straightforward way using the expectation-\nmaximization (EM) algorithm. In the E-step, the Gaussian posterior distribution P (X | Y ) can\nbe computed exactly because the x:,t and y:,t across all timepoints are jointly Gaussian, by de\ufb01ni-\ntion. In the M-step, the parameters updates for C, d, and R can be expressed in closed form. The\ncharacteristic timescales \u03c4i can be updated using any gradient optimization technique. Note that the\ndegree of smoothness (de\ufb01ned by the timescales) and the relationship between the low-dimensional\nneural trajectory and the high-dimensional recorded activity (de\ufb01ned by C) are jointly optimized.\nFurthermore, a different timescale is learned for each state dimension indexed by i. For the results\nshown in Section 6, the parameters C, d, and R were initialized using FA, and the \u03c4i were initialized\nto 100 ms. Although the learned timescales were initialization-dependent, their distributions were\nsimilar for different initializations. In particular, most learned timescales were less than 150 ms, but\nthere were usually one or two larger timescales around 300 and 500 ms.\n\nn,i\n\nCx:,t \u2208 R\n\nq\u00d7p are orthonormal and \u02dcx:,t = DCV (cid:2)\n\nOnce the GPFA model is learned, we can apply a post-processing step to orthonormalize the columns\nof C. Applying the singular value decomposition, Cx:,t can be rewritten as UC (DCV (cid:2)\nCx:,t), where\nthe columns of UC \u2208 R\np\u00d71 is referred to as the\northonormalized neural state at timepoint t. While each dimension of x:,t possesses a single char-\nacteristic timescale, each dimension of \u02dcx:,t represents a mixture of timescales de\ufb01ned by the columns\nof VC. An advantage of considering \u02dcx:,t rather than x:,t is that the elements of \u02dcx:,t (and the corre-\nsponding columns of UC) are ordered by the amount of data covariance explained. In contrast, the\nelements of x:,t (and the corresponding columns of C) have no particular order. Especially when\nthe number of state dimensions p is large, the ordering facilitates the identi\ufb01cation and visualization\nof the dimensions of the orthonormalized neural trajectory that are most important for explaining\nthe recorded activity. Because the columns of UC are orthonormal, one can readily picture how the\nlow-dimensional trajectory relates to the high-dimensional space of recorded activity, in much the\nsame spirit as for PCA. This orthonormalization procedure is also applicable to PPCA and FA. In\nfact, it is through this orthonormalization procedure that the principal directions found by PPCA are\nequated to those found by PCA.\n\n3 Leave-neuron-out prediction error\n\nWe would like to directly compare GPFA to the two-stage methods described in Section 1. Neither\nthe classic approach of comparing cross-validated likelihoods nor the Bayesian approach of com-\nparing marginal likelihoods is applicable here, for the same reason that they cannot be used to select\nthe appropriate degree of smoothness in the two-stage methods. Namely, when the data are altered\nby different pre-smoothing operations (or the lack thereof in the case of GPFA), the likelihoods\nare no longer comparable. Instead, we adopted the goodness-of-\ufb01t metric mentioned in Section 1,\nwhereby a prediction error is computed based on trials not used for model \ufb01tting. The idea is to\nleave out one neuron at a time and ask how well each method is able to predict the activity of that\nneuron, given the activity of all other recorded neurons. For GPFA, the model prediction for neuron\nj is \u02c6yj,: = E [yj,: | Y\u2212j,:], where yj,: is the jth row of Y and Y\u2212j,: \u2208 R\n(q\u22121)\u00d7T represents all but\n\n4\n\n\fthe jth row of Y . The model prediction can be computed analytically because all variables in Y\nare jointly Gaussian, by de\ufb01nition. Model predictions using PPCA and FA are analogous, but each\ntimepoint is considered individually. The prediction error is de\ufb01ned as the sum-of-squared errors\nbetween the model prediction and the observed square-rooted spike count across all neurons and\ntimepoints.\nOne way to compute the GPFA model prediction is via the low-dimensional state space. One can\n\ufb01rst estimate the neural trajectory using all but the jth neuron P (X | Y\u2212j,:), then map this estimate\nback out into the space of recorded activity for the jth neuron using (1) to obtain \u02c6yj,:. Equivalently,\none can convert P (X | Y\u2212j,:) into its orthonormalized form before mapping it out into the space\nof recorded activity using the jth row of UC. Because the orthonormalized dimensions are ordered,\nwe can evaluate the prediction error using only the top \u02dcp orthonormalized dimensions of \u02dcx:,t, where\n\u02dcp \u2208 {1, . . . , p}. This reduced GPFA model can make use of a larger number p of timescales than its\neffective dimensionality \u02dcp.\n\n4 Linear and non-linear dynamical systems\n\nxi,t+1 | xi,t \u223c N (cid:4)\n\n(cid:5)\n\naixi,t, \u03c32\ni\n\n.\n\nAnother way to extract neural trajectories is by de\ufb01ning a parametric dynamical model that describes\nhow the low-dimensional neural state evolves over time. A \ufb01rst-order linear auto-regressive (AR)\nmodel [5] captures linear Markovian dynamics. Such a model can be expressed as a Gaussian\nprocess, since the state variables are jointly Gaussian. This can be shown by de\ufb01ning a separate\n\ufb01rst-order AR model for each state dimension indexed by i \u2208 {1, . . . , p}\n\n(4)\nGiven enough time (t \u2192 \u221e) and| ai| < 1, the model will settle into a stationary state that is\nequivalent to (2) with\n\nKi(t1, t2) = \u03c32\n1 \u2212 a2\n\ni\n\ni\n\n|t1\u2212t2|\na\ni\n\n,\n\n(5)\n\nas in [21]. Different covariance structures Ki can be obtained by going from a \ufb01rst-order to an\nnth-order AR model. One drawback of this approach is that it is usually not easy to construct an\nnth-order AR model with a speci\ufb01ed covariance structure. In contrast, the GP approach described in\nSection 2 requires only the speci\ufb01cation of the covariance structure, thus allowing different smooth-\ning properties to be applied in a seamless way. AR models are generally less computationally de-\nmanding than those based on GP, but this advantage shrinks as the order of the AR model grows.\nn,i \u00b7 \u03b4t1,t2 as in (3). The\nAnother difference is that (5) does not contain an independent noise term \u03c32\ninnovations noise \u03c32\ni in (4) is involved in setting the smoothness of the time series, as shown in (5).\nThus, (4) would need to be augmented to explicitly capture departures from the AR model.\n\nOne may also consider de\ufb01ning a non-linear dynamical model [3], which typically has a richer set of\ndynamical behaviors than linear models. The identi\ufb01cation of the model parameters provides insight\ninto the dynamical rules governing the time-evolution of the system under study. However, espe-\ncially in exploratory data analyses, it may be unclear what form this model should take. Even if an\nappropriate non-linear model can be identi\ufb01ed, learning such a model can be unstable and slow due\nto approximations required [3]. In contrast, learning the GPFA model is stable and approximation-\nfree, as described in Section 2. The use of GPFA can be viewed as a practical way of going beyond\na \ufb01rst-order linear AR model without having to commit to a particular non-linear system, while\nretaining computational tractability.\n\n5 Behavioral task and neural recordings\n\nThe details of the neural recordings and behavioral task can be found elsewhere [22]. Brie\ufb02y, a\nrhesus macaque performed delayed center-out reaches to visual targets presented on a fronto-parallel\nscreen. On a given trial, the peripheral reach target was presented at one of 14 possible locations \u2013\ntwo distances (60 and 100 mm) and seven directions (0, 45, 90, 135, 180, 225, 315\u00b0). Delay periods\nwere randomly chosen between 200 and 700 ms. Neural activity was recorded using a 96-electrode\narray (Cyberkinetics, Foxborough, MA) in dorsal premotor and motor cortices. Only those units (61\nsingle and multi-units, experiment G20040123) with robust delay period activity were included in\nour analyses.\n\n5\n\n\f\u00d7 104\n\n3.05\n\nr\no\nr\nr\ne\n \nn\no\ni\nt\nc\nd\ne\nr\nP\n\ni\n\n3\n\n2.95\n\n100 ms\n25 ms\n100 ms\n\n50 ms\n\n25 ms\n50 ms\n\n5\nState dimensionality, p\n\n10\n\n15\n\n6 Results\n\nFigure 1: Prediction errors of two-stage methods (PPCA:\nred, FA: green), \ufb01rst-order AR model (blue), GPFA (dashed\nblack), and reduced GPFA (solid black), computed using 4-\nfold cross-validation. Labels at right are standard deviations\nof Gaussian kernels (referred to as kernel widths) for the two-\nstage methods. For reduced GPFA, the horizontal axis corre-\nsponds to \u02dcp rather than p, where the prediction error is com-\nputed using only the top \u02dcp orthonormalized dimensions of a\nGPFA model \ufb01t with p = 15. Star indicates minimum of\nsolid black curve. Analyses in this \ufb01gure are based on 56 tri-\nals for the reach target at distance 60 mm and direction 135\u00b0.\n\nWe considered neural data for one reach target at a time, ranging from 200 ms before reach target\nonset to movement end. This period comprised the 200 ms pre-target time, the randomly chosen\ndelay period (200\u2013700 ms), the monkey\u2019s reaction time (mean\u00b1s.d.: 293\u00b148 ms), and the duration\nof the monkey\u2019s reach (269\u00b140 ms). Spike counts were taken in non-overlapping 20 ms bins,\nthen square-rooted. For the two-stage methods, these square-rooted counts were smoothed over\ntime using a Gaussian kernel. We also considered smoothing spike trains directly, which yielded\nqualitatively similar results for the two-stage methods.\n\nUsing the goodness-of-\ufb01t metric described in Section 3, we can \ufb01nd the appropriate degree of\nsmoothness for the two-stage methods. Fig. 1 shows the prediction error for PPCA (red) and FA\n(green) for different kernel widths and state dimensionalities. There are two primary \ufb01ndings. First,\nFA yielded lower prediction error than PPCA across a range of kernel widths and state dimension-\nalities. The reason is that FA allows different neurons to have different independent noise variances.\nSecond, for these data, the optimal smoothing kernel width (s.d. of Gaussian kernel) is approxi-\nmately 40 ms for both FA and PPCA. This was found using a denser sweep of the kernel width than\nshown in Fig. 1.\nIt is tempting to try to relate this optimal smoothing kernel width (40 ms) to the timescales \u03c4i learned\nby GPFA, since the SE covariance has the same shape as the Gaussian smoothing kernel. However,\nnearly all of the timescales learned by GPFA are greater than 40 ms. This apparent mismatch can be\nunderstood by considering the equivalent kernel of the SE covariance [23], which takes on a sinc-\nlike shape whose main lobe is generally far narrower than a Gaussian kernel with the same width\nparameter. It is therefore reasonable that the timescales learned by GPFA are larger than the optimal\nsmoothing kernel width.\n\nThe same goodness-of-\ufb01t metric can be used to compare the two-stage methods, parametric dynam-\nical models, and GPFA. The parametric dynamical model considered in this work is a \ufb01rst-order AR\nmodel described by (2) and (5), coupled with the linear-Gaussian observation model (1). Note that a\nseparate stationary, one-dimensional \ufb01rst-order AR model is de\ufb01ned for each of the p latent dimen-\nsions. As shown in Fig. 1, the \ufb01rst-order AR model (blue) yielded lower prediction error than the\ntwo-stage methods (PPCA: red, FA: green). Furthermore, GPFA (dashed black) performed as well\nor better than the two-stage methods and the \ufb01rst-order AR model, regardless of the state dimen-\nsionality or kernel width used. As described in Section 3, the prediction error can also be computed\nfor a reduced GPFA model (solid black) using only the top \u02dcp orthonormalized dimensions, in this\ncase based on a GPFA model \ufb01t with p = 15 state dimensions. By de\ufb01nition, the dashed and solid\nblack lines coincide at \u02dcp = 15. The solid black curve reaches its minimum at \u02dcp = 10 (referred to\nas p\u2217\n). Thus, removing the lowest \ufb01ve orthonormalized dimensions decreased the GPFA prediction\nerror. Furthermore, this prediction error was lower than when \ufb01tting the GPFA model directly with\np = 10 (dashed black).\nThese latter \ufb01ndings can be understood by examining the orthonormalized neural trajectories ex-\ntracted by GPFA shown in Fig. 2. The traces plotted are the orthonormalized form of E[X | Y ].\nThe panels are arranged in decreasing order of data covariance explained. The top orthonormalized\ndimensions indicate \ufb02uctuations in the recorded population activity shortly after target onset (red\n\n6\n\n\f\u02dcx2,:\n\n\u02dcx7,:\n\n\u02dcx3,:\n\n\u02dcx8,:\n\n\u02dcx4,:\n\n\u02dcx9,:\n\n\u02dcx5,:\n\n\u02dcx10,:\n\n\u02dcx12,:\n\n\u02dcx13,:\n\n\u02dcx14,:\n\n\u02dcx15,:\n\n2 \u02dcx1,:\n1\n\n0\n\n -1\n\n -2\n\n2 \u02dcx6,:\n1\n\n0\n\n -1\n\n -2\n\n2 \u02dcx11,:\n1\n\n0\n\n -1\n\n- 2\n\n400 ms\n\nFigure 2: Orthonormalized neural trajectories for GPFA with p = 15. Each panel corresponds to\none of the 15 dimensions of the orthonormalized neural state, which is plotted versus time. The\northonormalized neural trajectory for one trial comprises one black trace from each panel. Dots\nindicate time of reach target onset (red), go cue (green), and movement onset (blue). Due to differing\ntrial lengths, the traces on the left/right half of each panel are aligned on target/movement onset for\nclarity. However, the GPFA model was \ufb01t using entire trials with no gaps. Note that the polarity\nof these traces is arbitrary, as long as it is consistent with the polarity of UC. Each trajectory\ncorresponds to planning and executing a reach to the target at distance 60 mm and direction 135\u00b0.\nFor clarity, only 10 trials with delay periods longer than 400 ms are plotted.\n\ndots) and again after the go cue (green dots). Furthermore, the neural trajectories around the time\nof the arm movement are well-aligned on movement onset. These observations are consistent with\nprevious analyses of the same dataset [22], as well as other studies of neural activity collected during\nsimilar tasks in the same cortical areas. Whereas the top 10 orthonormalized dimensions (upper and\nmiddle rows) show repeatable temporal structure across trials, the bottom \ufb01ve dimensions (lower\nrow) appear to be largely capturing noise. These \u201cnoise dimensions\u201d could be limiting GPFA\u2019s pre-\ndictive power. This is con\ufb01rmed by Fig. 1: when the bottom \ufb01ve orthonormalized dimensions were\nremoved, the GPFA prediction error decreased.\n\nIt still remains to be explained why the GPFA prediction error using only the top 10 orthonormalized\ndimensions is lower than that obtained by directly \ufb01tting a GPFA model with p = 10. Each panel\nin Fig. 2 represents a mixture of 15 characteristic timescales. Thus, the top 10 orthonormalized\ndimensions can make use of up to 15 timescales. However, a GPFA model \ufb01t with p = 10 can have\nat most 10 timescales. By \ufb01tting a GPFA model with a large number of state dimensions p (each\nwith its own timescale) and taking only the top \u02dcp = p\u2217\northonormalized dimensions, we can obtain\nneural trajectories whose effective dimensionality is smaller than the number of timescales at play.\n\nBased on the solid black line in Fig. 1 and Fig. 2, we consider the effective dimensionality of the\nrecorded population activity to be p\u2217 = 10. In other words, the linear subspace within which the\nrecorded activity evolved during reach planning and execution for this particular target was 10-\ndimensional. Across the 14 reach targets, the effective dimensionality ranged from 8 to 12. All\nmajor trends seen in Fig. 1 were preserved across all reach targets.\n\n7 Conclusion\n\nGPFA offers a \ufb02exible and intuitive framework for extracting neural trajectories, whose learning\nalgorithm is stable, approximation-free, and simple to implement. Because only the GP covariance\nstructure needs to be speci\ufb01ed, GPFA is particularly attractive for exploratory data analyses, where\nthe rules governing the dynamics of the system under study are unknown. Based on the trajectories\nobtained by GPFA, one can then attempt to de\ufb01ne an appropriate dynamical model that describes\nhow the neural state evolves over time.\n\n7\n\n\fCompared with two-stage methods, the choice of GP covariance allows for more explicit speci-\n\ufb01cation of the smoothing properties of the low-dimensional trajectories. This is important when\ninvestigating (possibly subtle) properties of the system dynamics. For example, one may wish to ask\nwhether the system exhibits second-order dynamics by examining the extracted trajectories. In this\ncase, it is critical that second-order effects not be built-in by the smoothness assumptions used to\nextract the trajectories. With GPFA, it is possible to select a triangular GP covariance that assumes\nsmoothness in position, but not in velocity. In contrast, it is unclear how to choose the shape of the\nsmoothing kernel to achieve this in the two-stage methods.\n\nIn future work, we would like to couple the covariance structure of the one-dimensional GPs, which\nwould allow for a richer description of the multi-dimensional neural state x:,t evolving over time.\nWe also plan to apply non-stationary GP kernels, since the neural data collected during a behavioral\ntask are usually non-stationary. In addition, we would like to extend GPFA by allowing for the\ndiscovery of non-linear manifolds and applying point-process likelihood models.\n\nAcknowledgments\n\nThis work was supported by NIH-NINDS-CRCNS 5-R01-NS054283-03, NSF, NDSEGF, Gatsby,\nSGF, CDRF, BWF, ONR, Sloan, and Whitaker. We would like to thank Dr. Mark Churchland,\nMelissa Howard, Sandra Eisensee, and Drew Haven.\n\nReferences\n\n[1] K. L. Briggman, H. D. I. Abarbanel, and W. B. Kristan Jr. Science, 307(5711):896\u2013901, Feb. 2005.\n[2] K. L. Briggman, H. D. I. Abarbanel, and W. B. Kristan Jr. Curr Opin Neurobiol, 16(2):135\u2013144, 2006.\n[3] B. M. Yu, A. Afshar, G. Santhanam, S. I. Ryu, K. V. Shenoy, and M. Sahani. In Y. Weiss, B. Scholkopf,\n\nand J. Platt, eds., Adv Neural Info Processing Sys 18, pp. 1545\u20131552. MIT Press, 2006.\n\n[4] M. M. Churchland, B. M. Yu, M. Sahani, and K. V. Shenoy. Curr Opin Neurobiol, 17(5):609\u2013618, 2007.\n[5] A. C. Smith and E. N. Brown. Neural Comput, 15(5):965\u2013991, 2003.\n[6] M. Stopfer, V. Jayaraman, and G. Laurent. Neuron, 39:991\u20131004, Sept. 2003.\n[7] S. L. Brown, J. Joseph, and M. Stopfer. Nat Neurosci, 8(11):1568\u20131576, Nov. 2005.\n[8] R. Levi, R. Varona, Y. I. Arshavsky, M. I. Rabinovich, and A. I. Selverston. J Neurosci, 25(42):9807\u2013\n\n9815, Oct. 2005.\n\n[9] O. Mazor and G. Laurent. Neuron, 48:661\u2013673, Nov. 2005.\n[10] B. M. Broome, V. Jayaraman, and G. Laurent. Neuron, 51:467\u2013482, Aug. 2006.\n[11] M. A. L. Nicolelis, L. A. Baccala, R. C. S. Lin, and J. K. Chapin. Science, 268(5215):1353\u20131358, 1995.\n[12] I. DiMatteo, C. R. Genovese, and R. E. Kass. Biometrika, 88(4):1055\u20131071, 2001.\n[13] J. P. Cunningham, B. M. Yu, K. V. Shenoy, and M. Sahani. In J. Platt, D. Koller, Y. Singer, and S. Roweis,\n\neds., Adv Neural Info Processing Sys 20. MIT Press, 2008.\n\n[14] S. T. Roweis and L. K. Saul. Science, 290(5500):2323\u20132326, Dec. 2000.\n[15] S. Roweis and Z. Ghahramani. Neural Comput, 11(2):305\u2013345, 1999.\n[16] N. A. Thacker and P. A. Bromiley. The effects of a square root transform on a Poisson distributed quantity.\n\nTechnical Report 2001-010, University of Manchester, 2001.\n\n[17] D. J. Tolhurst, J. A. Movshon, and A. F. Dean. Vision Res, 23(8):775\u2013785, 1983.\n[18] C. E. Rasmussen and C. K. I. Williams. Gaussian processes for machine learning. MIT Press, 2006.\n[19] Y. W. Teh, M. Seeger, and M. I. Jordan. In R. G. Cowell and Z. Ghahramani, eds., Proceedings of the\nTenth International Workshop on Arti\ufb01cial Intelligence and Statistics (AISTATS). Society for Arti\ufb01cial\nIntelligence and Statistics, 2005.\n\n[20] N. D. Lawrence and A. J. Moore. In Z. Ghahramani, ed., Proceedings of the 24th Annual International\n\nConference on Machine Learning (ICML 2007), pp. 481\u2013488. Omnipress, 2007.\n\n[21] R. E. Turner and M. Sahani. Neural Comput, 19(4):1022\u20131038, 2007.\n[22] M. M. Churchland, B. M. Yu, S. I. Ryu, G. Santhanam, and K. V. Shenoy. J Neurosci, 26(14):3697\u20133712,\n\nApr. 2006.\n\n[23] P. Sollich and C. K. I. Williams.\n\nIn L. K. Saul, Y. Weiss, and L. Bottou, eds., Advances in Neural\n\nInformation Processing Systems 17, pp. 1313\u20131320. MIT Press, 2005.\n\n8\n\n\f", "award": [], "sourceid": 374, "authors": [{"given_name": "Byron", "family_name": "Yu", "institution": null}, {"given_name": "John", "family_name": "Cunningham", "institution": null}, {"given_name": "Gopal", "family_name": "Santhanam", "institution": null}, {"given_name": "Stephen", "family_name": "Ryu", "institution": null}, {"given_name": "Krishna", "family_name": "Shenoy", "institution": null}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": null}]}