{"title": "Unsupervised Discovery of Temporal Structure in Noisy Data with Dynamical Components Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 14267, "page_last": 14278, "abstract": "Linear dimensionality reduction methods are commonly used to extract low-dimensional structure from high-dimensional data. However, popular methods disregard temporal structure, rendering them prone to extracting noise rather than meaningful dynamics when applied to time series data. At the same time, many successful unsupervised learning methods for temporal, sequential and spatial data extract features which are predictive of their surrounding context. Combining these approaches, we introduce Dynamical Components Analysis (DCA), a linear dimensionality reduction method which discovers a subspace of high-dimensional time series data with maximal predictive information, defined as the mutual information between the past and future. We test DCA on synthetic examples and demonstrate its superior ability to extract dynamical structure compared to commonly used linear methods. We also apply DCA to several real-world datasets, showing that the dimensions extracted by DCA are more useful than those extracted by other methods for predicting future states and decoding auxiliary variables. Overall, DCA robustly extracts dynamical structure in noisy, high-dimensional data while retaining the computational efficiency and geometric interpretability of linear dimensionality reduction methods.", "full_text": "Unsupervised Discovery of Temporal Structure in\nNoisy Data with Dynamical Components Analysis\n\nDavid G. Clark\u2217,1,2\n\nJesse A. Livezey\u2217,2,3 Kristofer E. Bouchard2,3,4\n\ndgc2138@cumc.columbia.edu\n\nkebouchard@lbl.gov\n\njlivezey@lbl.gov\n\n\u2217Equal contribution.\n\n1Center for Theoretical Neuroscience, Columbia University\n\n2Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory\n\n3Redwood Center for Theoretical Neuroscience, University of California, Berkeley\n\n4Helen Wills Neuroscience Institute, University of California, Berkeley\n\nAbstract\n\nLinear dimensionality reduction methods are commonly used to extract low-\ndimensional structure from high-dimensional data. However, popular methods\ndisregard temporal structure, rendering them prone to extracting noise rather than\nmeaningful dynamics when applied to time series data. At the same time, many\nsuccessful unsupervised learning methods for temporal, sequential and spatial data\nextract features which are predictive of their surrounding context. Combining these\napproaches, we introduce Dynamical Components Analysis (DCA), a linear dimen-\nsionality reduction method which discovers a subspace of high-dimensional time\nseries data with maximal predictive information, de\ufb01ned as the mutual information\nbetween the past and future. We test DCA on synthetic examples and demon-\nstrate its superior ability to extract dynamical structure compared to commonly\nused linear methods. We also apply DCA to several real-world datasets, showing\nthat the dimensions extracted by DCA are more useful than those extracted by\nother methods for predicting future states and decoding auxiliary variables. Over-\nall, DCA robustly extracts dynamical structure in noisy, high-dimensional data\nwhile retaining the computational ef\ufb01ciency and geometric interpretability of linear\ndimensionality reduction methods.\n\n1\n\nIntroduction\n\nExtracting meaningful structure from noisy, high-dimensional data in an unsupervised manner\nis a fundamental problem in many domains including neuroscience, physics, econometrics and\nclimatology. In the case of time series data, e.g., the spiking activity of a network of neurons or\nthe time-varying prices of many stocks, one often wishes to extract features which capture the\ndynamics underlying the system which generated the data. Such dynamics are often expected to\nbe low-dimensional, re\ufb02ecting the fact that the system has fewer effective degrees of freedom than\nobserved variables. For instance, in neuroscience, recordings of 100s of neurons during simple stimuli\nor behaviors generally contain only \u223c10 relevant dimensions [1]. In such cases, dimensionality\nreduction methods may be used to uncover the low-dimensional dynamical structure.\nLinear dimensionality reduction methods are popular since they are computationally ef\ufb01cient, often\nreducing to generalized eigenvalue or simple optimization problems, and geometrically interpretable,\nsince the high- and low-dimensional variables are related by a simple change of basis [2]. Analyzing\nthe new basis can provide insight into the relationship between the high- and low-dimensional\n\nDCA code is available at: https://github.com/BouchardLab/DynamicalComponentsAnalysis\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fvariables [3]. However, many popular linear methods including Principal Components Analysis,\nFactor Analysis and Independent Components Analysis disregard temporal structure, treating data at\ndifferent time steps as independent samples from a static distribution. Thus, these methods do not\nrecover dynamical structure unless it happens to be associated with the static structure targeted by the\nchosen method.\nOn the other hand, several sophisticated unsupervised learning methods for temporal, sequential and\nspatial data have recently been proposed, many of them rooted in prediction. These prediction-based\nmethods extract features which are predictive of the future (or surrounding sequential or spatial\ncontext) [4\u20139]. Predictive features form useful representations since they are generally linked to the\ndynamics, computation or other latent structure of the system which generated the data. Predictive\nfeatures are also of interest to organisms, which must make internal estimates of the future of the\nworld in order to guide behavior and compensate for latencies in sensory processing [10]. These\nideas have been formalized mathematically [11, 12] and tested experimentally [13].\nWe introduce Dynamical Components Analysis (DCA), a novel method which combines the com-\nputational ef\ufb01ciency and ease of interpretation of linear dimensionality reduction methods with\nthe temporal structure-discovery power of prediction-based methods. Speci\ufb01cally, DCA discov-\ners a subspace of high-dimensional time series data with maximal predictive information, de\ufb01ned\nas the mutual information between the past and future [12]. To make the predictive information\ndifferentiable and accurately estimable, we employ a Gaussian approximation of the data, however\nwe show that maximizing this approximation can yield near-optima of the full information-theoretic\nobjective. We compare and contrast DCA with several existing methods, including Principal Compo-\nnents Analysis and Slow Feature Analysis, and demonstrate the superior ability of DCA to extract\ndynamical structure in synthetic data. We apply DCA to several real-world datasets including neural\npopulation activity, multi-city weather data and human kinematics. In all cases, we show that DCA\noutperforms commonly used linear dimensionality reduction methods at predicting future states and\ndecoding auxiliary variables. Altogether, our results establish that DCA is an ef\ufb01cient and robust\nlinear method for extracting dynamical structure embedded in noisy, high-dimensional time series.\n\n2 Dynamical Components Analysis\n\n2.1 Motivation\n\nDimensionality reduction methods that do not take time into account will miss dynamical structure\nthat is not associated with the static structure targeted by the chosen method. We demonstrate this\nconcretely in the context of Principal Components Analysis (PCA), whose static structure of interest is\nvariance [14, 15]. Variance arises in time series due to both dynamics and noise, and the dimensions\nof greatest variance, found by PCA, contain contributions from both sources in general. Thus, PCA\nis prone to extracting spatially structured noise rather than dynamics if the noise variance dominates,\nor is comparable to, the dynamics variance (Fig. 1A). We note that for applications in which generic\nshared variability due to both dynamics and spatially structured noise is of interest, static methods are\nwell-suited.\nTo further illustrate this failure mode of PCA, suppose we embed a low-dimensional dynamical\nsystem, e.g., a Lorenz attractor, in a higher-dimensional space via a random embedding (Fig. 1B,C).\nWe then add spatially anisotropic Gaussian white noise (Fig. 1D). We de\ufb01ne a signal-to-noise ratio\n(SNR) given by the ratio of the variances of the \ufb01rst principal components of the dynamics and\nnoise. When the SNR is small, the noise variance dominates the dynamics variance and PCA\nprimarily extracts noise, missing the dynamics. Only when the SNR becomes large does PCA extract\ndynamical structure (Fig. 1F,G, black). Rather than maximizing variance, DCA \ufb01nds a projection\nwhich maximizes the mutual information between past and future windows of length T (Fig. 1E).\nAs we will show, this mutual information is maximized precisely when the projected time series\ncontains as much dynamical structure, and as little noise, as possible. As a result, DCA extracts\ndynamical structure even for small SNR values, and consistently outperforms PCA in terms of\ndynamics reconstruction performance as the SNR grows (Fig 1F,G, red).\n\n2\n\n\fFigure 1: DCA \ufb01nds dynamics rather than variance. (A) Schematic of unit vectors found by PCA\nand DCA for three relative levels of dynamics and noise. The dimension of greatest variance, found\nby PCA, contains contributions from both sources while the dimension found by DCA is orthogonal\nto the noise. (B) Lorenz attractor in the chaotic regime. (C) Random orthogonal embedding of the\nLorenz attractor into 30-dimensional space. (D) Embedded Lorenz attractor with spatially-structured\nwhite noise. (E) Random three-dimensional projection (top) and DCA projection (bottom) of the\nembedded Lorenz attractor. (F) Reconstructions of the Lorenz attractor given the three-dimensional\nprojections found by DCA and PCA. (G) Lorenz reconstruction performance (R2) as a function of\nthe SNR for both methods. See Appendix B for details of the noisy Lorenz embedding.\n\n2.2 Predictive information as an objective function\n\nThe goal of DCA is to extract a subspace with maximal dynamical structure. One fundamental\ncharacteristic of dynamics is predictability: in a system with dynamics, future uncertainty is reduced\nby knowledge of the past. This reduction in future uncertainty may be quanti\ufb01ed using information\ntheory. In particular, if we equate uncertainty with entropy, this reduction in future uncertainty is the\nmutual information between the past and future. This quantity was termed predictive information\nby Bialek et al. [12]. Formally, consider a discrete time series X = {xt}, xt \u2208 Rn, with a stationary\n(time translation-invariant) probability distribution P (X). Let Xpast and Xfuture denote consecu-\ntive length-T windows of X, i.e., Xpast = (x\u2212T +1, . . . , x0) and Xfuture = (x1, . . . , xT ). Then, the\npredictive information Ipred\nT (X) is de\ufb01ned as\nT (X) = H (Xfuture) \u2212 H (Xfuture|Xpast)\nIpred\n\n= H (Xpast) + H (Xfuture) \u2212 H (Xpast, Xfuture)\n= 2HX (T ) \u2212 HX (2T )\n\n(1)\n\nwhere HX (T ) is the entropy of any length-T window of X, which is well-de\ufb01ned by virtue of\nthe stationarity of X. Unlike entropy and related measures such as Kolmogorov complexity [16],\npredictive information is minimized, not maximized, by serially independent time series (white noise).\nThis is because predictive information captures the sub-extensive component of the entropy of X.\nSpeci\ufb01cally, if the data points that comprise X are mutually independent, then HX (\u03b1T ) = \u03b1HX (T )\nfor all \u03b1 and T , meaning that the entropy is perfectly extensive. On the other hand, if X has temporal\nstructure, then HX (\u03b1T ) < \u03b1HX (T ) and the entropy has a sub-extensive component given by\n\u03b1HX (T ) \u2212 HX (\u03b1T ) > 0. Upon setting \u03b1 = 2, this sub-extensive component is the predictive\ninformation.\nBeyond simply being able to detect the presence of temporal structure in time series, predic-\ntive information discriminates between different types of structure. For example, consider two\ndiscrete-time Gaussian processes with autocovariance functions f1(\u2206t) = exp (\u2212|\u2206t/\u03c4|) and\nas T \u2192 \u221e to c1 log \u03c4\n\nf2(\u2206t) = exp(cid:0)\u2212\u2206t2/\u03c4 2(cid:1). For \u03c4 (cid:29) 1, the predictive information in these time series saturates\n\n2 and c2\u03c4 4, respectively, where c1 and c2 are constants of order unity (see Ap-\n\n3\n\nPC1DC1PC1DC1xyz1234530\u00b7\u00b7\u00b71234530\u00b7\u00b7\u00b7DCAXpastXfutureTTrandomXpastXfutureSNR = 101SNR = 100SNR = 101ABCDEFGDCAPCAdynamicsnoisePC1DC1101100101SNR0.00.51.0R2\fpendix D for derivation). The disparity in the predictive information of these time series corresponds\nto differences in their underlying dynamics. In particular, f1(\u2206t) describes Markovian dynamics,\nleading to small predictive information, whereas f2(\u2206t) describes longer-timescale dependencies,\nleading to large predictive information. Finally, as discussed by Bialek et al. [12], the predictive\ninformation of many time series diverges with T . In these cases, different scaling behaviors of the\npredictive information correspond to different classes of time series. For one-dimensional time series,\nit was demonstrated that the divergent predictive information provides a unique complexity measure\ngiven simple requirements [12].\n\nT (Y ). In certain cases of theoretical interest, P (X) is known and Ipred\n\n2.3 The DCA method\nDCA takes as input samples xt \u2208 Rn of a discrete time series X, as well as a target dimensionality\nd \u2264 n, and outputs a projection matrix V \u2208 Rn\u00d7d such that the projected data yt = V T xt maximize\nan empirical estimate of Ipred\nT (Y )\nmay be computed exactly for a given projection V . Systems for which this is possible include linear\ndynamical systems with Gaussian noise and Gaussian processes more broadly. In practice, however,\nwe must estimate of Ipred\nT (Y ) from \ufb01nitely many samples. Directly estimating mutual information\nfrom multidimensional data with continuous support is possible, and popular nonparametric methods\ninclude those based on binning [17, 18], kernel density estimation [19] and k-nearest neighbor\n(kNN) statistics [20]. However, many of these nonparametric methods are not differentiable (e.g.,\nkNN-based methods involve counting data points), complicating optimization. Moreover, these\nmethods are typically sensitive to the choice of hyperparameters [21] and suffer from the curse of\ndimensionality, requiring prohibitively many samples for accurate results [22].\nTo circumvent these challenges, we assume that X is a stationary (discrete-time) Gaussian process.\nIt then follows that Y is stationary and Gaussian since Y is a linear projection of X. Under this\nassumption, Ipred\nT (Y ) may be computed from the second-order statistics of Y , which may in turn\nbe computed from the second-order statistics of X given V . Crucially, this estimate of Ipred\nT (Y ) is\ndifferentiable in V . Toward expressing Ipred\nT (Y ) in terms of V , we de\ufb01ne \u03a3T (X), the spatiotemporal\ncovariance matrix of X which encodes all second-order statistics of X across T time steps. Assuming\nthat (cid:104)xt(cid:105)t = 0, we have\n\n\uf8eb\uf8ec\uf8ec\uf8ed C0\n\nC1\nC T\nC0\n1\n...\n...\nC T\nT\u22121 C T\nT\u22122\n\n. . . CT\u22121\n. . . CT\u22122\n...\n. . .\n\n...\nC0\n\n\uf8f6\uf8f7\uf8f7\uf8f8 where C\u2206t =(cid:10)xtxT\n\n(cid:11)\n\nt+\u2206t\n\nt .\n\n(2)\n\n\u03a3T (X) =\n\nThen, the spatiotemporal covariance matrix of Y , \u03a3T (Y ), is given by sending C\u2206t \u2192 V T C\u2206tV in\n\u03a3T (X). Finally, Ipred\n\nT (Y ) is given by\nT (Y ) = 2HY (T ) \u2212 HY (2T ) = log |\u03a3T (Y )| \u2212 1\nIpred\n2\n\nlog |\u03a32T (Y )|.\n\n(3)\n\nTo run DCA on data, we \ufb01rst compute the 2T cross-covariance matrices C0, . . . , C2T\u22121, then\nmaximize the expression for Ipred\nT (Y ) of Eq. 3 with respect to V (see Appendix A for implementation\ndetails). Note that Ipred\nT (Y ) is invariant under invertible linear transformations of the columns of V .\nThus, DCA \ufb01nds a subspace as opposed to an ordered sequence of one-dimensional projections.\nOf course, real data violate the assumptions of both stationarity and Gaussianity. Note that stationarity\nis a fundamental conceptual assumption of our method in the sense that predictive information is\nde\ufb01ned only for stationary processes, for which the entropy as a function of window length is\nwell-de\ufb01ned. Nonetheless, extensions of DCA which take nonstationarity into account are possible\n(see Discussion). On the other hand, the Gaussian assumption makes optimization tractable, but is\nnot required in theory. Note, however, that the Gaussian assumption is acceptable so long as the\noptima of the Gaussian objective are also near-optima of the full information-theoretic objective.\nThis is a much weaker condition than agreement between the Gaussian and full objectives over\nall possible V . To probe whether the weak condition might hold in practice, we compared the\nGaussian estimate of predictive information to a direct estimate obtained using the nonparametric\nkNN estimator of Kraskov et al. [20] for projections of non-Gaussian synthetic data. We refer to\n\n4\n\n\fthese two estimates of predictive information as the \u201cGaussian\u201d and \u201cfull\u201d estimates, respectively.\nFor random one-dimensional projections of the three-dimensional Lorenz attractor, the Gaussian and\nfull predictive information estimates are positively correlated, but show a complex, non-monotonic\nrelationship (Fig. 2A,B). However, for one-dimensional projections of the 30-dimensional noisy\nLorenz embedding of Fig. 1, we observe tight agreement between the two estimates for random\nprojections (Fig. 2C, gray histogram). Running DCA, which by de\ufb01nition increases the Gaussian\nestimate of predictive information, also increases the full estimate (Fig. 2C, red trajectories). When we\nconsider three-dimensional projections of the same system, random projections no longer ef\ufb01ciently\nsample the full range of predictive information, but running DCA nevertheless increases both the\nGaussian and full estimates (Fig. 2D, trajectories). These results suggest that DCA \ufb01nds good optima\nof the full, information-theoretic loss surface in this synthetic system despite only taking second-order\nstatistics into account.\nFor a one-dimensional Gaussian time series Y , it is also possible to compute the predictive informa-\ntion using the Fourier transform of Y [23]. In particular, when the asymptotic predictive information\nk where {bk} are the so-called cepstrum coef\ufb01-\nIpred\nT\u2192\u221e(Y ) is \ufb01nite, we have Ipred\ncients of Y , which are related to the Fourier transform of Y (see Appendix C). When the Fourier\ntransform of Y is estimated for length-2T windows in conjunction with a window function, this\nmethod computes a regularized estimate of Ipred\nT (Y ). We call this the \u201cfrequency-domain\u201d method\nof computing Gaussian predictive information (in contrast the \u201ctime-domain\u201d method of Eq. 3).\nLike the time-domain method, the frequency-domain method is differentiable in V . Its primary\nadvantage lies in leveraging the fast Fourier transform (FFT), which allows DCA to be run with\nmuch larger T than would be feasible using the time-domain method which requires computing\n\nthe log-determinant of a T -by-T matrix, an O(cid:0)T 3(cid:1) operation. By contrast, the FFT is O (T log T ).\n\nHowever, the frequency-domain method is limited to \ufb01nding one-dimensional projections. To \ufb01nd\na multidimensional projection, one can greedily \ufb01nd one-dimensional projections and iteratively\nproject them out of of the problem, a technique called de\ufb02ation. However, de\ufb02ation is not guaranteed\nto \ufb01nd local optima of the DCA objective since correlations between the projected variables are\nignored (Fig. 2E). For this reason, we use the time-domain implementation of DCA unless stated\notherwise.\n\nT\u2192\u221e(Y ) =(cid:80)\u221e\n\nk=1 kb2\n\nFigure 2: Comparison of Gaussian vs. full predictive information estimates (A\u2013D) and the\nfrequency-domain method (E). (A) Predictive information of one-dimensional projections of the\nthree-dimensional Lorenz attractor as a function of the spherical coordinates (\u03b8, \u03c6) of the projection\nusing Gaussian and full (kNN) estimates. (A\u2013D) all consider DCA with T = 1. (B) Histogram of the\nGaussian and full estimates of predictive information from (A). (C) Histogram of the Gaussian and\nfull estimates of predictive information of random one-dimensional projections of the 30-dimensional\nnoisy Lorenz embedding of Fig. 1. Red trajectories correspond to \ufb01ve different runs of DCA. (D)\nSame as (C) but for three-dimensional projections of the same system. (E) Gaussian predictive\ninformation of subspaces found by different implementations of DCA when run on 109-dimensional\nmotor cortical data (see Section 4). \u201cDCA\u201d directly optimizes Eq. 3, \u201cde\ufb02ation\u201d optimizes Eq. 3 to\n\ufb01nd one-dimensional projections in a de\ufb02ational fashion and \u201cFFT de\ufb02ation\u201d uses the frequency-\ndomain method of computing Gaussian predictive information in a de\ufb02ational fashion. T = 5 is used\nin all three cases.\n\n3 Related work\n\nThough less common than static methods, linear dimensionality reduction methods which take time\ninto account, like DCA, are sometimes used. One popular method is Slow Feature Analysis (SFA),\n\n5\n\n00full PI0Gaussian PI01full PI01Gaussian PI01full PI01Gaussian PIDCA trajectories01full PI01Gaussian PI020dimension03.3PI (nats)DCAdeflationFFT deflationABCDE\fwhich we examine in some depth due to its resemblance to DCA [24, 25]. Given a discrete time\nsereis X, where xt \u2208 Rn, SFA \ufb01nds projected variables yt = V T xt \u2208 Rd that have unit variance,\nmutually uncorrelated components and minimal mean-squared time derivatives. For a discrete one-\ndimensional time series with unit variance, minimizing the mean-squared time derivative is equivalent\nto maximizing the one-time step autocorrelation. Thus, SFA may be formulated as\n\nwhere V \u2208 Rn\u00d7d, C0 =(cid:10)xtxT\n\nmaximize tr(cid:0)V T Csym\nt, C1 =(cid:10)xtxT\n\n1 V(cid:1) subject to V T C0V = I\n(cid:0)C1 + C T\n\nt and Csym\n\n(cid:11)\n\n(cid:11)\n\nt+1\n\n1\n\nt\n\n(cid:1). We assume that X\n\n(4)\n\n1\n\n1\n\nCsym\n\n1 C\n\nC1C\n\n\u22121/2\n0\n\n\u22121/2\n0\n\n1 = 1\n2\nhas been temporally oversampled so that the one-time step autocorrelation of any one-dimensional\nprojection is positive, which is equivalent to assuming that Csym\nis positive-de\ufb01nite (see Appendix\nE for explanation). SFA is naturally compared to the T = 1 case of DCA. For one-dimensional\nprojections (d = 1), the solutions of SFA and DCA coincide, since mutual information is monoton-\nically related to correlation for Gaussian variables in the positive-correlation regime. For higher-\ndimensional projections (d > 1), the comparison becomes more subtle. SFA is solved by making\nthe whitening transformation \u02dcV = C 1/2\n0 V and letting \u02dcV be the top-d orthonormal eigenvectors of\n\u22121/2\n. To understand the solution to DCA, it is helpful to consider the relaxed\nMSFA = C\n0\nproblem of maximizing I(U T xt; V T xt+1) where U need not equal V . The relaxed problem is\nsolved by performing Canonical Correlation Analysis (CCA) on xt and xt+1, which entails making\n0 U, \u02dcV = C 1/2\nthe whitening transformations \u02dcU = C 1/2\n0 V and letting \u02dcU and \u02dcV be the top-d left and\n\u22121/2\nright singular vectors, respectively, of MCCA = C\n[26, 27]. If X has time-reversal\n0\nsymmetry, then Csym\n1 = C1, so MSFA = MCCA and the projections found by SFA and DCA agree.\n(cid:54)= C1, so MSFA (cid:54)= MCCA and the projections found by SFA and\nFor time-irreversible processes, Csym\nDCA disagree. In particular, the SFA objective has no dependence on the off-diagonal elements of\nfor non-Markovian processes, SFA and DCA yield different subspaces for T > 1 for all d \u2265 1 since\nDCA captures longer-timescale dependencies than SFA (Fig. 3A). In summary, DCA is superior\nto SFA at capturing past-future mutual information for time-irreversible and/or non-Markovian\nprocesses. Note that most real-world systems including biological networks, stock markets and\nout-of-equilibrium physical systems are time-irreversible. Moreover, real-world systems are generally\nnon-Markovian. Thus, when capturing past-future mutual information is of interest, DCA is superior\nto SFA for most realistic applications.\nWith regard to the relaxed problem solved by CCA, Tegmark [28] has suggested that, for time-\n\nV T C1V , while DCA takes these terms into account to maximize I(cid:0)V T xt; V T xt+1\n\nirreversible processes X, the maximum of I(cid:0)U T xt; V T xt+1\n\n(cid:1) can be signi\ufb01cantly reduced when\n\n(cid:1). Additionally,\n\nU = V is enforced. This is because, in time-irreversible processes, predictive features are not\nnecessarily predictable, and vice versa. However, because this work did not compare CCA (the\noptimal U (cid:54)= V method) to DCA (the optimal U = V method), the results are overly pessimistic.\nWe repeated the analysis of [28] using both the noisy Lorenz embedding of Fig. 1 as well as a\nsystem of coupled oscillators that was used in [28]. For both systems, the single projection found\nby DCA captured almost as much past-future mutual information as the pair of projections found\nby CCA (Fig. 3B,C). This suggests that while predictive and predictable features are different in\ngeneral, shared past and future features might suf\ufb01ce for capturing most of the past-future mutual\ninformation in a certain systems. Identifying and characterizing this class of systems could have\nimportant implications for prediction-based unsupervised learning techniques [28, 9].\nIn addition to SFA, other time-based linear dimensionality reduction methods have been proposed.\nMaximum Autocorrelation Factors [29] is equivalent to the version of SFA described here. Com-\nplexity Pursuit [30] and Forecastable Components Analysis [31] each minimize the entropy of a\nnonlinear function of the projected variables. They are similar in spirit to the frequency-domain\nimplementation of DCA, but do not maximize past-future mutual information. Several algorithms\ninspired by Independent Components Analysis that incorporate time have been proposed [32\u201334],\nbut are designed to separate independent dimensions in time series rather than discover a dynamical\nsubspace with potentially correlated dimensions. Like DCA, Predictable Feature Analysis [35, 36] is\na linear dimensionality reduction method with a prediction-based objective. However, Predictable\nFeature Analysis requires explicitly specifying a prediction model, whereas DCA does not assume a\nparticular model. Moreover, Predictable Feature Analysis requires alternating optimization updates\nof the prediction model and the projection matrix, whereas DCA is end-to-end differentiable. Finally,\nDCA is related to the Past-Future Information Bottleneck [37] (see Appendix F).\n\n6\n\n\fFigure 3: Comparison of DCA with other methods.\n(A) Autocorrelation functions of one-\ndimensional DCA projections of motor cortical data (see Section 4) for T = 1, in which case\nDCA is equivalent to SFA, and T = 20. While the one-time step autocorrelation is larger for the\nT = 1 projection (inset), the T = 20 projection exhibits stronger oscillations apparent at longer\ntimescales. (B) Performance of DCA, SFA, PCA and CCA at capturing past-future mutual informa-\n\n(cid:1), where U = V for DCA, SFA and PCA and U (cid:54)= V for CCA. Following\n\ntion, I(cid:0)U T xt; V T xt+\u2206t\n\nTegmark [28], xt comprises the position and momentum variables of 10 coupled oscillators and\n\u2206t = 10. (C) Same as (B), but using the 30-dimensional noisy Lorenz embedding of Fig. 1 with\n\u2206t = 2.\n\nWe have been made aware of two existing methods which share the name Dynamical Component(s)\nAnalysis [38\u201340]. Thematically, they share the goal of uncovering low-dimensional dynamics from\ntime series data. Thirion and Faugeras [38] perform a two-stage, temporal then kernelized spatial\nanalysis. Seifert et al. [39] and Korn et al. [40] assume the observed dynamics are formed by low-\ndimensional latent variables with linear and nonlinear dynamics. To \ufb01t a linear approximation of the\nlatent variables, they derive a generalized eigenvalue problem which is sensitive to same-time and\none-time step correlations, i.e., the data and the approximation of its \ufb01rst derivative.\nAn alternative to objective function-based components analysis methods are generative models, which\npostulate a low-dimensional latent state that has been embedded in high-dimensional observation\nspace. Generative models featuring latent states imbued with dynamics, such as the Kalman \ufb01lter,\nGaussian Process Factor Analysis and LFADS, have found widespread use in neuroscience (see\nAppendix I for comparisons of DCA with the KF and GPFA) [41\u201343]. The power of these methods\nlies in the fact that rich dynamical structure can be encouraged in the latent state through careful\nchoice of priors and model structure. However, learning and inference in generative models tend to be\ncomputationally expensive, particularly in models featuring dynamics. In the case of deep learning-\nbased methods such as LFADS, there are often many model and optimization hyperparameters\nthat need to be tuned.\nIn terms of computational ef\ufb01ciency and simplicity, DCA occupies an\nattractive territory between linear methods like PCA and SFA, which are computationally ef\ufb01cient but\nextract relatively simple structure, and dynamical generative models like LFADS, which extract rich\ndynamical structure but are computationally demanding. As a components analysis method, DCA\nmakes the desired properties of the learned features explicit through its objective function. Finally,\nthe ability of DCA to yield a linear subspace in which dynamics unfold may be exploited for many\nanalyses. For example, the loadings for DCA can be studied to examine the relationship between the\nhigh- and low-dimensional variables (Appendix J).\nLastly, while DCA does not produce an explicit description of the dynamics, this is a potentially\nattractive property. In particular, while dynamical generative models such as the KF provide de-\nscriptions of the dynamics, they also assume a particular form of dynamics, biasing the extracted\ncomponents toward this form. By contrast, DCA is formulated in terms of spatiotemporal correlations\nand, as result, can extract broad forms of (stationary) dynamics, be they linear or nonlinear. For\nexample, the Lorenz attractor of Fig. 1 is a nonlinear dynamical system.\n\n4 Applications to real data\n\nWe used DCA to extract dynamical subspaces in four high-dimensional time series datasets: (i) multi-\nneuronal spiking activity of 109 single units recorded in monkey primary motor cortex (M1) while\nthe monkey performed a continuous grid-based reaching task [44]; (ii) multi-neuronal spiking activity\nof 55 single units recorded in rat hippocampal CA1 while the rat performed a reward-chasing\n\n7\n\n05101520t (100 ms bins)-0.20.01.0autocorrelationT=1T=2005101520dimensions retained010.5MI (nats)Coupled OscillatorsCCADCASFAPCA05dimensions retained0 3.6MI (nats)Embedded Lorenz Attractor0120.41.0ABC\ftask [45, 46]; (iii) multi-city temperature data from 30 cities over several years [47]; and (iv) 12\nvariables from an accelerometer, gyroscope, and gravity sensor recording human kinematics [48].\nSee Appendix B for details. For all results, three bins of projected data were used to predict one bin\nof response data. Data were split into \ufb01ve folds, and reported R2 values are averaged across folds.\nTo assess the performance of DCA, we noted that subspaces which capture dynamics should be\nmore predictive of future states than those which capture static structure. Moreover, for the motor\ncortical and hippocampal datasets, subspaces which capture dynamics should be more predictive of\nbehavioral variables (cursor kinematics and rat location, respectively) than subspaces which do not,\nsince neural dynamics are believed to underlie or encode these variables [49, 50]. Thus, we compared\nthe abilities of subspaces found by DCA, PCA and SFA to decode behavioral variables for the motor\ncortical and hippocampal datasets and to forecast future full-dimensional states for the temperature\nand accelerometer datasets.\nFor the motor cortical and hippocampal datasets, DCA outperformed PCA at predicting both current\nand future behavioral variables on held-out data (Fig. 4, top row). This re\ufb02ects the existence of\ndimensions which have substantial variance, but which do not capture as much dynamical structure\nas other, smaller-variance dimensions. Unlike PCA, DCA is not drawn to these noisy, high-variance\ndimensions. In addition to demonstrating that DCA captures more dynamical structure than PCA, this\nanalysis demonstrates the utility of DCA in a common task in neuroscience, namely, extracting low-\ndimensional representations of neural dynamics for visualization or further analysis (see Appendix\nH for forecasting results on the neural data and Appendix J for example latent trajectories and their\nrelationship to the original measurement variables) [27, 51]. For the temperature dataset, DCA\nand PCA performed similarly, and for the accelerometer dataset, DCA outperformed PCA for the\nlowest-dimensional projections. The narrower performance gap between DCA and PCA on the\ntemperature and accelerometer datasets suggests that the alignment between variance and dynamics\nis stronger in these datasets than in the neural data.\nAssuming Gaussianity, DCA is formally superior to SFA at capturing past-future mutual information\nin time series which are time-irreversible and/or non-Markovian (Section 3). All four of our datasets\npossess both of these properties, suggesting that subspaces extracted by DCA might offer superior\ndecoding and forecasting performance to those extracted by SFA. We found this to be the case across\nall four datasets (Fig. 4, bottom row). Moreover, the relative performance of DCA often became\nstronger as T (the past-future window size of DCA) was increased, highlighting the non-Markovian\nnature of the data (see Appendix G for absolute R2 values). This underscores the importance of\nleveraging spatiotemporal statistics across long timescales when extracting non-Markovian dynamical\nstructure from data.\n\n5 Discussion\n\nDCA retains the geometric interpretability of linear dimensionality reduction methods while im-\nplementing an information-theoretic objective function that robustly extracts dynamical structure\nwhile minimizing noise. Indeed, the subspace found by DCA may be thought of as the result of a\ncompetition between aligning the subspace with dynamics and making the subspace orthogonal to\nnoise, as in Fig. 1A. Applied to neural, weather and accelerometer datasets, DCA often outperforms\nPCA, indicating that noise variance often dominates or is comparable to dynamics variance in these\ndatasets. Moreover, DCA often outperforms SFA, particularly when DCA integrates spatiotemporal\nstatistics over long timescales, highlighting the non-Markovian statistical dependencies present in\nthese datasets. Overall, our results show that DCA is well-suited for \ufb01nding dynamical subspaces in\ntime series with structural attributes characteristic of real-world data.\nMany extensions of DCA are possible. Since real-world data generation processes are generally\nnon-stationary, extending DCA for non-stationary data is a key direction for future work. For example,\nnon-stationary data may be segmented into windows such that the data are approximately stationary\nwithin each window [52]. In general, the subspace found by DCA includes contributions from all of\nthe original variables. For increased interpretability, DCA could be optimized with an (cid:96)1 penalty on\nthe projection matrix V [53] to identify a small set of relevant features, e.g., individual neurons or\nstocks [3]. Both the time- and frequency-domain implementations of DCA may be made differentiable\nin the input data, opening the door to extensions of DCA that learn nonlinear transformations of the\ninput data, including kernel-like dimensionality expansion, or that use a nonlinear mapping from the\n\n8\n\n\fFigure 4: DCA for prediction and forecasting. For all panels, color indicates the projected di-\nmensionality. For the top row, marker type indicates the lag for prediction. The top row compares\nheld-out R2 for DCA vs. PCA as a function of projection dimensionality and prediction lag. The\nbottom row shows the difference in held-out R2 for DCA vs. SFA as a function of T , the past-future\nwindow size parameter for DCA. (M1) Predicting cursor location from projected motor cortical data.\n(Hippocampus) Predicting animal location from projected hippocampal data. (Temperature) Fore-\ncasting future full-dimensional temperature states from projected temperature states. (Accelerometer)\nForecasting future full-dimensional accelerometer states from projected states.\n\nhigh- to low-dimensional space, including deep architectures. Since DCA \ufb01nds a linear projection,\nit can also be kernelized using the kernel trick. The DCA objective could also be used in recurrent\nneural networks to encourage rich dynamics. Finally, dimensionality reduction via DCA could serve\nas a preprocessing step for time series analysis methods which scale unfavorably in the dimensionality\nof the input data, allowing such techniques to be scaled to high-dimensional data.\n\nAcknowledgements\n\nD.G.C. and K.E.B. were funded by LBNL Laboratory Directed Research and Development. We\nthank the Neural Systems and Data Science Lab and Laurenz Wiskott for helpful discussion.\n\nReferences\n[1] Peiran Gao and Surya Ganguli. On simplicity and complexity in the brave new world of\n\nlarge-scale neuroscience. Current opinion in neurobiology, 32:148\u2013155, 2015.\n\n[2] John P Cunningham and Zoubin Ghahramani. Linear dimensionality reduction: Survey, insights,\n\nand generalizations. The Journal of Machine Learning Research, 16(1):2859\u20132900, 2015.\n\n[3] Michael W Mahoney and Petros Drineas. CUR matrix decompositions for improved data\n\nanalysis. Proceedings of the National Academy of Sciences, 106(3):697\u2013702, 2009.\n\n[4] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef\ufb01cient estimation of word\n\nrepresentations in vector space. arXiv preprint arXiv:1301.3781, 2013.\n\n[5] Carl Doersch, Abhinav Gupta, and Alexei A Efros. Unsupervised visual representation learning\nby context prediction. In Proceedings of the IEEE International Conference on Computer\nVision, pages 1422\u20131430, 2015.\n\n[6] Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-aware neural\n\nlanguage models. In Thirtieth AAAI Conference on Arti\ufb01cial Intelligence, 2016.\n\n9\n\n00.5PCA R200.5DCA R2T = 5 binsdimlagM151015250 ms250 ms500 ms750 ms00.15PCA R200.15T = 5 binsdimlagHippocampus101525300 ms250 ms500 ms750 ms0.61.0PCA R20.61.0T = 5 binsdimlagTemperature34560 days5 days10 days15 days0.20.9PCA R20.20.9T = 5 binsdimlagAccelerometer34560 ms60 ms120 ms180 ms246810T (50 ms bins)-0.040.08R2 improvementover SFAlag = 5 bins246810T (50 ms bins)-0.020.04lag = 5 bins246810T (1 day bins)-0.0250.05lag = 0 bins246810T (20 ms bins)-0.20.4lag = 3 bins\f[7] Sarah E Marzen and James P Crutch\ufb01eld. Nearly maximally predictive features and their\n\ndimensions. Physical Review E, 95(5):051301, 2017.\n\n[8] David McAllester. Information Theoretic Co-Training. arXiv preprint arXiv:1802.07572, 2018.\n\n[9] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with Contrastive\n\nPredictive Coding. arXiv preprint arXiv:1807.03748, 2018.\n\n[10] Kristofer E Bouchard and Michael S Brainard. Auditory-induced neural dynamics in sensory-\nmotor circuitry predict learned temporal and sequential statistics of birdsong. Proceedings of\nthe National Academy of Sciences, 113(34):9641\u20139646, 2016.\n\n[11] Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.\n\narXiv preprint physics/0004057, 2000.\n\n[12] William Bialek, Ilya Nemenman, and Naftali Tishby. Predictability, complexity, and learning.\n\nNeural computation, 13(11):2409\u20132463, 2001.\n\n[13] Stephanie E Palmer, Olivier Marre, Michael J Berry, and William Bialek. Predictive information\nin a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908\u20136913,\n2015.\n\n[14] Karl Pearson. On lines and planes of closest \ufb01t to systems of points in space. The London,\nEdinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559\u2013572, 1901.\n\n[15] Harold Hotelling. Analysis of a complex of statistical variables into principal components.\n\nJournal of educational psychology, 24(6):417, 1933.\n\n[16] Ming Li and Paul Vit\u00e1nyi. An introduction to Kolmogorov complexity and its applications.\n\nSpringer Science & Business Media, 2013.\n\n[17] Steven P Strong, Roland Koberle, Rob R de Ruyter van Steveninck, and William Bialek. Entropy\n\nand information in neural spike trains. Physical review letters, 80(1):197, 1998.\n\n[18] Liam Paninski. Estimation of entropy and mutual information. Neural computation, 15(6):\n\n1191\u20131253, 2003.\n\n[19] Artemy Kolchinsky and Brendan Tracey. Estimating mixture entropy with pairwise distances.\n\nEntropy, 19(7):361, 2017.\n\n[20] Alexander Kraskov, Harald St\u00f6gbauer, and Peter Grassberger. Estimating mutual information.\n\nPhysical review E, 69(6):066138, 2004.\n\n[21] Xianli Zeng, Yingcun Xia, and Howell Tong. Jackknife approach to the estimation of mutual\n\ninformation. Proceedings of the National Academy of Sciences, 115(40):9956\u20139961, 2018.\n\n[22] David McAllester and Karl Statos. Formal limitations on the measurement of mutual information.\n\narXiv preprint arXiv:1811.04251, 2018.\n\n[23] Lei Li and Zhongjie Xie. Model selection and order determination for time series by information\n\nbetween the past and the future. Journal of time series analysis, 17(1):65\u201384, 1996.\n\n[24] Laurenz Wiskott and Terrence J Sejnowski. Slow Feature Analysis: Unsupervised learning of\n\ninvariances. Neural computation, 14(4):715\u2013770, 2002.\n\n[25] Matthias Bethge, Sebastian Gerwinn, and Jakob H Macke. Unsupervised learning of a steerable\nbasis for invariant image representations. In Human Vision and Electronic Imaging XII, volume\n6492, page 64920C. International Society for Optics and Photonics, 2007.\n\n[26] Magnus Borga. Canonical correlation: a tutorial. On line tutorial http://people. imt. liu.\n\nse/magnus/cca, 4(5), 2001.\n\n[27] John P Cunningham and M Yu Byron. Dimensionality reduction for large-scale neural record-\n\nings. Nature neuroscience, 17(11):1500, 2014.\n\n10\n\n\f[28] Max Tegmark. Optimal latent representations: Distilling mutual information into principal pairs.\n\narXiv preprint arXiv:1902.03364, 2019.\n\n[29] Rasmus Larsen. Decomposition using Maximum Autocorrelation Factors. Journal of Chemo-\n\nmetrics: A Journal of the Chemometrics Society, 16(8-10):427\u2013435, 2002.\n\n[30] Aapo Hyv\u00e4rinen. Complexity pursuit: separating interesting components from time series.\n\nNeural computation, 13(4):883\u2013898, 2001.\n\n[31] Georg Goerg. Forecastable component analysis. In International Conference on Machine\n\nLearning, pages 64\u201372, 2013.\n\n[32] Lang Tong, VC Soon, Yih-Fang Huang, and RALR Liu. AMUSE: a new blind identi\ufb01cation\nalgorithm. In IEEE international symposium on circuits and systems, pages 1784\u20131787. IEEE,\n1990.\n\n[33] Andreas Ziehe and Klaus-Robert M\u00fcller. TDSEP\u2014an ef\ufb01cient algorithm for blind separation\nusing time structure. In International Conference on Arti\ufb01cial Neural Networks, pages 675\u2013680.\nSpringer, 1998.\n\n[34] Harald St\u00f6gbauer, Alexander Kraskov, Sergey A Astakhov, and Peter Grassberger. Least-\ndependent-component analysis based on mutual information. Physical Review E, 70(6):066123,\n2004.\n\n[35] Stefan Richthofer and Laurenz Wiskott. Predictable Feature Analysis. In 2015 IEEE 14th\nInternational Conference on Machine Learning and Applications (ICMLA), pages 190\u2013196.\nIEEE, 2015.\n\n[36] Bj\u00f6rn Weghenkel, Asja Fischer, and Laurenz Wiskott. Graph-based Predictable Feature Analysis.\n\nMachine Learning, 106(9-10):1359\u20131380, 2017.\n\n[37] Felix Creutzig, Amir Globerson, and Naftali Tishby. Past-future information bottleneck in\n\ndynamical systems. Physical Review E, 79(4):041925, 2009.\n\n[38] Bertrand Thirion and Olivier Faugeras. Dynamical components analysis of fMRI data through\n\nkernel PCA. NeuroImage, 20(1):34\u201349, 2003.\n\n[39] Bastian Seifert, Katharina Korn, Steffen Hartmann, and Christian Uhl. Dynamical Component\nAnalysis (DyCA): dimensionality reduction for high-dimensional deterministic time-series. In\n2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP),\npages 1\u20136. IEEE, 2018.\n\n[40] Katharina Korn, Bastian Seifert, and Christian Uhl. Dynamical Component Analysis (DyCA)\nand its application on epileptic EEG. In ICASSP 2019-2019 IEEE International Conference on\nAcoustics, Speech and Signal Processing (ICASSP), pages 1100\u20131104. IEEE, 2019.\n\n[41] Rudolph Emil Kalman. A new approach to linear \ufb01ltering and prediction problems. Journal of\n\nbasic Engineering, 82(1):35\u201345, 1960.\n\n[42] Byron M Yu, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy, and\nManeesh Sahani. Gaussian-process factor analysis for low-dimensional single-trial analysis of\nneural population activity. J Neurophysiol, 102:614\u2013635, 2009.\n\n[43] Chethan Pandarinath, Daniel J O\u2019Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D Stavisky,\nJonathan C Kao, Eric M Trautmann, Matthew T Kaufman, Stephen I Ryu, Leigh R Hochberg,\net al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature\nmethods, page 1, 2018.\n\n[44] Joseph E. O\u2019Doherty, Mariana M. B. Cardoso, Joseph G. Makin, and Philip N. Sabes. Nonhuman\nPrimate Reaching with Multichannel Sensorimotor Cortex Electrophysiology, May 2017. URL\nhttps://doi.org/10.5281/zenodo.583331.\n\n[45] K Mizuseki, A Sirota, E Pastalkova, and G Buzs\u00e1ki. Multi-unit recordings from the rat\n\nhippocampus made during open \ufb01eld foraging. Available online at: CRCNS. org, 2009.\n\n11\n\n\f[46] Joshua I Glaser, Raeed H Chowdhury, Matthew G Perich, Lee E Miller, and Konrad P Kording.\n\nMachine learning for neural decoding. arXiv preprint arXiv:1708.00909, 2017.\n\n[47] Sel\ufb01sh Gene. Historical hourly weather data 2012-2017, Dec 2017. URL https://www.\n\nkaggle.com/selfishgene/historical-hourly-weather-data.\n\n[48] Mohammad Malekzadeh, Richard G Clegg, Andrea Cavallaro, and Hamed Haddadi. Protecting\nsensory data against sensitive inferences. In Proceedings of the 1st Workshop on Privacy by\nDesign in Distributed Systems, page 2. ACM, 2018.\n\n[49] Mark M Churchland, John P Cunningham, Matthew T Kaufman, Justin D Foster, Paul Nuyu-\njukian, Stephen I Ryu, and Krishna V Shenoy. Neural population dynamics during reaching.\nNature, 487(7405):51, 2012.\n\n[50] Matthew A Wilson and Bruce L McNaughton. Dynamics of the hippocampal ensemble code\n\nfor space. Science, 261(5124):1055\u20131058, 1993.\n\n[51] Matthew D Golub, Patrick T Sadtler, Emily R Oby, Kristin M Quick, Stephen I Ryu, Eliza-\nbeth C Tyler-Kabara, Aaron P Batista, Steven M Chase, and Byron M Yu. Learning by neural\nreassociation. Nature neuroscience, 21(4):607\u2013616, 2018.\n\n[52] Antonio C Costa, Tosif Ahamed, and Greg J Stephens. Adaptive, locally linear models of\ncomplex dynamics. Proceedings of the National Academy of Sciences, 116(5):1501\u20131510,\n2019.\n\n[53] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal\n\nStatistical Society: Series B (Methodological), 58(1):267\u2013288, 1996.\n\n12\n\n\f", "award": [], "sourceid": 8045, "authors": [{"given_name": "David", "family_name": "Clark", "institution": "Columbia University"}, {"given_name": "Jesse", "family_name": "Livezey", "institution": "Lawrence Berkeley National Laboratory"}, {"given_name": "Kristofer", "family_name": "Bouchard", "institution": "Lawrence Berkeley National Laboratory"}]}