{"title": "Enabling hyperparameter optimization in sequential autoencoders for spiking neural data", "book": "Advances in Neural Information Processing Systems", "page_first": 15937, "page_last": 15947, "abstract": "Continuing advances in neural interfaces have enabled simultaneous monitoring of spiking activity from hundreds to thousands of neurons. To interpret these large-scale data, several methods have been proposed to infer latent dynamic structure from high-dimensional datasets. One recent line of work uses recurrent neural networks in a sequential autoencoder (SAE) framework to uncover dynamics. SAEs are an appealing option for modeling nonlinear dynamical systems, and enable a precise link between neural activity and behavior on a single-trial basis. However, the very large parameter count and complexity of SAEs relative to other models has caused concern that SAEs may only perform well on very large training sets. We hypothesized that with a method to systematically optimize hyperparameters (HPs), SAEs might perform well even in cases of limited training data. Such a breakthrough would greatly extend their applicability. However, we find that SAEs applied to spiking neural data are prone to a particular form of overfitting that cannot be detected using standard validation metrics, which prevents standard HP searches. We develop and test two potential solutions: an alternate validation method (\u201csample validation\u201d) and a novel regularization method (\u201ccoordinated dropout\u201d). These innovations prevent overfitting quite effectively, and allow us to test whether SAEs can achieve good performance on limited data through large-scale HP optimization. When applied to data from motor cortex recorded while monkeys made reaches in various directions, large-scale HP optimization allowed SAEs to better maintain performance for small dataset sizes. Our results should greatly extend the applicability of SAEs in extracting latent dynamics from sparse, multidimensional data, such as neural population spiking activity.", "full_text": "Enabling hyperparameter optimization in sequential\n\nautoencoders for spiking neural data\n\nMohammad Reza Keshtkaran\n\nCoulter Dept. of Biomedical Engineering\n\nEmory University and Georgia Tech\n\nAtlanta, GA 30322\n\nmkeshtk@emory.edu\n\nChethan Pandarinath\n\nCoulter Dept. of Biomedical Engineering\n\nDept of Neurosurgery\n\nEmory University and Georgia Tech\n\nAtlanta, GA 30322\n\nchethan@gatech.edu\n\nAbstract\n\nContinuing advances in neural interfaces have enabled simultaneous monitoring of\nspiking activity from hundreds to thousands of neurons. To interpret these large-\nscale data, several methods have been proposed to infer latent dynamic structure\nfrom high-dimensional datasets. One recent line of work uses recurrent neural\nnetworks in a sequential autoencoder (SAE) framework to uncover dynamics. SAEs\nare an appealing option for modeling nonlinear dynamical systems, and enable a\nprecise link between neural activity and behavior on a single-trial basis. However,\nthe very large parameter count and complexity of SAEs relative to other models\nhas caused concern that SAEs may only perform well on very large training sets.\nWe hypothesized that with a method to systematically optimize hyperparameters\n(HPs), SAEs might perform well even in cases of limited training data. Such a\nbreakthrough would greatly extend their applicability. However, we \ufb01nd that SAEs\napplied to spiking neural data are prone to a particular form of over\ufb01tting that\ncannot be detected using standard validation metrics, which prevents standard HP\nsearches. We develop and test two potential solutions: an alternate validation\nmethod (\u201csample validation\u201d) and a novel regularization method (\u201ccoordinated\ndropout\u201d). These innovations prevent over\ufb01tting quite effectively, and allow us to\ntest whether SAEs can achieve good performance on limited data through large-\nscale HP optimization. When applied to data from motor cortex recorded while\nmonkeys made reaches in various directions, large-scale HP optimization allowed\nSAEs to better maintain performance for small dataset sizes. Our results should\ngreatly extend the applicability of SAEs in extracting latent dynamics from sparse,\nmultidimensional data, such as neural population spiking activity.\n\n1\n\nIntroduction\n\nOver the past decade, our ability to monitor the simultaneous spiking activity of large populations\nof neurons has increased exponentially, promising new avenues for understanding the brain. These\ncapabilities have motivated the development and application of numerous methods for uncovering\ndynamical structure underlying neural population spiking acqtivity, such as linear or switched\nlinear dynamical systems [1, 2, 3, 4, 5], Gaussian processes [6, 7, 8, 9], and nonlinear dynamical\nsystems [10, 11, 12, 13]. With this rich space of models, several factors in\ufb02uence which model is\nmost appropriate for any given application, such as whether the data can be well-modeled as an\nautonomous dynamical system, whether interpretability of the dynamics is desirable, whether it\nis important to link neural activity to behavioral or task variables, and simply the amount of data\navailable.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe previously developed a method known as Latent Factor Analysis via Dynamical Systems (LFADS),\nwhich used recurrent neural networks in a modi\ufb01ed sequential autoencoder (SAE) con\ufb01guration to\nuncover estimates of latent, nonlinear dynamical structure from neural population spiking activity\n[10, 12]. LFADS inferred latent states that were predictive of animals\u2019 behaviors on single trials,\ninferred perturbations to dynamics that correlated with behavioral choices, linked spiking activity\nto oscillations present in local \ufb01eld potentials, and combined data from non-overlapping recording\nsessions that spanned months to improve inference of underlying dynamics. These features may\nbe useful for studying a wide range of questions in neuroscience. However, SAEs have very large\nparameter counts (tens to hundreds of thousands of parameters), and this complexity relative to\nother models has raised concerns that SAEs (and other neural network-based approaches) may\nonly perform well on very large training sets [8]. We hypothesized that properly adjusting model\nhyperparameters (HPs) might increase the performance of SAEs in cases of limited data, which\nwould greatly extend their applicability. However, when we attempted to test the adjustment of SAE\nHPs beyond their previous hand-tuned settings, we found that SAEs are susceptible to over\ufb01tting on\nspiking data. Importantly, this over\ufb01tting could not be detected through standard validation metrics.\nConceptually, one knows that the best possible autoencoder is a trivial identity transformation of the\ndata, and complex models with enough capacity can converge to this solution. Without knowing the\ndimensionality of the latent dynamic structure a priori, it is unclear how to constrain the autoencoder\nto avoid over\ufb01tting while still providing the capacity to best \ufb01t the data. Thus, while it may be\npossible to manually tune HPs and achieve better SAE performance (e.g., by visual inspection of\nthe results), building a framework to optimize HPs in a principled fashion and without manual\nintervention remains a key challenge.\nThis paper is organized as follows: Section 2 demonstrates the tendency of SAEs to over\ufb01t on spiking\ndata; Section 3 proposes two potential solutions to this problem and characterizes their performance\non simulated datasets; Section 4 demonstrates the effectiveness of these solutions through large-scale\nHP optimization in applications to motor cortical population spiking activity.\n\n2 Sensitivity of SAEs to over\ufb01tting on spiking data\n\n2.1 The SAE architecture\n\nWe examine the LFADS architecture detailed in [10, 12]. The basic model is an instantiation of\na variational autoencoder (VAE) [14, 15] extended to sequences, as in [16, 17, 18]. Brie\ufb02y, an\nencoder RNN takes as input a data sequence xt, and produces as output a conditional distribution\nover a latent code z, Q(z|xt). In the VAE framework, an uninformative prior P (z) on this latent\ncode serves as a regularizer, and divergence from the prior is discouraged via a training penalty\nthat scales with DKL(Q(z|xt)||P (z)). A data sample \u02c6z is then drawn from Q(z|xt), which sets the\ninitial state of a decoder RNN. This RNN attempts to create a reconstruction \u02c6rt of the original data\nvia a low-dimensional set of factors \u02c6ft. Speci\ufb01cally, the data xt are assumed to be samples from\nan inhomogenous Poisson process with underlying rates \u02c6rt. This basic sequential autoencoder is\nappropriate for neural data that is well-modeled as an autonomous dynamical system.\nOur previous work also demonstrated modi\ufb01cations of the SAE architecture for modeling input-driven\ndynamical systems (Figure 1(a), also detailed in [10, 12]). In this case, an additional controller RNN\ncompares an encoding of the observed data with the output of the decoder RNN, and attempts to inject\na time-varying input ut into the decoder to account for data that cannot be modeled by the decoder\u2019s\nautonomous dynamics alone. (As with the latent code z, the time-varying input is parameterized as a\ndistribution Q(ut|xt), and the decoder network actually receives a sample \u02c6ut from this distribution.)\nThese inputs are a critical extension, as autonomous dynamics are likely only a good model for\nmotor areas of the brain, and for speci\ufb01c, tightly-controlled behavioral settings, such as pre-planned\nreaching movements [19]. Instead, most neural systems and behaviors of interest are expected to\nre\ufb02ect both internal dynamics and inputs, to varying degrees. Therefore we focused on the extended\nmodel shown in Figure 1(a).\n\nThe LFADS objective function is de\ufb01ned as the log likelihood of the data,(cid:80)\n\nx log P (x1:T ), marginal-\nized over all latent variables, which is optimized in VAE setting by maximizing a variational lower\nbound, L, on the marginal data log-likelihood,\n\nlog P (x1:T ) \u2265 L = Lx \u2212 LKL\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 1: (a) The LFADS architecture for modeling input-driven dynamical systems. (b) Performance\nof LFADS in inferring \ufb01ring rates from a synthetic RNN dataset for 200 models with randomly\nselected hyperparameters. (c) Ground truth (black) and inferred (blue) \ufb01ring rates for two neurons\nfrom three example LFADS models corresponding to points in b. Actual spike times are indicated by\nblack dots underneath the \ufb01ring rate traces. Each plot shows 1 sec of data.\nLx is the log-likelihood of the reconstruction of the data, given the inferred \ufb01ring rates \u02c6rt, and LKL\nis a non-negative penalty that restricts the approximate posterior distributions from deviating too far\nfrom the (uninformative) prior distribution de\ufb01ned as\nPoisson(xt|\u02c6rt)\n\n(cid:17)(cid:43)\n\nLx =\n\nlog\n\nLKL =\n\nt=1\n\n(cid:42) T(cid:88)\n(cid:68)\n(cid:68)\n(cid:42) T(cid:88)\n\nDKL\n\nDKL\n\nt=2\n\n(cid:16)\n(cid:16)N (g0 | \u00b5g0 , \u03c3g0) (cid:107) P g0 (g0)\n(cid:17)(cid:69)\n(cid:16)N (u1 | \u00b5u\n(cid:17)(cid:69)\n(cid:16)N (ut | \u00b5u\n\n1) (cid:107) P u1 (u1)\n\n1, \u03c3u\n\nt, \u03c3u\n\nz,u\n\nt) (cid:107) P u (ut|ut\u22121)\n\nDKL\n\nz\n\nz,u1\n\n+\n\n+\n\n(cid:17)(cid:43)\n\nz,u\n\n,\n\nHere, g0 represents the initial conditions (ICs) produced by the IC encoder RNN, \u00b5g0 and \u03c3g0 are\nthe mean and variance of the IC prior, respectively, and P u(\u00b7) is an autoregressive (AR) prior over\ninferred inputs [12] with \u00b5u\nt as the prior means and variances for time t, respectively. A more\ndetailed description of the LFADS model is given in Supplemental Section A.\n\nt and \u03c3u\n\n2.2 Over\ufb01tting on a synthetic RNN dataset\n\nAs the complexity and parameter count of neural network models increase, it might be expected that\nvery large datasets are required to achieve good performance. In the face of this challenge, our aim\nwas to test whether such models could be made to function even in cases of limited training data by\napplying principled HP optimization. For example, adjusting HPs to regularize the system might\nprevent over\ufb01tting, e.g., by limiting the complexity of the learned dynamics via L2 regularization\nof the recurrent weights of the RNNs, by limiting the amount of information passed through the\nprobabilistic codes Q(z|xt) and Q(u|xt) by scaling the KL penalty, as in [20], or by applying dropout\nto the networks [21]. Our aim is to regularize the system so it does not over\ufb01t, but to also use the\nleast amount of regularization necessary, so that the model still has the capacity to capture as much\nstructure of the data as possible. We \ufb01rst tested whether the model architecture itself was amenable to\nHP optimization by performing a random HP search. As we show below, the possibility of over\ufb01tting\nvia the identity transformation makes such HP optimization a dif\ufb01cult problem.\nTo precisely measure the performance of LFADS in inferring \ufb01ring rates, we needed a spiking dataset\nfor which ground truth neural \ufb01ring rates were known. Real neural data has no ground truth for direct\ncomparison, as there is no \u201ctrue\u201d measurable \ufb01ring rate. Common indirect validation measures, such\nas the likelihood of held-out neurons or behavioral decoding, are not adequate for precisely detecting\nover\ufb01tting. For example, the likelihood of held-out neurons is often a noisy measure and requires\nassumptions. Similarly, decoding behavior is only a coarse measure of the model\u2019s performance,\nas only a small fraction of neural activity directly correlates with behavior. In addition, behavioral\ndynamics are often much slower than neural dynamics, making behavioral measures inadequate for\ntesting whether a model captures \ufb01ne timescale features of neural activity.\nTo provide a dataset with known neural \ufb01ring rates, we created synthetic neural data by using an\ninput-driven RNN as a model of a neural system, following [10], Sections 4.2-3. Details of the system\n\n3\n\nDecoderEncodersControllerReconstruction loss(Poisson)1200160020002400Standard validation loss00.20.40.60.8R2Good (cid:31)tUnder(cid:31)tOver(cid:31)tNeuron 1Neuron 2\fused here are given in Supplemental Section B.We then tested the effect of varying model HPs on our\nability to infer the synthetic neurons\u2019 underlying \ufb01ring rates (Figure 1(b)). We trained 200 separate\nLFADS models in which the underlying model architecture was constant, but we randomly and\nindependently chose the values of \ufb01ve HPs implemented in the publicly-available LFADS codepack.\nTwo HPs were scalar multiples on the KL penalties applied to Q(z|xt) and Q(ut|xt), two HPs were\nL2 penalties on the recurrent weights of the generator and controller RNNs, and the last HP set the\ndropout probability for the input layer and the output of the RNNs.\nAs shown in Figure 1(b), varying HPs resulted in models whose performance in inferring \ufb01ring\nrates (R2) spanned a wide range. Importantly, however, the measured validation loss did not always\ncorrespond to accuracy. Figure 1(c) shows ground truth and inferred \ufb01ring rates for two arti\ufb01cial\nneurons with their corresponding spike times, for three models that spanned the range of validation\nlosses. Both under\ufb01t and over\ufb01t models failed to capture the dynamics underlying the neurons\u2019 \ufb01ring\nrates. Under\ufb01t models exhibited overly smooth inferred \ufb01ring rates, resulting in poor R2 values\nand reconstruction loss. In contrast, over\ufb01t models showed a different failure mode. Rather than\nmodeling the actual structure underlying the \ufb01ring rates, the networks simply learned to pass spike\ntimes through the input channel Q(u|x), resulting in excellent reconstruction loss for the original,\nnoisy data, but extremely poor inference of the underlying \ufb01ring rates. Conceptually, the network\nlearned a solution akin to applying an identity transformation to the spiking data. We suspect that\nthis failure mode is more acute with spiking activity, where binned spike counts might be precisely\nreconstructed by nonlinear transformation of a low-dimensional, time-varying signal. It is worth\nmentioning that the AR prior mentioned earlier might lessen, but cannot eliminate, this failure mode\n(all the results in Figure 1(b) are with AR prior applied). The key problem is that the AR prior has a\nwidth parameter (described in Supplement Section A.4), which is learnable. Minimizing this width\nallows the model to get better predictions by over\ufb01tting to spikes via inferred inputs. Forcing a\nminimum AR prior width might prevent over\ufb01tting, but might also prevent the model from capturing\nrapid changes.\nImportantly, this failure mode could not be detected through the standard method of validation,\nwhich is to hold out entire observations (trials), because those held-out trials are still shown to the\nnetwork during the inference step and can be used in an identity transformation to achieve accurate\nreconstruction. Without a reliable validation metric, it is dif\ufb01cult to perform a broad HP search,\nbecause it is unclear how one should select amongst the trained models. Ideally one would use\nperformance in inferring \ufb01ring rates (R2) as a selection metric, but of course ground truth \"\ufb01ring\nrates\" are unavailable for real datasets. One might expect that this failure mode (over\ufb01tting via the\ninput channel Q(u|x)) could be sidestepped simply by limiting the capacity of the input channel,\neither via regularization or by limiting its dimensionality. However, the appropriate dimensionality of\nthe input pathway may be heavily dependent on the dataset being tested (e.g., it may be different for\ndifferent brain areas), and the susceptibility to over\ufb01tting may vary with dataset size, complexity of\nthe underlying dynamics, and the number of neurons recorded. Without knowing a priori how to\nconstrain the model, we need the ability to try models with larger capacity and either detect when\nthey over\ufb01t, or prevent them from over\ufb01tting altogether.\nDenoising autoencoders: We tested whether\nexisting regularization methods for autoen-\ncoders could prevent the over\ufb01tting problem\ndescribed above. For this purpose, we applied\ntwo common denoising autoencoder (dAE) ap-\nproaches [22]: \u2018Zero masking\u2019 (Figure 2(a)),\nand \u2018Salt and pepper noise\u2019 (Figure 2(b)). We\nrepeated the same experiment presented in Fig-\nure 1(b), using both approaches and different\ninput noise levels. Noise level is a critical free\nparameter, and it is not possible to know the opti-\nmal value a priori. As shown, depending on the\nnoise level, dAEs could still show pathological\nover\ufb01tting, which again made standard validation cost an unreliable metric to assess model perfor-\nmance. Furthermore, it can be seen that the higher values of input noise reduced peak performance.\nTherefore, the remainder of this paper centers on searching for more generalizable solutions that\navoid these limitations.\n\nFigure 2: Performance of denoising autoencoders\nwith (a) Zero masking and (b) Salt and pepper\nnoise.\n\n(a)\n\n(b)\n\n4\n\n1200160020002400Validation loss00.40.8R2Zero Masking0%10%30%50%1200160020002400Validation lossSalt and Pepper0%10%20%30%\f3 Validation and regularization methods to prevent over\ufb01tting\n\nWe developed two complementary approaches to counteract the failure mode of over\ufb01tting through\nidentity transformations: 1) a different validation metric to detect over\ufb01tting (\"sample validation\"),\nand 2) a novel regularization strategy to force networks to model only structure that is shared between\ndimensions of the observed data (\"coordinated dropout\").\n\n3.1 Sample validation\n\nOur goal with sample validation (SV) was to\ndevelop a metric that detects when the networks\nsimply pass data from input to output (e.g.,\nvia an identity transformation) rather than mod-\neling underlying structure. Therefore, rather\nthan the standard approach of holding out en-\ntire observations of xt to compute validation\nloss [10, 12], SV holds out individual samples\nrandomly drawn from the [Neurons \u00d7 Time \u00d7\nTrials] data matrix (Figure 3(a)). This approach,\nsometimes called a \"speckled holdout pattern\",\nis a recognized method for cross-validating prin-\ncipal components analysis [23] and has recently\nbeen applied to dimensionality reduction of neural data [24]. We modi\ufb01ed our network training in\ntwo ways to integrate SV: \ufb01rst, at the network\u2019s input, we dropout the held-out samples, i.e., we\nreplace the them by zeros, and linearly scale xt by a corresponding amount to compensate for the\naverage decrease in input to the network (similar to [21]). Second, at the network\u2019s output, we\nprevented weight updating using held-out samples (or erroneous updating using the zeroed samples)\nby blocking backpropagation of the gradient at the speci\ufb01ed samples. This prevents the held-out\nsamples (or lack of samples) from inappropriately affecting network training. Finally, because the\nnetwork still infers rates at the timepoints corresponding to the held-out samples, they can be used\nto calculate a measure of cross-validated reconstruction loss at a sample-by-sample level. The SV\nmetric consisted of the reconstruction loss averaged over all held-out samples.\n\nFigure 3: (a) Illustration of sample validation (SV).\n(b) Illustration of coordinated dropout (CD) for a\nsingle training example.\n\n(a)\n\n(b)\n\n3.2 Coordinated dropout\n\nThe second strategy to avoid over\ufb01tting via the identity transformation, coordinated dropout (CD;\nFigure 3(b)), is based on the reasonable assumption that the observed activity is from a lower\ndimensional subspace. CD controls the \ufb02ow of information through the network during training,\nin order to force the network to model only structure that is shared across input dimensions. At\neach training step, a random mask is applied to dropout samples of the input. The complement of\nthat mask is applied at the network\u2019s output to choose which samples should have their gradients\nblocked. Thus, for each training step, any data sample that is fed in as input is not used to compute\nthe quality of the network\u2019s output. This simple strategy ensures the network cannot learn an identity\ntransformation because individual data samples are never used for self reconstruction. To demonstrate\nthe effectiveness of CD in preventing over\ufb01tting, we applied it to the simple case of uncovering\nlatent structure from low-dimensional, noise-corrupted data using a linear autoencoding network (see\nSupplemental Section C).\n\n3.3 Application to the synthetic RNN dataset\n\nOur next aim was to test whether SV and CD could be used to help select LFADS models that\nhad high performance in inferring the underlying spike rates (i.e., did not over\ufb01t to spikes), or to\nprevent the LFADS model from over\ufb01tting in the \ufb01rst place. We used the synthetic data described in\nSection 2.2, and again ran random HP searches (similar to Figure 1(b)). However, in this case, we\napplied either SV or CD to the LFADS models while leaving other HPs unchanged.\nIn the \ufb01rst experiment, we tested whether SV provided a more reliable metric of model performance\nthan standard validation loss. When applying SV to the LFADS model, we held out 20% of samples\nfrom network training, as described in Section 3.1. Figure 4 shows the performance of 200 models in\n\n5\n\nTimeTrialNeuronsHeld-out samplesHeld-out inferred rate2.31.40.12.00.52.91.62.31.40.12.00.52.91.6Train on these samplesTo reconstruct these samplesTraining exampleTraining exampleTraining step nSAEsample droppedgradient blocked\finferring \ufb01ring rates (R2) against the sample validation loss, i.e., the average reconstruction loss over\nthe held-out samples. With SV in place, we observed a clear correspondence between the SV loss\nand R2, which was in sharp contrast to the results when standard validation loss was used to evaluate\nLFADS models (Figure 1(b)). Models with lower SV loss generally had higher R2, which establishes\nSV loss as a candidate validation metric for performing model selection in HP searches.\nFor the second experiment, we tested whether CD could prevent LFADS models from over\ufb01tting. In\nthis test we set the \"keep ratio\" to 0.7, i.e., at each training step the network only sees 70% of the input\nsamples, and uses the complementary 30% of samples to perform backpropagation. Figure 4 shows\nperformance with respect to the standard validation loss for the 200 models we trained with CD.\nStrikingly, we can see a clear correspondence between the standard validation loss and performance,\nindicating that CD has successfully prevented over\ufb01tting during model training. Therefore, the\nstandard validation loss becomes a reliable performance metric when models are trained with CD,\nand can be used to perform HP search.\nWhile SV and CD both sidestepped the\nover\ufb01tting problem, we found that models\ntrained with CD had better correspondence\nbetween validation loss and performance.\nWith SV, the best models had some variabil-\nity in the relationship between SV loss and\nperformance (R2; Figure 4(a), inset). With\nCD, the best models had a more direct cor-\nrespondence between standard validation\nloss and performance (Figure 4(b), inset).\nBecause CD produced a more reliable per-\nformance measure, we used it to train and\nevaluate models in the remainder of this\nmanuscript.\nNote that while CD performed well in this\ntest, there may be cases where it is advantageous to use SV in addition, or instead. CD acts as a\nstrong regularizer to prevent over\ufb01tting, but it may also result in under\ufb01tting. By limiting data to only\nbeing seen as either input or output, but never both simultaneously, CD might prevent the model from\nlearning all the structure present in the data. This may have a prominent effect when the number of\nobservation dimensions (e.g., number of neurons) is small relative to the true dimensionality of the\nlatent space. In those cases of limited information, not seeing all the observed data may severely limit\nthe system\u2019s ability to uncover latent structure. We believe SV remains a useful validation metric, and\nfuture work will test whether SV is a good alternative HP search strategy when the dimensionality of\nthe observed data is more limited.\n\nFigure 4:\n(a) Performance of 200 models with the\nsame con\ufb01guration as in Figure 1(b) plotted against\nsample validation loss. (b) Performance for 200 models\nplotted against standard validation loss, with CD used\nduring training (blue) or CD off during training (grey,\nreproduced from Figure 1(b)).\n\n4 Large-scale HP search on neural population spiking activity\n\nOur results with synthetic data demonstrated that SV and CD are effective in detecting and preventing\nover\ufb01tting over a broad range of HPs. Next we aimed to apply these methods to real data to test\nwhether large-scale HP search yields improvements over efforts that use \ufb01xed HPs, especially in the\ncase of limited dataset sizes. It is important to note that, although datasets sizes vary substantially\nbetween experiments, neuroscienti\ufb01c datasets are often small (e.g., a few hundreds of samples/trials)\nin comparison to datasets in other \ufb01elds where deep learning is applied (e.g. thousands or millions of\nsamples). In this section we \ufb01rst describe the experimental data, and then lay out the details of the\ncomparison between \ufb01xed HP models and the HP-optimized models. Finally, we present the results\nof the performance comparison.\n\n4.1 Experimental data and evaluation framework\n\nTo characterize the effect of training dataset size on the performance of LFADS, we tested LFADS\non two real neural datasets. The \ufb01rst dataset is the \"Monkey J Maze\" dataset used previously [12].\nThis dataset is considered exceptionally large (2296 trials in total), which allowed us to subsample\nthe data and test the effect of dataset size on model performance. In this data, spiking activity\nwas simultaneously recorded from 202 neurons in primary motor (M1) and dorsal premotor (PMd)\n\n6\n\n185030004150Sample validation loss00.20.40.60.8R218500.70.8(a)135018502350Standard validation loss00.20.40.60.8R218500.70.8(b)\fcortices of a macaque monkey as it made 2-dimensional reaching movements with both curved\nand straight trajectories. A variable delay period allowed the monkey to prepare the movements\nbefore executing them. All trials were aligned by the time point at which movement was detectable\n(movement onset), and we analyzed the time period spanning 250 ms before and 450 ms after the\nmovement onset. The spike trains were placed into 2 ms bins, resulting in 350 timepoints for each\ntrial (2296 trials in total). We then randomly selected 150 neurons from the original 202 neurons\nrecorded, and used those neurons for the entire subsequent analysis.\nThe second dataset we analyzed is publicly available (indy_20160426_01 [25]). In this dataset, a\nmonkey made continuous, self-paced reaches to targets in a grid without any gaps or delays. Neural\npopulation activity from 181 (sorted) neurons was recorded from M1. There were a total of 715 trials,\nand we used the same bin size, trial length, and alignment to movement onset as described above for\nMonkey J Maze.\nTo evaluate model performance as a function of training dataset size, for each dataset, we trained\nmodels using 5%, 10%, 20%, and 100% of the full trial count of 2296. For all sizes below the full trial\ncount, we generated seven separate datasets by randomly sampling from the full dataset (except for\nthe 368 trial count point in the RTT data, which we sampled 3 times). In all cases, 80% of trials were\nused for model training, while 20% were held-out for validation. We then quanti\ufb01ed the performance\nof each model by estimating the monkey\u2019s hand velocities from the model\u2019s output (i.e., inferred\n\ufb01ring rates). Velocity was decoded using optimal linear estimation [26] with 5-fold cross validation.\nWe used R2 between the actual and estimated hand velocities as the metric of performance.\nFor each dataset we trained LFADS models in two scenarios: 1) when HPs are manually selected and\n\ufb01xed, and 2) when HP optimization is used.\n\n4.2 LFADS trained with \ufb01xed HPs\n\nWhen evaluating model performance as a function of dataset size, it is unclear how to select HPs a\npriori. In our previous work [12] we selected HPs using hand-tuning. We began our performance\ncharacterization of \ufb01xed HP models using these previously-selected HPs ([12] Supp. Table 1,\n\"Monkey J Maze\"). However, we quickly found that performance collapsed for small dataset sizes.\nThough we did not fully characterize the reason for this failure, we suspect it occurred because\nthe LFADS model previously applied to the Maze data did not attempt to infer inputs (i.e., it\nonly modeled autonomous dynamics), and models without inputs are empirically more dif\ufb01cult to\ntrain. The dif\ufb01culty in training may arise because the sequential autoencoder with no inputs must\nbackpropagate the gradient through two RNNs that are each unrolled for the number of timesteps\nin the sequence (a very long path). In contrast, models that infer inputs can make progress in\nlearning without backpropagating all the way through both unrolled RNNs, due to the input pathway.\nRegardless of the reason for this failure, we chose to compare performance for models with inputs,\nin order to give the \ufb01xed-HP models a better shot and avoid a trivial result. Therefore, we switched\nto a second set of HPs from the previous work, which were used to train models with inputs on a\nseparate task ([12] Supp. Table 1, \"Monkey J CursorJump\"). These HPs were previously applied\nto data recorded from the same monkey and the same brain region, but in a different task in which\nthe monkey\u2019s arm was perturbed during reaches. We found that the \"CursorJump\" HPs maintained\nhigh performance on the full dataset size and also achieved better performance on smaller datasets\nthan the hand-selected HPs for the Maze data. (Note that while this choice of HPs is also somewhat\narbitrary, it illustrates the lack of principled method for choosing \ufb01xed HPs when applying LFADS to\ndifferent datasets.)\n\n4.3 LFADS trained with HP optimization\n\nTo perform HP optimization, we integrated a recently-developed framework for distributed opti-\nmization, Population Based Training (PBT) [27]. Brie\ufb02y, PBT is a method to train many models in\nparallel, and it uses evolutionary algorithms to select amongst the set of models and optimize HPs.\nPBT was shown to consistently outperform methods like random HP search on several neural network\narchitectures, while requiring the same level of computational resources as random search [27].\nThus it seemed a more ef\ufb01cient framework for large-scale hp optimization than random HP search.\nWe implemented the PBT framework based on [27] and integrated it with LFADS to perform HP\n\n7\n\n\foptimization with a few tens of models on a local cluster. We applied CD while training LFADS\nmodels, and used the standard validation loss as the performance metric for model selection in PBT.\nTwo classes of model HPs might be adjusted: HPs that set the network architecture (e.g., number\nof units in each RNN), and HPs that control regularization and other training parameters. In our\nHP optimization, We \ufb01xed the network architecture HPs to match the values used in the \ufb01xed-HPs\nscenario, and allowed the other HPs to vary. Speci\ufb01cally, we allowed PBT to optimize learning\nrate, keep ratio (i.e., fraction of the data samples that are passed into the model using CD) and \ufb01ve\ndifferent regularizers: L2 penalties on the generator (L2 Gen scale) and controller (L2 Con scale)\nRNNs, scaling factors for KL penalties applied to Q(z|xt) (KL IC scale) and Q(ut|xt) (KL CO\nscale), and dropout probability (Table 1).\nPBT enables different schedules for different\nHPs during training, and this results in better\nperformance than random HP searches [27].\nWe generally observed that the learning rate and\nKL penalty scales began at the higher end of\ntheir ranges (Table 1) and decreased over the\ncourse of training. Conversely, L2 and keep ratio\noften increased, and dropout often remained low\nthroughout training.\n\nTable 1: List of HPs searched with PBT\n\nHP\nL2 Gen scale\nL2 Con scale\nKL IC scale\nKL CO scale\nDropout\nKeep ratio\nLearning rate\n\nValue/Range\n(5, 5e4)\n(5, 5e4)\n(0.05, 5)\n(0.05, 5)\n(0, 0.7)\n(0.3, 0.99)\n(10\u22125, 0.02)\n\nInitialization\nlog-uniform\nlog-uniform\nlog-uniform\nlog-uniform\nuniform\n0.5\n0.01\n\n4.4 Results\n\nFor LFADS models trained with \ufb01xed HPs, we found that performance signi\ufb01cantly worsened as\nsmaller datasets were used for training (Figure 5(a),(d), red). This was expected and illustrates\nprevious concerns regarding applying deep learning methods with limited datasets [8]. For the HP-\noptimized models (Figure 5(a),(d), black), performance with the largest dataset is comparable to the\n\ufb01xed-HPs model, which con\ufb01rms that HP optimization is not critical when enough data is available.\nHowever, as the amount of the training data decreased, models with optimized HPs maintained high\nperformance even up to an impressive 10-fold reduction in training data size. (It is important to\nnote that the Monkey J Maze and RTT datasets were chosen because they are larger than typical\nneuroscienti\ufb01c datasets, which often fall in the range where HP optimization appears to be critical.)\nWe also quanti\ufb01ed the average percentage performance improvement achieved by optimizing HPs,\nrelative to \ufb01xed HPs (Figure 5(b),(e)). As the training data size decreased, HP search became critical\nin improving performance. To better illustrate this improvement, we compared the decoded position\ntrajectories from a \ufb01xed-HP model trained using 10% of the data (184 trials; Figure 5(c), top) against\ntrajectories decoded from a HP-optimized model (Figure 5(c), bottom). The HP-optimized model led\nto signi\ufb01cantly better estimation of reach trajectories.\nThese results demonstrate that with effective HP search, deep learning models can still maintain high\nperformance in inferring neural population dynamics even when limited data is available for training.\nThis greatly extends the applicability of LFADS to scenarios previously thought to be challenging\ndue to the limited availability of data for model training.\nBeyond the application to spiking neural data demonstrated in this paper, the proposed techniques\nshould be generally applicable to over-complete/sparse autoencoder architectures [28] or when\nforecasting time series from sparse data, especially when HP or architecture searches are important.\n\n5 Conclusion\n\nWe demonstrated that a special case of over\ufb01tting occurs when applying SAEs to spiking neural data,\nwhich cannot be detected through using standard validation metrics. This lack of a reliable validation\nmetric prevented effective HP search in SAEs, and we demonstrated two solutions: sample validation\n(SV) and coordinated dropout (CD). As shown, SV can be used as an alternate validation metric that,\nunlike standard validation loss, is not susceptible to over\ufb01tting. CD is an effective regularization\ntechnique that prevents SAEs from over\ufb01tting to spiking data during training, which allows even\nthe standard validation loss to be effective in evaluating model performance. We illustrated the\neffectiveness of SV and CD on synthetic datasets, and showed that CD leads to better correlation\nbetween the validation loss and the actual model performance. We then demonstrated the challenge\n\n8\n\n\fMaze task\n\nRandom target task\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 5: (a) Performance in decoding hand velocity after smoothing spikes with a Gaussian kernel\n(blue, \u03c3 = 60 ms standard deviation), applying LFADS with \ufb01xed HPs (red), and applying LFADS\nwith HP optimization (black). Note that the dataset size is decreasing from left to right. Left and\nright panels show decoding for hand X- and Y-velocities, respectively. Lines and shading denote\nmean \u00b1 standard error across multiple models for the same dataset size (random draws of trials, note\nwe only have one sample for the full dataset). (b) Percentage improvement in performance from\nHP-optimized models, relative to \ufb01xed HPs, averaged across all the models for each dataset size.\n(c), top Examples of estimated (solid) and actual (dashed) reach trajectories for LFADS with \ufb01xed\nHPs (the model with median performance on 184 trials). (c), bottom Reach trajectories when HP\noptimization was used. Trajectories were calculated by integrating the decoded hand velocities over\ntime. (d), (e) Same results for the random target task (RTT) dataset.\n\nof achieving good performance with the LFADS model when the training dataset size is small. With\nCD in place, effective HP search can greatly improve the performance of LFADS. Most importantly,\nwe demonstrated that with effective HP search we could train SAE models that maintain high\nperformance, even up to an impressive 10-fold reduction in the training data size. Applications of\nSV and CD are not limited to the LFADS model, but should also be useful for other autoencoder\nstructures when applied to sparse, multidimensional data.\n\nAcknowledgements\n\nWe thank Chris Rozell, Ali Farshchian, Raghav Tandon, Andrew Sedler, and Lahiru Wimalasena for\ntheir comments on the paper. This work has been supported by NSF NCS 1835364, and DARPA\nIntelligent Neural Interfaces program.\n\nReferences\n[1] Jakob H Macke, Lars Buesing, John P Cunningham, M Yu Byron, Krishna V Shenoy, and\nManeesh Sahani. Empirical models of spiking in neural populations. In Advances in neural\ninformation processing systems, pages 1350\u20131358, 2011.\n\n[2] Lars Buesing, Jakob H Macke, and Maneesh Sahani. Spectral learning of linear dynamics\nfrom generalised-linear observations with application to neural population data. In Advances in\n\n9\n\n921843681836# of training trials00.20.40.60.81R2X-velocity5% of trials usedSmoothingFixed HPsHP-optimized921843681836# of training trials00.20.40.60.81R2Y-velocity5% of trials used921843681836# of training trials0204060%R2 improvementFixed HPsEstimated trajectoriesHP-optimized92184368572# of training trials00.20.40.60.8R2X-velocity16% of trials used92184368572# of training trials00.20.40.60.8R2Y-velocity16% of trials used92184368572# of training trials0204060%R2 improvement\fneural information processing systems, pages 1682\u20131690, 2012.\n\n[3] Yuanjun Gao, Evan W Archer, Liam Paninski, and John P Cunningham. Linear dynamical\nneural population models through nonlinear embeddings. In Advances in neural information\nprocessing systems, pages 163\u2013171, 2016.\n\n[4] Scott Linderman, Matthew Johnson, Andrew Miller, Ryan Adams, David Blei, and Liam\nPaninski. Bayesian learning and inference in recurrent switching linear dynamical systems. In\nArti\ufb01cial Intelligence and Statistics, pages 914\u2013922, 2017.\n\n[5] Josue Nassar, Scott W Linderman, Yuan Zhao, M\u00f3nica Bugallo, and Il Memming Park. Learning\nstructured neural dynamics from single trial population recording. In 2018 52nd Asilomar\nConference on Signals, Systems, and Computers, pages 666\u2013670. IEEE, 2018.\n\n[6] Byron M Yu, John P Cunningham, Gopal Santhanam, Stephen I Ryu, Krishna V Shenoy, and\nManeesh Sahani. Gaussian-process factor analysis for low-dimensional single-trial analysis\nof neural population activity. In Advances in neural information processing systems, pages\n1881\u20131888, 2009.\n\n[7] Yuan Zhao and Il Memming Park. Variational latent gaussian process for recovering single-trial\n\ndynamics from population spike trains. Neural computation, 29(5):1293\u20131316, 2017.\n\n[8] Anqi Wu, Nicholas A Roy, Stephen Keeley, and Jonathan W Pillow. Gaussian process based\nnonlinear latent structure discovery in multivariate spike train data. In Advances in neural\ninformation processing systems, pages 3496\u20133505, 2017.\n\n[9] Lea Duncker and Maneesh Sahani. Temporal alignment and latent gaussian process factor\ninference in population spike trains. In Advances in Neural Information Processing Systems,\npages 10445\u201310455, 2018.\n\n[10] David Sussillo, Rafal Jozefowicz, LF Abbott, and Chethan Pandarinath. Lfads-latent factor\n\nanalysis via dynamical systems. arXiv preprint arXiv:1608.06315, 2016.\n\n[11] Yuan Zhao and Il Memming Park. Interpretable nonlinear dynamic modeling of neural trajecto-\n\nries. In Advances in neural information processing systems, pages 3333\u20133341, 2016.\n\n[12] Chethan Pandarinath, Daniel J O\u2019Shea, Jasmine Collins, Rafal Jozefowicz, Sergey D Stavisky,\nJonathan C Kao, Eric M Trautmann, Matthew T Kaufman, Stephen I Ryu, Leigh R Hochberg,\net al. Inferring single-trial neural population dynamics using sequential auto-encoders. Nature\nmethods, page 1, 2018.\n\n[13] Lea Duncker, Gergo Bohner, Julien Boussard, and Maneesh Sahani.\n\nLearning inter-\npretable continuous-time models of latent stochastic dynamical systems. arXiv preprint\narXiv:1902.04420, 2019.\n\n[14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint\n\narXiv:1312.6114, 2013.\n\n[15] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.\n[16] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw:\n\nA recurrent neural network for image generation. arXiv preprint arXiv:1502.04623, 2015.\n\n[17] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational\nbayes \ufb01lters: Unsupervised learning of state space models from raw data. arXiv preprint\narXiv:1605.06432, 2016.\n\n[18] Rahul G Krishnan, Uri Shalit, and David Sontag. Structured inference networks for nonlinear\n\nstate space models. In Thirty-First AAAI Conference on Arti\ufb01cial Intelligence, 2017.\n\n[19] Mark M Churchland, John P Cunningham, Matthew T Kaufman, Justin D Foster, Paul Nuyu-\njukian, Stephen I Ryu, and Krishna V Shenoy. Neural population dynamics during reaching.\nNature, 487(7405):51, 2012.\n\n[20] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a\nconstrained variational framework. In International Conference on Learning Representations,\nvolume 3, 2017.\n\n10\n\n\f[21] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.\nDropout: a simple way to prevent neural networks from over\ufb01tting. The Journal of Machine\nLearning Research, 15(1):1929\u20131958, 2014.\n\n[22] Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio, and Pierre-Antoine Manzagol.\nStacked denoising autoencoders: Learning useful representations in a deep network with a local\ndenoising criterion. Journal of machine learning research, 11(Dec):3371\u20133408, 2010.\n\n[23] Svante Wold. Cross-validatory estimation of the number of components in factor and principal\n\ncomponents models. Technometrics, 20(4):397\u2013405, 1978.\n\n[24] Alex H Williams, Tony Hyun Kim, Forea Wang, Saurabh Vyas, Stephen I Ryu, Krishna V\nShenoy, Mark Schnitzer, Tamara G Kolda, and Surya Ganguli. Unsupervised discovery of\ndemixed, low-dimensional neural dynamics across multiple timescales through tensor compo-\nnent analysis. Neuron, 98(6):1099\u20131115, 2018.\n\n[25] Joseph E. O\u2019Doherty, Mariana M. B. Cardoso, Joseph G. Makin, and Philip N. Sabes. Nonhuman\nPrimate Reaching with Multichannel Sensorimotor Cortex Electrophysiology [Data set], May\n2017. Zenodo. http://doi.org/10.5281/zenodo.583331.\n\n[26] Emilio Salinas and LF Abbott. Decoding vectorial information from \ufb01ring rates.\n\nNeurobiology of Computation, pages 299\u2013304. Springer, 1995.\n\nIn The\n\n[27] Max Jaderberg, Valentin Dalibard, Simon Osindero, Wojciech M Czarnecki, Jeff Donahue, Ali\nRazavi, Oriol Vinyals, Tim Green, Iain Dunning, Karen Simonyan, et al. Population based\ntraining of neural networks. arXiv preprint arXiv:1711.09846, 2017.\n\n[28] Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1\u201319, 2011.\n\n11\n\n\f", "award": [], "sourceid": 9406, "authors": [{"given_name": "Mohammad Reza", "family_name": "Keshtkaran", "institution": "Georgia Tech and Emory University"}, {"given_name": "Chethan", "family_name": "Pandarinath", "institution": "Emory University and Georgia Tech"}]}