{"title": "A General Method for Amortizing Variational Filtering", "book": "Advances in Neural Information Processing Systems", "page_first": 7857, "page_last": 7868, "abstract": "We introduce the variational filtering EM algorithm, a simple, general-purpose method for performing variational inference in dynamical latent variable models using information from only past and present variables, i.e. filtering. The algorithm is derived from the variational objective in the filtering setting and consists of an optimization procedure at each time step. By performing each inference optimization procedure with an iterative amortized inference model, we obtain a computationally efficient implementation of the algorithm, which we call amortized variational filtering. We present experiments demonstrating that this general-purpose method improves inference performance across several recent deep dynamical latent variable models.", "full_text": "A General Method for\n\nAmortizing Variational Filtering\n\nJoseph Marino, Milan Cvitkovic, Yisong Yue\n\nCalifornia Institute of Technology\n\n{jmarino, mcvitkovic, yyue}@caltech.edu\n\nAbstract\n\nWe introduce the variational \ufb01ltering EM algorithm, a simple, general-purpose\nmethod for performing variational inference in dynamical latent variable models\nusing information from only past and present variables, i.e. \ufb01ltering. The algorithm\nis derived from the variational objective in the \ufb01ltering setting and consists of an op-\ntimization procedure at each time step. By performing each inference optimization\nprocedure with an iterative amortized inference model, we obtain a computationally\nef\ufb01cient implementation of the algorithm, which we call amortized variational\n\ufb01ltering. We present experiments demonstrating that this general-purpose method\nimproves performance across several deep dynamical latent variable models.\n\n1\n\nIntroduction\n\nComplex tasks with time-series data, like audio comprehension or robotic manipulation, must often\nbe performed online, where the model can only consider past and present information. Models for\nsuch tasks, e.g. Hidden Markov Models, frequently operate by inferring the hidden state of the world\nat each time-step. This type of online inference procedure is known as \ufb01ltering. Learning \ufb01ltering\nmodels purely through supervised labels or rewards can be impractical, requiring massive collections\nof labeled data or signi\ufb01cant efforts at reward shaping. In contrast, generative models can learn\nand infer hidden structure and states directly from data. Deep latent variable models [18, 27, 37],\nin particular, offer a promising direction; they infer latent representations using expressive deep\nnetworks, commonly using variational methods to perform inference [24]. Recent works have\nextended deep latent variable models to the time-series setting, e.g. [7, 12]. However, inference\nprocedures for these dynamical models have been proposed on the basis of intuition rather than from\na rigorous inference optimization perspective, potentially limiting performance.\nWe introduce variational \ufb01ltering EM, an algorithm for performing \ufb01ltering variational inference and\nlearning that is rigorously derived from the variational objective. As detailed below, the variational\nobjective in the \ufb01ltering setting results in a sequence of inference optimization objectives, with\none at each time-step. By initializing each of these inference optimization procedures from the\ncorresponding prior distribution, a classic Bayesian prediction-update loop naturally emerges. This\ncontrasts with existing \ufb01ltering approaches for deep dynamical models, which use inference models\nthat do not explicitly account for prior predictions during inference. However, using iterative inference\nmodels [32], which overcome this limitation, we develop a computationally ef\ufb01cient implementation\nof the variational \ufb01ltering EM algorithm, which we refer to as amortized variational \ufb01ltering (AVF).\nThe main contributions of this paper are the variational \ufb01ltering EM algorithm and its amortized\nimplementation, AVF. This general-purpose \ufb01ltering algorithm is widely applicable to dynamical\nlatent variable models, as we demonstrate in our experiments. Moreover, the variational \ufb01ltering EM\nalgorithm is derived from the \ufb01ltering variational objective, providing a solid theoretical framework\nfor \ufb01ltering inference. By precisely specifying the inference optimization procedure, this method\ntakes a simple form compared to previous hand\u2013designed methods. Using several deep dynamical\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fT(cid:89)\n\nt=1\n\nT(cid:89)\n\nt=1\n\nlatent variable models, we demonstrate that this \ufb01ltering approach compares favorably against current\nmethods across a variety of benchmark sequence data sets.\n\n2 Background\n\nSection 2.1 provides the general form of a dynamical latent variable model. Section 2.2 covers\nvariational inference. Deep latent variable models are often trained ef\ufb01ciently by amortizing inference\noptimization (Section 2.3). Applying this technique to dynamical models is non-trivial, leading many\nprior works to use hand\u2013designed amortized inference methods (Section 2.4).\n\n2.1 Dynamical latent variable models\n\nA sequence of T observations, x\u2264T , can be modeled using a dynamical latent variable model,\np\u03b8(x\u2264T , z\u2264T ), which models the joint distribution between x\u2264T and a sequence of latent variables,\nz\u2264T , with parameters \u03b8. It is typically assumed that p\u03b8(x\u2264T , z\u2264T ) can be factorized into conditional\njoint distributions at each step, p\u03b8(xt, zt|x<t, z<t), which are conditioned on preceding variables.\nThis results in the following auto-regressive formulation:\n\np\u03b8(x\u2264T , z\u2264T ) =\n\np\u03b8(xt, zt|x<t, z<t) =\n\np\u03b8(xt|x<t, z\u2264t)p\u03b8(zt|x<t, z<t).\n\n(1)\n\np\u03b8(xt|x<t, z\u2264t) is the observation model, and p\u03b8(zt|x<t, z<t) is the dynamics model, both of which\ncan be arbitrary functions of their conditioning variables. However, while Eq. 1 provides the general\nform of a dynamical latent variable model, further assumptions about the dependency structure, e.g.\nMarkov, or functional forms, e.g. linear, are often necessary for tractability.\n\n2.2 Variational inference\n\nDKL(q(z\u2264T|x\u2264T )||p(z\u2264T|x\u2264T )) = log p\u03b8(x\u2264T ) + F,\n\nGiven a model and a set of observations, we typically want to infer the posterior for each sequence,\np(z\u2264T|x\u2264T ), and learn the model parameters, \u03b8. Inference can be performed online or of\ufb02ine through\nBayesian \ufb01ltering or smoothing respectively [38], and learning can be performed through maximum\nlikelihood estimation. Unfortunately, inference and learning are intractable for all but the simplest\nmodel classes. For non-linear functions, which are present in deep latent variable models, we must\nresort to approximate inference. Variational inference [24] reformulates inference as optimization by\nintroducing an approximate posterior, q(z\u2264T|x\u2264T ), then minimizing the KL-divergence to the true\nposterior, p(z\u2264T|x\u2264T ). To avoid evaluating p(z\u2264T|x\u2264T ), one can express the KL-divergence as\n(2)\nwhere F is the variational free energy, also referred to as the (negative) evidence lower bound or\nELBO, de\ufb01ned as\n(3)\nIn Eq. 2, log p\u03b8(x\u2264T ) is independent of q(z\u2264T|x\u2264T ), so one can minimize the KL-divergence to\nthe true posterior, thereby performing approximate inference, by minimizing F w.r.t. q(z\u2264T|x\u2264T ).\nFurther, as KL-divergence is non-negative, Eq. 2 implies that free energy upper bounds the negative\nlog likelihood. Therefore, upon minimizing F w.r.t. q(z\u2264T|x\u2264T ), one can use the gradient \u2207\u03b8F to\nlearn the model parameters. These two optimization procedures are respectively the expectation and\nmaximization steps of the variational EM algorithm [34], which alternate until convergence. To scale\nthis algorithm, stochastic gradients can be used for both inference [35] and learning [21].\n\np\u03b8(x\u2264T , z\u2264T )\nq(z\u2264T|x\u2264T )\n\nF \u2261 \u2212Eq(z\u2264T |x\u2264T )\n\n(cid:20)\n\n(cid:21)\n\n.\n\nlog\n\n2.3 Amortized variational inference\n\nPerforming inference optimization using conventional stochastic gradient descent techniques can be\ncomputationally demanding, potentially requiring many inference iterations. To increase ef\ufb01ciency, a\nseparate inference model can learn to map data examples to approximate posterior estimates [8, 18,\n27, 37], thereby amortizing inference across examples [16]. Denoting the distribution parameters of\nq as \u03bbq (e.g. Gaussian mean and variance), standard inference models take the form\n\n\u03bbq \u2190 f\u03c6(x),\n\n2\n\n(4)\n\n\fwhere the inference model is denoted as f with parameters \u03c6. These models, though ef\ufb01cient, have\nlimitations. Notably, because these models only receive the data as input, they are unable to account\nfor empirical priors, which occur from one latent variable to another. Such priors arise in the dynamics\nof dynamical models, forming priors across time steps, as well as in hierarchical models, forming\npriors across levels. Previous works have neglected to include empirical priors during inference,\nattempting to overcome this limitation through heuristics, like \u201ctop-down\u201d inference in hierarchical\nmodels [40] and recurrent inference models in dynamical models, e.g. [7].\nIterative inference models [32] directly account for these priors, instead performing inference opti-\nmization by iteratively encoding approximate posterior estimates and gradients:\n\n\u03bbq \u2190 f\u03c6(\u03bbq,\u2207\u03bbqF).\n\n(5)\n\nThe gradients, \u2207\u03bbqF, can be estimated through black box methods [35] or the reparameterization\ntrick [27, 37] when applicable. Analogously to learning to learn [1], iterative inference models learn\nto perform inference optimization, thereby learning to infer. Eq. 5 provides a viable encoding form\nfor an iterative inference model, but other forms, such as additionally encoding the data, x, can\npotentially lead to faster inference convergence. Empirically, iterative inference models have also\nbeen shown to yield improved modeling performance over comparable standard models [32].\n\n2.4 Related work\n\nMany deterministic deep dynamical latent variable models have been proposed for sequential data\n[6, 41, 30, 10]. While these models often capture many aspects of the data, they cannot account for the\nuncertainty inherent in many domains, typically arising from partial observability of the environment.\nBy averaging over multi-modal distributions, these models often produce samples in regions of low\nprobability, e.g. blurry video frames. This inadequacy necessitates moving to probabilistic models,\nwhich can explicitly model uncertainty to accurately capture the distribution of possible sequences.\nAmortized variational inference [27, 37] has enabled many recently proposed probabilistic deep\ndynamical latent variable models, with applications to video [42, 26, 43, 23, 15, 11, 3, 9, 29, 19],\nspeech [7, 12, 17, 22, 29], handwriting [7], music [12], etc. While these models differ in their\nfunctional mappings, most fall within the general form of Eq. 1. Crucially, simply encoding the\nobservation at each step is insuf\ufb01cient to accurately perform approximate inference, as the prior\ncan vary across steps. Thus, with each model, a hand-crafted amortized inference procedure has\nbeen proposed. For instance, many \ufb01ltering inference methods re-use various components of the\ngenerative model [7, 12, 15, 9], while some methods introduce separate recurrent neural networks\ninto the \ufb01ltering procedure [4, 9] or encode the previous latent sample [26]. Specifying a \ufb01ltering\nmethod has been an engineering effort, as we have lacked a theoretical framework.\nThe variational \ufb01ltering EM algorithm precisely speci\ufb01es the inference optimization procedure\nimplied by the \ufb01ltering variational objective. The main insight from this analysis is that, having drawn\napproximate posterior samples at previous steps, inference becomes a local optimization, depending\nonly on the current prior and observation. This suggests one uni\ufb01ed approach that explicitly performs\ninference optimization at each step, replacing the current collection of custom \ufb01ltering methods.\nWhen the approximate posterior at each step is initialized at the corresponding prior, this approach\nentails a Bayesian prediction-update loop, with the update composed of a gradient (error) signal.\nPerhaps the closest technique in the probabilistic modeling literature is the \u201cresidual\" inference\nmethod from Fraccaro et al. [12], which updates the approximate posterior mean from the prior.\nSimilar ideas have been proposed on an empirical basis for deterministic models [30, 20]. PredNet\n[30] is a deterministic model that encodes prediction errors to perform inference. This approach is\ninspired by predictive coding [36, 13], a theory from neuroscience that postulates that feedforward\npathways in sensory processing areas of the brain use prediction errors to update state estimates from\nprior predictions. In turn, this theory is motivated by classical Bayesian \ufb01ltering [38], which updates\nthe posterior from the prior using the likelihood of the prediction. For linear Gaussian models, this\nmanifests as the Kalman \ufb01lter [25], which uses prediction errors to perform exact inference.\nFinally, several recent works have used particle \ufb01ltering in conjunction with amortized inference to\nprovide a tighter lower bound on the log likelihood for sequential data [31, 33, 28]. The techniques\ndeveloped here can also be applied to this tighter bound.\n\n3\n\n\fFigure 1: Variational \ufb01ltering EM. The diagram shows \ufb01ltering inference within a dynamical latent\nvariable model, as outlined in Algorithm 1. The central gray region depicts inference optimization\nof the approximate posterior, q(zt|x\u2264t, z<t), at step t, which can be initialized at or near the corre-\nsponding prior, p\u03b8(zt|x<t, z<t). Sampling from the approximate posterior generates the conditional\nlikelihood, p\u03b8(xt|x<t, z\u2264t), which is evaluated at the observation, xt, to calculate the reconstruction\nerror. This term is combined with the KL divergence between the approximate posterior and prior,\nyielding the step free energy, Ft (Eq. 9). Inference optimization (E-step) involves \ufb01nding the approx-\nimate posterior that minimizes the step free energy terms. Learning (M-step), which is not shown,\ncorresponds to updating the model parameters, \u03b8, to minimize the total free energy, F.\n\n3 Variational \ufb01ltering\n\nSection 3.1 describes variational \ufb01ltering EM (Algorithm 1), a general algorithm for performing\n\ufb01ltering variational inference in dynamical latent variable models. In Section 3.2, we introduce a\nmethod for amortizing inference optimization using iterative inference models.\n\n3.1 Variational \ufb01ltering expectation maximization (EM)\n\nIn the \ufb01ltering setting, the approximate posterior at each step is conditioned only on information\nfrom past and present variables, enabling online approximate inference. This implies a structured\napproximate posterior, in which q(z\u2264T|x\u2264T ) factorizes across steps as\n\nq(z\u2264T|x\u2264T ) =\n\nq(zt|x\u2264t, z<t).\n\n(6)\n\nNote that the conditioning variables in each term of q denote an indirect dependence that arises\nthrough free energy minimization and does not necessarily constitute a direct functional mapping.\nUnder a \ufb01ltering approximate posterior, the free energy (Eq. 3) can be expressed as\n\n(see Appendix A for the derivation) where Ft is the step free energy, de\ufb01ned as\n\n(7)\n\n(8)\n\nT(cid:89)\n\nt=1\n\n(cid:20)\n\nT(cid:88)\n\nt=1\n\nE(cid:81)t\u22121\n\nF =\n\n\u03c4 =1 q(z\u03c4|x\u2264\u03c4 ,z<\u03c4 ) [Ft] =\n\nT(cid:88)\n\nt=1\n\n\u02dcFt,\n\n(cid:21)\n\n,\n\nFt \u2261 \u2212Eq(zt|x\u2264t,z<t)\n\nlog\n\np\u03b8(xt, zt|x<t, z<t)\nq(zt|x\u2264t, z<t)\n\n4\n\nModelStepstt+1InferenceOptimization(E-step)t1PriorObservationKLDivergenceInferenceReconstructionErrorGenerativeModelConditionalLikelihoodApproximatePosterior\fAlgorithm 1 Variational Filtering Expectation Maximization\n1: Input: observation sequence x1:T , model p\u03b8(x1:T , z1:T )\n2: \u2207\u03b8F = 0\n3: for t = 1 to T do\n4:\n5:\n6:\n7:\n8: end for\n9: \u03b8 = \u03b8 \u2212 \u03b1\u2207\u03b8F\n\ninitialize q(zt|x\u2264t, z<t)\n\u02dcFt := Eq(z<t|x<t,z<t\u22121) [Ft]\n\u02dcFt\nq(zt|x\u2264t, z<t) = arg minq\n\u2207\u03b8F = \u2207\u03b8F + \u2207\u03b8 \u02dcFt\n\n(cid:46) parameter gradient\n(cid:46) at/near p\u03b8(zt|x<t, z<t)\n(cid:46) inference (E-Step)\n\n(cid:46) learning (M-Step)\n\nand we have also de\ufb01ned \u02dcFt as the tth term in the summation. Note that with a single step, the \ufb01ltering\nfree energy reduces to the \ufb01rst step free energy, thereby recovering the static case. As in this setting,\nthe step free energy can be re-expressed as a reconstruction term and a KL-divergence term:\nFt = \u2212Eq(zt|x\u2264t,z<t) [log p\u03b8(xt|x<t, z\u2264t)] + DKL(q(zt|x\u2264t, z<t)||p\u03b8(zt|x<t, z<t)).\n\n(9)\nThe \ufb01ltering free energy in Eq. 7 is the sum of these step free energy terms, each of which is evaluated\naccording to expectations over past latent sequences. To perform \ufb01ltering variational inference, we\nmust \ufb01nd the set of T terms in q(z\u2264T|x\u2264T ) that minimize the \ufb01ltering free energy summation.\nWe now describe the variational \ufb01ltering EM algorithm, given in Algorithm 1 and depicted in Figure\n1, which optimizes Eq. 7. This algorithm sequentially optimizes each of the approximate posterior\nterms to perform \ufb01ltering inference. Consider the approximate posterior at step t, q(zt|x\u2264t, z<t).\nThis term appears in F, either directly or in expectations, in terms t through T of the summation:\n\n(10)\n\n(11)\n\nF = (cid:124)\n\n(cid:123)(cid:122)\n\n\u02dcF1 + \u02dcF2 + \u00b7\u00b7\u00b7 + \u02dcFt\u22121 +\nsteps on which q(zt|x\u2264t, z<t) depends\n\n(cid:122)\n(cid:123)\nterms in which q(zt|x\u2264t, z<t) appears\n(cid:125)\n\u02dcFt + \u02dcFt+1 + \u00b7\u00b7\u00b7 + \u02dcFT\u22121 + \u02dcFT .\n\n(cid:125)(cid:124)\n\nHowever, the \ufb01ltering setting dictates that the optimization of the approximate posterior at each step\ncan only condition on past and present variables, i.e. steps 1 through t. Therefore, of the T terms in\nF, the only term through which we can optimize q(zt|x\u2264t, z<t) is the tth term:\n\n\u2217\n\nq\n\n(zt|x\u2264t, z<t) = arg min\nq(zt|x\u2264t,z<t)\n\n\u02dcFt.\n\nOptimizing \u02dcFt requires evaluating expectations over previous approximate posteriors. Again, because\napproximate posterior estimates cannot be in\ufb02uenced by future variables, these past expectations\nremain \ufb01xed through the future. Thus, variational \ufb01ltering (the variational E-step) can be performed\nby sequentially minimizing each Ft w.r.t. q(zt|x\u2264t, z<t), holding the expectations over past variables\n\ufb01xed. Conveniently, once the past expectations have been evaluated, inference optimization is entirely\nde\ufb01ned by the free energy at that step.\nFor simple models, such as linear Gaussian models, these expectations may be computed exactly.\nHowever, in general, the expectations must be estimated through Monte Carlo samples from q,\nwith inference optimization carried out using stochastic gradients [35]. As in the static setting,\nwe can initialize q(zt|x\u2264t, z<t) at (or near) the prior, p\u03b8(zt|x<t, z<t). This yields a simple inter-\npretation: starting with q at the prior, we generate a prediction of the data through the likelihood,\np\u03b8(xt|x<t, z\u2264t), to evaluate the current step free energy. Using the approximate posterior gradient,\nwe then perform an inference update to the estimate of q. This resembles classical Bayesian \ufb01ltering,\nwhere the posterior is updated from the prior prediction according to the likelihood of observations.\nUnlike the classical setting, reconstruction and update steps are repeated until inference convergence.\nAfter inferring an optimal approximate posterior, learning (the variational M-step) can be performed\nby minimizing the total \ufb01ltering free energy w.r.t. the model parameters, \u03b8. As Eq. 7 is a summation\nand differentiation is a linear operation, \u2207\u03b8F is the sum of contributions from each of these terms:\n(12)\n\n(cid:104)E(cid:81)t\u22121\n\n(cid:105)\n\n.\n\n\u2207\u03b8F =\n\n\u03c4 =1 q(z\u03c4|x\u2264\u03c4 ,z<\u03c4 ) [Ft]\n\nT(cid:88)\nt=1 \u2207\u03b8\n\n5\n\n\fParameter gradients can be estimated online by accumulating the result from each term in the \ufb01ltering\nfree energy. The parameters are then updated at the end of the sequence. For large data sets, stochastic\nestimates of parameter gradients can be obtained from a mini-batch of data examples [21].\n\n3.2 Amortized variational \ufb01ltering\n\nPerforming approximate inference optimization (Algorithm 1, Line 6) with traditional techniques can\nbe computationally costly, requiring many iterations of gradient updates and hand-tuning of optimizer\nhyper-parameters. In online settings, with large models and data sets, this may be impractical. An\nalternative approach is to employ an amortized inference model, which can learn to minimize Ft w.r.t.\nq(zt|x\u2264t, z<t) more ef\ufb01ciently at each step. Note that Ft (Eq. 8) contains p\u03b8(xt, zt|x<t, z<t) =\np\u03b8(xt|x<t, z\u2264t)p\u03b8(zt|x<t, z<t). The prior, p\u03b8(zt|x<t, z<t), varies across steps, constituting the\nlatent dynamics. Standard inference models, which only encode xt, do not have access to the prior\nand therefore cannot properly optimize q(zt|x\u2264t, z<t). Many inference models in the sequential\nsetting attempt to account for this information by including hidden states, e.g. [7, 12, 9]. However,\ngiven the complexities of many generative models, it can be dif\ufb01cult to determine how to properly\nroute the necessary prior information into the inference model. As a result, each dynamical latent\nlatent variable model has been proposed with an accompanying custom inference model set-up.\nWe propose a simple and general alternative method for amortizing \ufb01ltering inference that is agnostic\nto the particular form of the generative model. Iterative inference models [32] naturally account for\nthe changing prior through the approximate posterior gradients. These models are thus a natural\ncandidate for performing inference at each step. Similar to Eq. 5, when q(zt|x\u2264t, z<t) is a parametric\ndistribution with parameters \u03bbq\n\nt , the inference update takes the form:\n\n\u03bbq\nt \u2190 f\u03c6(\u03bbq\n\n(13)\nWe refer to this set-up as amortized variational \ufb01ltering (AVF). As in Eq. 5, we note that Eq. 13\noffers just one particular encoding form for an iterative inference model. For instance, xt could be\nadditionally encoded at each step. Marino et al. also note that in latent Gaussian models, precision-\nweighted errors provide an alternative inference optimization signal [32]. There are two main bene\ufb01ts\nto using iterative inference models in the \ufb01ltering setting:\n\n\u02dcFt).\n\nt ,\u2207\u03bbq\n\nt\n\ninference corrections rather than re-estimating the approximate posterior at each step.\n\n\u2022 The approximate posterior is updated from the prior, so model capacity is utilized for\n\u2022 These inference models contain all of the terms necessary to perform inference optimization,\nproviding a simple model form that does not require any additional hidden states or inputs.\n\nIn practice, these advantages permit the use of relatively simple iterative inference models that can\nperform \ufb01ltering inference ef\ufb01ciently and accurately. We demonstrate this in the following section.\n\n4 Experiments\n\nWe empirically evaluate amortized variational \ufb01ltering using multiple deep dynamical latent Gaussian\nmodel architectures on a variety of sequence data sets. Speci\ufb01cally, we use AVF to train VRNN\n[7], SRNN [12], and SVG [9] on speech [14], music [5], and video [39] data. In each setting, we\ncompare AVF against the originally proposed \ufb01ltering method for the model. Diagrams of the \ufb01ltering\nmethods are shown in Figure 2. Implementations of the models are based on code provided by\nthe respective authors of VRNN1, SRNN2, and SVG3. Accompanying code can be found online at\ngithub.com/joelouismarino/amortized-variational-filtering.\n\n4.1 Experiment set-up\n\nIterative inference models are implemented as speci\ufb01ed in Eq. 13, encoding the approximate posterior\nparameters and their gradients at each inference iteration at each step. Following [32], we normalize\nthe inputs to the inference model using layer normalization [2]. The generative models that we\n\n1https://github.com/jych/nips2015_vrnn\n2https://github.com/marcofraccaro/srnn\n3https://github.com/edenton/svg\n\n6\n\n\f(a) VRNN\n\n(b) SRNN\n\n(c) SVG\n\n(d) AVF\n\nFigure 2: Filtering inference models for VRNN, SRNN, SVG, and AVF. Each diagram shows the\ncomputational graph for inferring the approximate posterior parameters, \u03bbq, at step t. Previously\nproposed methods rely on hand-crafted architectures of observations, hidden states, and latent\nvariables. AVF is a simple, general \ufb01ltering procedure that only requires the local inference gradient.\n\nevaluate contain non-spatial latent variables, thus, we use fully-connected layers to parameterize the\ninference models. Importantly, minimal effort went into engineering the inference model architectures:\nacross all models and data sets, we utilize the same inference model architecture for AVF. Further\ndetails are found in Appendix B.\n\n4.1.1 Speech modeling\n\nModels For speech modeling, we\nuse VRNN and SRNN, attempting to\nkeep the model architectures consis-\ntent with the original implementations.\nThe most notable difference in our im-\nplementation occurs in SRNN, where\nwe use an LSTM rather than a GRU\nas the recurrent module. As in [12],\nwe anneal the KL divergence initially\nduring training. In both models, we\nuse a Gaussian output density. Un-\nlike [7, 12, 17], which evaluate log\ndensities, we evaluate and report log\nprobabilities by integrating the out-\nput density over the data discretization\nwindow, as in modeling image pixels.\nThis permits comparison across differ-\nent output distributions.\n\nFigure 3: Test data (top), output predictions (middle), and\nreconstructions (bottom) for TIMIT using SRNN with AVF.\nSequences run from left to right. The predictions made by\nthe model already contain the general structure of the data.\nAVF explicitly updates the approximate posterior from the\nprior prediction, focusing on inference corrections rather\nthan re-estimation.\n\nData We train and evaluate on TIMIT [14], which consists of audio recordings of 6,300 sentences\nspoken by 630 individuals. As performed in [7], we sample the audio waveforms at 16 kHz, split\nthe training and validation sets into half second clips, and group each sequence into bins of 200\nconsecutive samples. Thus, each training and validation sequence consists of 40 model steps.\nEvaluation is performed on the full duration of each test sequence, averaging roughly 3 seconds.\n\n4.1.2 Music modeling\n\nModel We model polyphonic music using SRNN. The generative model architecture is the same as\nin the speech modeling experiments, with changes in the number of layers and units to match [12].\nTo model the binary music notes, we use a Bernoulli output distribution. Again, we anneal the KL\ndivergence initially during training.\n\nData We use four data sets of polyphonic (MIDI) music [5]: Piano-midi.de, MuseData, JSB\nChorales, and Nottingham. Each data set contains between 100 and 1,000 songs, with each song\n\n7\n\nObservationModelOutputIteration0Iteration1ModelOutput\f(a)\n\n(b)\n\nFigure 4: Improvement with inference iterations. Results are shown on the TIMIT validation set\nusing VRNN with AVF. (a) Average free energy per step with varying numbers of inference iterations\nduring training. Additional iterations tend to result in improved performance. (b) Average relative\nimprovement in free energy from the initial (prior) estimate at each inference iteration for a single\nmodel. Empirically, each successive iteration provides further, smaller improvements.\n\nbetween 100 to 4,000 steps. For training and validation, we break the sequences into clips of length\n25, and we test on the entire test sequences.\n\n4.1.3 Video modeling\nModel Our implementation of SVG differs from the original model in that we evaluate conditional\nlog-likelihood under a Gaussian output density rather than mean squared output error. All other\narchitecture details are identical to the original model. However, [9] down-weight the KL-divergence\nby a factor of 1e-6 at all steps. We instead remove this factor to use the free energy during training\nand evaluation. As to be expected, this results in the model using the latent variables to a lesser extent.\nWe train and evaluate SVG using \ufb01ltering inference at all steps, rather than predicting multiple steps\ninto the future, as in [9].\n\nData We train and evaluate SVG on KTH Actions [39], which contains 760 train / 768 val / 863\ntest videos of people performing various actions, each of which is between roughly 50 - 150 frames.\nFrames are re-sized to 64 \u00d7 64. For training and validation, we split the data into clips of 20 frames.\n4.2 Results\n\n4.2.1 Additional Inference Iterations\n\nThe variational \ufb01ltering EM algorithm involves inference optimization at each step (Algorithm 1,\nLine 6). AVF optimizes each approximate posterior through a model that learns to perform iterative\nupdates (Eq. 13). Additional inference iterations may lead to further improvement in performance\n[32]. We explore this aspect on TIMIT using VRNN. In Figure 4a, we plot the average free energy\nper step on validation sequences for models trained with varying numbers of inference iterations.\nFigure 4b shows average relative improvement over the prior estimate for a single model trained with\n8 inference iterations. We observe that training with additional inference iterations empirically leads\nto improved performance (Figure 4a), with each iteration providing diminishing improvement during\ninference (Figure 4b). This aspect is distinct from many baseline \ufb01ltering methods, which directly\noutput the approximate posterior at each step.\nWe can also directly visualize inference improvement through the model output. Figure 3 illustrates\nexample reconstructions over inference iterations, using SRNN on TIMIT. At the initial inference\niteration, the approximate posterior is initialized from the prior, resulting in an output prediction.\nThe iterative inference model then uses the approximate posterior gradients to update the estimate,\nimproving the output reconstruction.\n\n8\n\n1248TrainingInferenceIterations1130113511401145115011551160FreeEnergyperStep(nats)0123456789InferenceIteration0510152025ImprovementinFreeEnergy(%)\fTable 1: Average free energy per step (in nats)\non the TIMIT speech data set for SRNN and\nVRNN with the respective originally proposed\n\ufb01ltering procedures (baselines) and with AVF.\n\nTable 2: Average free energy per step (in nats)\non the KTH Actions video data set for SVG\nwith the originally proposed \ufb01ltering procedure\n(baseline) and with AVF.\n\nVRNN\n\nSRNN\n\nbaseline\nAVF\n\nbaseline\nAVF\n\nTIMIT\n\n1,082\n1,105\n\n1,026\n1,024\n\nKTH Actions\n\nSVG\n\nbaseline\nAVF\n\n15,097\n11, 714\n\nTable 3: Average free energy per step (in nats) on polyphonic music data sets for SRNN with and\nwithout AVF. Results from Fraccaro et al. [12] are provided for comparison, however, our model\nimplementation differs in several aspects (see Appendix B).\n\nPiano-midi.de MuseData\n\nJSB Chorales Nottingham\n\nSRNN\n\nbaseline [12]\nbaseline\nAVF\n\n8.20\n8.19\n8.12\n\n6.28\n6.27\n5.99\n\n4.74\n6.92\n6.97\n\n2.94\n3.19\n3.13\n\n4.2.2 Quantitative Comparison\n\nTables 1, 2, and 3 present quantitative comparisons of average \ufb01ltering free energy per step between\nAVF (with 1 inference iteration per step) and baseline \ufb01ltering methods for TIMIT, KTH Actions,\nand the polyphonic music data sets respectively. On TIMIT, training with AVF performs comparably\nto the baseline methods for both VRNN and SRNN. We note that VRNN with AVF using 2 inference\niterations resulted in a \ufb01nal test performance of 1,071 nats per step, outperforming the baseline\nmethod. Similar results are also observed on each of the polyphonic music data sets. Again, increasing\nthe number of inference iterations to 5 for AVF on JSB Chorales resulted in a \ufb01nal test performance\nof 6.77 nats per step. AVF signi\ufb01cantly improves the performance of SVG on KTH Actions. We\nattribute this, likely, to the absence of the KL down-weighting factor in our training objective as\ncompared with [9]. The baseline \ufb01ltering procedure seems to struggle to a greater degree than AVF.\nFrom comparing the results above, we see that AVF is a general \ufb01ltering procedure that performs\nwell across multiple models and data sets, despite using a relatively simple inference model structure.\n\n5 Conclusion\n\nWe introduced the variational \ufb01ltering EM algorithm for \ufb01ltering in dynamical latent variable models.\nVariational \ufb01ltering inference can be expressed as a sequence of optimization objectives, linked\nacross steps through previous latent samples. Using iterative inference models to perform inference\noptimization, we arrived at an ef\ufb01cient implementation of the algorithm: amortized variational\n\ufb01ltering. This general \ufb01ltering algorithm scales to large models and data sets. Numerous methods\nhave been proposed for \ufb01ltering in deep dynamical latent variable models, with each method hand\u2013\ndesigned for each model. The variational \ufb01ltering EM algorithm provides a single framework for\nanalyzing and constructing these methods. Amortized variational \ufb01ltering is a simple, theoretically-\nmotivated, and general \ufb01ltering method that we have shown performs on-par with or better than\nmultiple existing state-of-the-art methods.\n\nAcknowledgments\n\nWe would like to thank Matteo Ruggero Ronchi for helpful discussions. This work was supported by\nthe following grants: JPL PDF 1584398, NSF 1564330, and NSF 1637598.\n\n9\n\n\fReferences\n[1] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom\nSchaul, and Nando de Freitas. Learning to learn by gradient descent by gradient descent. In\nAdvances in Neural Information Processing Systems, 2016.\n\n[2] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. In NIPS Deep\n\nLearning Symposium, 2016.\n\n[3] Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H Campbell, and Sergey Levine.\nStochastic variational video prediction. In International Conference on Learning Representa-\ntions, 2018.\n\n[4] Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. In NIPS 2014\n\nWorkshop on Advances in Variational Inference, 2014.\n\n[5] Nicolas Boulanger-Lewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal\ndependencies in high-dimensional sequences: Application to polyphonic music generation and\ntranscription. In International Conference on Machine Learning, 2012.\n\n[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation\nof gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555,\n2014.\n\n[7] Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua\nBengio. A recurrent latent variable model for sequential data. In Advances in Neural Information\nProcessing Systems, 2015.\n\n[8] Peter Dayan, Geoffrey E Hinton, Radford M Neal, and Richard S Zemel. The helmholtz\n\nmachine. Neural computation, 7(5):889\u2013904, 1995.\n\n[9] Emily Denton and Rob Fergus. Stochastic video generation with a learned prior. In International\n\nConference on Machine Learning, 2018.\n\n[10] Chelsea Finn, Ian Goodfellow, and Sergey Levine. Unsupervised learning for physical in-\nteraction through video prediction. In Advances in Neural Information Processing Systems,\n2016.\n\n[11] Marco Fraccaro, Simon Kamronn, Ulrich Paquet, and Ole Winther. A disentangled recognition\nand nonlinear dynamics model for unsupervised learning. In Advances in Neural Information\nProcessing Systems, 2017.\n\n[12] Marco Fraccaro, S\u00f8ren Kaae S\u00f8nderby, Ulrich Paquet, and Ole Winther. Sequential neural\nmodels with stochastic layers. In Advances in Neural Information Processing Systems, 2016.\n\n[13] Karl Friston. A theory of cortical responses. Philosophical Transactions of the Royal Society of\n\nLondon B: Biological Sciences, 360(1456):815\u2013836, 2005.\n\n[14] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren. Darpa\n\ntimit acoustic phonetic continuous speech corpus, 1993.\n\n[15] Mevlana Gemici, Chia-Chun Hung, Adam Santoro, Greg Wayne, Shakir Mohamed, Danilo J\nRezende, David Amos, and Timothy Lillicrap. Generative temporal models with memory. arXiv\npreprint arXiv:1702.04649, 2017.\n\n[16] Samuel Gershman and Noah Goodman. Amortized inference in probabilistic reasoning. In\n\nCognitive Science Society, volume 36, 2014.\n\n[17] Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre C\u00f4t\u00e9, Nan Ke, and Yoshua Bengio. Z-\nforcing: Training stochastic recurrent networks. In Advances in Neural Information Processing\nSystems, 2017.\n\n[18] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep\n\nautoregressive networks. In International Conference on Machine Learning, 2014.\n\n10\n\n\f[19] Jiawei He, Andreas Lehrmann, Joseph Marino, Greg Mori, and Leonid Sigal. Probabilistic\nvideo generation using holistic attribute control. In European Conference on Computer Vision,\n2018.\n\n[20] Mikael Henaff, Junbo Zhao, and Yann LeCun. Prediction under uncertainty with error-encoding\n\nnetworks. arXiv preprint arXiv:1711.04994, 2017.\n\n[21] Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational\n\ninference. The Journal of Machine Learning Research, 14(1):1303\u20131347, 2013.\n\n[22] Wei-Ning Hsu, Yu Zhang, and James Glass. Unsupervised learning of disentangled and inter-\npretable representations from sequential data. In Advances in Neural Information Processing\nSystems, 2017.\n\n[23] Matthew Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta.\nComposing graphical models with neural networks for structured representations and fast\ninference. In Advances in Neural Information Processing Systems, 2016.\n\n[24] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduc-\ntion to variational methods for graphical models. NATO ASI SERIES D BEHAVIOURAL AND\nSOCIAL SCIENCES, 89:105\u2013162, 1998.\n\n[25] Rudolph Emil Kalman et al. A new approach to linear \ufb01ltering and prediction problems. Journal\n\nof basic Engineering, 82(1):35\u201345, 1960.\n\n[26] Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational\nbayes \ufb01lters: Unsupervised learning of state space models from raw data. In International\nConference on Learning Representations, 2017.\n\n[27] Diederik P Kingma and Max Welling. Stochastic gradient vb and the variational auto-encoder.\n\nIn International Conference on Learning Representations, 2014.\n\n[28] Tuan Anh Le, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. Auto-encoding\n\nsequential monte carlo. In International Conference on Learning Representations, 2018.\n\n[29] Yingzhen Li and Stephan Mandt. A deep generative model for disentangled representations of\n\nsequential data. In International Conference on Machine Learning, 2018.\n\n[30] William Lotter, Gabriel Kreiman, and David Cox. Deep predictive coding networks for video\nprediction and unsupervised learning. In International Conference on Learning Representations,\n2017.\n\n[31] Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy\nMnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural\nInformation Processing Systems, 2017.\n\n[32] Joseph Marino, Yisong Yue, and Stephan Mandt. Iterative amortized inference. In International\n\nConference on Machine Learning, 2018.\n\n[33] Christian Naesseth, Scott Linderman, Rajesh Ranganath, and David Blei. Variational sequential\n\nmonte carlo. In International Conference on Arti\ufb01cial Intelligence and Statistics, 2018.\n\n[34] Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justi\ufb01es incremental,\n\nsparse, and other variants. In Learning in graphical models, pages 355\u2013368. Springer, 1998.\n\n[35] Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In Arti\ufb01cial\n\nIntelligence and Statistics, 2014.\n\n[36] Rajesh PN Rao and Dana H Ballard. Predictive coding in the visual cortex: a functional\ninterpretation of some extra-classical receptive-\ufb01eld effects. Nature neuroscience, 2(1), 1999.\n\n[37] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation\nand approximate inference in deep generative models. In International Conference on Machine\nLearning, 2014.\n\n11\n\n\f[38] Simo S\u00e4rkk\u00e4. Bayesian \ufb01ltering and smoothing, volume 3. Cambridge University Press, 2013.\n\n[39] Christian Schuldt, Ivan Laptev, and Barbara Caputo. Recognizing human actions: a local svm\n\napproach. In International Conference on Pattern Recognition, 2004.\n\n[40] Casper Kaae S\u00f8nderby, Tapani Raiko, Lars Maal\u00f8e, S\u00f8ren Kaae S\u00f8nderby, and Ole Winther.\nLadder variational autoencoders. In Advances in Neural Information Processing Systems, 2016.\n\n[41] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhudinov. Unsupervised learning of video\n\nrepresentations using lstms. In International Conference on Machine Learning, 2015.\n\n[42] Jacob Walker, Carl Doersch, Abhinav Gupta, and Martial Hebert. An uncertain future: Forecast-\ning from static images using variational autoencoders. In European Conference on Computer\nVision, 2016.\n\n[43] Tianfan Xue, Jiajun Wu, Katherine Bouman, and Bill Freeman. Visual dynamics: Probabilistic\nfuture frame synthesis via cross convolutional networks. In Advances in Neural Information\nProcessing Systems, 2016.\n\n12\n\n\f", "award": [], "sourceid": 4890, "authors": [{"given_name": "Joseph", "family_name": "Marino", "institution": "California Institute of Technology"}, {"given_name": "Milan", "family_name": "Cvitkovic", "institution": "California Institute of Technology"}, {"given_name": "Yisong", "family_name": "Yue", "institution": "Caltech"}]}