{"title": "Data Generation as Sequential Decision Making", "book": "Advances in Neural Information Processing Systems", "page_first": 3249, "page_last": 3257, "abstract": "We connect a broad class of generative models through their shared reliance on sequential decision making. Motivated by this view, we develop extensions to an existing model, and then explore the idea further in the context of data imputation -- perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling. We formulate data imputation as an MDP and develop models capable of representing effective policies for it. We construct the models using neural networks and train them using a form of guided policy search. Our models generate predictions through an iterative process of feedback and refinement. We show that this approach can learn effective policies for imputation problems of varying difficulty and across multiple datasets.", "full_text": "Data Generation as Sequential Decision Making\n\nPhilip Bachman\n\nDoina Precup\n\nMcGill University, School of Computer Science\n\nMcGill University, School of Computer Science\n\nphil.bachman@gmail.com\n\ndprecup@cs.mcgill.ca\n\nAbstract\n\nWe connect a broad class of generative models through their shared reliance on\nsequential decision making. Motivated by this view, we develop extensions to an\nexisting model, and then explore the idea further in the context of data imputation\n\u2013 perhaps the simplest setting in which to investigate the relation between uncon-\nditional and conditional generative modelling. We formulate data imputation as\nan MDP and develop models capable of representing effective policies for it. We\nconstruct the models using neural networks and train them using a form of guided\npolicy search [9]. Our models generate predictions through an iterative process of\nfeedback and re\ufb01nement. We show that this approach can learn effective policies\nfor imputation problems of varying dif\ufb01culty and across multiple datasets.\n\n1\n\nIntroduction\n\nDirected generative models are naturally interpreted as specifying sequential procedures for gener-\nating data. We traditionally think of this process as sampling, but one could also view it as making\nsequences of decisions for how to set the variables at each node in a model, conditioned on the\nsettings of its parents, thereby generating data from the model. The large body of existing work\non reinforcement learning provides powerful tools for addressing such sequential decision making\nproblems. We encourage the use of these tools to understand and improve the extended processes\ncurrently driving advances in generative modelling. We show how sequential decision making can be\napplied to general prediction tasks by developing models which construct predictions by iteratively\nre\ufb01ning a working hypothesis under guidance from exogenous input and endogenous feedback.\nWe begin this paper by reinterpreting several recent generative models as sequential decision making\nprocesses, and then show how changes inspired by this point of view can improve the performance\nof the LSTM-based model introduced in [3]. Next, we explore the connections between directed\ngenerative models and reinforcement learning more fully by developing an approach to training\npolicies for sequential data imputation. We base our approach on formulating imputation as a \ufb01nite-\nhorizon Markov Decision Process which one can also interpret as a deep, directed graphical model.\nWe propose two policy representations for the imputation MDP. One extends the model in [3] by\ninserting an explicit feedback loop into the generative process, and the other addresses the MDP\nmore directly. We train our models/policies using techniques motivated by guided policy pearch\n[9, 10, 11, 8]. We examine their qualitative and quantitative performance across imputation problems\ncovering a range of dif\ufb01culties (i.e. different amounts of data to impute and different \u201cmissingness\nmechanisms\u201d), and across multiple datasets. Given the relative paucity of existing approaches to the\ngeneral imputation problem, we compare our models to each other and to two simple baselines. We\nalso test how our policies perform when they use fewer/more steps to re\ufb01ne their predictions.\nAs imputation encompasses both classi\ufb01cation and standard (i.e. unconditional) generative mod-\nelling, our work suggests that further study of models for the general imputation problem is worth-\nwhile. The performance of our models suggests that sequential stochastic construction of predic-\ntions, guided by both input and feedback, should prove useful for a wide range of problems. Training\nthese models can be challenging, but lessons from reinforcement learning may bring some relief.\n\n1\n\n\f(cid:88)\n\nz\n\nT(cid:89)\n\nt=1\n\n2 Directed Generative Models as Sequential Decision Processes\n\nDirected generative models have grown in popularity relative to their undirected counter-parts [6,\n14, 12, 4, 5, 16, 15] (etc.). Reasons include: the development of ef\ufb01cient methods for training them,\nthe ease of sampling from them, and the tractability of bounds on their log-likelihoods. Growth in\navailable computing power compounds these bene\ufb01ts. One can interpret the (ancestral) sampling\nprocess in a directed model as repeatedly setting subsets of the latent variables to particular values,\nin a sequence of decisions conditioned on preceding decisions. Each subsequent decision restricts\nthe set of potential outcomes for the overall sequence. Intuitively, these models encode stochastic\nprocedures for constructing plausible observations. This section formally explores this perspective.\n\n2.1 Deep AutoRegressive Networks\n\nThe deep autoregressive networks investigated in [4] de\ufb01ne distributions of the following form:\n\np(x) =\n\np(x|z)p(z), with p(z) = p0(z0)\n\npt(zt|z0, ..., zt\u22121)\n\n(1)\n\nin which x indicates a generated observation and z0, ..., zT represent latent variables in the model.\nThe distribution p(x|z) may be factored similarly to p(z). The form of p(z) in Eqn. 1 can represent\narbitrary distributions over the latent variables, and the work work in [4] mainly concerned ap-\nproaches to parameterizing the conditionals pt(zt|z0, ..., zt\u22121) that restricted representational power\nin exchange for computational tractability. To appreciate the generality of Eqn. 1, consider using zt\nthat are univariate, multivariate, structured, etc. One can interpret any model based on this sequen-\ntial factorization of p(z) as a non-stationary policy pt(zt|st) for selecting each action zt in a state\nst, with each st determined by all zt(cid:48) for t(cid:48) < t, and train it using some form of policy search.\n\n2.2 Generalized Guided Policy Search\n\nE\n\nE\n\nE\n\n(cid:21)\n\nminimize\n\np,q\n\n(cid:20)\n\niq\u223cIq\n\nip\u223cIp(\u00b7|iq)\n\n\u03c4\u223cq(\u03c4|iq,ip)\n\nWe adopt a broader interpretation of guided policy search than one might initially take from, e.g.,\n[9, 10, 11, 8]. We provide a review of guided policy search in the supplementary material. Our\nexpanded de\ufb01nition of guided policy search includes any optimization of the general form:\n[(cid:96)(\u03c4, iq, ip)] + \u03bb div (q(\u03c4|iq, ip), p(\u03c4|ip))\n\n(2)\nin which p indicates the primary policy, q indicates the guide policy, Iq indicates a distribution over\ninformation available only to q, Ip indicates a distribution over information available to both p and\nq, (cid:96)(\u03c4, iq, ip) computes the cost of trajectory \u03c4 in the context of iq/ip, and div(q(\u03c4|iq, ip), p(\u03c4|ip))\nmeasures dissimilarity between the trajectory distributions generated by p/q. As \u03bb > 0 goes to\nin\ufb01nity, Eqn. 2 enforces the constraint p(\u03c4|ip) = q(\u03c4|iq, ip), \u2200\u03c4, ip, iq. Terms for controlling, e.g.,\nthe entropy of p/q can also be added. The power of the objective in Eq. 2 stems from two main\npoints: the guide policy q can use information iq that is unavailable to the primary policy p, and the\nprimary policy need only be trained to minimize the dissimilarity term div(q(\u03c4|iq, ip), p(\u03c4|ip)).\nFor example, a directed model structured as in Eqn. 1 can be interpreted as specifying a policy for\na \ufb01nite-horizon MDP whose terminal state distribution encodes p(x). In this MDP, the state at time\n1 \u2264 t \u2264 T +1 is determined by {z0, ..., zt\u22121}. The policy picks an action zt \u2208 Zt at time 1 \u2264 t \u2264 T ,\nand picks an action x \u2208 X at time t = T + 1. I.e., the policy can be written as pt(zt|z0, ..., zt\u22121)\nfor 1 \u2264 t \u2264 T , and as p(x|z0, ..., zT ) for t = T + 1. The initial state z0 \u2208 Z0 is drawn from p0(z0).\nExecuting the policy for a single trial produces a trajectory \u03c4 (cid:44) {z0, ..., zT , x}, and the distribution\nover xs from these trajectories is just p(x) in the corresponding directed generative model.\nThe authors of [4] train deep autoregressive networks by maximizing a variational lower bound on\nthe training set log-likelihood. To do this, they introduce a variational distribution q which provides\nq0(z0|x\u2217) and qt(zt|z0, ..., zt\u22121, x\u2217) for 1 \u2264 t \u2264 T , with the \ufb01nal step q(x|z0, ..., zT , x\u2217) given by\na Dirac-delta at x\u2217. Given these de\ufb01nitions, the training in [4] can be interpreted as guided policy\nsearch for the MDP described in the previous paragraph. Speci\ufb01cally, the variational distribution q\nprovides a guide policy q(\u03c4|x\u2217) over trajectories \u03c4 (cid:44) {z0, ..., zT , x\u2217}:\n\nq(\u03c4|x\u2217) (cid:44) q(x|z0, ..., zT , x\u2217)q0(z0|x\u2217)\n\nqt(zt|z0, ..., zt\u22121, x\u2217)\n\n(3)\n\nT(cid:89)\n\nt=1\n\n2\n\n\fT(cid:89)\n\nThe primary policy p generates trajectories distributed according to:\n\np(\u03c4 ) (cid:44) p(x|z0, ..., zT )p0(z0)\n\n(4)\nwhich does not depend on x\u2217. In this case, x\u2217 corresponds to the guide-only information iq \u223c Iq in\nEqn. 2. We now rewrite the variational optimization as:\n\nt=1\n\npt(zt|z0, ..., zt\u22121)\n\nminimize\n\n(5)\nwhere (cid:96)(\u03c4, x\u2217) (cid:44) 0 and DX indicates the target distribution for the terminal state of the primary\npolicy p.1 When expanded, the KL term in Eqn. 5 becomes:\n\n\u03c4\u223cq(\u03c4|x\u2217)\n\nx\u2217\u223cDX\n\np,q\n\n[(cid:96)(\u03c4, x\u2217)] + KL(q(\u03c4|x\u2217)|| p(\u03c4 ))\n\nE\n\nE\n\n(cid:21)\n\n(cid:20)\n\nKL(q(\u03c4|x\u2217)|| p(\u03c4 )) =\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\n(6)\n\nE\n\n\u03c4\u223cq(\u03c4|x\u2217)\n\nq0(z0|x\u2217)\np0(z0)\n\n+\n\nlog\n\nqt(zt|z0, ..., zt\u22121, x\u2217)\npt(zt|z0, ..., zt\u22121)\n\nlog\n\n\u2212 log p(x\u2217|z0, ..., zT )\n\nThus, the variational approach used in [4] for training directed generative models can be interpreted\nas a form of generalized guided policy search. As the form in Eqn. 1 can represent any \ufb01nite directed\ngenerative model, the preceding derivation extends to all models we discuss in this paper.2\n\n2.3 Time-reversible Stochastic Processes\nOne can simplify Eqn. 1 by assuming suitable forms for X and Z0, ...,ZT . E.g., the authors of [16]\nproposed a model in which Zt \u2261 X for all t and p0(x0) was Gaussian. We can write their model as:\n\n(cid:88)\n\nT\u22121(cid:89)\n\nT(cid:89)\n\nt=2\n\np(xT ) =\n\npT (xT|xT\u22121)p0(x0)\n\npt(xt|xt\u22121)\n\n(7)\n\nx0,...,xT \u22121\n\nt=1\n\nwhere p(xT ) indicates the terminal state distribution of the non-stationary, \ufb01nite-horizon Markov\nprocess determined by {p0(x0), p1(x1|x0), ..., pT (xT|xT\u22121)}. Note that, throughout this paper, we\n(ab)use sums over latent variables and trajectories which could/should be written as integrals.\nThe authors of [16] observed that, for any reasonably smooth target distribution DX and suf\ufb01ciently\nlarge T , one can de\ufb01ne a \u201creverse-time\u201d stochastic process qt(xt\u22121|xt) with simple, time-invariant\ndynamics that transforms q(xT ) (cid:44) DX into the Gaussian distribution p0(x0). This q is given by:\n\nq0(x0) =\n\nq1(x0|x1)DX (xT )\n\nqt(xt\u22121|xt) \u2248 p0(x0)\n\n(8)\n\n(cid:88)\n\nx1,...,xT\n\n(cid:34)\n\nE\n\nT\u22121(cid:88)\n\n(cid:35)\n\nDX (xT )\n\nNext, we de\ufb01ne q(\u03c4 ) as the distribution over trajectories \u03c4 (cid:44) {x0, ..., xT} generated by the reverse-\ntime process determined by {q1(x0|x1), ..., qT (xT\u22121|xT ),DX (xT )}. We de\ufb01ne p(\u03c4 ) as the distri-\nbution over trajectories generated by the \u201cforward-time\u201d process in Eqn. 7. The training in [16] is\nequivalent to guided policy search using guide trajectories sampled from q, i.e. it uses the objective:\n\nq1(x0|x1)\np0(x0)\n\nqt+1(xt|xt+1)\npt(xt|xt\u22121)\n\np,q\n\n+\n\nt=1\n\nlog\n\nlog\n\n+ log\n\n\u03c4\u223cq(\u03c4 )\n\nminimize\n\npT (xT|xT\u22121)\n\n(cid:104)\u2212 log p0(x0) \u2212(cid:80)T\n\n(9)\nwhich corresponds to minimizing KL(q || p). If the log-densities in Eqn. 9 are tractable, then this\nminimization can be done using basic Monte-Carlo. If, as in [16], the reverse-time process q is not\ntrained, then Eqn. 9 simpli\ufb01es to: minimizep Eq(\u03c4 )\nThis trick for generating guide trajectories exhibiting a particular distribution over terminal states\nxT \u2013 i.e. running dynamics backwards in time starting from xT \u223c DX \u2013 may prove useful in settings\nother than those considered in [16]. E.g., the LapGAN model in [1] learns to approximately invert\na \ufb01xed (and information destroying) reverse-time process. The supplementary material expands on\nthe content of this subsection, including a derivation of Eqn. 9 as a bound on Ex\u223cDX [\u2212 log p(x)].\n1We could pull the \u2212 log p(x\u2217|z0, ..., zT ) term from the KL and put it in the cost (cid:96)(\u03c4, x\u2217), but we prefer the\n\u201cpath-wise KL\u201d formulation for its elegance. We abuse notation using KL(\u03b4(x = x\u2217)|| p(x)) (cid:44) \u2212 log p(x\u2217).\n\nt=1 log pt(xt|xt\u22121)\n\n(cid:105)\n\n.\n\n2This also includes all generative models implemented and executed on an actual computer.\n\n3\n\n\f2.4 Learning Generative Stochastic Processes with LSTMs\n\nT(cid:89)\n\nThe authors of [3] introduced a model for sequentially-deep generative processes. We interpret their\nmodel as a primary policy p which generates trajectories \u03c4 (cid:44) {z0, ..., zT , x} with distribution:\n\nt=1\n\np(\u03c4 ) (cid:44) p(x|s\u03b8(\u03c4<x))p0(z0)\n\npt(zt), with \u03c4<x (cid:44) {z0, ..., zT}\n\n(10)\nin which \u03c4<x indicates a latent trajectory and s\u03b8(\u03c4<x) indicates a state trajectory {s0, ..., sT} com-\nputed recursively from \u03c4<x using the update st \u2190 f\u03b8(st\u22121, zt) for t \u2265 1. The initial state s0 is\ngiven by a trainable constant. Each state st (cid:44) [ht; vt] represents the joint hidden/visible state ht/vt\nof an LSTM and f\u03b8(state, input) computes a standard LSTM update.3 The authors of [3] de\ufb01ned\nall pt(zt) as isotropic Gaussians and de\ufb01ned the output distribution p(x|s\u03b8(\u03c4<x)) as p(x|cT ), where\nt=1 \u03c9\u03b8(vt). Here, c0 is a trainable constant and \u03c9\u03b8(vt) is, e.g., an af\ufb01ne transform of\nvt. Intuitively, \u03c9\u03b8(vt) transforms vt into a re\ufb01nement of the \u201cworking hypothesis\u201d ct\u22121, which gets\nupdated to ct = ct\u22121 + \u03c9\u03b8(vt). p is governed by parameters \u03b8 which affect f\u03b8, \u03c9\u03b8, s0, and c0. The\nsupplementary material provides pseudo-code and an illustration for this model.\nTo train p, the authors of [3] introduced a guide policy q with trajectory distribution:\n\ncT (cid:44) c0 +(cid:80)T\n\nT(cid:89)\n\n(cid:34) T(cid:88)\n\nt=1\n\nt=1\n\nq(\u03c4|x\u2217) (cid:44) q(x|s\u03c6(\u03c4<x), x\u2217)q0(z0|x\u2217)\n\nqt(zt|\u02dcst, x\u2217), with \u03c4<x (cid:44) {z0, ..., zT}\n\n(11)\nin which s\u03c6(\u03c4<x) indicates a state trajectory {\u02dcs0, ..., \u02dcsT} computed recursively from \u03c4<x using the\nguide policy\u2019s state update \u02dcst \u2190 f\u03c6(\u02dcst\u22121, g\u03c6(s\u03b8(\u03c4<t), x\u2217)). In this update \u02dcst\u22121 is the previous guide\nstate and g\u03c6(s\u03b8(\u03c4<t), x\u2217) is a deterministic function of x\u2217 and the partial (primary) state trajectory\ns\u03b8(\u03c4<t) (cid:44) {s0, ..., st\u22121}, which is computed recursively from \u03c4<t (cid:44) {z0, ..., zt\u22121} using the state\nupdate st \u2190 f\u03b8(st\u22121, zt). The output distribution q(x|s\u03c6(\u03c4<x), x\u2217) is de\ufb01ned as a Dirac-delta at\nx\u2217.4 Each qt(zt|\u02dcst, x\u2217) is a diagonal Gaussian distribution with means and log-variances given by\nan af\ufb01ne function L\u03c6(\u02dcvt) of \u02dcvt. q0(z0) is de\ufb01ned as identical to p0(z0). q is governed by parameters\n\u03c6 which affect the state updates f\u03c6(\u02dcst\u22121, g\u03c6(s\u03b8(\u03c4<t), x\u2217)) and the step distributions qt(zt|\u02dcst, x\u2217).\ng\u03c6(s\u03b8(\u03c4<t), x\u2217) corresponds to the \u201cread\u201d operation of the encoder network in [3].\nUsing our de\ufb01nitions for p/q, the training objective in [3] is given by:\n\nqt(zt|\u02dcst, x\u2217)\n\nE\n\nE\n\n\u2212 log p(x\u2217|s(\u03c4<x))\n\np,q\n\n\u03c4<x\n\nlog\n\npt(zt)\n\nx\u2217\u223cDX\n\nminimize\n\n\u03c4\u223cq(\u03c4|x\u2217)\n\np(x|s\u03b8(\u03c4<x))p(\u03c4<x).\n\n(12)\nwhich can be written more succinctly as Ex\u2217\u223cDX KL(q(\u03c4|x\u2217)|| p(\u03c4 )). This objective upper-bounds\n\nEx\u2217\u223cDX [\u2212 log p(x\u2217)], where p(x) (cid:44)(cid:80)\nWe propose changing p in Eqn. 10 to: p(\u03c4 ) (cid:44) p(x|s\u03b8(\u03c4<x))p0(z0)(cid:81)T\n\n2.5 Extending the LSTM-based Generative Model\n\nt=1 pt(zt|st\u22121). We de\ufb01ne\npt(zt|st\u22121) as a diagonal Gaussian distribution with means and log-variances given by an af\ufb01ne\nfunction L\u03b8(vt\u22121) of vt\u22121 (remember that st (cid:44) [ht; vt]), and we de\ufb01ne p0(z0) as an isotropic\nGaussian. We set s0 using s0 \u2190 f\u03b8(z0), where f\u03b8 is a trainable function (e.g. a neural network).\nIntuitively, our changes make the model more like a typical policy by conditioning its \u201caction\u201d zt on\nits state st\u22121, and upgrade the model to an in\ufb01nite mixture by placing a distribution over its initial\nstate s0. We also consider using ct (cid:44) L\u03b8(ht), which transforms the hidden part of the LSTM state st\ndirectly into an observation. This makes ht a working memory in which to construct an observation.\nThe supplementary material provides pseudo-code and an illustration for this model.\nWe train this model by optimizing the objective:\n\n(cid:35)\n\nminimize\n\np,q\n\nE\n\nx\u2217\u223cDX\n\nE\n\n\u03c4\u223cq(\u03c4|x\u2217)\n\nlog\n\nq0(z0|x\u2217)\np0(z0)\n\n+\n\nqt(zt|\u02dcst, x\u2217)\npt(zt|st\u22121)\n\nlog\n\n\u2212 log p(x\u2217|s(\u03c4<x))\n\n(13)\n\n3For those unfamiliar with LSTMs, a good introduction can be found in [2]. We use LSTMs including input\n\ngates, forget gates, output gates, and peephole connections for all tests presented in this chapter.\n\n4It may be useful to relax this assumption.\n\n4\n\n(cid:34)\n\nT(cid:88)\n\nt=1\n\n(cid:35)\n\n\fwhere we now have to deal with pt(zt|st\u22121), p0(z0), and q0(z0|x\u2217), which could be treated as\nconstants in the model from [3]. We de\ufb01ne q0(z0|x\u2217) as a diagonal Gaussian distribution whose\nmeans and log-variances are given by a trainable function g\u03c6(x\u2217).\nWhen trained for the binarized MNIST benchmark\nused in [3], our extended model scored a negative\nlog-likelihood of 85.5 on the test set.5 For compari-\nson, the score reported in [3] was 87.4.6 After \ufb01ne-\ntuning the variational distribution (i.e. q) on the test\nset, our model\u2019s score improved to 84.8, which is\nquite strong considering it is an upper bound. For\ncomparison, see the best upper bound reported for\nthis benchmark in [15], which was 85.1. When the\nmodel used the alternate cT (cid:44) L\u03b8(hT ), the raw/\ufb01ne-\ntuned test scores were 85.9/85.3.\nFig. 1 shows\nsamples from the model. Model/test code is avail-\nable at http://github.com/Philip-Bachman/\nSequential-Generation.\n\nFigure 1: The left block shows \u03c3(ct) for t \u2208\n{1, 3, 5, 9, 16}, for a policy p with ct (cid:44) c0 +\nt(cid:48)=1 L\u03b8(vt(cid:48)). The right block is analogous,\nfor a model using ct (cid:44) L\u03b8(ht).\n\n(cid:80)t\n\n3 Developing Models for Sequential Imputation\nThe goal of imputation is to estimate p(xu|xk), where x (cid:44) [xu; xk] indicates a complete observation\nwith known values xk and missing values xu. We de\ufb01ne a mask m \u2208 M as a (disjoint) partition of\nx into xu/xk. By expanding xu to include all of x, one recovers standard generative modelling. By\nshrinking xu to include a single element of x, one recovers standard classi\ufb01cation/regression. Given\ndistribution DM over m \u2208 M and distribution DX over x \u2208 X , the objective for imputation is:\n\n(cid:2)\u2212 log p(xu|xk)(cid:3)\n\nminimize\n\np\n\nE\n\nx\u223cDX\n\nE\n\nm\u223cDM\n\n(14)\n\nWe now describe a \ufb01nite-horizon MDP for which guided policy search minimizes a bound on the\nobjective in Eqn. 14. The MDP is de\ufb01ned by mask distribution DM, complete observation distri-\nbution DX , and the state spaces {Z0, ...,ZT} associated with each of T steps. Together, DM and\nDX de\ufb01ne a joint distribution over initial states and rewards in the MDP. For the trial determined\nby x \u223c DX and m \u223c DM, the initial state z0 \u223c p(z0|xk) is selected by the policy p based on the\nknown values xk. The cost (cid:96)(\u03c4, xu, xk) suffered by trajectory \u03c4 (cid:44) {z0, ..., zT} in the context (x, m)\nis given by \u2212 log p(xu|\u03c4, xk), i.e. the negative log-likelihood of p guessing the missing values xu\nafter following trajectory \u03c4, while seeing the known values xk.\n\nWe consider a policy p with trajectory distribution p(\u03c4|xk) (cid:44) p(z0|xk)(cid:81)T\n(cid:2)\u2212 log p(xu|\u03c4, xk)(cid:3)\n\nt=1 p(zt|z0, ..., zt\u22121, xk),\nwhere xk is determined by x/m for the current trial and p can\u2019t observe the missing values xu. With\nthese de\ufb01nitions, we can \ufb01nd an approximately optimal imputation policy by solving:\n\nminimize\n\n(15)\n\nE\n\nE\n\nE\n\np\n\nx\u223cDX\n\nm\u223cDM\n\n\u03c4\u223cp(\u03c4|xk)\n\nI.e. the expected negative log-likelihood of making a correct imputation on any given trial. This is a\nvalid, but loose, upper bound on the imputation objective in Eq. 14 (from Jensen\u2019s inequality). We\ncan tighten the bound by introducing a guide policy (i.e. a variational distribution).\nAs with the unconditional generative models in Sec. 2, we train p to imitate a guide policy q shaped\nby additional information (here it\u2019s xu). This q generates trajectories with distribution q(\u03c4|xu, xk) (cid:44)\n\n(cid:21)\nt=1 q(zt|z0, ..., zt\u22121, xu, xk). Given this p and q, guided policy search solves:\nE\n\n[\u2212 log q(xu|\u03c4, iq, ip)] + KL(q(\u03c4|iq, ip)|| p(\u03c4|ip))\n\n(cid:20)\n\nE\n\nE\n\nq(z0|xu, xk)(cid:81)T\n\n(16)\n\nminimize\n\np,q\n\nx\u223cDX\n\nm\u223cDM\n\n\u03c4\u223cq(\u03c4|iq,ip)\n\nwhere we de\ufb01ne iq (cid:44) xu, ip (cid:44) xk, and q(xu|\u03c4, iq, ip) (cid:44) p(xu|\u03c4, ip).\n\n5Data splits from: http://www.cs.toronto.edu/\u02dclarocheh/public/datasets/binarized_mnist\n6The model in [3] signi\ufb01cantly improves its score to 80.97 when using an image-speci\ufb01c architecture.\n\n5\n\n\f3.1 A Direct Representation for Sequential Imputation Policies\nWe de\ufb01ne an imputation trajectory as c\u03c4 (cid:44) {c0, ..., cT}, where each partial imputation ct \u2208 X is\ncomputed from a partial step trajectory \u03c4<t (cid:44) {z1, ..., zt}. A partial imputation ct\u22121 encodes the\npolicy\u2019s guess for the missing values xu immediately prior to selecting step zt, and cT gives the pol-\nicy\u2019s \ufb01nal guess. At each step of iterative re\ufb01nement, the policy selects a zt based on ct\u22121 and the\nknown values xk, and then updates its guesses to ct based on ct\u22121 and zt. By iteratively re\ufb01ning its\nguesses based on feedback from earlier guesses and the known values, the policy can construct com-\nplexly structured distributions over its \ufb01nal guess cT after just a few steps. This happens naturally,\nwithout any post-hoc MRFs/CRFs (as in many approaches to structured prediction), and without\nsampling values in cT one at a time (as required by existing NADE-type models [7]). This property\nof our approach should prove useful for many tasks.\nWe consider two ways of updating the guesses in ct, mirroring those described in Sec. 2. The \ufb01rst\nway sets ct \u2190 ct\u22121 + \u03c9\u03b8(zt), where \u03c9\u03b8(zt) is a trainable function. We set c0 (cid:44) [cu\n0] using a\ntrainable bias. The second way sets ct \u2190 \u03c9\u03b8(zt). We indicate models using the \ufb01rst type of update\nwith the suf\ufb01x -add, and models using the second type of update with -jump. Our primary policy p\u03b8\nselects zt at each step 1 \u2264 t \u2264 T using p\u03b8(zt|ct\u22121, xk), which we restrict to be a diagonal Gaussian.\nThis is a simple, stationary policy. Together, the step selector p\u03b8(zt|ct\u22121, xk) and the imputation\nconstructor \u03c9\u03b8(zt) fully determine the behaviour of the primary policy. The supplementary material\nprovides pseudo-code and an illustration for this model.\nWe construct a guide policy q similarly to p. The guide policy shares the imputation constructor\n\u03c9\u03b8(zt) with the primary policy. The guide policy incorporates additional information x (cid:44) [xu; xk],\ni.e. the complete observation for which the primary policy must reconstruct some missing values.\nThe guide policy chooses steps using q\u03c6(zt|ct\u22121, x), which we restrict to be a diagonal Gaussian.\nWe train the primary/guide policy components \u03c9\u03b8, p\u03b8, and q\u03c6 simultaneously on the objective:\n\n0 ; ck\n\nE\n\nE\n\n\u03b8,\u03c6\n\nminimize\nwhere q(xu|cu\nT ). We train our models using Monte-Carlo roll-outs of q, and stochastic\nbackpropagation as in [6, 14]. Full implementations and test code are available from http://\ngithub.com/Philip-Bachman/Sequential-Generation.\n\nx\u223cDX\nm\u223cDM\nT ) (cid:44) p(xu|cu\n\n\u03c4\u223cq\u03c6(\u03c4|xu,xk)\n\n[\u2212 log q(xu|cu\n\nT )] + KL(q(\u03c4|xu, xk)|| p(\u03c4|xk))\n\n(17)\n\nE\n\n(cid:20)\n\n(cid:21)\n\n3.2 Representing Sequential Imputation Policies using LSTMs\n\n\u03b8 (sr\nt ), then p updates the writer state using sw\n\u03b8 (hw\n\nTo make it useful for imputation, which requires conditioning on the exogenous information xk, we\nmodify the LSTM-based model from Sec. 2.5 to include a \u201cread\u201d operation in its primary policy p.\nWe incorporate a read operation by spreading p over two LSTMs, pr and pw, which respectively\n\u201cread\u201d and \u201cwrite\u201d an imputation trajectory c\u03c4 (cid:44) {c0, ..., cT}. Conveniently, the guide policy q\nfor this model takes the same form as the primary policy\u2019s reader pr. This model also includes an\n\u201cin\ufb01nite mixture\u201d initialization step, as used in Sec. 2.5, but modi\ufb01ed to incorporate conditioning on\nx and m. The supplementary material provides pseudo-code and an illustration for this model.\nFollowing the in\ufb01nite mixture initialization step, a single full step of execution for p involves several\nt\u22121, xk)), then p selects a\nsubsteps: \ufb01rst p updates the reader state using sr\nstep zt \u223c p\u03b8(zt|vr\nt\u22121, zt), and \ufb01nally p updates\nits guesses by setting ct \u2190 ct\u22121 + \u03c9w\n; vr,w\n]\nt\nrefer to the states of the (r)reader and (w)writer LSTMs. The LSTM updates f r,w\nand the read/write\noperations \u03c9r,w\nWe train p to imitate trajectories sampled from a guide policy q. The guide policy shares the primary\npolicy\u2019s writer updates f w\n\u03c6 and read oper-\n\u03c6(sq\nation \u03c9q\nt\u22121, x)),\n\u03c6(ct\u22121, sw\nthen selects zt \u223c q\u03c6(zt|vq\nt\u22121, zt), and \ufb01nally updates\nits guesses ct \u2190 ct\u22121 + \u03c9w\nt )). As in Sec. 3.1, the guide policy\u2019s read op-\neration \u03c9q\n\u03c6 gets to see the complete observation x, while the primary policy only gets to see the\nknown values xk. We restrict the step distributions p\u03b8/q\u03c6 to be diagonal Gaussians whose means\nand log-variances are af\ufb01ne functions of vr\nt . The training objective has the same form as Eq. 17.\n\nt \u2190 f r\nt ) (or ct \u2190 \u03c9w\nare governed by the policy parameters \u03b8.\n\n\u03c6. At each step, the guide policy: updates the guide state sq\nt \u2190 f w\n\nt ), then updates the writer state sw\n\n\u03b8 , but has its own reader updates f q\nt\u22121, \u03c9q\n\n\u03b8(ct\u22121, sw\n\u03b8 (sw\n\nt\u22121, \u03c9r\nt \u2190 f w\nt )). In these updates, sr,w\n\n\u03b8 and write operation \u03c9w\n\nt ) (or ct \u2190 \u03c9w\n\nt \u2190 f q\n\u03b8 (sw\n\n(cid:44) [hr,w\n\n\u03b8 (hw\n\n\u03b8 (vw\n\n\u03b8 (vw\n\nt /vq\n\n\u03b8\n\nt\n\nt\n\n\u03b8\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 2: (a) Comparing the performance of our imputation models against several baselines, using\nMNIST digits. The x-axis indicates the % of pixels which were dropped completely at random, and\nthe scores are normalized by the number of imputed pixels. (b) A closer view of results from (a),\njust for our models. (c) The effect of increased iterative re\ufb01nement steps for our GPSI models.\n\n4 Experiments\n\nWe tested the performance of our sequential imputation models on three datasets: MNIST (28x28),\nSVHN (cropped, 32x32) [13], and TFD (48x48) [17]. We converted images to grayscale and\nshift/scaled them to be in the range [0...1] prior to training/testing. We measured the imputation\nlog-likelihood log q(xu|cu\nT ) using the true missing values xu and the models\u2019 guesses given by\nT ). We report negative log-likelihoods, so lower scores are better in all of our tests. We refer to\n\u03c3(cu\nvariants of the model from Sec. 3.1 as GPSI-add and GPSI-jump, and to variants of the model from\nSec. 3.2 as LSTM-add and LSTM-jump. Except where noted, the GPSI models used 6 re\ufb01nement\nsteps and the LSTM models used 16.7\nWe tested imputation under two types of data masking: missing completely at random (MCAR)\nand missing at random (MAR). In MCAR, we masked pixels uniformly at random from the source\nimages, and indicate removal of d% of the pixels by MCAR-d. In MAR, we masked square regions,\nwith the occlusions located uniformly at random within the borders of the source image. We indicate\nocclusion of a d \u00d7 d square by MAR-d.\nOn MNIST, we tested MCAR-d for d \u2208 {50, 60, 70, 80, 90}. MCAR-100 corresponds to uncon-\nditional generation. On TFD and SVHN we tested MCAR-80. On MNIST, we tested MAR-d for\nd \u2208 {14, 16}. On TFD we tested MAR-25 and on SVHN we tested MAR-17. For test trials we\nsampled masks from the same distribution used in training, and we sampled complete observations\nfrom a held-out test set. Fig. 2 and Tab. 1 present quantitative results from these tests. Fig. 2(c)\nshows the behavior of our GPSI models when we allowed them fewer/more re\ufb01nement steps.\n\nMNIST\n\nTFD\n\nSVHN\n\nMAR-14 MAR-16 MCAR-80 MAR-25 MCAR-80 MAR-17\n568\n\u2013\n569\n572\n624\n\n1381\n\u2013\n1390\n1394\n1416\n\n1377\n\u2013\n1380\n1384\n1399\n\n170\n172\n177\n183\n374\n\n167\n169\n175\n177\n394\n\n525\n\u2013\n531\n540\n567\n\nLSTM-add\nLSTM-jump\nGPSI-add\nGPSI-jump\nVAE-imp\n\nTable 1: Imputation performance in various settings. Details of the tests are provided in the main\ntext. Lower scores are better. Due to time constraints, we did not test LSTM-jump on TFD or\nSVHN. These scores are normalized for the number of imputed pixels.\n\nWe tested our models against three baselines. The baselines were \u201cvariational auto-encoder impu-\ntation\u201d, honest template matching, and oracular template matching. VAE imputation ran multiple\nsteps of VAE reconstruction, with the known values held \ufb01xed and the missing values re-estimated\nwith each reconstruction step.8 After 16 re\ufb01nement steps, we scored the VAE based on its best\n\n7GPSI stands for \u201cGuided Policy Search Imputer\u201d. The tag \u201c-add\u201d refers to additive guess updates, and\n\n\u201c-jump\u201d refers to updates that fully replace the guesses.\n\n8We discuss some de\ufb01ciencies of VAE imputation in the supplementary material.\n\n7\n\n0.550.600.650.700.750.800.850.900.95MaskProbability50100150200250300350ImputationNLLvs.AvailableInformationTM-orcTM-honVAE-impGPSI-addGPSI-jumpLSTM-addLSTM-jump0.550.600.650.700.750.800.850.900.95MaskProbability70727476788082848688ImputationNLLvs.AvailableInformationGPSI-addGPSI-jumpLSTM-addLSTM-jump0246810121416Re\ufb01nementSteps828486889092949698TheEffectofIncreasedRe\ufb01nementStepsGPSI-addGPSI-jump\f(a)\n\n(b)\n\n(c)\n\nFigure 3: This \ufb01gure illustrates the policies learned by our models. (a): models trained for (MNIST,\nMAR-16). From top\u2192bottom the models are: GPSI-add, GPSI-jump, LSTM-add, LSTM-jump.\n(b): models trained for (TFD, MAR-25), with models in the same order as (a) \u2013 but without LSTM-\njump. (c): models trained for (SVHN, MAR-17), with models arranged as for (b).\n\nguesses. Honest template matching guessed the missing values based on the training image which\nbest matched the test image\u2019s known values. Oracular template matching was like honest template\nmatching, but matched directly on the missing values.\nOur models signi\ufb01cantly outperformed the baselines. In general, the LSTM-based models outper-\nformed the more direct GPSI models. We evaluated the log-likelihood of imputations produced by\nour models using the lower bounds provided by the variational objectives with respect to which they\nwere trained. Evaluating the template-based imputations was straightforward. For VAE imputation,\nwe used the expected log-likelihood of the imputations sampled from multiple runs of the 16-step\nimputation process. This provides a valid, but loose, lower bound on their log-likelihood.\nAs shown in Fig. 3, the imputations produced by our models appear promising. The imputations are\ngenerally of high quality, and the models are capable of capturing strongly multi-modal reconstruc-\ntion distributions (see sub\ufb01gure (a)). The behavior of GPSI models changed intriguingly when we\nswapped the imputation constructor. Using the -jump imputation constructor, the imputation pol-\nicy learned by the direct model was rather inscrutable. Fig. 2(c) shows that additive guess updates\nextracted more value from using more re\ufb01nement steps. When trained on the binarized MNIST\nbenchmark discussed in Sec. 2.5, i.e. with binarized images and subject to MCAR-100, the LSTM-\nadd model produced raw/\ufb01ne-tuned scores of 86.2/85.7. The LSTM-jump model scored 87.1/86.3.\nAnecdotally, on this task, these \u201cclosed-loop\u201d models seemed more prone to over\ufb01tting than the\n\u201copen-loop\u201d models in Sec. 2.5. The supplementary material provides further qualitative results.\n\n5 Discussion\n\nWe presented a point of view which links methods for training directed generative models with\npolicy search in reinforcement learning. We showed how our perspective can guide improvements\nto existing models. The importance of these connections will only grow as generative models rapidly\nincrease in structural complexity and effective decision depth.\nWe introduced the notion of imputation as a natural generalization of standard, unconditional gener-\native modelling. Depending on the relation between the data-to-generate and the available informa-\ntion, imputation spans from full unconditional generative modelling to classi\ufb01cation/regression. We\nshowed how to successfully train sequential imputation policies comprising millions of parameters\nusing an approach based on guided policy search [9]. Our approach outperforms the baselines quan-\ntitatively and appears qualitatively promising. Incorporating, e.g., the local read/write mechanisms\nfrom [3] should provide further improvements.\n\n8\n\n\fReferences\n[1] Emily L Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative models\n\nusing a laplacian pyramid of adversarial networks. arXiv:1506.05751 [cs.CV], 2015.\n\n[2] Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850 [cs.NE],\n\n2013.\n\n[3] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural\nIn International Conference on Machine Learning (ICML),\n\nnetwork for image generation.\n2015.\n\n[4] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep au-\n\ntoregressive networks. In International Conference on Machine Learning (ICML), 2014.\n\n[5] Diederik P Kingma, Danilo J Rezende, Shakir Mohamed, and Max Welling. Semi-supervised\nlearning with deep generative models. In Advances in Neural Information Processing Systems\n(NIPS), 2014.\n\n[6] Diederik P Kingma and Max Welling. Auto-encoding variational bayes.\n\nConference on Learning Representations (ICLR), 2014.\n\nIn International\n\n[7] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In Inter-\n\nnational Conference on Machine Learning (ICML), 2011.\n\n[8] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search\nIn Advances in Neural Information Processing Systems (NIPS),\n\nunder unknown dynamics.\n2014.\n\n[9] Sergey Levine and Vladlen Koltun. Guided policy search.\n\nMachine Learning (ICML), 2013.\n\nIn International Conference on\n\n[10] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In\n\nAdvances in Neural Information Processing Systems (NIPS), 2013.\n\n[11] Sergey Levine and Vladlen Koltun. Learning complex neural network policies with trajectory\n\noptimization. In International Conference on Machine Learning (ICML), 2014.\n\n[12] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks.\n\nIn International Conference on Machine Learning (ICML), 2014.\n\n[13] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng.\nReading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep\nLearning and Unsupervised Feature Learning, 2011.\n\n[14] Danilo Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and ap-\nIn International Conference on Machine\n\nproximate inference in deep generative models.\nLearning (ICML), 2014.\n\n[15] Danilo J Rezende and Shakir Mohamed. Variational inference with normalizing \ufb02ows.\n\nInternational Conference on Machine Learning (ICML), 2015.\n\nIn\n\n[16] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep un-\nIn International Conference on\n\nsupervised learning using nonequilibrium thermodynamics.\nMachine Learning (ICML), 2015.\n\n[17] Joshua Susskind, Adam Anderson, and Geoffrey E Hinton. The toronto face database. 2010.\n\n9\n\n\f", "award": [], "sourceid": 1811, "authors": [{"given_name": "Philip", "family_name": "Bachman", "institution": "McGill University"}, {"given_name": "Doina", "family_name": "Precup", "institution": "University of McGill"}]}