{"title": "Learning Plannable Representations with Causal InfoGAN", "book": "Advances in Neural Information Processing Systems", "page_first": 8733, "page_last": 8744, "abstract": "In recent years, deep generative models have been shown to 'imagine' convincing high-dimensional observations such as images, audio, and even video, learning directly from raw data. In this work, we ask how to imagine goal-directed visual plans -- a plausible sequence of observations that transition a dynamical system from its current configuration to a desired goal state, which can later be used as a reference trajectory for control. We focus on systems with high-dimensional observations, such as images, and propose an approach that naturally combines representation learning and planning. Our framework learns a generative model of sequential observations, where the generative process is induced by a transition in a low-dimensional planning model, and an additional noise. By maximizing the mutual information between the generated observations and the transition in the planning model, we obtain a low-dimensional representation that best explains the causal nature of the data. We structure the planning model to be compatible with efficient planning algorithms, and we propose several such models based on either discrete or continuous states. Finally, to generate a visual plan, we project the current and goal observations onto their respective states in the planning model, plan a trajectory, and then use the generative model to transform the trajectory to a sequence of observations. We demonstrate our method on imagining plausible visual plans of rope manipulation.", "full_text": "Learning Plannable Representations with Causal\n\nInfoGAN\n\nThanard Kurutach\u22171\n\nAviv Tamar\u22171 Ge Yang2\n\nStuart Russell1\n\nPieter Abbeel1\n\nAbstract\n\nIn recent years, deep generative models have been shown to \u2018imagine\u2019 convincing\nhigh-dimensional observations such as images, audio, and even video, learning\ndirectly from raw data. In this work, we ask how to imagine goal-directed visual\nplans \u2013 a plausible sequence of observations that transition a dynamical system\nfrom its current con\ufb01guration to a desired goal state, which can later be used as\na reference trajectory for control. We focus on systems with high-dimensional\nobservations, such as images, and propose an approach that naturally combines\nrepresentation learning and planning. Our framework learns a generative model of\nsequential observations, where the generative process is induced by a transition\nin a low-dimensional planning model, and an additional noise. By maximizing\nthe mutual information between the generated observations and the transition in\nthe planning model, we obtain a low-dimensional representation that best explains\nthe causal nature of the data. We structure the planning model to be compatible\nwith ef\ufb01cient planning algorithms, and we propose several such models based on\neither discrete or continuous states. Finally, to generate a visual plan, we project\nthe current and goal observations onto their respective states in the planning model,\nplan a trajectory, and then use the generative model to transform the trajectory to\na sequence of observations. We demonstrate our method on imagining plausible\nvisual plans of rope manipulation3.\n\nIntroduction\n\n1\nFor future robots to perform general tasks in unstructured environments such as homes or hospitals,\nthey must be able to reason about their domain and plan their actions accordingly. In AI literature,\nthis general problem has been investigated under two main paradigms \u2013 automated planning and\nscheduling [34] (henceforth, AI planning) and reinforcement learning [41] (RL).\nClassical work in AI planning has drawn on the remarkable capability of humans to perform long-term\nreasoning and planning by using abstract representations of the world. For example, humans might\nthink of \"cup on table\" as a state rather than detailed coordinates or a precise image of such a scene.\nInterestingly, powerful classical planners exist that can reason very effectively with these kinds of\nrepresentations, as demonstrated by results in the International Planning Competition [44]. However,\nsuch logical representations of the world can be dif\ufb01cult to specify correctly. As an example, consider\ndesigning a logical representation for the state of a deformable object such as a rope. Moreover,\nlogical representations that are not grounded a priori in real-world observation require a perception\nmodule that can identify, for example, exactly when the cup is considered \"on the table\".\nIn RL, on the other hand, a task is solved directly through trial and error, guided by a manually\nprovided reward signal. Recent advances in model-free RL (e.g., [28, 23]) have shown remarkable\nsuccess in learning policies that act directly on high-dimensional observations, such as raw images.\nDesigning a reward function that depends on such observations can be challenging, however, and\nmost recent studies either relied on domains where the reward can be instrumented [28, 23, 33], or\n\n1Berkeley AI Research, University of California, Berkeley\n2Department of Physics, University of Chicago\n3Code is available online at http://github.com/thanard/causal-infogan.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\frequired successful demonstrations as guidance [13, 39]. Moreover, since RL is guided by the reward\nto solve a particular task, it does not automatically generalize to different tasks [43, 18].\nIn principle, model-based RL can solve the generalization problem by learning a dynamics model and\nplanning with that model. However, applying model-based RL to domains with high-dimensional\nobservations has been challenging [46, 14, 12]. Deep learning approaches to learning dynamics\nmodels (e.g., action-conditional video prediction models [30, 1, 12]) tend to get bogged down in\npixel-level detail, tend to be computationally expensive, and are far from accurate over longer time\nscales. Moreover, the representations learned using such approaches are typically unstructured,\nhigh-dimensional continuous vectors, which cannot be used in ef\ufb01cient planning algorithms.\nIn this work, we aim to combine the merits of deep learning dynamics models and classical AI plan-\nning, and propose a framework for long-term reasoning and planning that is grounded in real-world\nperception. We present Causal InfoGAN (CIGAN), a method for learning plannable representations\nof dynamical systems with high-dimensional observations such as images. By plannable, we mean\nrepresentations that are structured in such a way that makes them amenable for ef\ufb01cient search,\nthrough AI planning tools. In particular, we focus on discrete and deterministic dynamics models,\nwhich can be used with graph search methods, and on continuous models where planning is done by\nlinear interpolation, though our framework can be generalized to other model types.\nIn our framework, a generative adversarial net (GAN; [15]) is trained to generate sequential observa-\ntion pairs from the dynamical system. The GAN generator takes as input both unstructured random\nnoise and a structured pair of consecutive states from a low-dimensional, parametrized dynamical\nsystem termed the planning model. The planning model is meant to capture the features that are\nmost essential for representing the causal properties in the data, and are therefore important for\nplanning future outcomes. To learn such a model, we follow the InfoGAN idea [5], and add to the\nGAN training loss a term that maximizes the mutual information between the observation pairs and\nthe transitions that induced them.\nThe CIGAN model can be trained using random exploration data from the system. After learn-\ning, given an observation of an initial con\ufb01guration and a goal con\ufb01guration, it can generate a\n\u201cwalkthrough\u201d sequence of feasible observations that lead from the initial state to the goal. This\nwalkthrough can be later used as a reference signal for a controller to execute the task in the real\nsystem. We demonstrate convincing walkthrough generation on synthetic tasks and real image data\ncollected by Nair et al. [29] of a robot randomly poking a rope.\n2 Preliminaries and Problem Formulation\nLet H denote the entropy of a random variable, and I denote the mutual information between two\nrandom variables [8].\nGAN and InfoGAN: Deep generative models aim to generate samples from the real distribution\nof observations, Pdata. In this work we build on the GAN framework [15], which is composed of a\ngenerator, G(z) = o, mapping a noise input z \u223c Pnoise(z) to an observation o, and a discriminator,\nD(o), mapping an observation to the probability that it was sampled from the real data. The GAN\ntraining optimizes a game between the generator and discriminator,\n\nmin\n\nG\n\nmax\n\nD\n\nV (G, D) = min\n\nG\n\nmax\n\nD\n\nEo\u223cPdata [log D(o)] + Ez\u223cPnoise [log (1 \u2212 D(G(z)))] .\n\nOne can view the noise vector z in the GAN as a representation of the observation o. In GAN training,\nhowever, there is no incentive for this representation to display any structure at all, making it dif\ufb01cult\nto interpret, or use in a downstream task. The InfoGAN method [5] aims to mitigate this issue. The\nidea in InfoGAN is to add to the generator input an additional \u2018state\u20194 component s \u223c P (s), and\nadd to the GAN objective a loss that induces maximal mutual information between the generated\nobservation and the state. The InfoGAN objective is given by:\n\nmin\n\nG\n\nmax\n\nD\n\nV (G, D) \u2212 \u03bbI (s; G(z, s)) ,\n\n(1)\n\nwhere \u03bb > 0 is a weight parameter, and V (G, D) is the GAN loss above. Intuitively, this objective\ninduces the state to capture the most salient properties of the observation. Optimizing the objective\nin (1) directly is dif\ufb01cult without access to the posterior distribution P (s|o), and a variational\nlower bound was proposed in [5]. De\ufb01ne an auxiliary distribution Q(s|o) to approximate the\nposterior P (s|o). Then, I (s; G(z, s)) \u2265 Es\u223cP (s),o\u223cG(z,s) [log Q(s|o)] + H(s). Using this bound,\nthe InfoGAN objective (1) can be optimized using stochastic gradient descent.\n\n4In [5], s is referred to as a code. Here we term it as a state, to correspond with our subsequent development\n\nof structured GAN input from a dynamical system.\n\n2\n\n\fobservations(cid:8)oi\n\n1, ui\n\n(cid:9)\n\ni\u22081,...,N\n\nProblem Formulation: We consider a fully observable and deterministic dynamical system, ot+1 =\nf (ot, ut), where ot and ut denote the observation and action at time t, respectively. The function\nf is assumed to be unknown. We are provided with data D in the form of N trajectories of\n, generated from f, where the actions are generated by an\n\n1 . . . , oi\nTi\narbitrary exploration policy.5\nWe say that two observations o, o(cid:48) are h-reachable if there exists a sequence of actions that takes the\nsystem from o to o(cid:48) within h steps or less. We consider the problem of generating a walkthrough \u2013 a\nsequence of reachable observations along a feasible path between the start and the goal:\nProblem 1 Walkthrough Planning: Given D, h, and two observations ostart, ogoal, generate a\nsequence of observations ostart, . . . , ogoal such that every two consecutive observations in the\nsequence are h-reachable. If such a sequence does not exist, return \u2205.\nThe motivation to solve problem 1 is that it breaks the long horizon planning problem (from ostart to\nogoal) into a sequence of short h-horizon planning problems which can be later solved effectively\nusing other methods such as inverse dynamics or model-free RL [29]. This concept of temporal\nabstraction has been fundamental in AI planning (e.g., [11, 42]). Since we are searching for a\nsequence of way point observations, the actions are not relevant for our problem, and in the sequel\nwe omit them from the discussion.\n3 Causal InfoGAN\nA natural approach for solving the walkthrough planning problem in Section 2 is learning some model\nof the dynamics f from the data, and searching for a plan within that model. This leads to a trade-off.\nOn the one hand, we want to be expressive, and learn all the transitions possible from every o within\na horizon h. When o is a high dimensional image observation, this typically requires mapping the\nimage to an extensive feature space [30, 12]. On the other hand, however, we want to plan ef\ufb01ciently,\nwhich generally requires either low dimensional state spaces or well-structured representations. We\napproach this challenge by proposing Causal InfoGAN \u2013 an expressive generative model with a\nstructured representation that is compatible with planning algorithms. In this section we present the\nCausal InfoGAN generative model, and in Section 4 we explain how to use the model for planning.\nLet o and o(cid:48) denote a pair of sequential observations from the dynamical system f, and let Pdata(o, o(cid:48))\ndenote their probability, as displayed in the data D. We posit that a generative model that can\naccurately learn Pdata(o, o(cid:48)) has to capture the features that are important for representing the causality\nin the data \u2013 what next observations o(cid:48) are reachable from the current observation o. Naturally, such\nfeatures would be useful later for planning.\nWe build on the GAN framework [15]. Applied to our setting, a vanilla GAN would be composed\nof a generator, o, o(cid:48) = G(z), mapping a noise input z \u223c Pnoise(z) to an observation pair, and a\ndiscriminator, D(o, o(cid:48)), mapping an observation pair to the probability that it was sampled from the\nreal data D and not from the generator. One can view the noise vector z in such a GAN as a feature\nvector, containing some representation of the transition to o(cid:48) from o. The problem, however, is that the\nstructure of this representation is not necessarily easy to decode and use for planning. Therefore, we\npropose to design a generator with a structured input that can be later used for planning. In particular,\nwe propose a GAN generator that is driven by states sampled from a parametrized dynamical system.\nLet M denote a dynamical system with state space S, which we term the set of abstract-states, and\na parametrized, stochastic transition function TM(s(cid:48)|s): s(cid:48) \u223c TM(s(cid:48)|s), where s, s(cid:48) \u2208 S are a pair\nof consecutive abstract states. We denote by PM(s) the prior probability of an abstract state s. We\nemphasize that the abstract state space S can be different from the space of real observations o. For\nreasons that will become clear later on, we term M as the latent planning system.\nWe propose to structure the generator as taking in a pair of consecutive abstract states s, s(cid:48) in addition\nto the noise vector z. The GAN objective in this case is therefore (cf. Section 2):\nV (G, D) = Eo,o(cid:48)\u223cPdata [log D(o, o(cid:48))] + Ez\u223cPnoise,s\u223cPM(s),s(cid:48)\u223cTM(s) [log (1 \u2212 D(G(z, s, s(cid:48))))] . (2)\nThe idea is that s and s(cid:48) would represent the abstract features that are important for understanding the\ncausality in the data, while z would model variations that are less informative, such as pixel level\ndetails. To learn such representations, we follow InfoGAN [5], and add to the GAN objective a term\nthat maximizes mutual information between the generated pair of observations and the abstract states.\n\n5In this work, we do not concern the problem of how to best generate the exploration data.\n\n3\n\n\fmax\n\nD\n\ns \u223c PM;\n\ns.t. o, o(cid:48) \u223c G(z, s, s(cid:48));\n\nV (G, D)\u2212 \u03bbI (s, s(cid:48); o, o(cid:48)) ,\n\nWe propose the Causal InfoGAN objective:\nminM,G\nwhere \u03bb > 0 is a weight parameter, and V (G, D) is given in (2). Intuitively, this objective induces the\nabstract model to capture the most salient possible changes that can be effected on the observation.\nOptimizing the objective in (3) directly is dif\ufb01cult, since we do not have access to the posterior\ndistribution, P (s, s(cid:48)|o, o(cid:48)), when using an expressive generator function. Following InfoGAN [5],\nwe optimize a variational lower bound of (3). De\ufb01ne an auxiliary distribution Q(s, s(cid:48)|o, o(cid:48)) to\napproximate the posterior P (s, s(cid:48)|o, o(cid:48)). We have, following a similar derivation to [5]:\n\ns(cid:48) \u223c TM(s), (3)\n\nI ((s, s(cid:48)); G(z, s, s(cid:48))) \u2265 E\n\ns\u223cPM,s(cid:48)\u223cTM(s),\n\no,o(cid:48)\u223cG(z,s,s(cid:48))\n\n[log Q(s, s(cid:48)|o, o(cid:48))] + H(s, s(cid:48))\n\n.\n= IV LB(G, Q).\n\n(4)\n\nTo encourage the same mapping between s, o and s(cid:48), o(cid:48), we propose the disentangled posterior\napproximation, Q(s, s(cid:48)|o, o(cid:48)) = Q(s|o)Q(s(cid:48)|o(cid:48)) (see Appendix B.)\nWe plug the lower bound (4) in Eq. (3) to obtain the following loss function:\n\nmin\nG,Q,M max\n\nD\n\nV (G, D) \u2212 \u03bbIVLB(G, Q),\n\n(5)\n\nwhere \u03bb > 0 is a constant. The loss in (5) can be optimized effectively using stochastic gradient\ndescent, and we provide a detailed algorithm in Appendix C.\n\n(a) Causal InfoGAN model\n\n(b) Planning paradigm\n\nFigure 1: The Causal InfoGAN framework. (a) Generative model (cf. Section 3). First, an abstract\nstate s is sampled from a prior PM(s). Given s, the next state s(cid:48) is sampled using the transition\nmodel TM(s(cid:48)|s). The states s, s(cid:48) are fed, together with a random noise sample z, into the generator\nwhich outputs o, o(cid:48). The discriminator D maps an observation pair to the probability of the pair being\nreal. Finally, the approximate posterior Q maps from each observation to the distribution of the state\nit associates with. The causal InfoGAN loss function in Equation (5) encourages Q to predict each\nstate accurately from each observation. (b) Planning paradigm (cf. Section 4). Given start and\ngoal observations, we \ufb01rst map them to abstract states, and then we apply planning algorithms using\nthe model M to search for a path from sstart to sgoal. Finally, from the plan in abstract states, we\ngenerate back a sequence of observations.\n4 Planning with Causal InfoGAN models\nIn this section, we discuss how to use the Causal InfoGAN model for planning goal directed\ntrajectories. We \ufb01rst present our general methodology, and then propose several model con\ufb01gurations\nfor which (5) can be optimized ef\ufb01ciently, and the latent planning system is compatible with ef\ufb01cient\nplanning algorithms. We then describe how to combine these ideas for solving the walkthrough\nplanning problem in various domains.\n4.1 General Planning Paradigm\nOur general paradigm for goal directed planning is described in Figure 1b. We start by training a\nCausal InfoGAN model from the data, as described in the previous section. Then, we perform the\nfollowing 3 steps, which are detailed in the rest of this section:\n\n1. Given a pair of observations ostart, ogoal, we \ufb01rst encode them into a pair of corresponding\n\nstates sstart, sgoal. This is described in Section 4.2.\n\n4\n\nGD>%A40;B\u0001~ %B\u0001I\u0001~ %I\u0001>\u0005B\u0005\u0001~ )(\u0001\b\u0001|\u0001B)\u0001State spaceQQGeneratorDiscriminatorMutual InformationCausalObservationsNoisesstarts1s2...smsgoalostarto1o2...omogoalogoalostartsgoalsgoalSec 4.4Sec 4.3Sec 4.2\fType\n\nDiscrete \u2013 one-hot\nDiscrete \u2013 binary\n\nContinuous\n\nValues s\n[N ]\n{0, 1}N\nRN\n\nPrior PM(s)\nU{1, . . . , N}\nU{0, 1}N\nU(\u22121, 1)N\n\nTransition TM(s(cid:48)|s)\ns(cid:48) \u223c Sof tmax(s(cid:62)\u03b8)\ns(cid:48) \u223c N (s, \u03a3\u03b8(s))\n\nSee eq. 6\n\nPlanning algorithms\n\nDijkstra\nDijkstra\n\nLinear interpolation\n\nTable 1: Various latent planning systems. In all cases, N is the state dimension. The parameters \u03b8 of\nthe transition TM depending on the state type. In the one-hot case, \u03b8 is a matrix in RN\u00d7N . In the\nbinary case, \u03b8 denotes parameters in a stochastic neural network; see Eq. (6). In the continuous case\n\u03b8 represents the parameters of a neural network that controls the variance of the transition.\n\n2. Using the transition probabilities in the planning model M, we plan a feasible state trajectory\n\nfrom sstart to sgoal: sstart, s1, . . . , sm, sgoal. This is described in Section 4.3.\n\n3. Finally, we decode the state trajectory into a corresponding trajectory of observations\n\nostart, o1, . . . , om, ogoal. This is described in Section 4.4.\n\nThe speci\ufb01c method for each step in the planning paradigm can depend on the problem at hand. For\nexample, some systems are naturally described by discrete abstract states, while others are better\ndescribed by continuous states. In the following, we describe several models and methods that\nworked well for us, under the general planning paradigm described above. This list is by no means\nexhaustive. On the contrary, we believe that the Causal InfoGAN framework provides a basis for\nfurther investigation of deep generative models that are compatible with planning.\n4.2 Encoding an Observation to a State\nFor mapping an observation to a state, we can simply use the disentangled posterior Q(s|o). We\nfound this approach to work well in low-dimensional observation spaces. However, for high-\ndimensional image observations we found that the learned Q(s|o) was accurate in classifying\ngenerated observations (by the generator), but inaccurate for classifying real observations. This\nis explained by the fact that in Causal InfoGAN, Q is only trained on generated observations,\nand can over\ufb01t to generated images. For images, we therefore opted for a different approach.\nFollowing [45], we performed a search over the latent space to \ufb01nd the best latent state mapping\ns\u2217(o): s\u2217(o) = arg mins mins(cid:48),z (cid:107)o\u2212G(s, s(cid:48), z)(cid:107)2. Another approach, which could scale to complex\nimage observations, is to add to the GAN training an explicit encoding network [10, 49]. In our\nexperiments, the simple search approach worked well and we did not require additional modi\ufb01cations\nto the GAN training.\n4.3 Latent Planning Systems\nWe now present several latent planning systems that are compatible with ef\ufb01cient planning algorithms.\nTable 1 summarizes the different models. In all cases, optimizing the model parameters with respect to\nthe expectation in the loss (4) is done using the reparametrization trick (following [20] for continuous\nstates, and [16] for discrete states).\nDiscrete Abstract States \u2013 One-Hot Representation. We start from a simple abstract state repre-\nsentation, in which each s \u2208 S is represented as a N\u2212dimensional one-hot vector. We denote by \u03b8 \u2208\nRN\u00d7N the model parameters, and compute transition probabilities as: TM(s(cid:48)|s) = Sof tmax(s(cid:62)\u03b8).\nDiscrete Abstract States \u2013 Binary Representation. We present a more expressive abstract state\nrepresentation using binary states. Binary state representations are common in AI planning, where\neach binary element is known as a predicate, and corresponds to a particular object property being\ntrue or false [34]. Using Causal InfoGAN, we learn the predicates directly from data.\nWe propose a parametric transition model that is suitable for binary representations. Let s \u2208 {0, 1}N\nbe an N\u2212dimensional binary vector, drawn from PM(s). We generate the next state s(cid:48) by \ufb01rst\ndrawing a random action vector a \u2208 {0, 1}M with some probability P M(a) 6. Let li denote a\nfeed-forward neural network with sigmoid output parametrized by \u03b8 that maps the state s and action\ni|s, a. Thus, the probability of the next state s(cid:48) is \ufb01nally given by:\na to the Bernoulli\u2019s parameter of s(cid:48)\n\n(cid:48)\nTM(s\n\n= v|s) = Ea\n\ni = vi|s, a)\n(cid:48)\nTM(s\n\n= Ea\n\nli(s, a)vi (1 \u2212 li(s, a))(1\u2212vi)\n\n.\n\n(6)\n\ni\n\ni\n\nWe emphasize that there is not necessarily any correspondence between the action vector a and the\nreal actions that generated the observation pairs in the data. The action a is simply a means to induce\nstochasticity to the state transition network.\n\n6See Appendix B.2 Binary States for experimental choices.\n\n5\n\n(cid:34)(cid:89)\n\n(cid:35)\n\n(cid:34)(cid:89)\n\n(cid:35)\n\n\fFor planning with discrete models, we interpret the stochastic transition model TM as providing the\npossible state transitions, i.e., for every s(cid:48) such that TM(s(cid:48)|s) > \u0001 there exists a possible transition\nfrom s to s(cid:48). For planning, we require abstract state representations that are compatible with ef\ufb01cient\nAI planning algorithms. The one-hot and binary representations above can be directly plugged in to\ngraph-planning algorithms such as Dijkstra\u2019s shortest-path algorithm [34].\nContinuous Abstract States. For some domains, such as the rope manipulation in our experiments, a\ncontinuous abstract state is more suitable. We consider a model where an s \u2208 S is an N\u2212dimensional\ncontinuous vector. Planning in high-dimensional continuous domains, however, is hard in general.\nHere, we propose a simple and effective solution: we will learn a latent planning system such that\nlinear interpolation between states makes for feasible plans. To bring about such a model, we consider\ntransition probabilities TM(s(cid:48)|s) given as Gaussian perturbations of the state: s(cid:48) = s + \u03b4, where\n\u03b4 \u223c N (0, \u03a3\u03b8(s)) and \u03a3\u03b8 is a diagonal covariance matrix, represented by a MLP with parameters\n\u03b8. The key idea here is that, if only small local transitions are possible in the system, then a linear\ninterpolation between two states sstart, sgoal has a high probability, and therefore represents a feasible\ntrajectory in the observation space. To encourage such small transitions, we add an L2 norm of the\nconvariance matrix to the loss function (5): Lcont(M) = Es\u223cPM (cid:107)\u03a3\u03b8(s)(cid:107)2 . The prior probability\nPM for each element of s is uniform in [\u22121, 1].\n4.4 Decoding a State Trajectory to an Observation Walkthrough Trajectory\nWe now discuss how to generate a feasible sequence of observations from the planned state trajectory.\nHere, as before, we separate the discussion for systems with low-dimensional observations and\nsystems with image observations, as we found that different practices work best for each.\nFor low-dimensional observations, we structure the GAN generator G to have an observation-\nconditional form: o = G1(z, s, s(cid:48)),\no(cid:48) = G2(z, o, s, s(cid:48)). Using this generator form, we can\nsequentially generate observations from a state sequence s1, . . . , sT . We \ufb01rst use G1 to generate o1\nfrom s1, s2, and then, for each 2 \u2264 t < t, use G2 to generate ot+1 from st, st+1, and ot.\nFor high-dimensional image observations, the sequential generator does not work well, since small\nerrors in the image generation tend to get accumulated when fed back into the generator. We therefore\nfollow a different approach. To generate the i\u2019th observation in the trajectory oi, we use the generator\nwith the input si, si+1, and a noise z that is \ufb01xed throughout the whole trajectory. The generator\nactually outputs a pair of sequential images, but we discard the second image in the pair.\nTo further improve the planning result we generate K random trajectories with different random\nnoise z, and select the best trajectory by using a discriminator D to provide a con\ufb01dence score for\neach trajectory. In the low-dimensional case, we use the GAN discriminator. In the high-dimensional\ncase, however, we \ufb01nd that the discriminator tends to over\ufb01t to the generator. Therefore, we trained\nan auxiliary discriminator for novelty detection, as described in the Experiment Section 6.2.\n5 Related Work\nCombining deep generative models with structured dynamical systems has been explored in the\ncontext of variational autoencoders (VAEs), where the latent space was continuous [6, 17]. Watter\net al. [46] used such models for planning, by learning latent linear dynamics, and using a linear\nquadratic Gaussian control algorithm. Disentangled video prediction [9] separates object content and\nposition, but has not been used for planning. Very recently, Corneil et al. [7] suggested Variational\nState Tabulation (VaST) \u2013 a VAE-based approach for learning latent dynamics over binary state\nrepresentations, and planning in the latent space using prioritized sweeping to speed up RL. Causal\nInfoGAN shares several similarities with VaST, such as using Gumbel-Softmax to backprop through\ntransitions of discrete binary states, and leveraging the structure of the binary states for planning.\nHowever, VaST is formulated to require the agent actions, and is thus limited to single time step\npredictions. More generally, our work is developed under the GAN formulation, which, to date, has\nseveral bene\ufb01ts over VAEs such as superior quality of image generation [19]. Causal InfoGAN can\nalso be used with continuous abstract states.\nThe semiparametric topological memory (SPTM) [37] is another recent approach for solving problems\nsuch as Problem 1, by planning in a graph where every observation in the data is a node, and\nconnectivity is decided using a learned similarity metric between pairs of observations. SPTM\nhas shown impressive results on image-based navigation. However, Causal InfoGAN\u2019s parametric\napproach of learning a compact, model for planning has the potential to scale up to more complex\nproblems, in which the increasing amount of data required would make the nonparametric SPTM\napproach dif\ufb01cult to apply.\n\n6\n\n\fLearning state aggregation and state representation has a long history in RL. Methods such as in\n[27, 38] exploit the value function for measuring state similarity, and are therefore limited to the\ntask de\ufb01ned by the reward. Methods for general state aggregation have also been proposed, based\non spectral clustering [26, 40, 25, 24], and variants of K-means [3]. All these approaches rely in\nsome form on the Euclidean distance as a metric between observation features. As we show in our\nexperiments, the Euclidean distance can be unsuitable even on low-dimensional continuous domains.\nRecent work in deep RL explored learning goal-conditioned value functions and policies [2, 32], and\npolicies with an explicit planning computation [43, 31, 39]. These approaches require a reward signal\nfor learning (or supervision from an expert [39]). In our work, we do not require a reward signal, and\nlearn a general model of the dynamical system, which is used for goal-directed planning.\nOur work is also related to learning models of intuitive physics. Previous work explored feedforward\nneural networks for predicting outcomes of physical experiments [22], neural networks for modelling\nrelations between objects [47, 36], and prediction based on physics simulators [4, 48]. To the best of\nour knowledge, these approaches cannot be used for planning. However, related ideas would likely\nbe required for scaling our method to more complex domains, such as manipulating several objects.\nIn the planning literature, most studies relied on manually designed state representations. In a\nrecent work, Konidaris et al. [21] automatically extracted state representations from raw observations,\nbut relied on a prespeci\ufb01ed set of skills for the task. In our work, we automatically extract state\nrepresentations by learning salient features that describe the causal structure of the data.\n6 Experiments\nIn our experiments, we aim to (1) visualize the abstract states and planning in CIGAN; (2) compare\nCIGAN with recent state-aggregation methods in the literature; (3) show that CIGAN can produce\nrealistic visual plans in a complex dynamical system; and (4) show that CIGAN signi\ufb01cantly\noutperforms baseline methods. We begin our investigation with a set of toy tasks, speci\ufb01cally\ndesigned to demonstrate the bene\ufb01ts of CIGAN, where we can also perform an extensive quantitative\nevaluation. We later present experiments on a real dataset of robotic rope manipulation.\n6.1\nIn this section we evaluate CIGAN on a set of 2D navigation problems. These problems abstract away\nthe challenges of learning visual features, allowing an informative comparison on the task of learning\ncausal structure in data, and using it for planning. For details of the training data see Appendix ??.\nOur toy domains involve a particle moving in a 2-dimensional continuous domain with impenetrable\nobstacles, as depicted in Figure 2, and Figure 5. in Appendix A.1. The observations are the (x, y)\ncoordinates of the particle in the plane, and, in the door-key domain, also a binary indicator for\nholding the key. We generate data trajectories by simulating a random motion of the particle, started\nfrom random initial points. We consider the following various geometrical arrangements of the\ndomain, chosen to demonstrate the properties of our method.\n\nIllustrative Experiments\n\n1. Tunnels: the domain is partitioned into two unconnected rooms (top/bottom), where in each\nroom there is an obstacle, positioned such that transitioning between the left/right quadrants\nis through a narrow tunnel.\n\n2. Door-key: two rooms are connected by a door. The door can be traversed only if the agent\nholds the key, which is obtained by moving to the red-marked area in the top right corner of\nthe upper room. Holding the key is represented as a binary 0/1 element in the observation.\n3. Rescaled door-key: Same as door key domain, but the key observation is rescaled to be a\n\nsmall \u0001 when the agent is holding the key, and 0 otherwise.\n\nOur domains are designed to distinguish when standard state aggregation methods, which rely on\nthe Euclidean metric, can work well. In the tunnel domain, the Euclidean metric is not informative\nabout the dynamics in the task \u2013 two points in different rooms inside the tunnel can be very close\nin Euclidean distance, but not connected, while points in the same room can be more distant but\nconnected. In the door-key domain, the Euclidean distance is informative if observations with key and\nwithout key are very distant in Euclidean space, as in the 0/1 representation (compared to the domain\nsize which is in [\u22121, 1]). In the rescaled door-key, we make the Euclidean distance less informative\nby changing the key observation to be 0/\u0001.\nWe compare CIGAN with several recent methods for aggregating observation features into states for\nplanning. Note that in these simple 2D domains, feature extraction is not necessary as the observations\nare already low dimensional vectors. The simplest baseline is K-means, which relies on the Euclidean\ndistance between observations. In [3], a variant of K-means for temporal data was proposed, using a\n\n7\n\n\fwindow of consecutive observations to measure a smoothed Euclidean distance to a cluster centroids.\nWe refer to this method as temporal K-means. In [26], and more recently [40] and [25], spectral\nclustering (SC) was used to cluster observations. For continuous observations, SC requires a distance\nfunction to build a connectivity graph, and previous studies [26, 40, 25] relied on the Euclidean\ndistance, using either nearest neighbor to connect nodes, or exponentiated-distance weighted edges.\nIn Figure 2, we show the CIGAN classi\ufb01cation of observations to abstract states, Q(s|o), and compare\nwith the K-means baseline; the other baselines gave qualitatively similar results. Note that CIGAN\nlearned a clustering that is related to the dynamical properties of the domain, while the baselines,\nwhich rely on a Euclidean distance, learned clusters that are not informative about the real possible\ntransitions. As a result, CIGAN clearly separates abstract states within each room, while the K-means\nbaseline clusters observations across the wall. This demonstrates the potential of CIGAN to learn\nmeaningful state abstractions without requiring a distance function in observation space. Similarly,\nthe results for door-key domains are shown in Appendix A.1.\nTo evaluate planning performance, we hand-coded an oracle function that evaluates whether an\nobservation trajectory is feasible or not (e.g., does not cross obstacles, correctly reports \u2205 when a\ntrajectory does not exist). For CIGAN, we ran the planning algorithm described in Section 4. For\nbaselines, we calculated cluster transitions from the data, and generated planning trajectories in\nobservation space by using the cluster centroids. We chose algorithm parameters and stopping criteria\nby measuring the average feasibility score on a validation set of start/goal observations, and report\nthe average feasibility on a held out test set of start/goal observations. Our results in Table 2 show\nthat by learning more informative clusters, CIGAN resulted in signi\ufb01cantly better planning.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: 2D particle results on tunnel domain. (a) The domain - top/bottom rooms are not connected.\nLeft/right quadrants are connected through a narrow tunnel. An example of several random walk\ntrajectories are shown. (b) Clustering found by CIGAN. (c) Clustering found by K-means. (d)\nExample walkthrough trajectories generated by CIGAN, from a point at the top right to \ufb01ve other\nlocations on the map, marked by colored circles. For trajectories that were not found only the target\nis shown. Note that CIGAN learned clusters that correspond to the possible dynamics of the particle\nin the task, and was therefore able to generate reasonable planning trajectories.\n\nTemporal K-means\n\nCIGAN\nK-means\n\nSpectral\nclustering\n\n98%\n12.25%\n7.0%\n8.75%\n\n98%\n100%\n100%\n60%\n\nTunnels Door-key Rescaled\ndoor-key\n\n97%\n0.0%\n0.0%\n20.0%\n\nTable 2: Planning results for 2D tasks.\nTable shows average feasibility of plans\n(higher is better) generated by the dif-\nferent algorithms. Note that CIGAN\nsigni\ufb01cantly outperforms baselines in\ndomains where the Euclidean distance\nis not informative for planning.\n\n6.2 Rope Manipulation\nIn this section we demonstrate CIGAN on the task of generating realistic walkthroughs of robotic\nrope manipulation. Then, we show that CIGAN generates signi\ufb01cantly better trajectories than those\ngenerated by the state-of-the-art generative model baselines both visually and quantitatively.\nThe rope manipulation dataset [29] contains sequential images of rope manipulated in a self-\nsupervised manner, by a robot randomly choosing a point on the rope and perturbing it slightly.\nUsing these data, the task is to manipulate the rope in a goal-oriented fashion, from one con\ufb01guration\nto another, where a goal is represented as an image of the desired rope con\ufb01guration. In the original\nstudy, Nair et al. [29] used the data to learn an inverse dynamics model for manipulating the rope\nbetween two images of similar rope con\ufb01gurations. Then, to solve long-horizon planning, Nair\n\n8\n\n\fFigure 3: Rope walkthroughs generated from CIGAN, InfoGAN, and DCGAN. Red crosses show\nunfeasible one-step transitions with respect to the data. See more plans in Appendix A.2.\n\nFigure 4: Evaluation of walkthrough planning in rope domain. We\ntrained a classi\ufb01er to predict whether two observations are sequen-\ntial or not (1=sequential, 0=not sequential), and compare the average\nclassi\ufb01cation score for different generative models. Note that CIGAN\nsigni\ufb01cantly outperforms the baselines, in alignment with the qualitative\nresults of Figure 3.\n\net al. required a human to provide the walkthrough sequence of rope poses, and used the learned\ncontroller to execute the short-horizon transitions within the plan. In our experiment, we show that\nCIGAN can be used to generate walkthrough plans directly from data for long-horizon tasks, without\nrequiring additional human guidance. We train a CIGAN model on the rope manipulation data of [29].\nWe pre-processed the data by removing the background, and applying a grayscale transformation. We\nchose the continuous abstract state representation with the linear interpolation planner as described\nin Section 4. Note that, as described in Section 4, the encoding, planning, and decoding methods\nin this case are not speci\ufb01c to CIGAN, and can be used with a GAN or InfoGAN generative model,\nallowing a fair comparison with alternative representation learning methods. In Figure 3, we generate\nwalkthroughs using CIGAN, InfoGAN, and DCGAN Noticeably, Causal InfoGAN resulted in a\nsmooth latent space where linear interpolation indeed corresponds to plausible trajectories.\nTo numerically evaluate planning performance we propose a visual \ufb01delity score, inspired by the\nInception score for evaluating GANs [35], we train a binary classi\ufb01er to classify whether two images\nare sequential in the data or not7. For an image pair, the classi\ufb01er output therefore provides a score\nbetween 0 and 1 for the feasibility of the transition. We then compute the trajectory score \u2013 the\naverage classi\ufb01er score of image pairs in the trajectory. Note that this classi\ufb01er is trained independent\nof the generative models, making for an impartial metric. For each start and goal, we pick the best\ntrajectory score out of 400 samples of the noise variable z.8 As shown in Figure 4, Causal InfoGAN\nachieved a signi\ufb01cantly higher trajectory score averaged over 57 task con\ufb01gurations.\n7 Conclusion\nWe presented Causal InfoGAN, a framework for learning deep generative models of sequential\ndata with a structured latent space. By choosing the latent space to be compatible with ef\ufb01cient\nplanning algorithms, we developed a framework capable of generating goal-directed trajectories from\nhigh-dimensional dynamical systems.\nOur results for generating realistic manipulation plans of rope suggest promising applications in\nrobotics, where designing models and controllers for manipulating deformable objects is challenging.\nThe binary latent models we explored provide a connection between deep representation learning and\nclassical AI planning, where Causal InfoGAN can be seen as a method for learning object predicates\ndirectly from data. In future work we intend to investigate this direction further, and incorporate\nobject-oriented models, which are a fundamental component in classical AI.\n\n7The positive data are the pairs of rope images that are 1 step apart and the negative data are randomly chosen\n\npairs that are from different runs which are highly likely to be farther than 1 step apart.\n\n8This selection process is applied the same way to the DCGAN and InfoGAN baselines.\n\n9\n\nCIGANInfoGANDCGANCausalInfoGANInfoGANDCGAN0.40.60.81.0Average Trajectory Score\fAcknowledgement\n\nThis work was funded in part by ONR PECASE N000141612723 and Siemens. Thanard Kurutach\nand Aviv Tamar were supported by ONR PECASE N000141612723. Aviv Tamar was partially\nsupported by the Technion Viterbi scholarship. The authors wish to thank Tom Zahavy and Aravind\nSrinivas for sharing their code for our comparisons with MDP state aggregation baselines, and Elinor\nTamar for timely assistance in preprocessing the rope data.\n\nReferences\n[1] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking:\nIn Advances in Neural Information Processing\n\nExperiential learning of intuitive physics.\nSystems, pages 5074\u20135082, 2016.\n\n[2] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin,\nO. P. Abbeel, and W. Zaremba. Hindsight experience replay. In Advances in Neural Information\nProcessing Systems, pages 5048\u20135058, 2017.\n\n[3] N. Baram, T. Zahavy, and S. Mannor. Spatio-temporal abstractions in reinforcement learning\n\nthrough neural encoding. 2016.\n\n[4] P. W. Battaglia, J. B. Hamrick, and J. B. Tenenbaum. Simulation as an engine of physical scene\nunderstanding. Proceedings of the National Academy of Sciences, 110(45):18327\u201318332, 2013.\n[5] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. InfoGAN: Inter-\npretable representation learning by information maximizing generative adversarial nets. In\nAdvances in Neural Information Processing Systems, pages 2172\u20132180, 2016.\n\n[6] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville, and Y. Bengio. A recurrent latent\nvariable model for sequential data. In Advances in neural information processing systems, pages\n2980\u20132988, 2015.\n\n[7] D. Corneil, W. Gerstner, and J. Brea. Ef\ufb01cient model-based deep reinforcement learning with\n\nvariational state tabulation. arXiv preprint arXiv:1802.04325, 2018.\n\n[8] T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.\n[9] E. L. Denton and v. Birodkar. Unsupervised learning of disentangled representations from video.\n\nIn Advances in Neural Information Processing Systems 30, pages 4414\u20134423. 2017.\n\n[10] J. Donahue, P. Kr\u00e4henb\u00fchl, and T. Darrell. Adversarial feature learning. arXiv preprint\n\narXiv:1605.09782, 2016.\n\n[11] R. E. Fikes, P. E. Hart, and N. J. Nilsson. Learning and executing generalized robot plans. In\n\nReadings in Arti\ufb01cial Intelligence, pages 231\u2013249. Elsevier, 1981.\n\n[12] C. Finn and S. Levine. Deep visual foresight for planning robot motion. In Robotics and\nAutomation (ICRA), 2017 IEEE International Conference on, pages 2786\u20132793. IEEE, 2017.\n[13] C. Finn, S. Levine, and P. Abbeel. Guided cost learning: Deep inverse optimal control via policy\n\noptimization. In International Conference on Machine Learning, pages 49\u201358, 2016.\n\n[14] C. Finn, X. Y. Tan, Y. Duan, T. Darrell, S. Levine, and P. Abbeel. Deep spatial autoencoders for\nvisuomotor learning. In Robotics and Automation (ICRA), 2016 IEEE International Conference\non, pages 512\u2013519. IEEE, 2016.\n\n[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and\nY. Bengio. Generative adversarial nets. In Advances in neural information processing systems,\npages 2672\u20132680, 2014.\n\n[16] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv\n\npreprint arXiv:1611.01144, 2016.\n\n[17] M. Johnson, D. K. Duvenaud, A. Wiltschko, R. P. Adams, and S. R. Datta. Composing graphical\nmodels with neural networks for structured representations and fast inference. In NIPS, pages\n2946\u20132954, 2016.\n\n10\n\n\f[18] K. Kansky, T. Silver, D. A. M\u00e9ly, M. Eldawy, M. L\u00e1zaro-Gredilla, X. Lou, N. Dorfman, S. Sidor,\nS. Phoenix, and D. George. Schema networks: Zero-shot transfer with a generative causal\nmodel of intuitive physics. In ICML, volume 70, pages 1809\u20131818, 2017.\n\n[19] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality,\n\nstability, and variation. arXiv preprint arXiv:1710.10196, 2017.\n\n[20] D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[21] G. Konidaris, L. P. Kaelbling, and T. Lozano-Perez. From skills to symbols: Learning symbolic\nrepresentations for abstract high-level planning. Journal of Arti\ufb01cial Intelligence Research,\n61:215\u2013289, 2018.\n\n[22] A. Lerer, S. Gross, and R. Fergus. Learning physical intuition of block towers by example.\n\narXiv preprint arXiv:1603.01312, 2016.\n\n[23] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.\n\nJMLR, 17, 2016.\n\n[24] M. Liu, M. C. Machado, G. Tesauro, and M. Campbell. The eigenoption-critic framework.\n\narXiv preprint arXiv:1712.04065, 2017.\n\n[25] M. C. Machado, M. G. Bellemare, and M. Bowling. A laplacian framework for option discovery\n\nin reinforcement learning. arXiv preprint arXiv:1703.00956, 2017.\n\n[26] S. Mahadevan and M. Maggioni. Proto-value functions: A laplacian framework for learning\nrepresentation and control in markov decision processes. Journal of Machine Learning Research,\n8(Oct):2169\u20132231, 2007.\n\n[27] S. Mannor, I. Menache, A. Hoze, and U. Klein. Dynamic abstraction in reinforcement learning\nvia clustering. In Proceedings of the twenty-\ufb01rst international conference on Machine learning,\npage 71. ACM, 2004.\n\n[28] V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Ried-\nmiller, A. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement\nlearning. Nature, 518(7540):529\u2013533, 2015.\n\n[29] A. Nair, D. Chen, P. Agrawal, P. Isola, P. Abbeel, J. Malik, and S. Levine. Combining\nself-supervised learning and imitation for vision-based rope manipulation. In Robotics and\nAutomation (ICRA), 2017 IEEE International Conference on, pages 2146\u20132153. IEEE, 2017.\n\n[30] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh. Action-conditional video prediction using\ndeep networks in atari games. In Advances in Neural Information Processing Systems, pages\n2863\u20132871, 2015.\n\n[31] J. Oh, S. Singh, and H. Lee. Value prediction network. In Advances in Neural Information\n\nProcessing Systems, pages 6118\u20136128, 2017.\n\n[32] V. Pong, S. Gu, M. Dalal, and S. Levine. Temporal difference models: Model-free deep RL for\n\nmodel-based control. CoRR, abs/1802.09081, 2018.\n\n[33] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. Van de Wiele, V. Mnih,\nN. Heess, and J. T. Springenberg. Learning by playing-solving sparse reward tasks from scratch.\narXiv preprint arXiv:1802.10567, 2018.\n\n[34] S. J. Russell and P. Norvig. Arti\ufb01cial Intelligence - A Modern Approach (3. internat. ed.).\n\nPearson Education, 2010.\n\n[35] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved\ntechniques for training gans. In Advances in Neural Information Processing Systems, pages\n2234\u20132242, 2016.\n\n[36] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap.\nA simple neural network module for relational reasoning. In Advances in neural information\nprocessing systems, pages 4974\u20134983, 2017.\n\n11\n\n\f[37] N. Savinov, A. Dosovitskiy, and V. Koltun. Semi-parametric topological memory for navigation.\n\narXiv preprint arXiv:1803.00653, 2018.\n\n[38] D. I. Simester, P. Sun, and J. N. Tsitsiklis. Dynamic catalog mailing policies. Management\n\nscience, 52(5):683\u2013696, 2006.\n\n[39] A. Srinivas, A. Jabri, P. Abbeel, S. Levine, and C. Finn. Universal planning networks. arXiv\n\npreprint arXiv:1804.00645, 2018.\n\n[40] A. Srinivas, R. Krishnamurthy, P. Kumar, and B. Ravindran. Option discovery in hierarchical\nreinforcement learning using spatio-temporal clustering. arXiv preprint arXiv:1605.05359,\n2016.\n\n[41] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction, volume 1. MIT press\n\nCambridge, 1998.\n\n[42] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal\n\nabstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):181\u2013211, 1999.\n\n[43] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. In NIPS,\n\npages 2146\u20132154, 2016.\n\n[44] M. Vallati, L. Chrpa, M. Grze\u00b4s, T. L. McCluskey, M. Roberts, S. Sanner, et al. The 2014\n\ninternational planning competition: Progress and trends. Ai Magazine, 36(3):90\u201398, 2015.\n\n[45] W. Wang, A. Wang, A. Tamar, X. Chen, and P. Abbeel. Safer classi\ufb01cation by synthesis. arXiv\n\npreprint arXiv:1711.08534, 2017.\n\n[46] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller. Embed to control: A locally\nlinear latent dynamics model for control from raw images. In Advances in neural information\nprocessing systems, pages 2746\u20132754, 2015.\n\n[47] N. Watters, A. Tacchetti, T. Weber, R. Pascanu, P. Battaglia, and D. Zoran. Visual interaction\n\nnetworks. arXiv preprint arXiv:1706.01433, 2017.\n\n[48] J. Wu, E. Lu, P. Kohli, B. Freeman, and J. Tenenbaum. Learning to see physics via visual\nde-animation. In Advances in Neural Information Processing Systems, pages 152\u2013163, 2017.\n\n[49] J.-Y. Zhu, R. Zhang, D. Pathak, T. Darrell, A. A. Efros, O. Wang, and E. Shechtman. Toward\nmultimodal image-to-image translation. In Advances in Neural Information Processing Systems,\npages 465\u2013476, 2017.\n\n12\n\n\f", "award": [], "sourceid": 5271, "authors": [{"given_name": "Thanard", "family_name": "Kurutach", "institution": "University of California Berkeley"}, {"given_name": "Aviv", "family_name": "Tamar", "institution": "UC Berkeley"}, {"given_name": "Ge", "family_name": "Yang", "institution": "Berkeley"}, {"given_name": "Stuart", "family_name": "Russell", "institution": "UC Berkeley"}, {"given_name": "Pieter", "family_name": "Abbeel", "institution": "UC Berkeley | Gradescope | Covariant"}]}