{"title": "Adaptive Skip Intervals: Temporal Abstraction for Recurrent Dynamical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 9816, "page_last": 9826, "abstract": "We introduce a method which enables a recurrent dynamics model to be temporally abstract. Our approach, which we call Adaptive Skip Intervals (ASI), is based on the observation that in many sequential prediction tasks, the exact time at which events occur is irrelevant to the underlying objective. Moreover, in many situations, there exist prediction intervals which result in particularly easy-to-predict transitions. We show that there are prediction tasks for which we gain both computational efficiency and prediction accuracy by allowing the model to make predictions at a sampling rate which it can choose itself.", "full_text": "Adaptive Skip Intervals: Temporal Abstraction for\n\nRecurrent Dynamical Models\n\nAlexander Neitz1 3 Giambattista Parascandolo1 2 Stefan Bauer1 2 Bernhard Sch\u00f6lkopf1 2\n\n1Max Planck Institute for Intelligent Systems\n2Max Planck ETH Center for Learning Systems\n\n3aneitz@tue.mpg.de\n\nAbstract\n\nWe introduce a method which enables a recurrent dynamics model to be temporally\nabstract. Our approach, which we call Adaptive Skip Intervals (ASI), is based on the\nobservation that in many sequential prediction tasks, the exact time at which events\noccur is irrelevant to the underlying objective. Moreover, in many situations, there\nexist prediction intervals which result in particularly easy-to-predict transitions.\nWe show that there are prediction tasks for which we gain both computational\nef\ufb01ciency and prediction accuracy by allowing the model to make predictions at a\nsampling rate which it can choose itself.\n\n1\n\nIntroduction\n\nA core component of intelligent agents is the ability to predict certain properties of future states of\ntheir environments (Legg and Hutter, 2007). For example, model-based reinforcement learning (Daw,\n2012; Arulkumaran et al., 2017) decomposes the task into the two components of learning a model\nand then using the learned model for planning ahead.\nDespite signi\ufb01cant recent advances, even relatively simple tasks like pushing objects is still a\nchallenging robotic task and foresight for robot planning is still limited to relatively short horizon\ntasks (Finn and Levine, 2017). This is partially due to the fact that errors even from early stages in the\nprediction pipeline are accumulating especially when new or complex environments are considered.\nMany dynamical systems have the property that long-term predictions of future states are easiest to\nlearn if they are obtained by a sequence of incremental predictions. Our starting point is the hypothesis\nthat at each instant of the evolution, there is an ideal temporal step length associated with those\nstate transitions which are easiest to predict: Intervals which are too long correspond to complicated\nmechanisms that could be simpli\ufb01ed by breaking them down into a successive application of simpler\nmechanisms. On the other hand, intervals which are too short do not contain much change, which\nmeans that the predictor has to represent roughly the identity \u2013 this can lead to a situation where\nthe model makes small absolute errors \u03b4s, but a large relative error \u03b4s\n\u2206t, which is the rate at which\nthe prediction error accumulates. This tradeoff is illustrated in Figure 1. An additional drawback\nof too short prediction intervals is that it requires many predictions, which can be computationally\nexpensive. Somewhere in-between the two extremes, there is an ideal step length corresponding to\ntransitions that are easiest to represent and learn.\nWe propose Adaptive Skip Intervals (ASI), a simple change to autoregressive environment simulators\n(Chiappa et al., 2017; Buesing et al., 2018) which can be applied to systems in which it is not\nnecessary to predict the exact time of events. While in the literature, abstractions are often considered\nwith respect to hierarchical components e.g. for locomotor control (Heess et al., 2016) or expanding\nthe dynamics in a latent space (Watter et al., 2015), our work focuses on temporal abstractions. Our\ngoal is to understand the dynamics of the environment in terms of robust causal mechanisms at the\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fright level of temporal granularity. This idea is closely related to causal inference (Peters et al., 2017)\nand the identi\ufb01cation of invariances (Pearl, 2009; Sch\u00f6lkopf et al., 2012; Peters et al., 2016) and\nmechanisms (Parascandolo et al., 2017).\nASI allows the model to dynamically adjust the temporal resolution\nat which predictions are made, based on the speci\ufb01c observed input.\nIn other words, the model has the option to converge to the easiest-to\npredict transitions, with prediction intervals \u2206t that are not constant\nover the whole trajectory, but situation-dependent. Moreover, the\nmodel is more robust to certain shifts in the evolution speed at\ntraining time, and also to shifts to datasets where the trajectories\nare partly corrupted. For example, when some frames are missing\nor extremely noisy, a frame-by-frame prediction method would be\nforced to model the noise, especially if it is not independent of\nthe state. Flexibly adjusting the time resolution of predictions also\nresults in more computationally ef\ufb01ciency, as fewer steps need to\nbe predicted where they are not necessary \u2014 a key requirement for\nreal-time applications.\nA type of prediction task which can especially pro\ufb01t from our proposed method is one which exhibits\na property we call inconsequential chaos. To illustrate this, consider the following example: In\nFigure 2 we visualize the trajectories of a ball which falls into a funnel-shaped object at different\ninitial horizontal velocities. The exact trajectories that are taken within the funnel depend sensitively\non the initial state and are therefore dif\ufb01cult to predict ahead of time. On the other hand, predicting\nthat the ball will hit the horizontal platform on the bottom is easy because it only requires knowing\nthat when the ball falls somewhere into the funnel, it will come out at the bottom end, irrespective of\nhow long it bounces around. If we are only interested in predicting where the ball will ultimately\nland, we can skip the dif\ufb01cult parts, provided that they are inconsequential. Figure 3 explains another\nperspective to motivate our method.\n\nFigure 1: Hypothesized relation-\nship between skip interval \u2206t and\nerror accumulation rate L\n\u2206t .\n\n2 Preliminaries\n\n2.1 Problem statement\n\nThe machine learning problem we are considering is a classi\ufb01cation problem where the labels are\ngenerated by a dynamical process, such as a Hidden Markov Model. As auxiliary data, we get access\nto observations of the system\u2019s internal state. The training data consists of observation sequences\n{x(i)}i\u22081,...,N and labels {y(i)}i\u22081,...,N . The trajectories x are ordered sequences of elements xt from\nan observation space X . Typically, a trajectory x arises from repeatedly measuring the dynamical\nsystem\u2019s state at some \ufb01xed sampling rate. To keep the scope limited, we assume the labels y(i) to\nbe categorical, i.e. belonging to a \ufb01nite set Y. In our formulation, there is only a single label for\neach trajectory, which intuitively corresponds to the eventual \u201coutcome\u201d of the particular system\nevolution. At test time, we are only given some initial observations (x0, x1, ..., xk), for some small k\n(e.g., k = 0 in the fully-observable case) and have to predict the corresponding label y.\nNote that the problem does not demand the prediction of any future observations xt. As a performance\nmeasure we use the accuracy of the label predictions. The role of the classi\ufb01cation task is to provide\n\nFigure 2: Visualization of a ball which is dropped into a funnel at different initial horizontal velocities. The part\nof the trajectory within the funnel can be considered inconsequential chaos.\n\n2\n\n\fFigure 3: One way to motivate the need for adaptive skip intervals compared to a \ufb01xed temporal coarsening is\nto consider the complexity of the learned model. If the underlying true dynamics have recurring \u201cmechanisms\u201d\nwhich take different amounts of time, ASI enables the model to represent fewer distinct transition types, reducing\nthe required model capacity and thus the amount of training data.\n\na way to measure performance, as the objective is to know how well the model is suited to predict the\nqualitative outcome of each instance. We explicitly do not care about the loss in pixel space. Since\nframes may be skipped, video-prediction metrics are not relevant for this task. In the future we would\nlike to use our model in latent spaces as well.\nIt is straightforward to generalize the classi\ufb01cation task to a value prediction task in a (hierarchical)\nreinforcement learning setting, given a \ufb01xed policy (e.g. an option, as introduced in Sutton et al.\n(1999)). However, in this work we focus on uncontrolled tasks only.\n\n2.2 Environment simulators\n\nEnvironment simulators are models which approximate the conditional probability distribution\n\nP (Xt+1, Rt+1|Xt)\n\n(1)\nwhere Xt is a random variable with range X which describes the Markovian state of the system at\ntime t. Rt is the random variable over some real-valued cumulant which we want to track for our task.\nIn order to simplify our experiments, in this paper we consider the special case of fully-observable\ntasks. For this reason, we use the terms \u201cobservation\u201d and \u201cstate\u201d interchangeably. However, note\nthat in realistic applications, it may be desirable to predict future states given past observations,\nwhich poses the additional challenge of state inference. As an additional simpli\ufb01cation, we consider\ndeterministic simulators, which put a probability point mass of one on a single future state. For a\nrecent, more detailed investigation of several ef\ufb01cient state-space architectures, see Buesing et al.\n(2018).\nNote that given a distribution over an initial X0, we can apply an environment simulator multiple\ntimes to a distribution over the initial state, yielding a probability distribution over trajectories and\ncumulants.\n\nP (X0:N , G0:N ) = P (X0)\n\nP (Xt, Gt|Xt\u22121)\n\n(2)\n\nTemporally abstract environment simulators only need to represent a relaxed version of the above\nconditional probability distribution:\n\nt=1\n\nN(cid:89)\n\nP (Xt+\u03c4 , R\u03c4\n\nt |Xt)\n\n(3)\n\nt denotes the sum(cid:80)\u03c4\n\nwhere \u03c4 is some arbitrary time skip interval up to the end of the trajectory, which can be chosen by\nthe model and R\u03c4\nk=t Rk. In other words, a temporally abstract environment\nsimulator must only be able to predict some future state of the system and additionally provide the\nsum of the cumulants since the last step. To address the classi\ufb01cation problem de\ufb01ned in Section\n2.1, we only consider tasks where the cumulant is zero everywhere except for the last state of the\ntrajectory, which is a plausible restriction if the cumulant tracks some form of \u201coutcome\u201d of the\ntrajectory.\nThe dynamical models we consider in this paper consist at their core of a deep neural network\nf : X \u2192 X which is meant to represent the dynamical law of the environment. In order to learn to\npredict multiple time-steps into the future, f is iterated multiple times, which makes the architecture a\nrecurrent neural network. As the model predicts the new state at time t + 1, it needs to be conditioned\n\n3\n\nTimeASITrue mechanismsfixed \u0394tDistinct transitions:{,}{,,,}\fFigure 4: Visualization of the \ufb01rst three steps of ASI with a horizon of H = 3. The blue lines\nrepresent loss components between the ground truth frames x and predicted frames \u02c6x. For simplicity,\nwe do not consider scheduled sampling here, therefore f is always applied to the previous predicted\nstate.\n\non the previous state at the previous time step t. During training, there is a choice for the source of the\nnext input frame for the model: Either the ground truth (observed) frame or the model\u2019s own previous\nprediction can be taken. The former provides more signal when f is weak, while the latter matches\nmore accurately the conditions during inference, when the ground truth is not known. We found the\ntechnique of scheduled sampling (Bengio et al., 2015) to be a simple and effective curriculum to\naddress the trade-off described above. Note that other works, such as Chiappa et al. (2017) and (Oh\net al., 2017) have addressed the issue in different ways. The exact way of dealing with this issue is\northogonal to the use of temporal abstraction.\n\n3 Adaptive skip intervals for recurrent dynamical models\n\nWe now introduce a method to inject temporal abstraction into deterministic recurrent environment\nsimulators.\n\nAlgorithm 1: Dynamical model learning with ASI\nInput :i\u2019th trajectory x(i) = (x1, x2, ..., xTi ) \u2208 X Ti\nDifferentiable model f : X \u2192 X w/ params \u03b8\nLoss function L : X \u00d7 X \u2192 R\nMatching-horizon H \u2208 N\nExploration schedule \u00b5 : N \u2192 [0, 1]\nScheduled sampling temperatures \u0001 : N \u2192 [0, 1]\n(cid:46) Data timestep t, abstract timestep u\n(cid:46) Trajectory loss\n(cid:46) Next input to the dynamics model f\n\nTraining process The main idea of ASI is that the dynamical model f is not forced to predict every\nsingle time step in the sequence. Instead, it has the freedom to skip an arbitrary number of frames up\nto some pre-de\ufb01ned horizon H \u2208 N. We train f in such a way that it has the incentive to focus on\nrepresenting those transitions which allow it to predict extended sequences which are accurate over\nmany time steps into the future. Figure 4 visualizes the three steps of the ASI training procedure with\na horizon of H = 3.\nAt training time, we feed the \ufb01rst frame x1 into\na differentiable model f, producing the output\n\u02c6x1 := f (x1). In contrast to classical autoregres-\nsive modeling, \u02c6x1 does not have to correspond\nto the next frame in the ground truth sequence,\nx2, but can be matched with any frame from\nx2 to x2+H. Importantly, f is not required to\nknow how many frames it is going to skip \u2013 the\ntemporal matching is performed by a \u201ctraining\nsupervisor\u201d who takes f\u2019s prediction and selects\nthe best-\ufb01tting ground-truth frame to compute\nthe loss, which is later on reduced using gradient\nbased optimization.\nTo soften the winner-takes-all mechanism, we\nuse an exploration-curriculum. At every step,\na Bernoulli trial with probability \u00b5 decides\nwhether an exploration or an exploitation step is\nexecuted: In an exploration step, the supervisor\nselects a future frame at random with a frame-\nskip value between 1 and H; in an exploitation\nstep, the supervisor takes the best-\ufb01tting ground-\ntruth frame xi = argmint\u2208{2..2+H}Lx(\u02c6xb, xt)\nto provide the training signal. At the beginning\nof training, \u00b5 is high, such that exploration is\nencouraged. Over the course of several epochs, \u00b5 is gradually decreased such that f can converge to\npredicting sharp mechanisms. The goal of the exploration schedule is to avoid being caught in a local\noptimum early on during training. Over the course of the learning process, we gradually decrease the\n\nt \u2190 1, u \u2190 1\nl \u2190 0\np \u2190 x1\nwhile t < |x| do\n\u02c6xu \u2190 f (p)\nT \u2190 min(t + H,|x|)\nif Bernoulli(\u00b5(i)) = 0 then\n\n(cid:46) Accumulate trajectory loss\n\nend\nl \u2190 l + L(xu, xt)\np \u2190 binary_choice(\u02c6xu, xt; p = \u0001(i))\n(cid:46) Scheduled sampling (Bengio et al., 2015)\nu \u2190 u + 1\n\nend\n\u03b8 \u2190 gradient descent step on \u03b8 to reduce l\n\n(cid:46) Upper time step limit\n\nt \u2190 arg mint(cid:48)\u2208{t+1..T} L(xu, xt(cid:48) )\nt \u223c unif{t + 1, T}\n\nelse\n\n(cid:46) Exploration\n\n4\n\n\u2026\fchance of picking a random frame, effectively transitioning to the winner-takes-all mechanism. We\nrefer to this curriculum scheme Exploration of temporal matching.\nThe best \ufb01tting frame xi is then fed into f again, iterating the same procedure as described above,\nbut from a later starting point. At every step, we accumulate a loss lx, leading to an overall prediction\nloss Lx which is simply the mean of all the step-losses. We train the model f via gradient descent to\nreduce the prediction loss Lx.\nIn the example with the funnel, this could intuitively work as follows: the transition from the ball\nwhich falls into the funnel to the ball which is at the end of the funnel is the most robust one (let\u2019s call\nit the \"robust transition\") \u2013 it occurs virtually every time. All other positions within the funnel are\nvisited less often. Therefore, f will tend to get most training signal from the robust transition. Hence,\nf will begin to predict something that resembles the robust transition, which will subsequently be\nreinforced because it will often be the best-\ufb01tting transition which wins in the matching process.\nInstead of using a greedy matching algorithm it is conceivable to use a global optimization method\nwhich is applied to the whole sequence of iteratively predicted frames, which would then be aligned\nin the globally best possible way to the ground truth data. However, in this case, we would not\nbe able to alternate randomly between the input sources for f, as we currently do with scheduled\nsampling, because in order to know which ground truth frame to take next, we already need to know\nthe alignment.\nBesides exploration of temporal matching,as mentioned in Section 2.2 we adopt another curriculum\nscheme, scheduled sampling (Bengio et al., 2015), which gradually shifts the training distribution\nfrom observation-dependent transitions towards prediction-dependent transitions.\n\nPredicting the labels Since the learning procedure can choose to skip dif\ufb01cult-to-predict frames,\nthe mean loss in pixel space would not be a fair metric to evaluate whether ASI serves a purpose. As\nexplained in Section 2.1, one of our central assumptions is that we are dealing with environments\nwhich have the notion of a qualitative outcome, represented e.g. by the classi\ufb01cation problem\nassociated with the task. Therefore, as a way to measure the learning success, we let a separate\nclassi\ufb01er \u03c8 : X \u2192 P(Y) predict the label of the underlying classi\ufb01cation task based on the frames\npredicted by f. At test time, f can unfold the dynamics over multiple steps and \u03c8 is applied to the\nresulting frames, allowing the combined model to predict the label from the initial frame.\nIn principle, the classi\ufb01er \u03c8 could be trained alongside the model f, or after convergence of f \u2013 the\ntwo training processes do not interfere with each other. For the experiments described in Section 4,\nwe hand-specify a classi\ufb01er \u03c8 ahead of time for each environment. Since our classi\ufb01cation tasks are\neasy, given the last frame of a trajectory, the classi\ufb01ers are simple functions which achieve perfect\naccuracy when fed the ground truth frames.\n\n4 Experiments\n\nWe demonstrate the ef\ufb01cacy of our approach by introducing two environments for which our\napproach is expected to perform well. Code to reproduce our experiments is available at\nhttps://github.com/neitzal/adaptive-skip-intervals.\n\n4.1 Domains\n\nRoom runner\nIn the Room runner task, an agent, represented by a green dot, moves through a\nrandomly generated map of rooms, which are observed in 2D from above. The agent follows the\npolicy of always trying to move towards and into the next room, until it reaches a dead end. Two\nrooms are colored \u2013 the actual dead end which the agent will reach and another room, which is a dead\nend for another path. One of these two rooms is red, the other one blue, but the assignment is chosen\nby a fair coin \ufb02ip. The underlying classi\ufb01cation task is to predict whether the agent will end up in the\nred room or in the blue one. Since there is always exactly one passage between two adjacent rooms,\nthe \ufb01nal room is always well-de\ufb01ned and there is no ambiguity in the outcome. We add noise to the\nrunner\u2019s acceleration at every step, simulating an imperfect controller \u2013 for example one which is still\ntaking exploratory actions in order to improve.\nFigure 5 shows examples for the \ufb01rst states and the resulting trajectories.\n\n5\n\n\f(a) Label: blue\n\n(b) Label: red\n\n(c) Label: red\n\nFigure 5: Examples of \ufb01rst states of the Room runner domain, along with the corresponding\ntrajectories which arise from evolving the environment dynamics and the agent\u2019s policy. Darker\nregions in the trajectory correspond to parts where the agent was moving more slowly.\n\nFunnel board In this task, a ball falls through a grid of obstacles onto one of \ufb01ve platforms. Every\nother row of obstacles consists of funnel-shaped objects, which are meant to capture the ball and\nrelease it at a well-de\ufb01ned exit position. Variety arises from the random rotations of the sliders, from\nthe random presence or absence of funnels in every layer except for the last one, and from slight\nperturbations in the funnel and slider positions. The courses are generated such that the ball is always\nguaranteed to hit exactly one of the platforms. Figure 6 shows three examples for the \ufb01rst states and\nthe ball\u2019s resulting paths. In order to simplify the problem, we make the states nearly fully observable\nby preprocessing the video frames such that they include a trace of the ball\u2019s position at the previous\nstep.\nThe underlying classi\ufb01cation task is to predict, given only access to the \ufb01rst frame, on which of the\n\ufb01ve platforms the ball will land eventually. Note that the task does not include predicting the time\nwhen the ball will reach its goal.\n\n(a) Label: 3\n\n(b) Label: 1\n\n(c) Label: 2\n\nFigure 6: Examples of \ufb01rst states of the Funnel board domain, along with the corresponding trajectories which\narise from evolving the environment dynamics. The trajectories are merged into one image for visualization\npurposes only \u2013 in the dataset every frame is separate.\n\n4.2 Experiment setup\n\nThe experiments are ablation studies of our method. We would like to investigate the ef\ufb01cacy of\nadaptive skip intervals and whether the exploration schedule is bene\ufb01cial to obtain good results. For\neach of our two environments, we compare four methods: (a) The recurrent dynamics model with\nadaptive skip intervals as described in Section 3. (ASI) (b) The dynamics model with adaptive skip\nintervals, but without any exploration phase, i.e. \u00b5 = 0. (ASI w/o exploration) (c) The dynamics\nmodel without adaptive skip intervals such that it is forced to predict every step (\ufb01xed (\u2206t = 1)). (d)\nThe dynamics model without adaptive skip intervals such that it is forced to predict every second step\n(\ufb01xed (\u2206t = 2)). In each experiment we train with a training set of 500 trajectories, and we report\nvalidation metrics evaluated on a validation set of 500 trajectories. We perform validation steps four\ntimes per epoch in order to obtain a higher resolution in the training curves.\nFor our experiments, we use a neural network with seven convolutional layers as the dynamics model\nf. Architectural details, which are the same in all experiments, are described in the Appendix. Like\n(Weber et al., 2017), we train f using a pixel-wise binary cross entropy loss. Hyperpararameter\nsettings such as the learning rates are determined for each method individually by using the set of\nparameters which led to the best result (highest maximum achieved accuracy on the validation set),\nout of 9 runs each. We use the same search ranges for all experiments and methods. The remaining\n\n6\n\n\fFigure 7: Portion of a sequence from Room runner using ASI, with ground truth frames on top and predicted,\ntemporally aligned sequence on bottom.\n\nFigure 8: Portion of a sequence from Funnel board using ASI, with ground truth frames on top and predicted,\ntemporally aligned sequence on bottom. Darker lines connecting a predicted frame to the ground truth frames\ncorrespond to better matching in terms of pixel loss.\n\nhyperparameters, including search ranges, are provided in the Appendix. For instance, as a value\nfor the horizon H in the ASI runs, our search yielded optimal results for values of around 20 in both\nexperiments. After \ufb01xing the best hyperparameters, each method is evaluated 8 additional times with\ndifferent random seeds, which we use to report the results. We additionally included baselines with\n\u2206t > 2, but to reduce the amount of computation did not perform another hyperparameter search for\nthem, instead taking the best parameters for the baseline \u201c\ufb01xed (\u2206t = 2)\u201d.\n\n4.3 Results\n\nWe begin by visualizing how the network with adaptive skip intervals performs after training. In\nFigure 8 we show a portion of one trajectory from the Funnel board, as processed by the net-\nwork. As shown, the network trained with ASI has learned to skip a variable number of frames,\nspeci\ufb01cally avoiding the bouncing in the funnel, and directly predicting the exiting ball. Similarly,\nFigure 7 shows a portion of a sequence from the Room runner domain. As the videos presented\nat http://tiny.cc/x2suwy demonstrate, ASI is able to produce sharp predictions over many\ntime-steps while the \ufb01xed-skip baselines produce blurry predictions.\n\nQuantitative results As shown in Figure 9, ASI outperforms the \ufb01xed-steps baselines on both\ndatasets. On Funnel board the networks equipped with adaptive skip intervals achieve higher accuracy\nand in a shorter time, with exploration of adaptive skip intervals obtaining even better results. In the\n\nFigure 9: Learning progress, curves show validation accuracies on two tasks. For each task, we show on the\nhorizontal axis the number of model evaluations and the epoch number. Curves show mean validation accuracy,\nevaluated on 500 trajectories. The training sets consist of 500 trajectories in each experiment. Shaded areas\ncorrespond to the interquartile range over all eight runs.\n\n7\n\n050100150200250300Epoch0.60.8AccuracyRoomrunner01e+072e+073e+074e+07#ofmodelevaluations0.60.8AccuracyRoomrunner50100150Epoch0.20.40.60.8AccuracyFunnelboard01e+072e+073e+07#ofmodelevaluations0.20.40.60.8AccuracyFunnelboardASIASIw/oexplorationfixed(t=1)fixed(t=2)fixed(t=3)fixed(t=4)fixed(t=5)fixed(t=10)fixed(t=20)\fRoom runner task, we observe a signi\ufb01cant improvement of ASI with exploration over the version\nwithout exploration and the baselines. Note that some of the baselines curves get worse after an\ninitial improvement. This can be explained by the fact that the two training curricula, scheduled\nsampling and exploration of temporal matching, create a nonstationary distribution for the network.\nWe observe that ASI appears more resilient to this effect.\n\nComputational ef\ufb01ciency Note that the x-axis in Figure 9 represents the number of forward-passes\nthrough f, which loosely corresponds to the wall clock time during the training process. Since the\nadaptive skip intervals methods are allowed to skip frames, they need fewer model evaluations\n(and therefore fewer backpropagation passes at training time) than \ufb01xed-rate training schemes. In\nthe tasks we considered, not only this gain in training speed does not come at the cost of reduced\naccuracy, but it actually improves the overall performance. Full-resolution timelines can be viewed at\nhttp://tiny.cc/x2suwy\n\nRobustness w.r.t. perturbation of dynamics An-\nother advantage of the temporally abstract model\nwhich we hypothesize is that the training process is\nmore stable when the dynamical systems changes in a\ncertain way. This is relevant because in real systems,\nthe i.i.d. assumption is often violated. The same is\ntrue for reinforcement learning tasks, in which the\ndistribution over observed transition changes as the\nagent improves its policy or due to changes in the en-\nvironment over time. As a test for our hypothesis, we\nprepare a second version of the Funnel board dataset\nwith 500 trajectories of slightly altered physics: The\nbounciness of the funnel walls is reduced to zero.\nThis leads to a slightly different behavior in the fun-\nnels, but the \ufb01nal platforms are the same in the ma-\njority of trajectories. We start with the perturbed version and before the start of the 75th epoch, we\nexchange it with the original one. Figure 10 shows the accuracy curves for this experiment. We\nobserve that while the \ufb01xed frame-rate baselines learn the correct classi\ufb01cation better than in the\nmore dif\ufb01cult original task, after the switch the validation accuracy quickly deteriorates. Note that\nfreezing the network at epoch 75 would leave the validation accuracy almost unchanged, since both\nversions of the task have similar labels.\n\nFigure 10: Up to epoch 75 we use a version of the\nFunnel board task where the funnels\u2019 bounciness is\nset to zero. At epoch 75 we switch the dataset for\nthe standard one but otherwise keep the training\nprocedure going.\n\n5 Related work\n\nThe observation that every environment has an optimal sampling frequency has also been made for\nreinforcement learning. For instance, Braylan et al. (2000) investigate the effect of different frame-\nskip intervals on the performance of agents learning to play Atari 2600 games. A constant frame-skip\nvalue of four frames is considered standard for Deep RL agents (Machado et al., 2017). Focusing\non spatio-temporal prediction problems, (Oh et al., 2015) introduce a neural network architecture\nfor action conditional video prediction. Their approach bene\ufb01ts from using curriculum learning to\nstabilize the training of the network. Buesing et al. (2018) investigate action-conditional state-space\nmodels and explicitly consider \u201cjumpy\u201d models which skip a certain number of timesteps in order to\nbe more computationally ef\ufb01cient. In contrast to our work they do not use adaptive skip intervals, but\nskip at a \ufb01xed frame rate. Belzner (2016) introduces a time-adaptive version of model-based online\nplanning in which the planner can optimize the step-length adaptively. Their approach focuses on\ntemporal abstraction in the space of actions and plans. Temporal abstraction in the planning space is\nalso a motivation of the \ufb01eld of hierarchical reinforcement learning (Barto and Mahadevan, 2003),\noften in the framework of semi-MDPs \u2013 Markov Decision Processes with temporally extended actions\n(e.g. Puterman, 1994).\nThe idea of skipping time steps has also been investigated in Ke et al. (2017), where the authors\npresent a way to attack the problem of long-term credit assignment in recurrent neural networks by\nonly propagating errors through selected states instead of every single past timestep.\n\n8\n\n50100150Epoch0.20.40.60.8AccuracyDatasetexchangeASIASIw/oexplorationfixed(t=1)fixed(t=2)\fClosely related to our work is the Predictron (Silver et al., 2016), which is a deep neural network\narchitecture which is set up to perform a sequence of temporally abstract lookahead steps in a latent\nspace. It can be trained end-to-end in order to approximate the values in a Markov Reward Process.\nIn contrast to ASI, the outputs of the Predictron are regressed exclusively towards rewards and values,\nwhich circumvents the need for an explicit solution to the temporal alignment problem. However, by\nignoring future states, the training process ignores a large amount of dynamical information from the\nunderlying system.\nSimilar in spirit to the Predictron, the value prediction network (VPN) (Oh et al., 2017) proposes a\nneural network architecture to learn a dynamics model whose abstract states make option-conditional\npredictions of future values rather than of future observations. Their temporal abstraction is \u201cgrounded\u201d\nby using option-termination as the skip-interval.\nEbert et al. (2017) introduced temporal skip connections for self-supervised visual planning to keep\ntrack of objects through occlusion.\n(Pong et al., 2018) introduce temporal difference models (TDM) which are dynamical models trained\nby temporal difference learning. Their approach starts with a temporally \ufb01ne-grained dynamics\nmodel, which is represented with a goal-conditioned value function. The temporal resolution is\nsuccessively coarsened so as to converge toward a model-free formulation.\nConcurrently to our work, Jayaraman et al. (2018) propose a training framework with a similar\nmotivation to ours. They further explore ways to generalize the objective and include experiments on\nhierarchical planning.\n\n6 Conclusion\n\nWe presented a time skipping framework for the problem of sequential predictions. Our approach\nbuilds on concepts from causal discovery (Peters et al., 2017; Parascandolo et al., 2017) and can\nbe included in multiple \ufb01elds where planning is important. In cases where our approach fails, e.g.\nwhen the alignment of predicted and ground truth is lost and the model does not have the power to\nrestore it, more advanced optimization methods like dynamic time warping (M\u00fcller, 2007) during\nthe matching phase may help at the cost of the simplicity and seamless integration of the scheduled\nsampling, as described in Section 3.\nAn interesting direction for future work is the combination of temporal abstraction with abstractions\nin a latent space. As noted for instance by Oh et al. (2017), predicting future observations is a too\ndif\ufb01cult task for realistic environment due to the high dimensionality of typical observation spaces.\nThe idea of an optimal prediction skip interval should extend to the case of stochastic generative\nmodels, where instead of a deterministic mapping from current to next state, the model provides\na probability distribution over next states. In this case, ASI should lead to simpler distributions,\nallowing for simpler models and more data ef\ufb01ciency just as in the deterministic case. The evaluation\nof this claim is left for future work.\nAnother line of investigation which is left to future work is to integrate ASI with action-conditional\nmodels. As mentioned in Section 2.1, the problem could be addressed by using a separate ASI-\ndynamical model for each policy or option, which would allow for option-conditional planning.\nHowever, there may be a more interesting interplay between ideal skip intervals and switching points\nfor options, which suggest that they should ideally be learned jointly.\n\nAcknowledgements\n\nThis work is partially supported by the International Max Planck Research School for Intelligent\nSystems and the Max Planck ETH Center for Learning Systems.\n\n9\n\n\fReferences\nArulkumaran, K., Deisenroth, M. P., Brundage, M., and Bharath, A. A. (2017). A brief survey of\n\ndeep reinforcement learning. arXiv preprint arXiv:1708.05866.\n\nBarto, A. G. and Mahadevan, S. (2003). Recent advances in hierarchical reinforcement learning.\n\nDiscrete event dynamic systems, 13(1-2):41\u201377.\n\nBelzner, L. (2016). Time-adaptive cross entropy planning. In Proceedings of the 31st Annual ACM\n\nSymposium on Applied Computing, pages 254\u2013259. ACM.\n\nBengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. (2015). Scheduled sampling for sequence\nprediction with recurrent neural networks. In Advances in Neural Information Processing Systems,\npages 1171\u20131179.\n\nBraylan, A., Hollenbeck, M., Meyerson, E., and Miikkulainen, R. (2000). Frame skip is a powerful\n\nparameter for learning to play atari. Space, 1600:1800.\n\nBuesing, L., Weber, T., Racaniere, S., Eslami, S., Rezende, D., Reichert, D. P., Viola, F., Besse,\nF., Gregor, K., Hassabis, D., et al. (2018). Learning and querying fast generative models for\nreinforcement learning. arXiv preprint arXiv:1802.03006.\n\nChiappa, S., Racaniere, S., Wierstra, D., and Mohamed, S. (2017). Recurrent environment simulators.\n\narXiv preprint arXiv:1704.02254.\n\nDaw, N. D. (2012). Model-based reinforcement learning as cognitive search: neurocomputational\n\ntheories.\n\nEbert, F., Finn, C., Lee, A. X., and Levine, S. (2017). Self-supervised visual planning with temporal\n\nskip connections. arXiv preprint arXiv:1710.05268.\n\nFinn, C. and Levine, S. (2017). Deep visual foresight for planning robot motion. In Robotics and\n\nAutomation (ICRA), 2017 IEEE International Conference on, pages 2786\u20132793. IEEE.\n\nGlorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse recti\ufb01er neural networks. In Proceedings\nof the fourteenth international conference on arti\ufb01cial intelligence and statistics, pages 315\u2013323.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into recti\ufb01ers: Surpassing human-level\nperformance on imagenet classi\ufb01cation. In Proceedings of the IEEE international conference on\ncomputer vision, pages 1026\u20131034.\n\nHeess, N., Wayne, G., Tassa, Y., Lillicrap, T., Riedmiller, M., and Silver, D. (2016). Learning and\n\ntransfer of modulated locomotor controllers. arXiv preprint arXiv:1610.05182.\n\nJayaraman, D., Ebert, F., Efros, A. A., and Levine, S. (2018). Time-agnostic prediction: Predicting\n\npredictable video frames. arXiv preprint arXiv:1808.07784.\n\nKe, N. R., Goyal, A., Bilaniuk, O., Binas, J., Charlin, L., Pal, C., and Bengio, Y. (2017). Sparse\nattentive backtracking: Long-range credit assignment in recurrent networks. arXiv preprint\narXiv:1711.02326.\n\nKingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980.\n\nLegg, S. and Hutter, M. (2007). Universal intelligence: A de\ufb01nition of machine intelligence. Minds\n\nand Machines, 17(4):391\u2013444.\n\nMachado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M., and Bowling, M. (2017).\nRevisiting the arcade learning environment: Evaluation protocols and open problems for general\nagents. arXiv preprint arXiv:1709.06009.\n\nM\u00fcller, M. (2007). Dynamic time warping. Information retrieval for music and motion, pages 69\u201384.\n\nOh, J., Guo, X., Lee, H., Lewis, R. L., and Singh, S. (2015). Action-conditional video prediction\nusing deep networks in atari games. In Advances in Neural Information Processing Systems, pages\n2863\u20132871.\n\n10\n\n\fOh, J., Singh, S., and Lee, H. (2017). Value prediction network. In Advances in Neural Information\n\nProcessing Systems, pages 6120\u20136130.\n\nParascandolo, G., Rojas-Carulla, M., Kilbertus, N., and Sch\u00f6lkopf, B. (2017). Learning independent\n\ncausal mechanisms. arXiv preprint arXiv:1712.00961.\n\nPearl, J. (2009). Causality. Cambridge university press.\n\nPeters, J., B\u00fchlmann, P., and Meinshausen, N. (2016). Causal inference by using invariant prediction:\nidenti\ufb01cation and con\ufb01dence intervals. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 78(5):947\u20131012.\n\nPeters, J., Janzing, D., and Sch\u00f6lkopf, B. (2017). Elements of causal inference: foundations and\n\nlearning algorithms. MIT Press.\n\nPong, V., Gu, S., Dalal, M., and Levine, S. (2018). Temporal difference models: Model-free deep rl\n\nfor model-based control. arXiv preprint arXiv:1802.09081.\n\nPuterman, M. L. (1994). Markov decision processes. j. Wiley and Sons.\n\nSch\u00f6lkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., and Mooij, J. (2012). On causal and\nanticausal learning. In Langford, J. and Pineau, J., editors, Proceedings of the 29th International\nConference on Machine Learning (ICML), pages 1255\u20131262, New York, NY, USA. Omnipress.\n\nSilver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert,\nD., Rabinowitz, N., Barreto, A., et al. (2016). The predictron: End-to-end learning and planning.\narXiv preprint arXiv:1612.08810.\n\nSutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for\n\ntemporal abstraction in reinforcement learning. Arti\ufb01cial intelligence, 112(1-2):181\u2013211.\n\nWatter, M., Springenberg, J., Boedecker, J., and Riedmiller, M. (2015). Embed to control: A locally\nlinear latent dynamics model for control from raw images. In Advances in neural information\nprocessing systems, pages 2746\u20132754.\n\nWeber, T., Racani\u00e8re, S., Reichert, D. P., Buesing, L., Guez, A., Rezende, D. J., Badia, A. P., Vinyals,\nO., Heess, N., Li, Y., et al. (2017). Imagination-augmented agents for deep reinforcement learning.\narXiv preprint arXiv:1707.06203.\n\n11\n\n\f", "award": [], "sourceid": 6418, "authors": [{"given_name": "Alexander", "family_name": "Neitz", "institution": "Max Planck Institute for Intelligent Systems"}, {"given_name": "Giambattista", "family_name": "Parascandolo", "institution": "Max Planck Insitute for Intelligent Systems & ETH"}, {"given_name": "Stefan", "family_name": "Bauer", "institution": "MPI for Intelligent Systems"}, {"given_name": "Bernhard", "family_name": "Sch\u00f6lkopf", "institution": "MPI for Intelligent Systems"}]}