{"title": "Sequential Attend, Infer, Repeat: Generative Modelling of Moving Objects", "book": "Advances in Neural Information Processing Systems", "page_first": 8606, "page_last": 8616, "abstract": "We present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep generative model for image sequences.\nIt can reliably discover and track objects through the sequence; it can also conditionally generate future frames, thereby simulating expected motion of objects. \nThis is achieved by explicitly encoding object numbers, locations and appearances in the latent variables of the model.\nSQAIR retains all strengths of its predecessor, Attend, Infer, Repeat (AIR, Eslami et. al. 2016), including unsupervised learning, made possible by inductive biases present in the model structure.\nWe use a moving multi-\\textsc{mnist} dataset to show limitations of AIR in detecting overlapping or partially occluded objects, and show how \\textsc{sqair} overcomes them by leveraging temporal consistency of objects.\nFinally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns to reliably detect, track and generate walking pedestrians with no supervision.", "full_text": "Sequential Attend, Infer, Repeat: Generative\n\nModelling of Moving Objects\n\nAdam R. Kosiorek\u2217 \u00a7 \u2020\n\nHyunjik Kim\u2020\n\nIngmar Posner\u00a7\n\nYee Whye Teh\u2020\n\n\u00a7 Applied Arti\ufb01cial Intelligence Lab\n\nOxford Robotics Institute\n\nUniversity of Oxford\n\n\u2020 Department of Statistics\n\nUniversity of Oxford\n\nAbstract\n\nWe present Sequential Attend, Infer, Repeat (SQAIR), an interpretable deep genera-\ntive model for videos of moving objects. It can reliably discover and track objects\nthroughout the sequence of frames, and can also generate future frames condition-\ning on the current frame, thereby simulating expected motion of objects. This is\nachieved by explicitly encoding object presence, locations and appearances in the\nlatent variables of the model. SQAIR retains all strengths of its predecessor, Attend,\nInfer, Repeat (AIR, Eslami et al., 2016), including learning in an unsupervised\nmanner, and addresses its shortcomings. We use a moving multi-MNIST dataset to\nshow limitations of AIR in detecting overlapping or partially occluded objects, and\nshow how SQAIR overcomes them by leveraging temporal consistency of objects.\nFinally, we also apply SQAIR to real-world pedestrian CCTV data, where it learns\nto reliably detect, track and generate walking pedestrians with no supervision.\n\n1\n\nIntroduction\n\nThe ability to identify objects in their environments and to understand relations between them is a\ncornerstone of human intelligence (Kemp and Tenenbaum, 2008). Arguably, in doing so we rely on\na notion of spatial and temporal consistency which gives rise to an expectation that objects do not\nappear out of thin air, nor do they spontaneously vanish, and that they can be described by properties\nsuch as location, appearance and some dynamic behaviour that explains their evolution over time.\nWe argue that this notion of consistency can be seen as an inductive bias that improves the ef\ufb01ciency\nof our learning. Equally, we posit that introducing such a bias towards spatio-temporal consistency\ninto our models should greatly reduce the amount of supervision required for learning.\n\nOne way of achieving such inductive biases is through model structure. While recent successes in deep\nlearning demonstrate that progress is possible without explicitly imbuing models with interpretable\nstructure (LeCun, Bengio, et al., 2015), recent works show that introducing such structure into deep\nmodels can indeed lead to favourable inductive biases improving performance e.g. in convolutional\nnetworks (LeCun, Boser, et al., 1989) or in tasks requiring relational reasoning (Santoro et al.,\n2017). Structure can also make neural networks useful in new contexts by signi\ufb01cantly improving\ngeneralization, data ef\ufb01ciency (Jacobsen et al., 2016) or extending their capabilities to unstructured\ninputs (Graves et al., 2016).\n\nAttend, Infer, Repeat (AIR), introduced by Eslami et al., 2016, is a notable example of such a structured\nprobabilistic model that relies on deep learning and admits ef\ufb01cient amortized inference. Trained\nwithout any supervision, AIR is able to decompose a visual scene into its constituent components and\nto generate a (learned) number of latent variables that explicitly encode the location and appearance\nof each object. While this approach is inspiring, its focus on modelling individual (and thereby\ninherently static) scenes leads to a number of limitations. For example, it often merges two objects\nthat are close together into one since no temporal context is available to distinguish between them.\n\n\u2217Corresponding author: adamk@robots.ox.ac.uk\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fSimilarly, we demonstrate that AIR struggles to identify partially occluded objects, e.g. when they\nextend beyond the boundaries of the scene frame (see Figure 7 in Section 4.1).\n\nOur contribution is to mitigate the shortcomings of AIR by introducing a sequential version that\nmodels sequences of frames, enabling it to discover and track objects over time as well as to\ngenerate convincing extrapolations of frames into the future. We achieve this by leveraging temporal\ninformation to learn a richer, more capable generative model. Speci\ufb01cally, we extend AIR into a\nspatio-temporal state-space model and train it on unlabelled image sequences of dynamic objects.\nWe show that the resulting model, which we name Sequential AIR (SQAIR), retains the strengths of\nthe original AIR formulation while outperforming it on moving MNIST digits.\n\nThe rest of this work is organised as follows. In Section 2, we describe the generative model and\ninference of AIR. In Section 3, we discuss its limitations and how it can be improved, thereby\nintroducing Sequential Attend, Infer, Repeat (SQAIR), our extension of AIR to image sequences. In\nSection 4, we demonstrate the model on a dataset of multiple moving MNIST digits (Section 4.1)\nand compare it against AIR trained on each frame and Variational Recurrent Neural Network (VRNN)\nof Chung et al., 2015 with convolutional architectures, and show the superior performance of SQAIR\nin terms of log marginal likelihood and interpretability of latent variables. We also investigate the\nutility of inferred latent variables of SQAIR in downstream tasks. In Section 4.2 we apply SQAIR on\nreal-world pedestrian CCTV data, where SQAIR learns to reliably detect, track and generate walking\npedestrians without any supervision. Code for the implementation on the MNIST dataset2 and the\nresults video3 are available online.\n\n2 Attend, Infer, Repeat (AIR)\n\nAIR, introduced by Eslami et al., 2016, is a structured variational auto-encoder (VAE) capable of\ndecomposing a static scene x into its constituent objects, where each object is represented as a\nseparate triplet of continuous latent variables z = {zwhat,i, zwhere,i, zpres,i}n\ni=1, n \u2208 N being the\n(random) number of objects in the scene. Each triplet of latent variables explicitly encodes position,\nappearance and presence of the respective object, and the model is able to infer the number of objects\npresent in the scene. Hence it is able to count, locate and describe objects in the scene, all learnt in an\nunsupervised manner, made possible by the inductive bias introduced by the model structure.\n\nGenerative Model The generative model of AIR is de\ufb01ned as follows\n\nn\n\np\u03b8(n) = Geom(n | \u03b8),\n\np\u03b8(zw | n) =\n\np\u03b8(cid:0)zw,i(cid:1) =\n\nn\n\nYi=1\n\nN(cid:0)zw,i|0, I(cid:1),\n\nn\n\nYi=1\nXi=1\n\nwith yt =\n\nhdec\n\u03b8\n\n(zwhat,i, zwhere,i),\n\n(1)\n\np\u03b8(x | z) = N(cid:0)x | yt, \u03c32\n\nx\n\nI(cid:1),\n\nwhere zw,i ..= (zwhat,i, zwhere,i), zpres,i = 1 for i = 1 . . . n and hdec\nis the object decoder with\nparameters \u03b8. It is composed of a glimpse decoder f dec\nt, which constructs an image patch\nand a spatial transformer (ST, Jaderberg et al., 2015), which scales and shifts it according to zwhere;\nsee Figure 1 for details.\n\nt 7\u2192 yi\n\n: gi\n\n\u03b8\n\n\u03b8\n\nInference Eslami et al., 2016 use a sequential inference algorithm, where latent variables are inferred\none at a time; see Figure 2. The number of inference steps n is given by zpres,1:n+1, a random vector\nof n ones followed by a zero. The zi are sampled sequentially from\n\nn\n\nq\u03c6(z | x) = q\u03c6(cid:0)zpres,n+1 = 0 | zw,1:n, x(cid:1)\n\nYi=1\n\nq\u03c6(cid:0)zw,i, zpres,i = 1 | z1:i\u22121, x(cid:1),\n\n(2)\n\nwhere q\u03c6 is implemented as a neural network with parameters \u03c6. To implement explain-\ning away, e.g. to avoid encoding the same object twice,\nit is vital to capture the depen-\ndency of zw,i and zpres,i on z1:i\u22121 and x. This is done using a recurrent neural network\n(RNN) R\u03c6 with hidden state hi, namely: \u03c9i, hi = R\u03c6(x, zi\u22121, hi\u22121). The outputs \u03c9i,\nwhich are computed iteratively and depend on the previous latent variables (cf. Algorithm 3),\n\nparametrise q\u03c6(cid:0)zw,i, zpres,i | z1:i\u22121, x(cid:1). For simplicity the latter is assumed to factorise such that\nq\u03c6(cid:0)zw, zpres | z1:i\u22121, x(cid:1) = q\u03c6(cid:0)zpres,n+1 = 0 | \u03c9n+1(cid:1)Qn\ni=1 q\u03c6(cid:0)zw,i | \u03c9i(cid:1)q\u03c6(cid:0)zpres,i = 1 | \u03c9i(cid:1).\n\n2code: github.com/akosiorek/sqair\n3video: youtu.be/-IUNQgSLE0c\n\n2\n\n\fFigure 1: Left: Generation in AIR. The image mean yt is generated by \ufb01rst using the glimpse decoder f dec\nto\nmap the what variables into glimpses gt, transforming them with the spatial transformer ST according to the\nwhere variables and summing up the results. Right: Generation in SQAIR. When new objects enter the frame,\n4\nnew latent variables (here, z\nt ) are sampled from the discovery prior. The temporal evolution of already present\n4\nobjects is governed by the propagation prior, which can choose to forget some variables (here, z\nt+1)\nwhen the object moves out of the frame. The image generation process, which mimics the left-hand side of the\n\ufb01gure, is abstracted in the decoder block.\n\n3\nt and z\n\n\u03b8\n\ni\nt at a time. Here, it decides that two latent variables are enough to explain the image and z\n\nFigure 2: Left: Inference in AIR. The pink RNN attends to the image sequentially and produces one latent variable\n3\nt is not generated.\nz\nRight: Inference in SQAIR starts with the Propagation (PROP) phase. PROP iterates over latent variables from\nthe previous time-step t \u2212 1 and updates them based on the new observation xt. The blue RNN runs forward in\ntime to update the hidden state of each object, to model its change in appearance and location throughout time.\nThe orange RNN runs across all current objects and models the relations between different objects. Here, when\n1\nattending to z\nt\u22121, it decides that the corresponding object has disappeared from the frame and forgets it. Next,\nthe Discovery (DISC) phase detects new objects as in AIR, but in SQAIR it is also conditioned on the results of\nPROP, to prevent rediscovering objects. See Figure 3 for details of the colored RNNS.\n\n3 Sequential Attend-Infer-Repeat\n\nWhile capable of decomposing a scene into objects, AIR only describes single images. Should we\nwant a similar decomposition of an image sequence, it would be desirable to do so in a temporally\nconsistent manner. For example, we might want to detect objects of the scene as well as infer\ndynamics and track identities of any persistent objects. Thus, we introduce Sequential Attend, Infer,\nRepeat (SQAIR), whereby AIR is augmented with a state-space model (SSM) to achieve temporal\nconsistency in the generated images of the sequence. The resulting probabilistic model is composed\nof two parts: Discovery (DISC), which is responsible for detecting (or introducing, in the case of the\ngeneration) new objects at every time-step (essentially equivalent to AIR), and Propagation (PROP),\nresponsible for updating (or forgetting) latent variables from the previous time-step given the new\nobservation (image), effectively implementing the temporal SSM. We now formally introduce SQAIR\nby \ufb01rst describing its generative model and then the inference network.\n\nGenerative Model The model assumes that at every-time step, objects are \ufb01rst propagated from the\nprevious time-step (PROP). Then, new objects are introduced (DISC). Let t \u2208 N be the current time-\nstep. Let Pt be the set of objects propagated from the previous time-step and let Dt be the set of objects\ndiscovered at the current time-step, and let Ot = Pt \u222aDt be the set of all objects present at time-step t.\nConsequently, at every time step, the model retains a set of latent variables zPt\nt}i\u2208Pt , and\n\nt = {zi\n\n3\n\n\fgenerates a set of new latent variables zDt\nthe representation of the ith object zi\nt\n(as in AIR): zwhat,i\nand zwhere,i\nof the object, respectively. zpres,i\ngiven time-step or not.\n\nt\n\nt\n\nt\n\nt = {zi\n..= [zwhat,i\n\nt}i\u2208Dt . Together they form zt\n, zwhere,i\n\n], where\n] is composed of three components\nare real vector-valued variables representing appearance and location\nis a binary variable representing whether the object is present at the\n\n, zpres,i\n\n..= [zPt\nt\n\n, zDt\n\nt\n\nt\n\nt\n\nt\n\nt\n\nt\n\n, zwhere,i\n\nAt the \ufb01rst time-step (t = 1) there are no objects to propagate, so we sample D1, the number of\nobjects at t = 1, from the discovery prior pD(D1). Then for each object i \u2208 Dt, we sample latent\nvariables zwhat,i\nobjects from t = 1 are propagated to t = 2, and which objects disappear from the frame, using the\nbinary random variable (zpres,i\n)i\u2208Pt . The discovery step at t = 2 models new objects that enter\nthe frame, with a similar procedure to t = 1: sample D2 (which depends on zP2\n2 ) then sample\n(zwhat,i\n)i\u2208D2 . This procedure of propagation and discovery recurs for t = 2, . . . T . Once\nthe zt have been formed, we may generate images xt using the exact same generative distribution\np\u03b8(xt | zt) as in AIR (cf. Equation (1), Fig. 1, and Algorithm 1). In full, the generative model is:\n\n1 | D1(cid:1). At time t = 2, the propagation step models which\n\nfrom pD(cid:0)zi\n\n, zwhere,i\n\n2\n\n2\n\nt\n\nT\n\np(x1:T , z1:T , D1:T ) = pD(D1, zD1\n1 )\n\npD(Dt, zDt\nt\n\n|zPt\n\nt )pP (zPt\n\nt\n\n|zt\u22121)p\u03b8(xt|zt),\n\n(3)\n\nYt=2\n\nThe discovery prior pD(Dt, zDt\n|zPt\nt ) samples latent variables for new objects that enter the frame.\nt\nThe propagation prior pP (zPt\n|zt\u22121) samples latent variables for objects that persist in the frame\nt\nand removes latents of objects that disappear from the frame, thereby modelling dynamics and\nappearance changes. Both priors are learned during training. The exact forms of the priors are given\nin Appendix B.\n\nInference Similarly to AIR, inference in SQAIR can capture the number of objects and the representa-\ntion describing the location and appearance of each object that is necessary to explain every image in\na sequence. As with generation, inference is divided into PROP and DISC. During PROP, the inference\nnetwork achieves two tasks. Firstly, the latent variables from the previous time step are used to infer\nthe current ones, modelling the change in location and appearance of the corresponding objects,\nthereby attaining temporal consistency. This is implemented by the temporal RNN RT\n\u03c6 , with hidden\nstates hT\nt (recurs in t). Crucially, it does not access the current image directly, but uses the output of\nthe relation RNN (cf. Santoro et al., 2017). The relation RNN takes relations between objects into\naccount, thereby implementing the explaining away phenomenon; it is essential for capturing any\ninteractions between objects as well as occlusion (or overlap, if one object is occluded by another).\nSee Figure 7 for an example. These two RNNs together decide whether to retain or to forget objects\nthat have been propagated from the previous time step. During DISC, the network infers further latent\nvariables that are needed to describe any new objects that have entered the frame. All latent variables\nremaining after PROP and DISC are passed on to the next time step.\n\nSee Figures 2 and 3 for the inference network structure . The full variational posterior is de\ufb01ned as\n\nq\u03c6(D1:t, z1:T | x1:T ) =\n\nT\n\nYt=1\n\nqD\n\n\u03c6(cid:16)Dt, zDt\n\nt\n\n| xt, zPt\n\nt (cid:17) Yi\u2208Ot\u22121\n\nt | zi\n\nt\u22121, hT ,i\n\nt\n\nqP\n\n\u03c6(cid:16)zi\n\n, hR,i\n\nt (cid:17).\n\n(4)\n\nDiscovery, described by qD\n\u03c6 , is very similar to the full posterior of AIR, cf. Equation (2). The only\ndifference is the conditioning on zPt\n, which allows for a different number of discovered objects at\nt\neach time-step and also for objects explained by PROP not to be explained again. The second term, or\nqP\n\u03c6 , describes propagation. The detailed structures of qD\n\u03c6 are shown in Figure 3, while all the\npertinent algorithms and equations can be found in Appendices A and C, respectively.\n\n\u03c6 and qP\n\nLearning We train SQAIR as an importance-weighted auto-encoder (IWAE) of Burda et al., 2016.\nSpeci\ufb01cally, we maximise the importance-weighted evidence lower-bound LIWAE, namely\n\nLIWAE = E\n\nx1:T \u223cpdata(x1:T )\"Eq\"log\n\n1\nK\n\np\u03b8(x1:T , z1:T )\n\nq\u03c6(z1:T | x1:T )##.\n\n(5)\n\nK\n\nXk=1\n\nTo optimise the above, we use RMSPROP, K = 5 and batch size of 32. We use the VIMCO gradient\nestimator of Mnih and Rezende, 2016 to backpropagate through the discrete latent variables zpres,\n\n4\n\n\fFigure 3: Left: Interaction between PROP and DISC in\nSQAIR. Firstly, objects are propagated to time t, and\nobject i = 7 is dropped. Secondly, DISC tries to dis-\ncover new objects. Here, it manages to \ufb01nd two objects:\ni = 9 and i = 10. The process recurs for all remaining\ntime-steps. Blue arrows update the temporal hidden\nstate, orange ones infer relations between objects, pink\nones correspond to discovery. Bottom: Information\n\ufb02ow in a single discovery block (left) and propagation\nblock (right). In DISC we \ufb01rst predict where and extract\na glimpse. We then predict what and presence. PROP\nstarts with extracting a glimpse at a candidate location\nand updating where. Then it follows a procedure similar\nto DISC, but takes the respective latent variables from\nthe previous time-step into account. It is approximately\ntwo times more computationally expensive than DISC.\nFor details, see Algorithms 2 and 3 in Appendix A.\n\nand use reparameterisation for the continuous ones (Kingma and Welling, 2013). We also tried to\nuse NVIL of Mnih and Gregor, 2014 as in the original work on AIR, but found it very sensitive to\nhyper-parameters, fragile and generally under-performing.\n\n4 Experiments\n\nWe evaluate SQAIR on two datasets. Firstly, we perform an extensive evaluation on moving MNIST\ndigits, where we show that it can learn to reliably detect, track and generate moving digits (Section 4.1).\nMoreover, we show that SQAIR can simulate moving objects into the future \u2014 an outcome it has not\nbeen trained for. We also study the utility of learned representations for a downstream task. Secondly,\nwe apply SQAIR to real-world pedestrian CCTV data from static cameras (DukeMTMC, Ristani et al.,\n2016), where we perform background subtraction as pre-processing. In this experiment, we show that\nSQAIR learns to detect, track, predict and generate walking pedestrians without human supervision.\n\n4.1 Moving multi-MNIST\n\nThe dataset consists of sequences of length 10 of multiple moving MNIST digits. All images are of\nsize 50 \u00d7 50 and there are zero, one or two digits in every frame (with equal probability). Sequences\nare generated such that no objects overlap in the \ufb01rst frame, and all objects are present through\nthe sequence; the digits can move out of the frame, but always come back. See Appendix F for an\nexperiment on a harder version of this dataset. There are 60,000 training and 10,000 testing sequences\ncreated from the respective MNIST datasets. We train two variants of SQAIR: the MLP-SQAIR uses\n\n5\n\n\fFigure 4: Input images (top) and SQAIR reconstructions with marked glimpse locations (bottom). For more\nexamples, see Figure 13 in Appendix H.\n\nFigure 5: Samples from SQAIR. Both motion and appearance are consistent through time, thanks to the\npropagation part of the model. For more examples, see Figure 15 in Appendix H.\n\nonly fully-connected networks, while the CONV-SQAIR replaces the networks used to encode images\nand glimpses with convolutional ones; it also uses a subpixel-convolution network as the glimpse\ndecoder (Shi et al., 2016). See Appendix D for details of the model architectures and the training\nprocedure.\n\nWe use AIR and VRNN (Chung et al., 2015) as baselines for comparison. VRNN can be thought of\nas a sequential VAE with an RNN as its deterministic backbone. Being similar to a VAE, its latent\nvariables are not structured, nor easily interpretable. For a fair comparison, we control the latent\ndimensionality of VRNN and the number of learnable parameters. We provide implementation details\nin Appendix D.3.\n\nThe quantitative analysis consists of comparing all models in terms of the marginal log-likelihood\nlog p\u03b8(x1:T ) evaluated as the LIWAE bound with K = 1000 particles, reconstruction quality evaluated\nas a single-sample approximation of Eq\u03c6 [log p\u03b8(x1:T | z1:T )] and the KL-divergence between the\napproximate posterior and the prior (Table 1). Additionally, we measure the accuracy of the number\nof objects modelled by SQAIR and AIR. SQAIR achieves superior performance across a range of\nmetrics \u2014 its convolutional variant outperforms both AIR and the corresponding VRNN in terms of\nmodel evidence and reconstruction performance. The KL divergence for SQAIR is almost twice as\nlow as for VRNN and by a yet larger factor for AIR. We can interpret KL values as an indicator of the\nability to compress, and we can treat SQAIR/AIR type of scheme as a version of run-length encoding.\nWhile VRNN has to use information to explicitly describe every part of the image, even if some parts\nare empty, SQAIR can explicitly allocate content information (zwhat) to speci\ufb01c parts of the image\n(indicated by zwhere). AIR exhibits the highest values of KL, but this is due to encoding every frame\nof the sequence independently \u2014 its prior cannot take what and where at the previous time-step into\naccount, hence higher KL. The \ufb01fth column of Table 1 details the object counting accuracy, that is\nindicative of the quality of the approximate posterior. It is measured as the sum of zpres\nfor a given\nframe against the true number of objects in that frame. As there is no zpres for VRNN no score is\nprovided. Perhaps surprisingly, this metric is much higher for SQAIR than for AIR. This is because\nAIR mistakenly infers overlapping objects as a single object. Since SQAIR can incorporate temporal\n\nt\n\nFigure 6: The \ufb01rst three frames are input to SQAIR, which generated the rest conditional on the \ufb01rst frames.\n\n6\n\n\fFigure 7: Inputs, reconstructions with\nmarked glimpse locations and recon-\nstructed glimpses for AIR (left) and\nSQAIR (right). SQAIR can model par-\ntially visible and heavily overlapping\nobjects by aggregating temporal infor-\nmation.\n\nCONV-SQAIR\nMLP-SQAIR\n\nMLP-AIR\n\nCONV-VRNN\nMLP-VRNN\n\nlog p\u03b8(x1:T )\n\n6784.8\n6617.6\n6443.6\n6561.9\n5959.3\n\nlog p\u03b8(x1:T | z1:T ) KL(q\u03c6 || p\u03b8) Counting Addition\n0.9990\n0.9998\n0.8644\n0.8536\n0.8059\n\n6923.8\n6786.5\n6830.6\n6737.8\n6108.7\n\n134.6\n164.5\n352.6\n270.2\n218.3\n\n0.9974\n0.9986\n0.9058\n\nn/a\nn/a\n\nTable 1: SQAIR achieves higher performance than the baselines across a range of metrics. The third column\nrefers to the Kullback-Leibler (KL) divergence between the approximate posterior and the prior. Counting refers\nto accuracy of the inferred number of objects present in the scene, while addition stands for the accuracy of a\nsupervised digit addition experiment, where a classi\ufb01er is trained on the learned latent representations of each\nframe.\n\ninformation, it does not exhibit this failure mode (cf. Figure 7). Next, we gauge the utility of the\nlearnt representations by using them to determine the sum of the digits present in the image (Table 1,\ncolumn six). To do so, we train a 19-way classi\ufb01er (mapping from any combination of up to two digits\nin the range [0, 9] to their sum) on the extracted representations and use the summed labels of digits\npresent in the frame as the target. Appendix D contains details of the experiment. SQAIR signi\ufb01cantly\noutperforms AIR and both variants of VRNN on this tasks. VRNN under-performs due to the inability\nof disentangling overlapping objects, while both VRNN and AIR suffer from low temporal consistency\nof learned representations, see Appendix H. Finally, we evaluate SQAIR qualitatively by analyzing\nreconstructions and samples produced by the model against reconstructions and samples from VRNN.\nWe observe that samples and reconstructions from SQAIR are of better quality and, unlike VRNN,\npreserve motion and appearance consistently through time. See Appendix H for direct comparison\nand additional examples. Furthermore, we examine conditional generation, where we look at samples\nfrom the generative model of SQAIR conditioned on three images from a real sequence (see Figure 6).\nWe see that the model can preserve appearance over time, and that the simulated objects follow\nsimilar trajectories, which hints at good learning of the motion model (see Appendix H for more\nexamples). Figure 7 shows reconstructions and corresponding glimpses of AIR and SQAIR. Unlike\nSQAIR, AIR is unable to recognize objects from partial observations, nor can it distinguish strongly\noverlapping objects (it treats them as a single object; columns \ufb01ve and six in the \ufb01gure). We analyze\nfailure cases of SQAIR in Appendix G.\n\n4.2 Generative Modelling of Walking Pedestrians\n\nTo evaluate the model in a more challenging, real-world setting, we turn to data from static CCTV\ncameras of the DukeMTMC dataset (Ristani et al., 2016). As part of pre-precessing, we use standard\nbackground subtraction algorithms (Itseez, 2015). In this experiment, we use 3150 training and 350\nvalidation sequences of length 5. For details of model architectures, training and data pre-processing,\nsee Appendix E. We evaluate the model qualitatively by examining reconstructions, conditional\nsamples (conditioned on the \ufb01rst four frames) and samples from the prior (Figure 8 and Appendix I).\nWe see that the model learns to reliably detect and track walking pedestrians, even when they are\nclose to each other.\n\nThere are some spurious detections and re-detections of the same objects, which is mostly caused by\nimperfections of the background subtraction pipeline \u2014 backgrounds are often noisy and there are\nsudden appearance changes when a part of a person is treated as background in the pre-processing\npipeline. The object counting accuracy in this experiment is 0.5712 on the validation dataset, and we\nnoticed that it does increase with the size of the training set. We also had to use early stopping to\nprevent over\ufb01tting, and the model was trained for only 315k iterations (> 1M for MNIST experiments).\nHence, we conjecture that accuracy and marginal likelihood can be further improved by using a\nbigger dataset.\n\n7\n\n\fFigure 8: Inputs on the top, reconstructions in the second row, samples in the third row; rows four and \ufb01ve\ncontain inputs and conditional generation: the \ufb01rst four frames in the last row are reconstructions, while the\nremaining ones are predicted by sampling from the prior. There is no ground-truth, since we used sequences of\nlength \ufb01ve of training and validation.\n\n5 Related Work\n\nObject Tracking There have been many approaches to modelling objects in images and videos.\nObject detection and tracking are typically learned in a supervised manner, where object bounding\nboxes and often additional labels are part of the training data. Single-object tracking commonly use\nSiamese networks, which can be seen as an RNN unrolled over two time-steps (Valmadre et al., 2017).\nRecently, Kosiorek et al., 2017 used an RNN with an attention mechanism in the HART model to\npredict bounding boxes for single objects, while robustly modelling their motion and appearance.\nMulti-object tracking is typically attained by detecting objects and performing data association on\nbounding-boxes (Bewley et al., 2016). Schulter et al., 2017 used an end-to-end supervised approach\nthat detects objects and performs data association. In the unsupervised setting, where the training\ndata consists of only images or videos, the dominant approach is to distill the inductive bias of\nspatial consistency into a discriminative model. Cho et al., 2015 detect single objects and their\nparts in images, and Kwak et al., 2015; Xiao and Jae Lee, 2016 incorporate temporal consistency\nto better track single objects. SQAIR is unsupervised and hence it does not rely on bounding boxes\nnor additional labels for training, while being able to learn arbitrary motion and appearance models\nsimilarly to HART (Kosiorek et al., 2017). At the same time, is inherently multi-object and performs\ndata association implicitly (cf. Appendix A). Unlike the other unsupervised approaches, temporal\nconsistency is baked into the model structure of SQAIR and further enforced by lower KL divergence\nwhen an object is tracked.\n\nVideo Prediction Many works on video prediction learn a deterministic model conditioned on the\ncurrent frame to predict the future ones (Ranzato et al., 2014; Srivastava et al., 2015). Since these\nmodels do not model uncertainty in the prediction, they can suffer from the multiple futures problem\n\u2014 since perfect prediction is impossible, the model produces blurry predictions which are a mean of\npossible outcomes. This is addressed in stochastic latent variable models trained using variational\ninference to generate multiple plausible videos given a sequence of images (Babaeizadeh et al., 2017;\nDenton and Fergus, 2018). Unlike SQAIR, these approaches do not model objects or their positions\nexplicitly, thus the representations they learn are of limited interpretability.\n\nLearning Decomposed Representations of Images and Videos Learning decomposed representa-\ntions of object appearance and position lies at the heart of our model. This problem can be also\nseen as perceptual grouping, which involves modelling pixels as spatial mixtures of entities. Greff,\nRasmus, et al., 2016 and Greff, Steenkiste, et al., 2017 learn to decompose images into separate\nentities by iterative re\ufb01nement of spatial clusters using either learned updates or the Expectation\nMaximization algorithm; Ilin et al., 2017 and Steenkiste et al., 2018 extend these approaches to videos,\nachieving very similar results to SQAIR. Perhaps the most similar work to ours is the concurrently\ndeveloped model of Hsieh et al., 2018. The above approaches rely on iterative inference procedures,\n\n8\n\n\fbut do not exhibit the object-counting behaviour of SQAIR. For this reason, their computational\ncomplexities are proportional to the prede\ufb01ned maximum number of objects, while SQAIR can be\nmore computationally ef\ufb01cient by adapting to the number of objects currently present in an image.\nAnother interesting line of work is the GAN-based unsupervised video generation that decomposes\nmotion and content (Tulyakov et al., 2018; Denton and Birodkar, 2017). These methods learn\ninterpretable features of content and motion, but deal only with single objects and do not explicitly\nmodel their locations. Nonetheless, adversarial approaches to learning structured probabilistic models\nof objects offer a plausible alternative direction of research.\n\nBayesian Nonparametric Models To the best of our knowledge, Neiswanger and Wood, 2012 is the\nonly known approach that models pixels belonging to a variable number of objects in a video together\nwith their locations in the generative sense. This work uses a Bayesian nonparametric (BNP) model,\nwhich relies on mixtures of Dirichlet processes to cluster pixels belonging to an object. However,\nthe choice of the model necessitates complex inference algorithms involving Gibbs sampling and\nSequential Monte Carlo, to the extent that any sensible approximation of the marginal likelihood is\ninfeasible. It also uses a \ufb01xed likelihood function, while ours is learnable.\nThe object appearance-persistence-disappearance model in SQAIR is reminiscent of the Markov\nIndian buffet process (MIBP) of Gael et al., 2009, another BNP model. MIBP was used as a\nmodel for blind source separation, where multiple sources contribute toward an audio signal, and\ncan appear, persist, disappear and reappear independently. The prior in SQAIR is similar, but the\ncrucial differences are that SQAIR combines the BNP prior with \ufb02exible neural network models for\nthe dynamics and likelihood, as well as variational learning via amortized inference. The interface\nbetween deep learning and BNP, and graphical models in general, remains a fertile area of research.\n\n6 Discussion\n\nIn this paper we proposed SQAIR, a probabilistic model that extends AIR to image sequences, and\nthereby achieves temporally consistent reconstructions and samples. In doing so, we enhanced AIR\u2019s\ncapability of disentangling overlapping objects and identifying partially observed objects.\n\nThis work continues the thread of Greff, Steenkiste, et al., 2017, Steenkiste et al., 2018 and, together\nwith Hsieh et al., 2018, presents unsupervised object detection & tracking with learnable likelihoods\nby the means of generative modelling of objects. In particular, our work is the \ufb01rst one to explicitly\nmodel object presence, appearance and location through time. Being a generative model, SQAIR can\nbe used for conditional generation, where it can extrapolate sequences into the future. As such, it\nwould be interesting to use it in a reinforcement learning setting in conjunction with Imagination-\nAugmented Agents (Weber et al., 2017) or more generally as a world model (Ha and Schmidhuber,\n2018), especially for settings with simple backgrounds, e. g., games like Montezuma\u2019s Revenge or\nPacman.\n\nThe framework offers various avenues of further research; SQAIR leads to interpretable representa-\ntions, but the interpretability of what variables can be further enhanced by using alternative objectives\nthat disentangle factors of variation in the objects (Kim and Mnih, 2018). Moreover, in its current\nstate, SQAIR can work only with simple backgrounds and static cameras. In future work, we would\nlike to address this shortcoming, as well as speed up the sequential inference process whose com-\nplexity is linear in the number of objects. The generative model, which currently assumes additive\nimage composition, can be further improved by e. g., autoregressive modelling (Oord et al., 2016).\nIt can lead to higher \ufb01delity of the model and improved handling of occluded objects. Finally, the\nSQAIR model is very complex, and it would be useful to perform a series of ablation studies to further\ninvestigate the roles of different components.\n\nAcknowledgements\n\nWe would like to thank Ali Eslami for his help in implementing AIR, Alex Bewley and Martin\nEngelcke for discussions and valuable insights and anonymous reviewers for their constructive\nfeedback. Additionally, we acknowledge that HK and YWT\u2019s research leading to these results\nhas received funding from the European Research Council under the European Union\u2019s Seventh\nFramework Programme (FP7/2007-2013) ERC grant agreement no. 617071.\n\n9\n\n\fReferences\n\nBabaeizadeh, M., C. Finn, D. Erhan, R. H. Campbell, and S. Levine (2017). \u201cStochastic Variational\n\nVideo Prediction\u201d. In: CoRR. arXiv: 1710.11252.\n\nBewley, A., Z. Ge, L. Ott, F. T. Ramos, and B. Upcroft (2016). \u201cSimple online and realtime tracking\u201d.\n\nIn: ICIP.\n\nBurda, Y., R. Grosse, and R. Salakhutdinov (2016). \u201cImportance Weighted Autoencoders\u201d. In: ICLR.\n\narXiv: 1509.00519.\n\nCho, M., S. Kwak, C. Schmid, and J. Ponce (2015). \u201cUnsupervised object discovery and localization\nin the wild: Part-based matching with bottom-up region proposals\u201d. In: CoRR. arXiv: 1501.06170.\nChung, J., K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio (2015). \u201cA Recurrent Latent\n\nVariable Model for Sequential Data\u201d. In: NIPS. arXiv: 1506.02216.\n\nClevert, D.-A., T. Unterthiner, and S. Hochreiter (2015). \u201cFast and Accurate Deep Network Learning\n\nby Exponential Linear Units (ELUs)\u201d. In: CoRR. arXiv: 1511.07289.\n\nDenton, E. and V. Birodkar (2017). \u201cUnsupervised learning of disentangled representations from\n\nvideo\u201d. In: NIPS.\n\nDenton, E. and R. Fergus (2018). \u201cStochastic Video Generation with a Learned Prior\u201d. In: ICML.\nEslami, S. M. A., N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton\n(2016). \u201cAttend, Infer, Repeat: Fast Scene Understanding with Generative Models\u201d. In: NIPS.\narXiv: 1603.08575.\n\nGael, J. V., Y. W. Teh, and Z. Ghahramani (2009). \u201cThe In\ufb01nite Factorial Hidden Markov Model\u201d. In:\n\nNIPS.\n\nGraves, A., G. Wayne, M. Reynolds, T. Harley, I. Danihelka, A. Grabska-Barwi\u00b4nska, S. G. Col-\nmenarejo, E. Grefenstette, T. Ramalho, J. Agapiou, A. P. Badia, K. M. Hermann, Y. Zwols, G.\nOstrovski, A. Cain, H. King, C. Summer\ufb01eld, P. Blunsom, K. Kavukcuoglu, and D. Hassabis\n(2016). \u201cHybrid computing using a neural network with dynamic external memory\u201d. In: Nature\n538.7626.\n\nGreff, K., A. Rasmus, M. Berglund, T. H. Hao, H. Valpola, and J. Schmidhuber (2016). \u201cTagger:\n\nDeep Unsupervised Perceptual Grouping\u201d. In: NIPS.\n\nGreff, K., S. van Steenkiste, and J. Schmidhuber (2017). \u201cNeural Expectation Maximization\u201d. In:\n\nNIPS.\n\nGulrajani, I., K. Kumar, F. Ahmed, A. A. Taiga, F. Visin, D. Vazquez, and A. Courville (2016).\n\n\u201cPixelvae: A latent variable model for natural images\u201d. In: CoRR. arXiv: 1611.05013.\n\nHa, D. and J. Schmidhuber (2018). \u201cWorld Models\u201d. In: CoRR. arXiv: 1603.10122.\nHsieh, J.-T., B. Liu, D.-A. Huang, L. Fei-Fei, and J. C. Niebles (2018). \u201cLearning to Decompose and\n\nDisentangle Representations for Video Prediction\u201d. In: NIPS.\n\nIlin, A., I. Pr\u00e9mont-Schwarz, T. H. Hao, A. Rasmus, R. Boney, and H. Valpola (2017). \u201cRecurrent\n\nLadder Networks\u201d. In: NIPS.\n\nItseez (2015). Open Source Computer Vision Library. https://github.com/itseez/opencv.\nJacobsen, J.-H., J. Van Gemert, Z. Lou, and A. W. M. Smeulders (2016). \u201cStructured Receptive Fields\n\nin CNNs\u201d. In: CVPR.\n\nJaderberg, M., K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015). \u201cSpatial Transformer\n\nNetworks\u201d. In: NIPS. DOI: 10.1038/nbt.3343. arXiv: 1506.02025v1.\n\nKemp, C. and J. B. Tenenbaum (2008). \u201cThe discovery of structural form\u201d. In: Proceedings of the\n\nNational Academy of Sciences 105.31.\n\nKim, H. and A. Mnih (2018). \u201cDisentangling by factorising\u201d. In: ICML. arXiv: 1802.05983.\nKingma, D. P. and J. Ba (2015). \u201cAdam: A Method for Stochastic Optimization\u201d. In: ICLR. arXiv:\n\n1412.6980.\n\nKingma, D. P. and M. Welling (2013). \u201cAuto-encoding variational bayes\u201d. In: arXiv preprint\n\narXiv:1312.6114.\n\nKosiorek, A. R., A. Bewley, and I. Posner (2017). \u201cHierarchical Attentive Recurrent Tracking\u201d. In:\n\nNIPS. arXiv: 1706.09262.\n\nKwak, S., M. Cho, I. Laptev, J. Ponce, and C. Schmid (2015). \u201cUnsupervised object discovery and\n\ntracking in video collections\u201d. In: ICCV. IEEE.\n\nLeCun, Y., Y. Bengio, and G. Hinton (2015). \u201cDeep learning\u201d. In: Nature 521.7553.\nLeCun, Y., B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel (1989).\n\n\u201cBackpropagation applied to handwritten zip code recognition\u201d. In: Neural computation 1.4.\n\nMaddison, C. J., J. Lawson, G. Tucker, N. Heess, M. Norouzi, A. Mnih, A. Doucet, and Y. Teh (2017).\n\n\u201cFiltering Variational Objectives\u201d. In: Advances in Neural Information Processing Systems.\n\n10\n\n\fMnih, A. and K. Gregor (2014). \u201cNeural Variational Inference and Learning in Belief Networks\u201d. In:\n\nICML. arXiv: 1402.0030v2.\n\nMnih, A. and D. J. Rezende (2016). \u201cVariational inference for Monte Carlo objectives\u201d. In: ICML.\n\narXiv: 1602.06725.\n\nNeiswanger, W. and F. Wood (2012). \u201cUnsupervised Detection and Tracking of Arbitrary Objects\n\nwith Dependent Dirichlet Process Mixtures\u201d. In: CoRR. arXiv: 1210.3288.\n\nOord, A. van den, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu (2016).\n\n\u201cConditional Image Generation with PixelCNN Decoders\u201d. In: NIPS. arXiv: 1606.05328.\n\nRanzato, M., A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra (2014). \u201cVideo (language)\n\nmodeling: a baseline for generative models of natural videos\u201d. In: CoRR. arXiv: 1412.6604.\n\nRistani, E., F. Solera, R. Zou, R. Cucchiara, and C. Tomasi (2016). \u201cPerformance measures and a\n\ndata set for multi-target, multi-camera tracking\u201d. In: ECCV. Springer.\n\nSantoro, A., D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap\n(2017). \u201cA simple neural network module for relational reasoning\u201d. In: NIPS. arXiv: 1706.01427.\nSchulter, S., P. Vernaza, W. Choi, and M. K. Chandraker (2017). \u201cDeep Network Flow for Multi-object\n\nTracking\u201d. In: CVPR.\n\nShi, W., J. Caballero, F. Huszar, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016).\n\u201cReal-Time Single Image and Video Super-Resolution Using an Ef\ufb01cient Sub-Pixel Convolutional\nNeural Network\u201d. In: CVPR.\n\nSrivastava, N., E. Mansimov, and R. Salakhudinov (2015). \u201cUnsupervised learning of video represen-\n\ntations using lstms\u201d. In: ICML.\n\nSteenkiste, S. van, M. Chang, K. Greff, and J. Schmidhuber (2018). \u201cRelational Neural Expectation\n\nMaximization: Unsupervised Discovery of Objects and their Interactions\u201d. In: ICLR.\n\nTieleman, T. and G. Hinton (2012). Lecture 6.5\u2014RmsProp: Divide the gradient by a running average\n\nof its recent magnitude. COURSERA: Neural Networks for Machine Learning.\n\nTulyakov, S., M.-Y. Liu, X. Yang, and J. Kautz (2018). \u201cMocogan: Decomposing motion and content\n\nfor video generation\u201d. In: CVPR.\n\nValmadre, J., L. Bertinetto, J. F. Henriques, A. Vedaldi, and P. H. S. Torr (2017). \u201cEnd-to-end\n\nrepresentation learning for Correlation Filter based tracking\u201d. In: CVPR. arXiv: 1704.06036.\n\nWeber, T., S. Racani\u00e8re, D. P. Reichert, L. Buesing, A. Guez, D. J. Rezende, A. P. Badia, O. Vinyals,\nN. Heess, Y. Li, et al. (2017). \u201cImagination-augmented agents for deep reinforcement learning\u201d. In:\nNIPS.\n\nXiao, F. and Y. Jae Lee (2016). \u201cTrack and segment: An iterative unsupervised approach for video\n\nobject proposals\u201d. In: CVPR.\n\nZaheer, M., S. Kottur, S. Ravanbakhsh, B. P\u00f3czos, R. R. Salakhutdinov, and A. J. Smola (2017).\n\n\u201cDeep Sets\u201d. In: NIPS.\n\n11\n\n\f", "award": [], "sourceid": 5214, "authors": [{"given_name": "Adam", "family_name": "Kosiorek", "institution": "University of Oxford"}, {"given_name": "Hyunjik", "family_name": "Kim", "institution": null}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Ingmar", "family_name": "Posner", "institution": "Oxford University"}]}