{"title": "Stacked Capsule Autoencoders", "book": "Advances in Neural Information Processing Systems", "page_first": 15512, "page_last": 15522, "abstract": "Objects are composed of a set of geometrically organized parts. We introduce an unsupervised capsule autoencoder (SCAE), which explicitly uses geometric relationships between parts to reason about objects.\n Since these relationships do not depend on the viewpoint, our model is robust to viewpoint changes.\nSCAE consists of two stages.\nIn the first stage, the model predicts presences and poses of part templates directly from the image and tries to reconstruct the image by appropriately arranging the templates.\nIn the second stage, the SCAE predicts parameters of a few object capsules, which are then used to reconstruct part poses.\nInference in this model is amortized and performed by off-the-shelf neural encoders, unlike in previous capsule networks.\nWe find that object capsule presences are highly informative of the object class, which leads to state-of-the-art results for unsupervised classification on SVHN (55%) and MNIST (98.7%).", "full_text": "Stacked Capsule Autoencoders\n\nAdam R. Kosiorek \u2217 \u2020 \u2021\n\nSara Sabour\u00a7\n\nYee Whye Teh\u2207\n\nGeoffrey E. Hinton\u00a7\n\nadamk@robots.ox.ac.uk\n\n\u2021 Applied AI Lab\n\n\u2020 Department of Statistics\n\n\u00a7 Google Brain\n\n\u2207 DeepMind\n\nOxford Robotics Institute\n\nUniversity of Oxford\n\nToronto\n\nLondon\n\nUniversity of Oxford\n\nAbstract\n\nObjects are composed of a set of geometrically organized parts. We introduce\nan unsupervised capsule autoencoder (SCAE), which explicitly uses geometric\nrelationships between parts to reason about objects. Since these relationships do\nnot depend on the viewpoint, our model is robust to viewpoint changes. SCAE\nconsists of two stages. In the \ufb01rst stage, the model predicts presences and poses of\npart templates directly from the image and tries to reconstruct the image by appro-\npriately arranging the templates. In the second stage, SCAE predicts parameters of\na few object capsules, which are then used to reconstruct part poses. Inference in\nthis model is amortized and performed by off-the-shelf neural encoders, unlike in\nprevious capsule networks. We \ufb01nd that object capsule presences are highly infor-\nmative of the object class, which leads to state-of-the-art results for unsupervised\nclassi\ufb01cation on SVHN (55%) and MNIST (98.7%).\n\n1\n\nIntroduction\n\nConvolutional neural networks (CNN) work better than networks\nwithout weight-sharing because of their inductive bias: if a local\nfeature is useful in one image location, the same feature is likely to\nbe useful in other locations. It is tempting to exploit other effects\nof viewpoint changes by replicating features across scale, orienta-\ntion and other af\ufb01ne degrees of freedom, but this quickly leads to\ncumbersome, high-dimensional feature maps.\n\nAn alternative to replicating features across the non-translational\ndegrees of freedom is to explicitly learn transformations between\nthe natural coordinate frame of a whole object and the natural coor-\ndinate frames of each of its parts. Computer graphics relies on such\nobject\u2192part coordinate transformations to represent the geometry\nof an object in a viewpoint-invariant manner. Moreover, there is\nstrong evidence that, unlike standard CNNs, human vision also relies\non coordinate frames: imposing an unfamiliar coordinate frame on\na familiar object makes it challenging to recognize the object or its\ngeometry (Rock, 1973; G. E. Hinton, 1979).\n\nA neural system can learn to reason about transformations between\nobjects, their parts and the viewer, but each kind of transforma-\ntion will likely need to be represented differently. An object-part-\nrelationship (OP) is viewpoint-invariant, approximately constant and\ncould be easily coded by learned weights. The relative coordinates\n\n\u2217This work was done during an internship at Google Brain.\n\nFigure 1: SCAEs learn to ex-\nplain different object classes with\nseparate object capsules, thereby\ndoing unsupervised classi\ufb01cation.\nHere, we show TSNE embeddings\nof object capsule presence prob-\nabilities for 10000 MNIST digits.\nIndividual points are color-coded\naccording to the corresponding\ndigit class.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 2: Stacked Capsule Au-\ntoencoder (SCAE): (a) part cap-\nsules segment the input into parts\nand their poses. The poses are\nthen used to reconstruct the input\nby af\ufb01ne-transforming learned\ntemplates. (b) object capsules try\nto arrange inferred poses into ob-\njects, thereby discovering under-\nlying structure. SCAE is trained\nby maximizing image and part\nlog-likelihoods subject to sparsity\nconstraints.\n\nof an object (or a part) with respect to the viewer change with the viewpoint (they are viewpoint-\nequivariant), and could be easily coded with neural activations2.\n\nWith this representation, the pose of a single object is represented by its relationship to the viewer.\nConsequently, representing a single object does not necessitate replicating neural activations across\nspace, unlike in CNNs. It is only processing two (or more) different instances of the same type of\nobject in parallel that requires spatial replicas of both model parameters and neural activations.\n\nIn this paper we propose the Stacked Capsule Autoencoder (SCAE), which has two stages (Fig. 2).\nThe \ufb01rst stage, the Part Capsule Autoencoder (PCAE), segments an image into constituent parts, infers\ntheir poses, and reconstructs the image by appropriately arranging af\ufb01ne-transformed part templates.\nThe second stage, the Object Capsule Autoencoder (OCAE), tries to organize discovered parts and\ntheir poses into a smaller set of objects. These objects then try to reconstruct the part poses using\na separate mixture of predictions for each part. Every object capsule contributes components to\neach of these mixtures by multiplying its pose\u2014the object-viewer-relationship (OV)\u2014by the relevant\nobject-part-relationship (OP).\n\nStacked Capsule Autoencoders (Section 2) capture spatial relationships between whole objects\nand their parts when trained on unlabelled data. The vectors of presence probabilities for\nthe object capsules tend to form tight clusters (cf. Figure 1), and when we assign a class to\neach cluster we achieve state-of-the-art results for unsupervised classi\ufb01cation on SVHN (55%)\nand MNIST (98.7%), which can be further improved to 67% and 99%, respectively, by learn-\ning fewer than 300 parameters (Section 3). We describe related work in Section 4 and dis-\ncuss implications of our work and future directions in Section 5. The code is available at\ngithub.com/google-research/google-research/tree/master/stacked_capsule_autoencoders.\n\n2 Stacked Capsule Autoencoders (SCAE)\n\nSegmenting an image into parts is non-trivial, so we begin by abstracting away pixels and the part-\ndiscovery stage, and develop the Constellation Capsule Autoencoder (CCAE) (Section 2.1). It uses\ntwo-dimensional points as parts, and their coordinates are given as the input to the system. CCAE\nlearns to model sets of points as arrangements of familiar constellations, each of which has been\ntransformed by an independent similarity transform. The CCAE learns to assign individual points\nto their respective constellations\u2014without knowing the number of constellations or their shapes in\nadvance. Next, in Section 2.2, we develop the Part Capsule Autoencoder (PCAE) which learns to infer\nparts and their poses from images. Finally, we stack the Object Capsule Autoencoder (OCAE), which\nclosely resembles the CCAE, on top of the PCAE to form the Stacked Capsule Autoencoder (SCAE).\n\n2.1 Constellation Autoencoder (CCAE)\n\nLet {xm | m = 1, . . . , M } be a set of two-dimensional input points, where every point belongs to a\nconstellation as in Figure 3. We \ufb01rst encode all input points (which take the role of part capsules)\nwith Set Transformer (Lee et al., 2019)\u2014a permutation-invariant encoder hcaps based on attention\nmechanisms\u2014into K object capsules. An object capsule k consists of a capsule feature vector ck, its\npresence probability ak \u2208 [0, 1] and a 3 \u00d7 3 object-viewer-relationship (OV) matrix, which represents\n\n2This may explain why accessing perceptual knowledge about objects, when they are not visible, requires\ncreating a mental image of the object with a speci\ufb01c viewpoint.\n\n2\n\n(b)explain posespart likelihoodimage likelihood(a)infer parts& posesreassembletemplates (learned)Part Capsule AutoencoderObject Capsule Autoencoder\fFigure 3: Unsupervised segmentation of points belonging to up to three\nconstellations of squares and triangles at different positions, scales\nand orientations. The model is trained to reconstruct the points (top\nrow) under the CCAE mixture model. The bottom row colours the\npoints based on the parent with the highest posterior probability in the\nmixture model. The right-most column shows a failure case. Note that\nthe model uses sets of points, not pixels, as its input; we use images\nonly to visualize the constellation arrangements.\n\nk\n\nthe af\ufb01ne transformation between the object (constellation) and the viewer. Note that each object\ncapsule can represent only one object at a time. Every object capsule uses a separate multilayer\nperceptron (MLP) hpart\nto predict N \u2264 M part candidates from the capsule feature vector ck. Each\ncandidate consists of the conditional probability ak,n \u2208 [0, 1] that a given candidate part exists, an\nassociated scalar standard deviation \u03bbk,n, and a 3 \u00d7 3 object-part-relationship (OP) matrix, which\nrepresents the af\ufb01ne transformation between the object capsule and the candidate part3. Candidate\npredictions \u00b5k,n are given by the product of the object capsule OV and the candidate OP matrices.\nWe model all input points as a single Gaussian mixture, where \u00b5k,n and \u03bbk,n are the centres and\nstandard deviations of the isotropic Gaussian components. See Figures 2 and 6 for illustration; formal\ndescription follows:\n\nOV1:K, c1:K, a1:K = hcaps(x1:M )\nOPk,1:N , ak,1:N , \u03bbk,1:N = hpart\nVk,n = OVk OPk,n\np(xm | k, n) = N (xm | \u00b5k,n, \u03bbk,n)\n\n(ck)\n\nk\n\npredict object capsule parameters,\n\ndecode candidate parameters from ck\u2019s,\ndecode a part pose candidate,\n\nturn candidates into mixture components,\n\n(1)\n\n(2)\n\n(3)\n\n(4)\n\n(5)\n\np(x1:M ) =\n\np(xm | k, n) .\n\nMYm=1\n\nKXk=1\n\nNXn=1\n\nakak,n\n\nPi aiPj ai,j\n\nThe model is trained without supervision by maximizing the likelihood of part capsules in Equa-\ntion (5) subject to sparsity constraints, cf. Section 2.4 and Appendix C. The part capsule m can\nbe assigned to the object capsule k\u22c6 by looking at the mixture component responsibility, that is\nk\u22c6 = arg maxk akak,n p(xm | k, n).4 Empirical results show that this model is able to perform\nunsupervised instance-level segmentation of points belonging to different constellations, even in data\nwhich is dif\ufb01cult to interpret for humans. See Figure 3 for an example and Section 3.1 for details.\n\n2.2 Part Capsule Autoencoder (PCAE)\n\nExplaining images as geometrical arrangements of parts requires 1) discovering what parts are there\nin an image and 2) inferring the relationships of the parts to the viewer (their pose). For the CCAE\na part is just a 2D point (that is, a (x, y) coordinate), but for the PCAE each part capsule m has\na six-dimensional pose xm (two rotations, two translations, scale and shear), a presence variable\ndm \u2208 [0, 1] and a unique identity. We frame the part-discovery problem as auto-encoding: the\nencoder learns to infer the poses and presences of different part capsules, while the decoder learns an\nimage template Tm for each part (Fig. 4) similar to Tieleman, 2014; Eslami et al., 2016. If a part\nexists (according to its presence variable), the corresponding template is af\ufb01ne-transformed with\n\nthe inferred pose giving bTm. Finally, transformed templates are arranged into the image. The PCAE\n\nis followed by an Object Capsule Autoencoder (OCAE), which closely resembles the CCAE and is\ndescribed in Section 2.3.\nLet y \u2208 [0, 1]h\u00d7w\u00d7c be the image. We limit the maximum number of part capsules to M and use an\nencoder to infer their poses xm, presence probabilities dm, and special features zm \u2208 Rcz , one per\npart capsule. Special features can be used to alter the templates in an input-dependent manner (we\nuse them to predict colour, but more complicated mappings are possible). The special features also\ninform the OCAE about unique aspects of the corresponding part (e. g., occlusion or relation to other\nparts). Templates Tm \u2208 [0, 1]ht\u00d7wt\u00d7(c+1) are smaller than the image y, but have an additional alpha\n\n3Deriving these matrices from capsule feature vectors allows for deformable objects, see Appendix D for details.\n4We treat parts as independent and evaluate their probability under the same mixture model. While there are no\nclear 1:1 connections between parts and predictions, it seems to work well in practice.\n\n3\n\n\fFigure 4: Stroke-like templates\nlearned on MNIST (left) as well\nas sobel-\ufb01ltered SVHN (middle)\nand CIFAR10 (right). For SVHN\nthey often take the form of dou-\nble strokes due to sobel \ufb01ltering.\n\nFigure 5: MNIST (a) images, (b) re-\nconstructions from part capsules in red\nand object capsules in green, with over-\nlapping regions in yellow. Only a few\nobject capsules are activated for every\ninput (c) a priori (left) and even fewer\nare needed to reconstruct it (right). The\nmost active capsules (d) capture object\nidentity and its appearance; (e) shows a\nfew o f the af\ufb01ne-transformed templates\nused for reconstruction.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nchannel which allows occlusion by other templates. We use T a\nto refer to its colours.\n\nm to refer to the alpha channel and T c\nm\n\nWe allow each part capsule to be used only once to reconstruct an image, which means that parts\nof the same type are not repeated5. To infer part capsule parameters we use a CNN-based encoder\nfollowed by attention-based pooling, which is described in more detail in the Appendix E and whose\neffects on the model performance are analyzed in Section 3.3.\n\nThe image is modelled as a spatial Gaussian mixture, similarly to Greff et al., 2019; Burgess et al.,\n2019; Engelcke et al., 2019. Our approach differs in that we use pixels of the transformed templates\n(instead of component-wise reconstructions) as the centres of isotropic Gaussian components, but\nwe also use constant variance. Mixing probabilities of different components are proportional to the\nproduct of presence probabilities of part capsules and the value of the learned alpha channel for every\ntemplate. More formally:\nx1:M , d1:M , z1:M = henc(y)\ncm = MLP(zm)\n\npredict part capsule parameters,\npredict the color of the mth template,\n\n(7)\n\n(6)\n\npy\n\nm,i,j\n\nbTm = TransformImage(Tm, xm)\nm,i,j \u221d dmbT a\nm,i,j N(cid:16)yi,j | cm \u00b7 bT c\nMXm=1\np(y) =Yi,j\n\npy\n\napply af\ufb01ne transforms to image templates, (8)\n\ncompute mixing probabilities,\n\ncalculate the image likelihood.\n\n(9)\n\n(10)\n\nm,i,j; \u03c32\n\ny(cid:17)\n\nTraining the PCAE results in learning templates for object parts, which resemble strokes in the case\nof MNIST, see Figure 4. This stage of the model is trained by maximizing the image likelihood of\nEquation (10).\n\n2.3 Object Capsule Autoencoder (OCAE)\n\nHaving identi\ufb01ed parts and their parameters, we would like to discover objects that could be composed\nof them6. To do so, we use concatenated poses xm, special features zm and \ufb02attened templates\nTm (which convey the identity of the part capsule) as an input to the OCAE, which differs from the\nCCAE in the following ways. Firstly, we feed part capsule presence probabilities dm into the OCAE\u2019s\nencoder\u2014these are used to bias the Set Transformer\u2019s attention mechanism not to take absent points\ninto account. Secondly, dms are also used to weigh the part-capsules\u2019 log-likelihood, so that we do\nnot take log-likelihood of absent points into account. This is implemented by raising the likelihood\nof the mth part capsule to the power of dm, cf. Equation (5). Additionally, we stop the gradient on\nall of OCAE\u2019s inputs except the special features to improve training stability and avoid the problem\n\n5We could repeat parts by using multiple instances of the same part capsule.\n6Discovered objects are not used top-down to re\ufb01ne the presences or poses of the parts during inference. However,\nthe derivatives backpropagated via OCAE re\ufb01ne the lower-level encoder network that infers the parts.\n\n4\n\n\fof collapsing latent variables; see e. g., Rasmus et al., 2015. Finally, parts discovered by the PCAE\nhave independent identities (templates and special features rather than 2D points). Therefore, every\npart-pose is explained as an independent mixture of predictions from object-capsules\u2014where every\nobject capsule makes exactly M candidate predictions Vk,1:M , or exactly one candidate prediction\nper part. Consequently, the part-capsule likelihood is given by,\n\np(x1:M , d1:M ) =\n\n.\n\n(11)\n\nMYm=1\" KXk=1\n\nakak,m\n\nPi aiPj ai,j\n\np(xm | k, m)#dm\n\nThe OCAE is trained by maximising the part pose likelihood of Equation (11), and it learns to discover\nfurther structure in previously identi\ufb01ed parts, leading to learning sparsely-activated object capsules,\nsee Figure 5. Achieving this sparsity requires further regularization, however.\n\n2.4 Achieving Sparse and Diverse Capsule Presences\n\nStacked Capsule Autoencoders are trained to maximise pixel and part log-likelihoods (Lll =\nlog p(y) + log p(x1:M )). If not constrained, however, they tend to either use all of the part and\nobject capsules to explain every data example or collapse onto always using the same subset of\ncapsules, regardless of the input. We want the model to use different sets of part-capsules for different\ninput examples and to specialize object-capsules to particular arrangements of parts. To encourage\nthis, we impose sparsity and entropy constraints. We evaluate their importance in Section 3.3.\n\nWe \ufb01rst de\ufb01ne prior and posterior object-capsule presence as follows. For a minibatch of size B with\nK object capsules and M part capsules we de\ufb01ne a minibatch of prior capsule presence aprior\n1:K with\ndimension [B, K] and posterior capsule presence aposterior\n\n1:K,1:M with dimension [B, K, M ] as,\n\naprior\nk\n\n= ak max\n\nm\n\nam,k ,\n\naposterior\nk,m\n\n= akak,m N (xm | m, k) ,\n\n(12)\n\nrespectively; the former is the maximum presence probability among predictions from object capsule\nk while the latter is the unnormalized mixing proportion used to explain part capsule m.\n\nb,k\n\nb=1 aprior\n\nthe sum of presence probabilities of the object capsule k\n\nPrior sparsity Let uk = PB\namong different training examples, and bub = PK\n\nthe sum of object capsule presence\nprobabilities for a given example. If we assume that training examples contain objects from different\nclasses uniformly at random and we would like to assign the same number of object capsules to every\nclass, then each class would obtain K/C capsules. Moreover, if we assume that only one object is\npresent in every image, then K/C object capsules should be present for every input example, which\nresults in the sum of presence probabilities of B/C for every object capsule. To this end, we minimize,\n\nk=1 aprior\n\nb,k\n\nLprior =\n\n1\nB\n\nBXb=1\n\n(bub \u2212 K/C)2 +\n\n1\nK\n\nKXk=1\n\n(uk \u2212 B/C)2 .\n\n(13)\n\nPosterior Sparsity Similarly, we experimented with minimizing the within-example entropy of\n\ncapsule posterior presence H(vk) and maximizing its between-example entropy H(bvb), where\nH is the entropy, and where vk and bvb are the the normalized versions of Pk,m aposterior\nPb,m aposterior\n\nb,k,m , respectively. The \ufb01nal loss reads as\n\nand\n\nb,k,m\n\nLposterior =\n\nH(vk) \u2212\n\n(14)\n\n1\nK\n\nKXk=1\n\n1\nB\n\nBXb=1\n\nH(bvb) .\n\nOur ablation study has shown, however, that the model can perform equally well without these\nposterior sparsity constraints, cf. Section 3.3.\n\nFig. 6 shows the schematic architecture of SCAE. We optimize a weighted sum of image and part\nlikelihoods and the auxiliary losses. Loss weight selection process, as well as the values used for\nexperiments, are detailed in Appendix A.\n\nIn order to make the values of presence probabilities (ak, ak,m and dm) closer to binary we inject\nuniform noise \u2208 [\u22122, 2] into logits, similar to Tieleman, 2014. This forces the model to predict\nlogits that are far from zero to avoid stochasticity and makes the predicted presence probabilities\nclose to binary. Interestingly, it tends to work better in our case than using the Concrete distribution\n(Maddison et al., 2017).\n\n5\n\n\fFigure 6: SCAE architecture.\n\n3 Evaluation\n\nThe decoders in the SCAE use explicitly parameterised af\ufb01ne transformations that allow the encoders\u2019\ninputs to be explained with a small set of transformed objects or parts. The following evaluations\nshow how the embedded geometrical knowledge helps to discover patterns in data. Firstly, we show\nthat the CCAE discovers underlying structures in sets of two-dimensional points, thereby performing\ninstance-level segmentation. Secondly, we pair an OCAE with a PCAE and investigate whether the\nresulting SCAE can discover structure in real images. Finally, we present an ablation study that shows\nwhich components of the model contribute to the results.\n\n3.1 Discovering Constellations\n\nWe create arrangements of constellations online, where every input example consists of up to 11\ntwo-dimensional points belonging to up to three different constellations (two squares and a triangle)\nas well as binary variables indicating the presence of the points (points can be missing). Each\nconstellation is included with probability 0.5 and undergoes a similarity transformation, whereby it is\nrandomly scaled, rotated by up to 180\u00b0 and shifted. Finally, every input example is normalised such\nthat all points lie within [\u22121, 1]2. Note that we use sets of points, and not images, as inputs to our\nmodel.\n\nWe compare the CCAE against a baseline that uses the same encoder but a simpler decoder: the\ndecoder uses the capsule parameter vector ck to directly predict the location, precision and presence\nprobability of each of the four points as well as the presence probability of the whole corresponding\nconstellation. Implementation details are listed in Appendix A.1.\n\nBoth models are trained unsupervised by maximising the part log-likelihood. We evaluate them by\ntrying to assign each input point to one of the object capsules. To do so, we assign every input point\nto the object capsule with the highest posterior probability for this point, cf. Section 2.1, and compute\nsegmentation accuracy (i. e., the true-positive rate).\n\nThe CCAE consistently achieves7 below 4% error with the best model achieving 2.8% , while the\nbest baseline achieved 26% error using the same budget for hyperparameter search. This shows that\nwiring in an inductive bias towards modelling geometric relationships can help to bring down the\nerror by an order of magnitude\u2014at least in a toy setup where each set of points is composed of\nfamiliar constellations that have been independently transformed.\n\n3.2 Unsupervised Class Discovery in Images\n\nWe now turn to images in order to assess if our model can simultaneously learn to discover parts\nand group them into objects. To allow for multimodality in the appearance of objects of a speci\ufb01c\nclass, we typically use more object capsules than the number of class labels. It turns out that the\nvectors of presence probabilities form tight clusters as shown by their TSNE embeddings (Maaten\nand G. E. Hinton, 2008) in Figure 1\u2014note the large separation between clusters corresponding to\ndifferent digits, and that only a few data points are assigned to the wrong clusters. Therefore, we\nexpect object capsules presences to be highly informative of the class label. To test this hypothesis,\n\n7This result requires using an additional sparsity loss described in Appendix C; without it the CCAE achieves\naround 10% error.\n\n6\n\nObject CapsulesCNNTemplatesSet TransformerSpecial FeaturesPresenceImage LikelihoodTransformedTemplatesPartLikelihoodInformation flow. Gradient flows in the reverse direction.Information flow withstopped gradients.Trainable variablesActivationsLossesSparsityPart CapsulesPose PresenceObject-Viewer RelationsFixed Object-Part RelationsPredicted Part Poses\fTable 1: Unsupervised classi\ufb01-\ncation results in % with (stan-\ndard deviation) are averaged\nover 5 runs. Methods based on\nmutual information are shaded.\nResults marked with \u2020 use\ndata augmentation, \u2207 use IM-\nAGENET-pretrained features in-\nstead of images, while \u00a7 are\ntaken from Ji et al., 2018. We\nhighlight the best results and\nthose that are are within its 98%\ncon\ufb01dence interval according\nto a two-sided t test.\n\nMethod\n\nMNIST\n\nCIFAR10\n\nSVHN\n\nKMEANS (Haeusser et al., 2018)\nAE (Bengio et al., 2007)\u00a7\nGAN (Radford et al., 2016)\u00a7\nIMSAT (Hu et al., 2017)\u2020,\u2207\nIIC (Ji et al., 2018)\u00a7,\u2020\nADC (Haeusser et al., 2018)\u2020\n\n53.49\n81.2\n82.8\n98.4 (0.4)\n98.4 (0.6)\n98.7 (0.6)\n\n20.8\n31.4\n31.5\n45.6 (0.8)\n57.6 (5.0)\n29.3 (1.5)\n\n12.5\n-\n-\n57.3 (3.9)\n-\n38.6 (4.1)\n\nSCAE (LIN-MATCH)\nSCAE (LIN-PRED)\n\n98.7 (0.35)\n99.0 (0.07)\n\n25.01 (1.0)\n33.48 (0.3)\n\n55.33 (3.4)\n67.27 (4.5)\n\nwe train SCAE on MNIST, SVHN8 and CIFAR10 and try to assign class labels to vectors of object\ncapsule presences. This is done with one of the following methods: LIN-MATCH: after \ufb01nding\n10 clusters9 with KMEANS we use bipartite graph matching (Kuhn, 1955) to \ufb01nd the permutation\nof cluster indices that minimizes the classi\ufb01cation error\u2014this is standard practice in unsupervised\nclassi\ufb01cation, see e. g., Ji et al., 2018; LIN-PRED: we train a linear classi\ufb01er with supervision given\nthe presence vectors; this learns K \u00d7 10 weights and 10 biases, where K is the number of object\ncapsules, but it does not modify any parameters of the main model.\n\nIn agreement with previous work on unsupervised clustering (Ji et al., 2018; Hu et al., 2017; Hjelm\net al., 2019; Haeusser et al., 2018), we train our models and report results on full datasets (TRAIN,\nVALID and TEST splits). The linear transformation used in LIN-PRED variant of our method is trained\non the TRAIN split of respective datasets while its performance on the TEST split is reported.\n\nWe used an PCAE with 24 single-channel 11 \u00d7 11 templates for MNIST and 24 and 32 three-channel\n14 \u00d7 14 templates for SVHN and CIFAR10, respectively. We used sobel-\ufb01ltered images as the\nreconstruction target for SVHN and CIFAR10, as in Jaiswal et al., 2018, while using the raw pixel\nintensities as the input to PCAE. The OCAE used 24, 32 and 64 object capsules, respectively. Further\ndetails on model architectures and hyper-parameter tuning are available in Appendix A. All results\nare presented in Table 1. SCAE achieves state-of-the-art results in unsupervised object classi\ufb01cation\non MNIST and SVHN and under-performs on CIFAR10 due to the inability to model backgrounds,\nwhich is further discussed in Section 5.\n\n3.3 Ablation study\n\nSCAEs have many moving parts; an ablation study shows which model components are important\nand to what degree. We train SCAE variants on MNIST as well as a padded-and-translated 40 \u00d7 40\nversion of the dataset, where the original digits are translated up to 6 pixels in each direction. Trained\nmodels are tested on TEST splits of both datasets; additionally, we evaluate the model trained on the\n40 \u00d7 40 MNIST on the TEST split of AFFNIST dataset. Testing on AFFNIST shows whether the model\ncan generalise to unseen viewpoints. This task was used by Rawlinson et al., 2018 to evaluate Sparse\nUnsupervised Capsules, which achieved 90.12% accuracy. SCAE achieves 92.2 \u00b1 0.59%, which\nindicates that it is better at viewpoint generalisation. We choose the LIN-MATCH performance metric,\nsince it is the one favoured by the unsupervised classi\ufb01cation community.\n\nResults are split into several groups and shown in Table 2. We describe each group in turn. Group\na) shows that sparsity losses introduced in Section 2.4 increase model performance, but that the\nposterior loss might not be necessary. Group b) checks the in\ufb02uence of injecting noise into logits for\npresence probabilities, cf. Section 2.4. Injecting noise into part capsules seems critical, while noise\nin object capsules seems unnecessary\u2014the latter might be due to sparsity losses. Group c) shows\nthat using similarity (as opposed to af\ufb01ne) transforms in the decoder can be restrictive in some cases,\nwhile not allowing deformations hurts performance in every case.\n\nGroup d) evaluates the type of the part-capsule encoder. The LINEAR encoder entails a CNN\nfollowed by a fully-connected layer, while the CONV encoder predicts one feature map for every\ncapsule parameter, followed by global-average pooling. The choice of part-capsule encoder seems\n\n8We note that we tie the values of the alpha channel T a\nthe SVHN experiments.\n9All considered datasets have 10 classes.\n\nm and the color values T c\n\nm which leads to better results in\n\n7\n\n\fMethod\n\nfull model\n\nMNIST\n\n40 \u00d7 40 MNIST\n\nAFFNIST\n\n95.3 (4.65)\n\n98.7 (0.35)\n\n92.2 (0.59)\n\nresults\n\nTable 2: Ablation study on\nMNIST. All used model\ncomponents\ncontribute\nto its \ufb01nal performance.\nshow\nAFFNIST\nout-of-distribution gener-\nalization properties and\ncome from a model trained\non 40 \u00d7 40 MNIST. Num-\nbers represent average %\nand (standard deviation)\nover 10 runs. We highlight\nthe best results and those\nthat are are within its\n98% con\ufb01dence interval\naccording to a two-sided t\ntest.\n\na)\n\nb)\n\nc)\n\nd)\n\nLINEAR part enc\nCONV part enc\n\ne) MLP enc for object caps\nf)\n\nno special features\n\nno posterior sparsity\nno prior sparsity\nno prior/posterior sparsity\n\n97.5 (1.55)\n72.4 (22.39)\n84.7 (3.01)\n\n95.0 (7.20)\n88.2 (6.98)\n82.0 (5.46)\n\nno noise in object caps\nno noise in any caps\nno noise in part caps\n\n96.7 (2.30)\n93.1 (5.09)\n93.9 (7.16)\n\n98.5 (0.12)\n78.5 (22.69)\n82.8 (24.83)\n\nsimilarity transforms\nno deformations\n\n97.5 (1.55)\n87.3 (21.48)\n\n95.9 (1.59)\n87.2 (18.54)\n\n98.0 (0.52)\n97.6 (1.22)\n\n27.1 (9.03)\n90.7 (2.25)\n\n63.2 (31.47)\n97.8 (.98)\n\n36.3 (3.70)\n58.7 (31.60)\n\n85.3 (11.67)\n71.3 (5.46)\n59.0 (5.66)\n\n93.5 (0.38)\n64.1 (26.74)\n70.7 (25.96)\n\n88.9 (1.58)\n79.0 (22.44)\n\n50.8 (26.46)\n81.6 (1.66)\n\n25.29 (3.69)\n44.5 (21.71)\n\nnot to matter much for within-distribution performance; however, our attention-based pooling (cf.\nAppendix E) does achieve much higher classi\ufb01cation accuracy when evaluated on a different dataset,\nshowing better generalisation to novel viewpoints.\n\nAdditionally, e) using Set Transformer as the object-capsule encoder is essential. We hypothesise\nthat it is due to the natural tendency of Set Transformer to \ufb01nd clusters, as reported in Lee et al.,\n2019. Finally, f) using special features zm seems not less important\u2014presumably due to effects the\nhigh-level capsules have on the representation learned by the primary encoder.\n\n4 Related Work\n\nCapsule Networks Our work combines ideas from Transforming Autoencoders (G. E. Hinton,\nKrizhevsky, et al., 2011) and EM Capsules (G. E. Hinton, Sabour, et al., 2018). Transforming\nautoencoders discover af\ufb01ne-aware capsule instantiation parameters by training an autoencoder to\nreconstruct an af\ufb01ne-transformed version of the original image. This model uses an additional input\nthat explicitly represents the transformation, which is a form of supervision. By contrast, our model\ndoes not need any input other than the image.\n\nBoth EM Capsules and the preceding Dynamic Capsules (Sabour et al., 2017) use the poses of parts\nand learned part\u2192object relationships to vote for the poses of objects. When multiple parts cast very\nsimilar votes, the object is assumed to be present, which is facilitated by an interactive inference\n(routing) algorithm. Iterative routing is inef\ufb01cient and has prompted further research (Wang and Liu,\n2018; Zhang et al., 2018; Li et al., 2018). In contrast to prior work, we use objects to predict parts\nrather than vice-versa; therefore we can dispense with iterative routing at inference time\u2014every part\nis explained as a mixture of predictions from different objects, and can have only one parent. This\nregularizes the OCAE\u2019s encoder to respect the single parent constraint when learning to group parts\ninto objects.\n\nAdditionally, since it is the objects that predict parts, the part poses can have fewer degrees-of-freedom\nthan object poses (as in the CCAE). Inference is still possible because the OCAE encoder makes\nobject predictions based on all the parts. This is in contrast to each individual part making its own\nprediction, as was the case in previous works on capsules.\n\nA further advantage of our version of capsules is that it can perform unsupervised learning, whereas\nprevious capsule networks used discriminative learning. Rawlinson et al., 2018 is a notable exception\nand used the reconstruction MLP introduced in Sabour et al., 2017 to train Dynamic Capsules without\nsupervision. Their results show that unsupervised training for capsule-conditioned reconstruction\nhelps with generalization to AFFNIST classi\ufb01cation; we further improve on their results, cf. Section 3.3.\n\nUnsupervised Classi\ufb01cation There are two main approaches to unsupervised object category\ndetection in computer vision. The \ufb01rst one is based on representation learning and typically requires\ndiscovering clusters or learning a classi\ufb01er on top of the learned representation. Eslami et al., 2016;\nKosiorek et al., 2018 use an iterative procedure to infer a variable number of latent variables, one\nfor every object in a scene, that are highly informative of the object class, while Greff et al., 2019;\nBurgess et al., 2019 perform unsupervised instance-level segmentation in an iterative fashion. While\n\n8\n\n\fsimilar to our work, these approaches cannot decompose objects into their constituent parts and do\nnot provide an explicit description of object shape (e. g., templates and their poses in our model).\n\nThe second approach targets classi\ufb01cation explicitly by minimizing mutual information (MI)-based\nlosses and directly learning class-assignment probabilities. IIC (Ji et al., 2018) maximizes an exact\nestimator of MI between two discrete probability vectors describing (transformed) versions of the\ninput image. DeepInfoMax (Hjelm et al., 2019) relies on negative samples and maximizes MI be-\ntween the predicted probability vector and its input via noise-contrastive estimation (Gutmann and\nHyv\u00e4rinen, 2010). This class of methods directly maximizes the amount of information contained in\nan assignment to discrete clusters, and they hold state-of-the-art results on most unsupervised classi\ufb01-\ncation tasks. MI-based methods suffer from typical drawbacks of mutual information estimation: they\nrequire massive data augmentation and large batch sizes. This is in contrast to our method, which\nachieves comparable performance with batch size no bigger than 128 and with no data augmentation.\n\nGeometrical Reasoning Other attempts at incorporating geometrical knowledge into neural net-\nworks include exploiting equivariance properties of group transformations (Cohen and Welling, 2016)\nor new types of convolutional \ufb01lters (Oyallon and Mallat, 2015; Dieleman et al., 2016). Although\nthey achieve signi\ufb01cant parameter ef\ufb01ciency in handling rotations or re\ufb02ections compared to standard\nCNNs, these methods cannot handle additional degrees of freedom of af\ufb01ne transformations\u2014like\nscale. Lenssen et al., 2018 combined capsule networks with group convolutions to guarantee equiv-\nariance and invariance in capsule networks. Spatial Transformers (ST; Jaderberg et al., 2015) apply\naf\ufb01ne transformations to the image sampling grid while steerable networks (Cohen and Welling,\n2017; Jacobsen et al., 2017) dynamically change convolutional \ufb01lters. These methods are similar to\nours in the sense that transformation parameters are predicted by a neural network but differ in the\nsense that ST uses global transformations applied to the whole image while steerable networks use\nonly local transformations. Our approach can use different global transformations for every object as\nwell as local transformations for each of their parts.\n\n5 Discussion\n\nThe main contribution of our work is a novel method for representation learning, in which highly\nstructured decoder networks are used to train one encoder network that can segment an image into\nparts and their poses and another encoder network that can compose the parts into coherent wholes.\nEven though our training objective is not concerned with classi\ufb01cation or clustering, SCAE is the\nonly method that achieves competitive results in unsupervised object classi\ufb01cation without relying\non mutual information (MI). This is signi\ufb01cant since, unlike our method, MI-based methods require\nsophisticated data augmentation. It may be possible to further improve results by using an MI-based\nloss to train SCAE, where the vector of capsule probabilities could take the role of discrete probability\nvectors in IIC (Ji et al., 2018). SCAE under-performs on CIFAR10, which could be because of using\n\ufb01xed templates, which are not expressive enough to model real data. This might be \ufb01xed by building\ndeeper hierarchies of capsule autoencoders ( e. g., complicated scenes in computer graphics are\nmodelled as deep trees of af\ufb01ne-transformed geometric primitives) as well as using input-dependent\nshape functions instead of \ufb01xed templates\u2014both of which are promising directions for future work.\nIt may also be possible to make a much better PCAE for learning the primary capsules by using a\ndifferentiable renderer in the generative model that reconstructs pixels from the primary capsules.\n\nFinally, the SCAE could be the \u2018\ufb01gure\u2019 component of a mixture model that also includes a versatile\n\u2018ground\u2019 component that can be used to account for everything except the \ufb01gure. A complex image\ncould then be analyzed using sequential attention to perceive one \ufb01gure at a time.\n\n6 Acknowledgements\n\nWe would like to thank Sandy H. Huang for help with editing the manuscript and making Figure 2.\nAdditionally, we would like to thank S. M. Ali Eslami and Danijar Hafner for helpful discussions\nthroughout the project. We also thank Hyunjik Kim, Martin Engelcke, Emilien Dupont and Simon\nKornblith for feedback on initial versions of the manuscript.\n\nReferences\n\nJ. Ba, J. Kiros, and G. E. Hinton (2016). \u201cLayer Normalization\u201d. In: CoRR abs/1607.06450.\n\n9\n\n\fY. Bengio, P. Lamblin, D. Popovici, and H. Larochelle (2007). \u201cGreedy Layer-wise Training of Deep\n\nNetworks\u201d. In: Advances in Neural Information Processing Systems.\n\nC. P. Burgess, L. Matthey, N. Watters, R. Kabra, I. Higgins, M. Botvinick, and A. Lerchner (2019).\n\u201cMONet: Unsupervised Scene Decomposition and Representation\u201d. In: CoRR. arXiv: 1901.11390.\n\nT. Cohen and M. Welling (2016). \u201cGroup Equivariant Convolutional Networks\u201d. In: International\n\nConference on Machine Learning.\n\nT. Cohen and M. Welling (2017). \u201cSteerable CNNs\u201d. In: International Conference on Representation\n\nLearning.\n\nS. Dieleman, J. De Fauw, and K. Kavukcuoglu (2016). \u201cExploiting Cyclic Symmetry in Convolutional\n\nNeural Networks\u201d. In: CoRR. arXiv: 1602.02660.\n\nM. Engelcke, A. R. Kosiorek, O. Parker Jones, and I. Posner (2019). \u201cGENESIS: Generative Scene\nInference and Sampling with Object-Centric Latent Representations\u201d. In: CoRR. arXiv: 1907.\n13052.\n\nS. M. A. Eslami, N. Heess, T. Weber, Y. Tassa, D. Szepesvari, K. Kavukcuoglu, and G. E. Hinton\n(2016). \u201cAttend, Infer, Repeat: Fast Scene Understanding with Generative Models\u201d. In: Advances\nin Neural Information Processing Systems. arXiv: 1603.08575.\n\nK. Greff, R. L. Kaufmann, R. Kabra, N. Watters, C. Burgess, D. Zoran, L. Matthey, M. Botvinick, and\nA. Lerchner (2019). \u201cMulti-Object Representation Learning with Iterative Variational Inference\u201d.\nIn: CoRR. arXiv: 1903.00450.\n\nM. Gutmann and A. Hyv\u00e4rinen (2010). \u201cNoise-contrastive Estimation: A New Estimation Principle\nfor Unnormalized Statistical Models\u201d. In: International Conference on Arti\ufb01cial Intelligence and\nStatistics.\n\nP. Haeusser, J. Plapp, V. Golkov, E. Aljalbout, and D. Cremers (2018). \u201cAssociative Deep Clustering:\nTraining a Classi\ufb01cation Network with No Labels\u201d. In: German Conference on Pattern Recognition.\n\nG. E. Hinton (1979). \u201cSome Demonstrations of the Effects of Structural Descriptions in Mental\n\nImagery\u201d. In: Cognitive Science 3.\n\nG. E. Hinton, A. Krizhevsky, and S. D. Wang (2011). \u201cTransforming Auto-Encoders\u201d. In: Interna-\n\ntional Conference on Arti\ufb01cal Neural Networks.\n\nG. E. Hinton, S. Sabour, and N. Frosst (2018). \u201cMatrix Capsules with EM routing\u201d. In: International\n\nConference on Learning Representations.\n\nR. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, A. Trischler, and Y. Bengio (2019).\n\u201cLearning Deep Representations by Mutual Information Estimation and Maximization\u201d. In: CoRR.\narXiv: 1808.06670.\n\nW. Hu, T. Miyato, S. Tokui, E. Matsumoto, and M. Sugiyama (2017). \u201cLearning Discrete Represen-\ntations via Information Maximizing Self-augmented Training\u201d. In: International Conference on\nMachine Learning.\n\nJ.-H. Jacobsen, B. De Brabandere, and A. W. Smeulders (2017). \u201cDynamic steerable blocks in deep\n\nresidual networks\u201d. In: CoRR. arXiv: 1706.00598.\n\nM. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu (2015). \u201cSpatial Transformer\n\nNetworks\u201d. In: Advances in Neural Information Processing Systems. arXiv: 1506.02025v1.\n\nA. Jaiswal, W. AbdAlmageed, Y. Wu, and P. Natarajan (2018). \u201cCapsulegan: Generative adversarial\n\ncapsule network\u201d. In: European Conference on Computer Vision.\n\nX. Ji, J. F. Henriques, and A. Vedaldi (2018). \u201cInvariant Information Distillation for Unsupervised\nImage Segmentation and Clustering\u201d. In: CoRR. arXiv: 1807.06653. URL: http://arxiv.org/\nabs/1807.06653.\n\nA. Kosiorek, H. Kim, Y. W. Teh, and I. Posner (2018). \u201cSequential Attend, Infer, Repeat: Generative\nmodelling of moving objects\u201d. In: Advances in Neural Information Processing Systems. arXiv:\n1806.01794.\n\nH. W. Kuhn (1955). \u201cThe Hungarian Method for the Assignment Problem\u201d. In: Naval Research\n\nLogistics Quarterly.\n\n10\n\n\fJ. Lee, Y. Lee, J. Kim, A. R. Kosiorek, S. Choi, and Y. W. Teh (2019). \u201cSet Transformer\u201d. In:\n\nInternational Conference on Machine Learning. arXiv: 1810.00825.\n\nJ. E. Lenssen, M. Fey, and P. Libuschewski (2018). \u201cGroup Equivariant Capsule Networks\u201d. In:\n\nAdvances in Neural Information Processing Systems.\n\nH. Li, X. Guo, B. Dai, W. Ouyang, and X. Wang (2018). \u201cNeural Network Encapsulation\u201d. In: CoRR.\n\narXiv: 1808.03749.\n\nL. van der Maaten and G. E. Hinton (2008). \u201cVisualizing data using t-SNE\u201d. In: Journal of Machine\n\nLearning Research.\n\nC. J. Maddison, A. Mnih, and Y. W. Teh (2017). \u201cThe Concrete Distribution: A Continuous Relaxation\n\nof Discrete Random Variables\u201d. In: International Conference on Learning Representations.\n\nE. Oyallon and S. Mallat (2015). \u201cDeep Roto-Translation Scattering for Object Classi\ufb01cation\u201d. In:\n\nIEEE Conference on Computer Vision and Pattern Recognition.\n\nA. Radford, L. Metz, and S. Chintala (2016). \u201cUnsupervised Representation Learning with Deep\nConvolutional Generative Adversarial Networks\u201d. In: International Conference on Learning\nRepresentations.\n\nA. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko (2015). \u201cSemi-supervised Learning\n\nwith Ladder Networks\u201d. In: Advances in Neural Information Processing Systems.\n\nD. Rawlinson, A. Ahmed, and G. Kowadlo (2018). \u201cSparse Unsupervised Capsules Generalize\n\nBetter\u201d. In: CoRR. arXiv: 1804.06094.\n\nI. Rock (1973). Orientation and form. Academic Press.\n\nS. Sabour, N. Frosst, and G. E. Hinton (2017). \u201cDynamic Routing Between Capsules\u201d. In: Advances\n\nin Neural Information Processing Systems.\n\nT. Tieleman and G. Hinton (2012). Lecture 6.5\u2014RmsProp: Divide the gradient by a running average\n\nof its recent magnitude. COURSERA: Neural Networks for Machine Learning.\n\nT. Tieleman (2014). Optimizing Neural Networks That Generate Images. University of Toronto,\n\nCanada.\n\nD. Wang and Q. Liu (2018). \u201cAn Optimization View on Dynamic Routing Between Capsules\u201d. In:\n\nInternational Conference on Learning Representations Workshop.\n\nS. Zhang, Q. Zhou, and X. Wu (2018). \u201cFast Dynamic Routing Based on Weighted Kernel Density\n\nEstimation\u201d. In: International Symposium on Arti\ufb01cial Intelligence and Robotics.\n\n11\n\n\f", "award": [], "sourceid": 8996, "authors": [{"given_name": "Adam", "family_name": "Kosiorek", "institution": "University of Oxford"}, {"given_name": "Sara", "family_name": "Sabour", "institution": "Google"}, {"given_name": "Yee Whye", "family_name": "Teh", "institution": "University of Oxford, DeepMind"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google & University of Toronto"}]}