{"title": "Dynamic Routing Between Capsules", "book": "Advances in Neural Information Processing Systems", "page_first": 3856, "page_last": 3866, "abstract": "A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules.  When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.", "full_text": "Dynamic Routing Between Capsules\n\nSara Sabour\n\nNicholas Frosst\n\nGeoffrey E. Hinton\n\n{sasabour, frosst, geoffhinton}@google.com\n\nGoogle Brain\n\nToronto\n\nAbstract\n\nA capsule is a group of neurons whose activity vector represents the instantiation\nparameters of a speci\ufb01c type of entity such as an object or an object part. We use\nthe length of the activity vector to represent the probability that the entity exists and\nits orientation to represent the instantiation parameters. Active capsules at one level\nmake predictions, via transformation matrices, for the instantiation parameters of\nhigher-level capsules. When multiple predictions agree, a higher level capsule\nbecomes active. We show that a discrimininatively trained, multi-layer capsule\nsystem achieves state-of-the-art performance on MNIST and is considerably better\nthan a convolutional net at recognizing highly overlapping digits. To achieve these\nresults we use an iterative routing-by-agreement mechanism: A lower-level capsule\nprefers to send its output to higher level capsules whose activity vectors have a big\nscalar product with the prediction coming from the lower-level capsule.\n\n1\n\nIntroduction\n\nHuman vision ignores irrelevant details by using a carefully determined sequence of \ufb01xation points\nto ensure that only a tiny fraction of the optic array is ever processed at the highest resolution.\nIntrospection is a poor guide to understanding how much of our knowledge of a scene comes from\nthe sequence of \ufb01xations and how much we glean from a single \ufb01xation, but in this paper we will\nassume that a single \ufb01xation gives us much more than just a single identi\ufb01ed object and its properties.\nWe assume that our multi-layer visual system creates a parse tree-like structure on each \ufb01xation, and\nwe ignore the issue of how these single-\ufb01xation parse trees are coordinated over multiple \ufb01xations.\nParse trees are generally constructed on the \ufb02y by dynamically allocating memory. Following Hinton\net al. [2000], however, we shall assume that, for a single \ufb01xation, a parse tree is carved out of a \ufb01xed\nmultilayer neural network like a sculpture is carved from a rock. Each layer will be divided into many\nsmall groups of neurons called \u201ccapsules\u201d (Hinton et al. [2011]) and each node in the parse tree will\ncorrespond to an active capsule. Using an iterative routing process, each active capsule will choose a\ncapsule in the layer above to be its parent in the tree. For the higher levels of a visual system, this\niterative process will be solving the problem of assigning parts to wholes.\nThe activities of the neurons within an active capsule represent the various properties of a particular\nentity that is present in the image. These properties can include many different types of instantiation\nparameter such as pose (position, size, orientation), deformation, velocity, albedo, hue, texture, etc.\nOne very special property is the existence of the instantiated entity in the image. An obvious way to\nrepresent existence is by using a separate logistic unit whose output is the probability that the entity\nexists. In this paper we explore an interesting alternative which is to use the overall length of the\nvector of instantiation parameters to represent the existence of the entity and to force the orientation\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fof the vector to represent the properties of the entity1. We ensure that the length of the vector output\nof a capsule cannot exceed 1 by applying a non-linearity that leaves the orientation of the vector\nunchanged but scales down its magnitude.\nThe fact that the output of a capsule is a vector makes it possible to use a powerful dynamic routing\nmechanism to ensure that the output of the capsule gets sent to an appropriate parent in the layer\nabove. Initially, the output is routed to all possible parents but is scaled down by coupling coef\ufb01cients\nthat sum to 1. For each possible parent, the capsule computes a \u201cprediction vector\u201d by multiplying its\nown output by a weight matrix. If this prediction vector has a large scalar product with the output of\na possible parent, there is top-down feedback which increases the coupling coef\ufb01cient for that parent\nand decreasing it for other parents. This increases the contribution that the capsule makes to that\nparent thus further increasing the scalar product of the capsule\u2019s prediction with the parent\u2019s output.\nThis type of \u201crouting-by-agreement\u201d should be far more effective than the very primitive form of\nrouting implemented by max-pooling, which allows neurons in one layer to ignore all but the most\nactive feature detector in a local pool in the layer below. We demonstrate that our dynamic routing\nmechanism is an effective way to implement the \u201cexplaining away\u201d that is needed for segmenting\nhighly overlapping objects.\nConvolutional neural networks (CNNs) use translated replicas of learned feature detectors. This\nallows them to translate knowledge about good weight values acquired at one position in an image\nto other positions. This has proven extremely helpful in image interpretation. Even though we are\nreplacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling\nwith routing-by-agreement, we would still like to replicate learned knowledge across space. To\nachieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make\nhigher-level capsules cover larger regions of the image. Unlike max-pooling however, we do not throw\naway information about the precise position of the entity within the region. For low level capsules,\nlocation information is \u201cplace-coded\u201d by which capsule is active. As we ascend the hierarchy,\nmore and more of the positional information is \u201crate-coded\u201d in the real-valued components of the\noutput vector of a capsule. This shift from place-coding to rate-coding combined with the fact that\nhigher-level capsules represent more complex entities with more degrees of freedom suggests that the\ndimensionality of capsules should increase as we ascend the hierarchy.\n\n2 How the vector inputs and outputs of a capsule are computed\n\nThere are many possible ways to implement the general idea of capsules. The aim of this paper is not\nto explore this whole space but simply to show that one fairly straightforward implementation works\nwell and that dynamic routing helps.\nWe want the length of the output vector of a capsule to represent the probability that the entity\nrepresented by the capsule is present in the current input. We therefore use a non-linear \"squashing\"\nfunction to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a\nlength slightly below 1. We leave it to discriminative learning to make good use of this non-linearity.\n\nvj =\n\n||sj||2\n1 + ||sj||2\n\nsj\n||sj||\n\n(1)\n\nwhere vj is the vector output of capsule j and sj is its total input.\nFor all but the \ufb01rst layer of capsules, the total input to a capsule sj is a weighted sum over all\n\u201cprediction vectors\u201d \u02c6uj|i from the capsules in the layer below and is produced by multiplying the\noutput ui of a capsule in the layer below by a weight matrix Wij\n\n(cid:88)\n\nsj =\n\ncij \u02c6uj|i ,\n\n\u02c6uj|i = Wijui\n\n(2)\n\ni\n\nwhere the cij are coupling coef\ufb01cients that are determined by the iterative dynamic routing process.\nThe coupling coef\ufb01cients between capsule i and all the capsules in the layer above sum to 1 and are\ndetermined by a \u201crouting softmax\u201d whose initial logits bij are the log prior probabilities that capsule i\n\n1This makes biological sense as it does not use large activities to get accurate representations of things that\n\nprobably don\u2019t exist.\n\n2\n\n\fshould be coupled to capsule j.\n\n(cid:80)\n\nexp(bij)\nk exp(bik)\n\ncij =\n\n(3)\n\nThe log priors can be learned discriminatively at the same time as all the other weights. They depend\non the location and type of the two capsules but not on the current input image2. The initial coupling\ncoef\ufb01cients are then iteratively re\ufb01ned by measuring the agreement between the current output vj of\neach capsule, j, in the layer above and the prediction \u02c6uj|i made by capsule i.\nThe agreement is simply the scalar product aij = vj.\u02c6uj|i. This agreement is treated as if it was a log\nlikelihood and is added to the initial logit, bij before computing the new values for all the coupling\ncoef\ufb01cients linking capsule i to higher level capsules.\nIn convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in\nthe layer above using different transformation matrices for each member of the grid as well as for\neach type of capsule.\n\nProcedure 1 Routing algorithm.\n1: procedure ROUTING(\u02c6uj|i, r, l)\n2:\n3:\n4:\n5:\n6:\n7:\n\nreturn vj\n\nfor all capsule i in layer l and capsule j in layer (l + 1): bij \u2190 0.\nfor r iterations do\n\nfor all capsule j in layer (l + 1): sj \u2190(cid:80)\n\nfor all capsule i in layer l: ci \u2190 softmax(bi)\ni cij \u02c6uj|i\nfor all capsule j in layer (l + 1): vj \u2190 squash(sj)\nfor all capsule i in layer l and capsule j in layer (l + 1): bij \u2190 bij + \u02c6uj|i.vj\n\n(cid:46) softmax computes Eq. 3\n\n(cid:46) squash computes Eq. 1\n\n3 Margin loss for digit existence\n\nWe are using the length of the instantiation vector to represent the probability that a capsule\u2019s entity\nexists. We would like the top-level capsule for digit class k to have a long instantiation vector if and\nonly if that digit is present in the image. To allow for multiple digits, we use a separate margin loss,\nLk for each digit capsule, k:\n\nLk = Tk max(0, m+ \u2212 ||vk||)2 + \u03bb (1 \u2212 Tk) max(0,||vk|| \u2212 m\u2212)2\n\n(4)\nwhere Tk = 1 iff a digit of class k is present3 and m+ = 0.9 and m\u2212 = 0.1. The \u03bb down-weighting\nof the loss for absent digit classes stops the initial learning from shrinking the lengths of the activity\nvectors of all the digit capsules. We use \u03bb = 0.5. The total loss is simply the sum of the losses of all\ndigit capsules.\n\n4 CapsNet architecture\n\nA simple CapsNet architecture is shown in Fig. 1. The architecture is shallow with only two\nconvolutional layers and one fully connected layer. Conv1 has 256, 9 \u00d7 9 convolution kernels with a\nstride of 1 and ReLU activation. This layer converts pixel intensities to the activities of local feature\ndetectors that are then used as inputs to the primary capsules.\nThe primary capsules are the lowest level of multi-dimensional entities and, from an inverse graphics\nperspective, activating the primary capsules corresponds to inverting the rendering process. This is a\nvery different type of computation than piecing instantiated parts together to make familiar wholes,\nwhich is what capsules are designed to be good at.\nThe second layer (PrimaryCapsules) is a convolutional capsule layer with 32 channels of convolutional\n8D capsules (i.e. each primary capsule contains 8 convolutional units with a 9 \u00d7 9 kernel and a stride\nof 2). Each primary capsule output sees the outputs of all 256 \u00d7 81 Conv1 units whose receptive\n\n2For MNIST we found that it was suf\ufb01cient to set all of these priors to be equal.\n3We do not allow an image to contain two instances of the same digit class. We address this weakness of\n\ncapsules in the discussion section.\n\n3\n\n\fFigure 1: A simple CapsNet with 3 layers. This model gives comparable results to deep convolutional\nnetworks (such as Chang and Chen [2015]). The length of the activity vector of each capsule\nin DigitCaps layer indicates presence of an instance of each class and is used to calculate the\nclassi\ufb01cation loss. Wij is a weight matrix between each ui, i \u2208 (1, 32 \u00d7 6 \u00d7 6) in PrimaryCapsules\nand vj, j \u2208 (1, 10).\n\nFigure 2: Decoder structure to reconstruct a digit from the DigitCaps layer representation. The\neuclidean distance between the image and the output of the Sigmoid layer is minimized during\ntraining. We use the true label as reconstruction target during training.\n\n\ufb01elds overlap with the location of the center of the capsule. In total PrimaryCapsules has [32 \u00d7 6 \u00d7 6]\ncapsule outputs (each output is an 8D vector) and each capsule in the [6 \u00d7 6] grid is sharing their\nweights with each other. One can see PrimaryCapsules as a Convolution layer with Eq. 1 as its block\nnon-linearity. The \ufb01nal Layer (DigitCaps) has one 16D capsule per digit class and each of these\ncapsules receives input from all the capsules in the layer below.\nWe have routing only between two consecutive capsule layers (e.g. PrimaryCapsules and DigitCaps).\nSince Conv1 output is 1D, there is no orientation in its space to agree on. Therefore, no routing is used\nbetween Conv1 and PrimaryCapsules. All the routing logits (bij) are initialized to zero. Therefore,\ninitially a capsule output (ui) is sent to all parent capsules (v0...v9) with equal probability (cij).\nOur implementation is in TensorFlow (Abadi et al. [2016]) and we use the Adam optimizer (Kingma\nand Ba [2014]) with its TensorFlow default parameters, including the exponentially decaying learning\nrate, to minimize the sum of the margin losses in Eq. 4.\n\n4.1 Reconstruction as a regularization method\n\nWe use an additional reconstruction loss to encourage the digit capsules to encode the instantiation\nparameters of the input digit. During training, we mask out all but the activity vector of the correct\ndigit capsule. Then we use this activity vector to reconstruct the input image. The output of the digit\ncapsule is fed into a decoder consisting of 3 fully connected layers that model the pixel intensities as\ndescribed in Fig. 2. We minimize the sum of squared differences between the outputs of the logistic\nunits and the pixel intensities. We scale down this reconstruction loss by 0.0005 so that it does not\ndominate the margin loss during training. As illustrated in Fig. 3 the reconstructions from the 16D\noutput of the CapsNet are robust while keeping only important details.\n\n4\n\n\fFigure 3: Sample MNIST test reconstructions of a CapsNet with 3 routing iterations. (l, p, r)\nrepresents the label, the prediction and the reconstruction target respectively. The two rightmost\ncolumns show two reconstructions of a failure example and it explains how the model confuses a\n5 and a 3 in this image. The other columns are from correct classi\ufb01cations and shows that model\npreserves many of the details while smoothing the noise.\n\n(l, p, r)\n\n(2, 2, 2)\n\n(5, 5, 5)\n\n(8, 8, 8)\n\n(9, 9, 9)\n\n(5, 3, 5)\n\n(5, 3, 3)\n\nInput\n\nOutput\n\nTable 1: CapsNet classi\ufb01cation test accuracy. The MNIST average and standard deviation results are\nreported from 3 trials.\n\nMethod\nBaseline\nCapsNet\nCapsNet\nCapsNet\nCapsNet\n\nRouting Reconstruction MNIST (%) MultiMNIST (%)\n\n-\n1\n1\n3\n3\n\n-\nno\nyes\nno\nyes\n\n0.39\n\n0.34\u00b10.032\n0.29\u00b10.011\n0.35\u00b10.036\n0.25\u00b10.005\n\n8.1\n-\n7.5\n-\n5.2\n\n5 Capsules on MNIST\nTraining is performed on 28 \u00d7 28 MNIST (LeCun et al. [1998]) images that have been shifted by up\nto 2 pixels in each direction with zero padding. No other data augmentation/deformation is used. The\ndataset has 60K and 10K images for training and testing respectively.\nWe test using a single model without any model averaging. Wan et al. [2013] achieves 0.21% test\nerror with ensembling and augmenting the data with rotation and scaling. They achieve 0.39%\nwithout them. We get a low test error (0.25%) on a 3 layer network previously only achieved by\ndeeper networks. Tab. 1 reports the test error rate on MNIST for different CapsNet setups and shows\nthe importance of routing and reconstruction regularizer. Adding the reconstruction regularizer boosts\nthe routing performance by enforcing the pose encoding in the capsule vector.\nThe baseline is a standard CNN with three convolutional layers of 256, 256, 128 channels. Each has\n5x5 kernels and stride of 1. The last convolutional layers are followed by two fully connected layers\nof size 328, 192. The last fully connected layer is connected with dropout to a 10 class softmax layer\nwith cross entropy loss. The baseline is also trained on 2-pixel shifted MNIST with Adam optimizer.\nThe baseline is designed to achieve the best performance on MNIST while keeping the computation\ncost as close as to CapsNet. In terms of number of parameters the baseline has 35.4M while CapsNet\nhas 8.2M parameters and 6.8M parameters without the reconstruction subnetwork.\n\n5.1 What the individual dimensions of a capsule represent\n\nSince we are passing the encoding of only one digit and zeroing out other digits, the dimensions of a\ndigit capsule should learn to span the space of variations in the way digits of that class are instantiated.\nThese variations include stroke thickness, skew and width. They also include digit-speci\ufb01c variations\nsuch as the length of the tail of a 2. We can see what the individual dimensions represent by making\nuse of the decoder network. After computing the activity vector for the correct digit capsule, we can\nfeed a perturbed version of this activity vector to the decoder network and see how the perturbation\naffects the reconstruction. Examples of these perturbations are shown in Fig. 4. We found that one\ndimension (out of 16) of the capsule almost always represents the width of the digit. While some\ndimensions represent combinations of global variations, there are other dimensions that represent\n\n5\n\n\fFigure 4: Dimension perturbations. Each row shows the reconstruction when one of the 16 dimensions\nin the DigitCaps representation is tweaked by intervals of 0.05 in the range [\u22120.25, 0.25].\n\nScale and thickness\n\nLocalized part\n\nStroke thickness\n\nLocalized skew\n\nWidth and translation\n\nLocalized part\n\nvariation in a localized part of the digit. For example, different dimensions are used for the length of\nthe ascender of a 6 and the size of the loop.\n\n5.2 Robustness to Af\ufb01ne Transformations\n\nExperiments show that each DigitCaps capsule learns a more robust representation for each class\nthan a traditional convolutional network. Because there is natural variance in skew, rotation, style, etc\nin hand written digits, the trained CapsNet is moderately robust to small af\ufb01ne transformations of the\ntraining data.\nTo test the robustness of CapsNet to af\ufb01ne transformations, we trained a CapsNet and a traditional\nconvolutional network (with MaxPooling and DropOut) on a padded and translated MNIST training\nset, in which each example is an MNIST digit placed randomly on a black background of 40 \u00d7 40\npixels. We then tested this network on the affNIST4 data set, in which each example is an MNIST digit\nwith a random small af\ufb01ne transformation. Our models were never trained with af\ufb01ne transformations\nother than translation and any natural transformation seen in the standard MNIST. An under-trained\nCapsNet with early stopping which achieved 99.23% accuracy on the expanded MNIST test set\nachieved 79% accuracy on the affnist test set. A traditional convolutional model with a similar\nnumber of parameters which achieved similar accuracy (99.22%) on the expanded mnist test set only\nachieved 66% on the affnist test set.\n\n6 Segmenting highly overlapping digits\n\nDynamic routing can be viewed as a parallel attention mechanism that allows each capsule at one\nlevel to attend to some active capsules at the level below and to ignore others. This should allow\nthe model to recognize multiple objects in the image even if objects overlap. Hinton et al. propose\nthe task of segmenting and recognizing highly overlapping digits (Hinton et al. [2000] and others\nhave tested their networks in a similar domain (Goodfellow et al. [2013], Ba et al. [2014], Greff et al.\n[2016]). The routing-by-agreement should make it possible to use a prior about the shape of objects\nto help segmentation and it should obviate the need to make higher-level segmentation decisions in\nthe domain of pixels.\n\n6.1 MultiMNIST dataset\n\nWe generate the MultiMNIST training and test dataset by overlaying a digit on top of another digit\nfrom the same set (training or test) but different class. Each digit is shifted up to 4 pixels in each\ndirection resulting in a 36\u00d7 36 image. Considering a digit in a 28\u00d7 28 image is bounded in a 20\u00d7 20\nbox, two digits bounding boxes on average have 80% overlap. For each digit in the MNIST dataset\nwe generate 1K MultiMNIST examples. So the training set size is 60M and the test set size is 10M.\n\n4Available at http://www.cs.toronto.edu/~tijmen/affNIST/.\n\n6\n\n\fFigure 5: Sample reconstructions of a CapsNet with 3 routing iterations on MultiMNIST test dataset.\nThe two reconstructed digits are overlayed in green and red as the lower image. The upper image\nshows the input image. L:(l1, l2) represents the label for the two digits in the image and R:(r1, r2)\nrepresents the two digits used for reconstruction. The two right most columns show two examples\nwith wrong classi\ufb01cation reconstructed from the label and from the prediction (P). In the (2, 8)\nexample the model confuses 8 with a 7 and in (4, 9) it confuses 9 with 0. The other columns have\ncorrect classi\ufb01cations and show that the model accounts for all the pixels while being able to assign\none pixel to two digits in extremely dif\ufb01cult scenarios (column 1 \u2212 4). Note that in dataset generation\nthe pixel values are clipped at 1. The two columns with the (*) mark show reconstructions from a\ndigit that is neither the label nor the prediction. These columns suggests that the model is not just\n\ufb01nding the best \ufb01t for all the digits in the image including the ones that do not exist. Therefore in case\nof (5, 0) it cannot reconstruct a 7 because it knows that there is a 5 and 0 that \ufb01t best and account for\nall the pixels. Also, in case of (8, 1) the loop of 8 has not triggered 0 because it is already accounted\nfor by 8. Therefore it will not assign one pixel to two digits if one of them does not have any other\nsupport.\n\n*R:(5, 7) *R:(2, 3)\nL:(5, 0)\nL:(4, 3)\n\nR:(2, 8) R:P:(2, 7)\nL:(2, 8)\nL:(2, 8)\n\nR:(2, 7)\nL:(2, 7)\n\nR:(6, 0)\nL:(6, 0)\n\nR:(6, 8)\nL:(6, 8)\n\nR:(7, 1)\nL:(7, 1)\n\nR:(8, 7)\nL:(8, 7)\n\nR:(9, 4)\nL:(9, 4)\n\nR:(9, 5)\nL:(9, 5)\n\nR:(8, 4)\nL:(8, 4)\n\n*R:(0, 8) *R:(1, 6)\nL:(1, 8)\nL:(7, 6)\n\nR:(4, 9) R:P:(4, 0)\nL:(4, 9)\nL:(4, 9)\n\n6.2 MultiMNIST results\n\nOur 3 layer CapsNet model trained from scratch on MultiMNIST training data achieves higher\ntest classi\ufb01cation accuracy than our baseline convolutional model. We are achieving the same\nclassi\ufb01cation error rate of 5.0% on highly overlapping digit pairs as the sequential attention model of\nBa et al. [2014] achieves on a much easier task that has far less overlap (80% overlap of the boxes\naround the two digits in our case vs < 4% for Ba et al. [2014]). On test images, which are composed\nof pairs of images from the test set, we treat the two most active digit capsules as the classi\ufb01cation\nproduced by the capsules network. During reconstruction we pick one digit at a time and use the\nactivity vector of the chosen digit capsule to reconstruct the image of the chosen digit (we know this\nimage because we used it to generate the composite image). The only difference with our MNIST\nmodel is that we increased the period of the decay step for the learning rate to be 10\u00d7 larger because\nthe training dataset is larger.\nThe reconstructions illustrated in Fig. 5 show that CapsNet is able to segment the image into the\ntwo original digits. Since this segmentation is not at pixel level we observe that the model is able to\ndeal correctly with the overlaps (a pixel is on in both digits) while accounting for all the pixels. The\nposition and the style of each digit is encoded in DigitCaps. The decoder has learned to reconstruct\na digit given the encoding. The fact that it is able to reconstruct digits regardless of the overlap\nshows that each digit capsule can pick up the style and position from the votes it is receiving from\nPrimaryCapsules layer.\n\n7\n\n\fTab. 1 emphasizes the importance of capsules with routing on this task. As a baseline for the\nclassi\ufb01cation of CapsNet accuracy we trained a convolution network with two convolution layers and\ntwo fully connected layers on top of them. The \ufb01rst layer has 512 convolution kernels of size 9 \u00d7 9\nand stride 1. The second layer has 256 kernels of size 5\u00d7 5 and stride 1. After each convolution layer\nthe model has a pooling layer of size 2 \u00d7 2 and stride 2. The third layer is a 1024D fully connected\nlayer. All three layers have ReLU non-linearities. The \ufb01nal layer of 10 units is fully connected. We\nuse the TensorFlow default Adam optimizer (Kingma and Ba [2014]) to train a sigmoid cross entropy\nloss on the output of \ufb01nal layer. This model has 24.56M parameters which is 2 times more parameters\nthan CapsNet with 11.36M parameters. We started with a smaller CNN (32 and 64 convolutional\nkernels of 5 \u00d7 5 and stride of 1 and a 512D fully connected layer) and incrementally increased the\nwidth of the network until we reached the best test accuracy on a 10K subset of the MultiMNIST\ndata. We also searched for the right decay step on the 10K validation set.\nWe decode the two most active DigitCaps capsules one at a time and get two images. Then by\nassigning any pixel with non-zero intensity to each digit we get the segmentation results for each\ndigit.\n\n7 Other datasets\n\nWe tested our capsule model on CIFAR10 and achieved 10.6% error with an ensemble of 7 models\neach of which is trained with 3 routing iterations on 24 \u00d7 24 patches of the image. Each model\nhas the same architecture as the simple model we used for MNIST except that there are three color\nchannels and we used 64 different types of primary capsule. We also found that it helped to introduce\na \"none-of-the-above\" category for the routing softmaxes, since we do not expect the \ufb01nal layer of\nten capsules to explain everything in the image. 10.6% test error is about what standard convolutional\nnets achieved when they were \ufb01rst applied to CIFAR10 (Zeiler and Fergus [2013]).\nOne drawback of Capsules which it shares with generative models is that it likes to account for\neverything in the image so it does better when it can model the clutter than when it just uses an\nadditional \u201corphan\u201d category in the dynamic routing. In CIFAR-10, the backgrounds are much too\nvaried to model in a reasonable sized net which helps to account for the poorer performance.\nWe also tested the exact same architecture as we used for MNIST on smallNORB (LeCun et al.\n[2004]) and achieved 2.7% test error rate, which is on-par with the state-of-the-art (Cire\u00b8san et al.\n[2011]). The smallNORB dataset consists of 96x96 stereo grey-scale images. We resized the images\nto 48x48 and during training processed random 32x32 crops of them. We passed the central 32x32\npatch during test.\nWe also trained a smaller network on the small training set of SVHN (Netzer et al. [2011]) with\nonly 73257 images. We reduced the number of \ufb01rst convolutional layer channels to 64, the primary\ncapsule layer to 16 6D-capsules with 8D \ufb01nal capsule layer at the end and achieved 4.3% on the test\nset.\n\n8 Discussion and previous work\n\nFor thirty years, the state-of-the-art in speech recognition used hidden Markov models with Gaussian\nmixtures as output distributions. These models were easy to learn on small computers, but they\nhad a representational limitation that was ultimately fatal: The one-of-n representations they use\nare exponentially inef\ufb01cient compared with, say, a recurrent neural network that uses distributed\nrepresentations. To double the amount of information that an HMM can remember about the string it\nhas generated so far, we need to square the number of hidden nodes. For a recurrent net we only need\nto double the number of hidden neurons.\nNow that convolutional neural networks have become the dominant approach to object recognition, it\nmakes sense to ask whether there are any exponential inef\ufb01ciencies that may lead to their demise. A\ngood candidate is the dif\ufb01culty that convolutional nets have in generalizing to novel viewpoints. The\nability to deal with translation is built in, but for the other dimensions of an af\ufb01ne transformation\nwe have to chose between replicating feature detectors on a grid that grows exponentially with the\nnumber of dimensions, or increasing the size of the labelled training set in a similarly exponential way.\nCapsules (Hinton et al. [2011]) avoid these exponential inef\ufb01ciencies by converting pixel intensities\n\n8\n\n\finto vectors of instantiation parameters of recognized fragments and then applying transformation\nmatrices to the fragments to predict the instantiation parameters of larger fragments. Transformation\nmatrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute\nviewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011]\nproposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule\nlayer and their system required transformation matrices to be supplied externally. We propose a\ncomplete system that also answers \"how larger and more complex visual entities can be recognized\nby using agreements of the poses predicted by active, lower-level capsules\".\nCapsules make a very strong representational assumption: At each location in the image, there is\nat most one instance of the type of entity that a capsule represents. This assumption, which was\nmotivated by the perceptual phenomenon called \"crowding\" (Pelli et al. [2004]), eliminates the\nbinding problem (Hinton [1981a]) and allows a capsule to use a distributed representation (its activity\nvector) to encode the instantiation parameters of the entity of that type at a given location. This\ndistributed representation is exponentially more ef\ufb01cient than encoding the instantiation parameters\nby activating a point on a high-dimensional grid and with the right distributed representation, capsules\ncan then take full advantage of the fact that spatial relationships can be modelled by matrix multiplies.\nCapsules use neural activities that vary as viewpoint varies rather than trying to eliminate viewpoint\nvariation from the activities. This gives them an advantage over \"normalization\" methods like\nspatial transformer networks (Jaderberg et al. [2015]): They can deal with multiple different af\ufb01ne\ntransformations of different objects or object parts at the same time.\nCapsules are also very good for dealing with segmentation, which is another of the toughest problems\nin vision, because the vector of instantiation parameters allows them to use routing-by-agreement, as\nwe have demonstrated in this paper. The importance of dynamic routing procedure is also backed by\nbiologically plausible models of invarient pattern recognition in the visual cortex. Hinton [1981b]\nproposes dynamic connections and canonical object based frames of reference to generate shape\ndescriptions that can be used for object recognition. Olshausen et al. [1993] improves upon Hinton\n[1981b] dynamic connections and presents a biologically plausible, position and scale invariant model\nof object representations.\nResearch on capsules is now at a similar stage to research on recurrent neural networks for speech\nrecognition at the beginning of this century. There are fundamental representational reasons for\nbelieving that it is a better approach but it probably requires a lot more small insights before it can\nout-perform a highly developed technology. The fact that a simple capsules system already gives\nunparalleled performance at segmenting overlapping digits is an early indication that capsules are a\ndirection worth exploring.\nAcknowledgement. Of the many who provided us with constructive comments, we are specially\ngrateful to Robert Gens, Eric Langlois, Vincent Vanhoucke, Chris Williams, and the reviewers for\ntheir fruitful comments and corrections.\n\n9\n\n\fReferences\nMart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S\nCorrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensor\ufb02ow: Large-scale machine\nlearning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.\n\nJimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual\n\nattention. arXiv preprint arXiv:1412.7755, 2014.\n\nJia-Ren Chang and Yong-Sheng Chen. Batch-normalized maxout network in network. arXiv preprint\n\narXiv:1511.02583, 2015.\n\nDan C Cire\u00b8san, Ueli Meier, Jonathan Masci, Luca M Gambardella, and J\u00fcrgen Schmidhuber. High-\nperformance neural networks for visual object classi\ufb01cation. arXiv preprint arXiv:1102.0183,\n2011.\n\nIan J Goodfellow, Yaroslav Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet. Multi-digit number\nrecognition from street view imagery using deep convolutional neural networks. arXiv preprint\narXiv:1312.6082, 2013.\n\nKlaus Greff, Antti Rasmus, Mathias Berglund, Tele Hao, Harri Valpola, and J\u00fcrgen Schmidhuber.\nTagger: Deep unsupervised perceptual grouping. In Advances in Neural Information Processing\nSystems, pages 4484\u20134492, 2016.\n\nGeoffrey E Hinton. Shape representation in parallel systems. In International Joint Conference on\n\nArti\ufb01cial Intelligence Vol 2, 1981a.\n\nGeoffrey E Hinton. A parallel computation that assigns canonical object-based frames of reference.\nIn Proceedings of the 7th international joint conference on Arti\ufb01cial intelligence-Volume 2, pages\n683\u2013685. Morgan Kaufmann Publishers Inc., 1981b.\n\nGeoffrey E Hinton, Zoubin Ghahramani, and Yee Whye Teh. Learning to parse images. In Advances\n\nin neural information processing systems, pages 463\u2013469, 2000.\n\nGeoffrey E Hinton, Alex Krizhevsky, and Sida D Wang. Transforming auto-encoders. In International\n\nConference on Arti\ufb01cial Neural Networks, pages 44\u201351. Springer, 2011.\n\nMax Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer\n\nnetworks. In Advances in Neural Information Processing Systems, pages 2017\u20132025, 2015.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nYann LeCun, Corinna Cortes, and Christopher JC Burges. The mnist database of handwritten digits,\n\n1998.\n\nYann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic object recognition with\ninvariance to pose and lighting. In Computer Vision and Pattern Recognition, 2004. CVPR 2004.\nProceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II\u2013104. IEEE,\n2004.\n\nYuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading\ndigits in natural images with unsupervised feature learning. In NIPS workshop on deep learning\nand unsupervised feature learning, volume 2011, page 5, 2011.\n\nBruno A Olshausen, Charles H Anderson, and David C Van Essen. A neurobiological model of visual\nattention and invariant pattern recognition based on dynamic routing of information. Journal of\nNeuroscience, 13(11):4700\u20134719, 1993.\n\nDenis G Pelli, Melanie Palomares, and Najib J Majaj. Crowding is unlike ordinary masking:\n\nDistinguishing feature integration from detection. Journal of vision, 4(12):12\u201312, 2004.\n\nLi Wan, Matthew D Zeiler, Sixin Zhang, Yann LeCun, and Rob Fergus. Regularization of neural\nnetworks using dropconnect. In Proceedings of the 30th International Conference on Machine\nLearning (ICML-13), pages 1058\u20131066, 2013.\n\n10\n\n\fMatthew D Zeiler and Rob Fergus. Stochastic pooling for regularization of deep convolutional neural\n\nnetworks. arXiv preprint arXiv:1301.3557, 2013.\n\nA How many routing iterations to use?\n\nIn order to experimentally verify the convergence of the routing algorithm we plot the average change\nin the routing logits at each routing iteration. Fig. A.1 shows the average bij change after each routing\niteration. Experimentally we observe that there is negligible change in the routing by 5 iteration from\nthe start of training. Average change in the 2nd pass of the routing settles down after 500 epochs of\ntraining to 0.007 while at routing iteration 5 the logits only change by 1e \u2212 5 on average.\n\nFigure A.1: Average change of each routing logit (bij) by each routing iteration. After 500 epochs of\ntraining on MNIST the average change is stabilized and as it shown in right \ufb01gure it decreases almost\nlinearly in log scale with more routing iterations.\n\n(a) During training.\n\n(b) Log scale of \ufb01nal differences.\n\nWe observed that in general more routing iterations increases the network capacity and tends to over\ufb01t\nto the training dataset. Fig. A.2 shows a comparison of Capsule training loss on Cifar10 when trained\nwith 1 iteration of routing vs 3 iteration of routing. Motivated by Fig. A.2 and Fig. A.1 we suggest 3\niteration of routing for all experiments.\n\nFigure A.2: Traning loss of CapsuleNet on cifar10 dataset. The batch size at each training step is 128.\nThe CapsuleNet with 3 iteration of routing optimizes the loss faster and converges to a lower loss at\nthe end.\n\n11\n\n\f", "award": [], "sourceid": 2100, "authors": [{"given_name": "Sara", "family_name": "Sabour", "institution": "Google"}, {"given_name": "Nicholas", "family_name": "Frosst", "institution": "Google"}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": "Google & University of Toronto"}]}