{"title": "Spatial Transformer Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2017, "page_last": 2025, "abstract": "Convolutional Neural Networks define an exceptionallypowerful class of model, but are still limited by the lack of abilityto be spatially invariant to the input data in a computationally and parameterefficient manner. In this work we introduce a new learnable module, theSpatial Transformer, which explicitly allows the spatial manipulation ofdata within the network. This differentiable module can be insertedinto existing convolutional architectures, giving neural networks the ability toactively spatially transform feature maps, conditional on the feature map itself,without any extra training supervision or modification to the optimisation process. We show that the useof spatial transformers results in models which learn invariance to translation,scale, rotation and more generic warping, resulting in state-of-the-artperformance on several benchmarks, and for a numberof classes of transformations.", "full_text": "Spatial Transformer Networks\n\nMax Jaderberg\n\nKaren Simonyan\n\nAndrew Zisserman\n\nKoray Kavukcuoglu\n\nGoogle DeepMind, London, UK\n\n{jaderberg,simonyan,zisserman,korayk}@google.com\n\nAbstract\n\nConvolutional Neural Networks de\ufb01ne an exceptionally powerful class of models,\nbut are still limited by the lack of ability to be spatially invariant to the input data\nin a computationally and parameter ef\ufb01cient manner. In this work we introduce a\nnew learnable module, the Spatial Transformer, which explicitly allows the spa-\ntial manipulation of data within the network. This differentiable module can be\ninserted into existing convolutional architectures, giving neural networks the abil-\nity to actively spatially transform feature maps, conditional on the feature map\nitself, without any extra training supervision or modi\ufb01cation to the optimisation\nprocess. We show that the use of spatial transformers results in models which\nlearn invariance to translation, scale, rotation and more generic warping, result-\ning in state-of-the-art performance on several benchmarks, and for a number of\nclasses of transformations.\n\nIntroduction\n\n1\nOver recent years, the landscape of computer vision has been drastically altered and pushed forward\nthrough the adoption of a fast, scalable, end-to-end learning framework, the Convolutional Neural\nNetwork (CNN) [18]. Though not a recent invention, we now see a cornucopia of CNN-based\nmodels achieving state-of-the-art results in classi\ufb01cation, localisation, semantic segmentation, and\naction recognition tasks, amongst others.\nA desirable property of a system which is able to reason about images is to disentangle object\npose and part deformation from texture and shape. The introduction of local max-pooling layers in\nCNNs has helped to satisfy this property by allowing a network to be somewhat spatially invariant\nto the position of features. However, due to the typically small spatial support for max-pooling\n(e.g. 2 \u00d7 2 pixels) this spatial invariance is only realised over a deep hierarchy of max-pooling and\nconvolutions, and the intermediate feature maps (convolutional layer activations) in a CNN are not\nactually invariant to large transformations of the input data [5, 19]. This limitation of CNNs is due\nto having only a limited, pre-de\ufb01ned pooling mechanism for dealing with variations in the spatial\narrangement of data.\nIn this work we introduce the Spatial Transformer module, that can be included into a standard\nneural network architecture to provide spatial transformation capabilities. The action of the spatial\ntransformer is conditioned on individual data samples, with the appropriate behaviour learnt during\ntraining for the task in question (without extra supervision). Unlike pooling layers, where the re-\nceptive \ufb01elds are \ufb01xed and local, the spatial transformer module is a dynamic mechanism that can\nactively spatially transform an image (or a feature map) by producing an appropriate transformation\nfor each input sample. The transformation is then performed on the entire feature map (non-locally)\nand can include scaling, cropping, rotations, as well as non-rigid deformations. This allows networks\nwhich include spatial transformers to not only select regions of an image that are most relevant (at-\ntention), but also to transform those regions to a canonical, expected pose to simplify inference in\nthe subsequent layers. Notably, spatial transformers can be trained with standard back-propagation,\nallowing for end-to-end training of the models they are injected in.\n\n1\n\n\fFigure 1: The result of using a spatial transformer as the\n\ufb01rst layer of a fully-connected network trained for distorted\nMNIST digit classi\ufb01cation. (a) The input to the spatial trans-\nformer network is an image of an MNIST digit that is dis-\ntorted with random translation, scale, rotation, and clutter. (b)\nThe localisation network of the spatial transformer predicts a\ntransformation to apply to the input image. (c) The output\nof the spatial transformer, after applying the transformation.\n(d) The classi\ufb01cation prediction produced by the subsequent\nfully-connected network on the output of the spatial trans-\nformer. The spatial transformer network (a CNN including a\nspatial transformer module) is trained end-to-end with only\nclass labels \u2013 no knowledge of the groundtruth transforma-\ntions is given to the system.\n\nSpatial transformers can be incorporated into CNNs to bene\ufb01t multifarious tasks, for example:\n(i) image classi\ufb01cation: suppose a CNN is trained to perform multi-way classi\ufb01cation of images\naccording to whether they contain a particular digit \u2013 where the position and size of the digit may\nvary signi\ufb01cantly with each sample (and are uncorrelated with the class); a spatial transformer that\ncrops out and scale-normalizes the appropriate region can simplify the subsequent classi\ufb01cation\ntask, and lead to superior classi\ufb01cation performance, see Fig. 1; (ii) co-localisation: given a set of\nimages containing different instances of the same (but unknown) class, a spatial transformer can be\nused to localise them in each image; (iii) spatial attention: a spatial transformer can be used for\ntasks requiring an attention mechanism, such as in [11, 29], but is more \ufb02exible and can be trained\npurely with backpropagation without reinforcement learning. A key bene\ufb01t of using attention is that\ntransformed (and so attended), lower resolution inputs can be used in favour of higher resolution\nraw inputs, resulting in increased computational ef\ufb01ciency.\nThe rest of the paper is organised as follows: Sect. 2 discusses some work related to our own, we\nintroduce the formulation and implementation of the spatial transformer in Sect. 3, and \ufb01nally give\nthe results of experiments in Sect. 4. Additional experiments and implementation details are given\nin the supplementary material or can be found in the arXiv version.\n2 Related Work\nIn this section we discuss the prior work related to the paper, covering the central ideas of modelling\ntransformations with neural networks [12, 13, 27], learning and analysing transformation-invariant\nrepresentations [3, 5, 8, 17, 19, 25], as well as attention and detection mechanisms for feature selec-\ntion [1, 6, 9, 11, 23].\nEarly work by Hinton [12] looked at assigning canonical frames of reference to object parts, a theme\nwhich recurred in [13] where 2D af\ufb01ne transformations were modeled to create a generative model\ncomposed of transformed parts. The targets of the generative training scheme are the transformed\ninput images, with the transformations between input images and targets given as an additional\ninput to the network. The result is a generative model which can learn to generate transformed\nimages of objects by composing parts. The notion of a composition of transformed parts is taken\nfurther by Tieleman [27], where learnt parts are explicitly af\ufb01ne-transformed, with the transform\npredicted by the network. Such generative capsule models are able to learn discriminative features\nfor classi\ufb01cation from transformation supervision.\nThe invariance and equivariance of CNN representations to input image transformations are studied\nin [19] by estimating the linear relationships between representations of the original and transformed\nimages. Cohen & Welling [5] analyse this behaviour in relation to symmetry groups, which is also\nexploited in the architecture proposed by Gens & Domingos [8], resulting in feature maps that are\nmore invariant to symmetry groups. Other attempts to design transformation invariant representa-\ntions are scattering networks [3], and CNNs that construct \ufb01lter banks of transformed \ufb01lters [17, 25].\nStollenga et al. [26] use a policy based on a network\u2019s activations to gate the responses of the net-\nwork\u2019s \ufb01lters for a subsequent forward pass of the same image and so can allow attention to speci\ufb01c\nfeatures. In this work, we aim to achieve invariant representations by manipulating the data rather\nthan the feature extractors, something that was done for clustering in [7].\nNeural networks with selective attention manipulate the data by taking crops, and so are able to learn\ntranslation invariance. Work such as [1, 23] are trained with reinforcement learning to avoid the\n\n2\n\n(a)(c)7(d)56(b)94\fFigure 2: The architecture of a spatial\ntransformer module. The input feature map\nU is passed to a localisation network which\nregresses the transformation parameters \u03b8.\nThe regular spatial grid G over V is trans-\nformed to the sampling grid T\u03b8(G), which\nis applied to U as described in Sect. 3.3,\nproducing the warped output feature map V .\nThe combination of the localisation network\nand sampling mechanism de\ufb01nes a spatial\ntransformer.\n\nneed for a differentiable attention mechanism, while [11] use a differentiable attention mechansim\nby utilising Gaussian kernels in a generative model. The work by Girshick et al. [9] uses a region\nproposal algorithm as a form of attention, and [6] show that it is possible to regress salient regions\nwith a CNN. The framework we present in this paper can be seen as a generalisation of differentiable\nattention to any spatial transformation.\n3 Spatial Transformers\nIn this section we describe the formulation of a spatial transformer. This is a differentiable module\nwhich applies a spatial transformation to a feature map during a single forward pass, where the\ntransformation is conditioned on the particular input, producing a single output feature map. For\nmulti-channel inputs, the same warping is applied to each channel. For simplicity, in this section we\nconsider single transforms and single outputs per transformer, however we can generalise to multiple\ntransformations, as shown in experiments.\nThe spatial transformer mechanism is split into three parts, shown in Fig. 2. In order of computation,\n\ufb01rst a localisation network (Sect. 3.1) takes the input feature map, and through a number of hidden\nlayers outputs the parameters of the spatial transformation that should be applied to the feature map\n\u2013 this gives a transformation conditional on the input. Then, the predicted transformation parameters\nare used to create a sampling grid, which is a set of points where the input map should be sampled to\nproduce the transformed output. This is done by the grid generator, described in Sect. 3.2. Finally,\nthe feature map and the sampling grid are taken as inputs to the sampler, producing the output map\nsampled from the input at the grid points (Sect. 3.3).\nThe combination of these three components forms a spatial transformer and will now be described\nin more detail in the following sections.\n3.1 Localisation Network\nThe localisation network takes the input feature map U \u2208 RH\u00d7W\u00d7C with width W , height H and\nC channels and outputs \u03b8, the parameters of the transformation T\u03b8 to be applied to the feature map:\n\u03b8 = floc(U ). The size of \u03b8 can vary depending on the transformation type that is parameterised,\ne.g. for an af\ufb01ne transformation \u03b8 is 6-dimensional as in (1).\nThe localisation network function floc() can take any form, such as a fully-connected network or\na convolutional network, but should include a \ufb01nal regression layer to produce the transformation\nparameters \u03b8.\n3.2 Parameterised Sampling Grid\nTo perform a warping of the input feature map, each output pixel is computed by applying a sampling\nkernel centered at a particular location in the input feature map (this is described fully in the next\nsection). By pixel we refer to an element of a generic feature map, not necessarily an image. In\ngeneral, the output pixels are de\ufb01ned to lie on a regular grid G = {Gi} of pixels Gi = (xt\ni ),\ni, yt\n(cid:48)\u00d7C, where H(cid:48) and W (cid:48) are the height and width of the\nforming an output feature map V \u2208 RH\ngrid, and C is the number of channels, which is the same in the input and output.\nFor clarity of exposition, assume for the moment that T\u03b8 is a 2D af\ufb01ne transformation A\u03b8. We will\ndiscuss other transformations below. In this af\ufb01ne case, the pointwise transformation is\n\n(cid:48)\u00d7W\n\n(cid:19)\n\n(cid:18) xs\n\ni\nys\ni\n\n= T\u03b8(Gi) = A\u03b8\n\n\uf8f6\uf8f8\n\n(cid:21)\uf8eb\uf8ed xt\n\ni\nyt\ni\n1\n\n\u03b812\n\u03b822\n\n\u03b813\n\u03b823\n\n\uf8eb\uf8ed xt\n\ni\nyt\ni\n1\n\n\uf8f6\uf8f8 =\n\n(cid:20) \u03b811\n\n\u03b821\n\n3\n\n(1)\n\n(cid:93)(cid:93)(cid:93)(cid:93)UVLocalisation netSamplerSpatial TransformerGrid !generator(cid:93)T\u2713(G)\u2713\f(a)\n\n(b)\n\ni, yt\n\nFigure 3: Two examples of applying the parameterised sampling grid to an image U producing the output V .\n(a) The sampling grid is the regular grid G = TI (G), where I is the identity transformation parameters. (b)\nThe sampling grid is the result of warping the regular grid with an af\ufb01ne transformation T\u03b8(G).\nwhere (xt\ni ) are\ni ) are the target coordinates of the regular grid in the output feature map, (xs\nthe source coordinates in the input feature map that de\ufb01ne the sample points, and A\u03b8 is the af\ufb01ne\ntransformation matrix. We use height and width normalised coordinates, such that \u22121 \u2264 xt\ni \u2264 1\nwhen within the spatial bounds of the output, and \u22121 \u2264 xs\ni \u2264 1 when within the spatial bounds\nof the input (and similarly for the y coordinates). The source/target transformation and sampling is\nequivalent to the standard texture mapping and coordinates used in graphics.\nThe transform de\ufb01ned in (1) allows cropping, translation, rotation, scale, and skew to be applied\nto the input feature map, and requires only 6 parameters (the 6 elements of A\u03b8) to be produced by\nthe localisation network. It allows cropping because if the transformation is a contraction (i.e. the\ndeterminant of the left 2\u00d7 2 sub-matrix has magnitude less than unity) then the mapped regular grid\nwill lie in a parallelogram of area less than the range of xs\ni . The effect of this transformation on\ni , ys\nthe grid compared to the identity transform is shown in Fig. 3.\nThe class of transformations T\u03b8 may be more constrained, such as that used for attention\n\ni , ys\n\ni , ys\n\ni, yt\n\n(cid:21)\n\n(cid:20) s\n\n0\n\nA\u03b8 =\n\n0\ns\n\ntx\nty\n\n(2)\n\nallowing cropping, translation, and isotropic scaling by varying s, tx, and ty. The transformation\nT\u03b8 can also be more general, such as a plane projective transformation with 8 parameters, piece-\nwise af\ufb01ne, or a thin plate spline. Indeed, the transformation can have any parameterised form,\nprovided that it is differentiable with respect to the parameters \u2013 this crucially allows gradients to be\nbackpropagated through from the sample points T\u03b8(Gi) to the localisation network output \u03b8. If the\ntransformation is parameterised in a structured, low-dimensional way, this reduces the complexity\nof the task assigned to the localisation network. For instance, a generic class of structured and dif-\nferentiable transformations, which is a superset of attention, af\ufb01ne, projective, and thin plate spline\ntransformations, is T\u03b8 = M\u03b8B, where B is a target grid representation (e.g. in (1), B is the regu-\nlar grid G in homogeneous coordinates), and M\u03b8 is a matrix parameterised by \u03b8. In this case it is\npossible to not only learn how to predict \u03b8 for a sample, but also to learn B for the task at hand.\n3.3 Differentiable Image Sampling\nTo perform a spatial transformation of the input feature map, a sampler must take the set of sampling\npoints T\u03b8(G), along with the input feature map U and produce the sampled output feature map V .\ni ) coordinate in T\u03b8(G) de\ufb01nes the spatial location in the input where a sampling kernel\nEach (xs\nis applied to get the value at a particular pixel in the output V . This can be written as\nH(cid:88)\n\nW(cid:88)\n\ni , ys\n\nV c\ni =\n\nn\n\nm\n\nU c\n\nnmk(xs\n\ni \u2212 m; \u03a6x)k(ys\n\ni \u2212 n; \u03a6y) \u2200i \u2208 [1 . . . H\n\n(cid:48)\n\n(cid:48)\n\nW\n\n] \u2200c \u2208 [1 . . . C]\n\n(3)\n\nwhere \u03a6x and \u03a6y are the parameters of a generic sampling kernel k() which de\ufb01nes the image\ninterpolation (e.g. bilinear), U c\nnm is the value at location (n, m) in channel c of the input, and V c\ni\nis the output value for pixel i at location (xt\ni ) in channel c. Note that the sampling is done\nidentically for each channel of the input, so every channel is transformed in an identical way (this\npreserves spatial consistency between channels).\n\ni, yt\n\n4\n\n\fIn theory, any sampling kernel can be used, as long as (sub-)gradients can be de\ufb01ned with respect to\ni and ys\nxs\n\ni . For example, using the integer sampling kernel reduces (3) to\n\nV c\ni =\n\nU c\nnm\u03b4((cid:98)xs\n\ni + 0.5(cid:99) \u2212 m)\u03b4((cid:98)ys\n\ni + 0.5(cid:99) \u2212 n)\n\n(4)\n\nwhere (cid:98)x + 0.5(cid:99) rounds x to the nearest integer and \u03b4() is the Kronecker delta function. This\nsampling kernel equates to just copying the value at the nearest pixel to (xs\ni ) to the output location\ni, yt\n(xt\n\ni ). Alternatively, a bilinear sampling kernel can be used, giving\n\ni , ys\n\nV c\ni =\n\nU c\nnm max(0, 1 \u2212 |xs\n\ni \u2212 m|) max(0, 1 \u2212 |ys\n\ni \u2212 n|)\n\n(5)\n\nTo allow backpropagation of the loss through this sampling mechanism we can de\ufb01ne the gradients\nwith respect to U and G. For bilinear sampling (5) the partial derivatives are\n\nH(cid:88)\n\nW(cid:88)\n\nn\n\nm\n\nH(cid:88)\n\nW(cid:88)\n\nn\n\nm\n\nW(cid:88)\n\nm\n\nH(cid:88)\nW(cid:88)\n\nn\n\n\u2202V c\ni\n\u2202U c\n\nnm\n\n=\n\nH(cid:88)\n\n\u2202V c\ni\n\u2202xs\ni\n\n=\n\nn\n\nm\n\nmax(0, 1 \u2212 |xs\n\ni \u2212 m|) max(0, 1 \u2212 |ys\n\ni \u2212 n|)\n\n\uf8f1\uf8f2\uf8f30\n\n1\n\u22121\n\ni| \u2265 1\n\nif |m \u2212 xs\nif m \u2265 xs\ni\nif m < xs\ni\n\nU c\nnm max(0, 1 \u2212 |ys\n\ni \u2212 n|)\n\n(6)\n\n(7)\n\nand similarly to (7) for \u2202V c\ni\n\u2202ys\ni\n\n.\n\ni\n\ni\n\n\u2202\u03b8 and \u2202xs\n\nThis gives us a (sub-)differentiable sampling mechanism, allowing loss gradients to \ufb02ow back not\nonly to the input feature map (6), but also to the sampling grid coordinates (7), and therefore back\nto the transformation parameters \u03b8 and localisation network since \u2202xs\n\u2202\u03b8 can be easily derived\nfrom (1) for example. Due to discontinuities in the sampling fuctions, sub-gradients must be used.\nThis sampling mechanism can be implemented very ef\ufb01ciently on GPU, by ignoring the sum over\nall input locations and instead just looking at the kernel support region for each output pixel.\n3.4 Spatial Transformer Networks\nThe combination of the localisation network, grid generator, and sampler form a spatial transformer\n(Fig. 2). This is a self-contained module which can be dropped into a CNN architecture at any point,\nand in any number, giving rise to spatial transformer networks. This module is computationally very\nfast and does not impair the training speed, causing very little time overhead when used naively, and\neven potential speedups in attentive models due to subsequent downsampling that can be applied to\nthe output of the transformer.\nPlacing spatial transformers within a CNN allows the network to learn how to actively transform\nthe feature maps to help minimise the overall cost function of the network during training. The\nknowledge of how to transform each training sample is compressed and cached in the weights of\nthe localisation network (and also the weights of the layers previous to a spatial transformer) during\ntraining. For some tasks, it may also be useful to feed the output of the localisation network, \u03b8,\nforward to the rest of the network, as it explicitly encodes the transformation, and hence the pose, of\na region or object.\nIt is also possible to use spatial transformers to downsample or oversample a feature map, as one can\nde\ufb01ne the output dimensions H(cid:48) and W (cid:48) to be different to the input dimensions H and W . However,\nwith sampling kernels with a \ufb01xed, small spatial support (such as the bilinear kernel), downsampling\nwith a spatial transformer can cause aliasing effects.\nFinally, it is possible to have multiple spatial transformers in a CNN. Placing multiple spatial trans-\nformers at increasing depths of a network allow transformations of increasingly abstract representa-\ntions, and also gives the localisation networks potentially more informative representations to base\nthe predicted transformation parameters on. One can also use multiple spatial transformers in paral-\nlel \u2013 this can be useful if there are multiple objects or parts of interest in a feature map that should be\nfocussed on individually. A limitation of this architecture in a purely feed-forward network is that\nthe number of parallel spatial transformers limits the number of objects that the network can model.\n\n5\n\n\fModel\n\nFCN\nCNN\n\nMNIST Distortion\nR RTS P\nE\n2.1 5.2 3.1 3.2\n1.2 0.8 1.5 1.4\n1.2 0.8 1.5 2.7\nAff\nProj 1.3 0.9 1.4 2.6\nTPS 1.1 0.8 1.4 2.4\nAff\n0.7 0.5 0.8 1.2\nProj 0.8 0.6 0.8 1.3\nTPS 0.7 0.5 0.8 1.1\n\nST-FCN\n\nST-CNN\n\nTable 1: Left: The percentage errors for different models on different distorted MNIST datasets. The different\ndistorted MNIST datasets we test are TC: translated and cluttered, R: rotated, RTS: rotated, translated, and\nscaled, P: projective distortion, E: elastic distortion. All the models used for each experiment have the same\nnumber of parameters, and same base structure for all experiments. Right: Some example test images where\na spatial transformer network correctly classi\ufb01es the digit but a CNN fails. (a) The inputs to the networks. (b)\nThe transformations predicted by the spatial transformers, visualised by the grid T\u03b8(G). (c) The outputs of the\nspatial transformers. E and RTS examples use thin plate spline spatial transformers (ST-CNN TPS), while R\nexamples use af\ufb01ne spatial transformers (ST-CNN Aff) with the angles of the af\ufb01ne transformations given. For\nvideos showing animations of these experiments and more see https://goo.gl/qdEhUu.\n4 Experiments\nIn this section we explore the use of spatial transformer networks on a number of supervised learn-\ning tasks. In Sect. 4.1 we begin with experiments on distorted versions of the MNIST handwriting\ndataset, showing the ability of spatial transformers to improve classi\ufb01cation performance through\nactively transforming the input images. In Sect. 4.2 we test spatial transformer networks on a chal-\nlenging real-world dataset, Street View House Numbers [21], for number recognition, showing state-\nof-the-art results using multiple spatial transformers embedded in the convolutional stack of a CNN.\nFinally, in Sect. 4.3, we investigate the use of multiple parallel spatial transformers for \ufb01ne-grained\nclassi\ufb01cation, showing state-of-the-art performance on CUB-200-2011 birds dataset [28] by auto-\nmatically discovering object parts and learning to attend to them. Further experiments with MNIST\naddition and co-localisation can be found in the supplementary material.\n4.1 Distorted MNIST\nIn this section we use the MNIST handwriting dataset as a testbed for exploring the range of trans-\nformations to which a network can learn invariance to by using a spatial transformer.\nWe begin with experiments where we train different neural network models to classify MNIST data\nthat has been distorted in various ways: rotation (R); rotation, scale and translation (RTS); projec-\ntive transformation (P); elastic warping (E) \u2013 note that elastic warping is destructive and cannot\nbe inverted in some cases. The full details of the distortions used to generate this data are given\nin the supplementary material. We train baseline fully-connected (FCN) and convolutional (CNN)\nneural networks, as well as networks with spatial transformers acting on the input before the clas-\nsi\ufb01cation network (ST-FCN and ST-CNN). The spatial transformer networks all use bilinear sam-\npling, but variants use different transformation functions: an af\ufb01ne transformation (Aff), projective\ntransformation (Proj), and a 16-point thin plate spline transformation (TPS) with a regular grid of\ncontrol points. The CNN models include two max-pooling layers. All networks have approximately\nthe same number of parameters, are trained with identical optimisation schemes (backpropagation,\nSGD, scheduled learning rate decrease, with a multinomial cross entropy loss), and all with three\nweight layers in the classi\ufb01cation network.\nThe results of these experiments are shown in Table 1 (left). Looking at any particular type of dis-\ntortion of the data, it is clear that a spatial transformer enabled network outperforms its counterpart\nbase network. For the case of rotation, translation, and scale distortion (RTS), the ST-CNN achieves\n0.5% and 0.6% depending on the class of transform used for T\u03b8, whereas a CNN, with two max-\npooling layers to provide spatial invariance, achieves 0.8% error. This is in fact the same error that\nthe ST-FCN achieves, which is without a single convolution or max-pooling layer in its network,\nshowing that using a spatial transformer is an alternative way to achieve spatial invariance. ST-CNN\nmodels consistently perform better than ST-FCN models due to max-pooling layers in ST-CNN pro-\nviding even more spatial invariance, and convolutional layers better modelling local structure. We\nalso test our models in a noisy environment, on 60 \u00d7 60 images with translated MNIST digits and\n\n6\n\n(a)(c)(b)RRRRTSEE(c)(b)58\u00b0(a)-65\u00b093\u00b0\fModel\n\nMaxout CNN [10]\nCNN (ours)\nDRAM* [1]\nST-CNN Single\nMulti\n\nSize\n\n64px 128px\n4.0\n4.0\n3.9\n3.7\n3.6\n\n-\n5.6\n4.5\n3.9\n3.9\n\nTable 2: Left: The sequence error (%) for SVHN multi-digit recognition on crops of 64\u00d764 pixels (64px), and\nin\ufb02ated crops of 128 \u00d7 128 (128px) which include more background. *The best reported result from [1] uses\nmodel averaging and Monte Carlo averaging, whereas the results from other models are from a single forward\npass of a single model. Right: (a) The schematic of the ST-CNN Multi model. The transformations of each\nspatial transformer (ST) are applied to the convolutional feature map produced by the previous layer. (b) The\nresult of the composition of the af\ufb01ne transformations predicted by the four spatial transformers in ST-CNN\nMulti, visualised on the input image.\n\nbackground clutter (see Fig. 1 third row for an example): an FCN gets 13.2% error, a CNN gets\n3.5% error, while an ST-FCN gets 2.0% error and an ST-CNN gets 1.7% error.\nLooking at the results between different classes of transformation, the thin plate spline transfor-\nmation (TPS) is the most powerful, being able to reduce error on elastically deformed digits by\nreshaping the input into a prototype instance of the digit, reducing the complexity of the task for the\nclassi\ufb01cation network, and does not over \ufb01t on simpler data e.g. R. Interestingly, the transformation\nof inputs for all ST models leads to a \u201cstandard\u201d upright posed digit \u2013 this is the mean pose found\nin the training data. In Table 1 (right), we show the transformations performed for some test cases\nwhere a CNN is unable to correctly classify the digit, but a spatial transformer network can.\n4.2 Street View House Numbers\nWe now test our spatial transformer networks on a challenging real-world dataset, Street View House\nNumbers (SVHN) [21]. This dataset contains around 200k real world images of house numbers, with\nthe task to recognise the sequence of numbers in each image. There are between 1 and 5 digits in\neach image, with a large variability in scale and spatial arrangement.\nWe follow the experimental setup as in [1, 10], where the data is preprocessed by taking 64 \u00d7 64\ncrops around each digit sequence. We also use an additional more loosely 128\u00d7128 cropped dataset\nas in [1]. We train a baseline character sequence CNN model with 11 hidden layers leading to \ufb01ve\nindependent softmax classi\ufb01ers, each one predicting the digit at a particular position in the sequence.\nThis is the character sequence model used in [16], where each classi\ufb01er includes a null-character\noutput to model variable length sequences. This model matches the results obtained in [10].\nWe extend this baseline CNN to include a spatial transformer immediately following the input (ST-\nCNN Single), where the localisation network is a four-layer CNN. We also de\ufb01ne another extension\nwhere before each of the \ufb01rst four convolutional layers of the baseline CNN, we insert a spatial\ntransformer (ST-CNN Multi). In this case, the localisation networks are all two-layer fully con-\nnected networks with 32 units per layer. In the ST-CNN Multi model, the spatial transformer before\nthe \ufb01rst convolutional layer acts on the input image as with the previous experiments, however the\nsubsequent spatial transformers deeper in the network act on the convolutional feature maps, pre-\ndicting a transformation from them and transforming these feature maps (this is visualised in Table 2\n(right) (a)). This allows deeper spatial transformers to predict a transformation based on richer fea-\ntures rather than the raw image. All networks are trained from scratch with SGD and dropout [14],\nwith randomly initialised weights, except for the regression layers of spatial transformers which are\ninitialised to predict the identity transform. Af\ufb01ne transformations and bilinear sampling kernels are\nused for all spatial transformer networks in these experiments.\nThe results of this experiment are shown in Table 2 (left) \u2013 the spatial transformer models obtain\nstate-of-the-art results, reaching 3.6% error on 64\u00d7 64 images compared to previous state-of-the-art\nof 3.9% error. Interestingly on 128 \u00d7 128 images, while other methods degrade in performance,\nan ST-CNN achieves 3.9% error while the previous state of the art at 4.5% error is with a recurrent\nattention model that uses an ensemble of models with Monte Carlo averaging \u2013 in contrast the ST-\nCNN models require only a single forward pass of a single model. This accuracy is achieved due to\nthe fact that the spatial transformers crop and rescale the parts of the feature maps that correspond\nto the digit, focussing resolution and network capacity only on these areas (see Table 2 (right) (b)\n\n7\n\nSTconvSTconvSTconvST\u20262!6!0(a)(b)\u21e5\fModel\n\nCimpoi \u201915 [4]\n66.7\nZhang \u201914 [30]\n74.9\nBranson \u201914 [2]\n75.7\nLin \u201915 [20]\n80.9\nSimon \u201915 [24]\n81.0\nCNN (ours) 224px 82.3\n2\u00d7ST-CNN 224px 83.1\n2\u00d7ST-CNN 448px 83.9\n4\u00d7ST-CNN 448px 84.1\n\nTable 3: Left: The accuracy (%) on CUB-200-2011 bird classi\ufb01cation dataset. Spatial transformer networks\nwith two spatial transformers (2\u00d7ST-CNN) and four spatial transformers (4\u00d7ST-CNN) in parallel outperform\nother models. 448px resolution images can be used with the ST-CNN without an increase in computational\ncost due to downsampling to 224px after the transformers. Right: The transformation predicted by the spatial\ntransformers of 2\u00d7ST-CNN (top row) and 4\u00d7ST-CNN (bottom row) on the input image. Notably for the 2\u00d7ST-\nCNN, one of the transformers (shown in red) learns to detect heads, while the other (shown in green) detects\nthe body, and similarly for the 4\u00d7ST-CNN.\n\nfor some examples). In terms of computation speed, the ST-CNN Multi model is only 6% slower\n(forward and backward pass) than the CNN.\n4.3 Fine-Grained Classi\ufb01cation\nIn this section, we use a spatial transformer network with multiple transformers in parallel to perform\n\ufb01ne-grained bird classi\ufb01cation. We evaluate our models on the CUB-200-2011 birds dataset [28],\ncontaining 6k training images and 5.8k test images, covering 200 species of birds. The birds appear\nat a range of scales and orientations, are not tightly cropped, and require detailed texture and shape\nanalysis to distinguish. In our experiments, we only use image class labels for training.\nWe consider a strong baseline CNN model \u2013 an Inception architecture with batch normalisation [15]\npre-trained on ImageNet [22] and \ufb01ne-tuned on CUB \u2013 which by itself achieves state-of-the-art ac-\ncuracy of 82.3% (previous best result is 81.0% [24]). We then train a spatial transformer network,\nST-CNN, which contains 2 or 4 parallel spatial transformers, parameterised for attention and acting\non the input image. Discriminative image parts, captured by the transformers, are passed to the part\ndescription sub-nets (each of which is also initialised by Inception). The resulting part representa-\ntions are concatenated and classi\ufb01ed with a single softmax layer. The whole architecture is trained\non image class labels end-to-end with backpropagation (details in supplementary material).\nThe results are shown in Table 3 (left). The 4\u00d7ST-CNN achieves an accuracy of 84.1%, outperform-\ning the baseline by 1.8%. In the visualisations of the transforms predicted by 2\u00d7ST-CNN (Table 3\n(right)) one can see interesting behaviour has been learnt: one spatial transformer (red) has learnt\nto become a head detector, while the other (green) \ufb01xates on the central part of the body of a bird.\nThe resulting output from the spatial transformers for the classi\ufb01cation network is a somewhat pose-\nnormalised representation of a bird. While previous work such as [2] explicitly de\ufb01ne parts of the\nbird, training separate detectors for these parts with supplied keypoint training data, the ST-CNN is\nable to discover and learn part detectors in a data-driven manner without any additional supervision.\nIn addition, spatial transformers allows for the use of 448px resolution input images without any\nimpact on performance, as the output of the transformed 448px images are sampled at 224px before\nbeing processed.\n5 Conclusion\nIn this paper we introduced a new self-contained module for neural networks \u2013 the spatial trans-\nformer. This module can be dropped into a network and perform explicit spatial transformations\nof features, opening up new ways for neural networks to model data, and is learnt in an end-to-\nend fashion, without making any changes to the loss function. While CNNs provide an incredibly\nstrong baseline, we see gains in accuracy using spatial transformers across multiple tasks, result-\ning in state-of-the-art performance. Furthermore, the regressed transformation parameters from the\nspatial transformer are available as an output and could be used for subsequent tasks. While we\nonly explore feed-forward networks in this work, early experiments show spatial transformers to be\npowerful in recurrent models, and useful for tasks requiring the disentangling of object reference\nframes.\n\n8\n\n\fReferences\n[1] J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple object recognition with visual attention. ICLR, 2015.\n[2] S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird species categorization using pose normalized\n\ndeep convolutional nets. BMVC., 2014.\n\n[3] J. Bruna and S. Mallat. Invariant scattering convolution networks. IEEE PAMI, 35(8):1872\u20131886, 2013.\n[4] M. Cimpoi, S. Maji, and A. Vedaldi. Deep \ufb01lter banks for texture recognition and segmentation. In CVPR,\n\n2015.\n\n[5] T. S. Cohen and M. Welling. Transformation properties of learned visual representations. ICLR, 2015.\n[6] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks.\n\nIn CVPR, 2014.\n\n[7] B. J. Frey and N. Jojic. Fast, large-scale transformation-invariant clustering. In NIPS, 2001.\n[8] R. Gens and P. M. Domingos. Deep symmetry networks. In NIPS, 2014.\n[9] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection\n\nand semantic segmentation. In CVPR, 2014.\n\n[10] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet. Multi-digit number recognition from street\n\nview imagery using deep convolutional neural networks. arXiv:1312.6082, 2013.\n\n[11] K. Gregor, I. Danihelka, A. Graves, and D. Wierstra. Draw: A recurrent neural network for image\n\ngeneration. ICML, 2015.\n\n[12] G. E. Hinton. A parallel computation that assigns canonical object-based frames of reference. In IJCAI,\n\n1981.\n\n[13] G. E. Hinton, A. Krizhevsky, and S. D. Wang. Transforming auto-encoders. In ICANN. 2011.\n[14] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural net-\n\nworks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.\n\n[15] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal\n\ncovariate shift. ICML, 2015.\n\n[16] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and arti\ufb01cial neural networks\n\nfor natural scene text recognition. NIPS DLW, 2014.\n\n[17] A. Kanazawa, A. Sharma, and D. Jacobs. Locally scale-invariant convolutional neural networks. In NIPS,\n\n2014.\n\n[18] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[19] K. Lenc and A. Vedaldi. Understanding image representations by measuring their equivariance and equiv-\n\nalence. CVPR, 2015.\n\n[20] T. Lin, A. RoyChowdhury, and S. Maji. Bilinear CNN models for \ufb01ne-grained visual recognition.\n\narXiv:1504.07889, 2015.\n\n[21] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with\n\nunsupervised feature learning. In NIPS DLW, 2011.\n\n[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,\n\nM. Bernstein, et al. Imagenet large scale visual recognition challenge. arXiv:1409.0575, 2014.\n\n[23] P. Sermanet, A. Frome, and E. Real. Attention for \ufb01ne-grained categorization. arXiv:1412.7054, 2014.\n[24] M. Simon and E. Rodner. Neural activation constellations: Unsupervised part model discovery with\n\nconvolutional networks. arXiv:1504.08289, 2015.\n\n[25] K. Sohn and H. Lee. Learning invariant representations with local transformations. arXiv:1206.6418,\n\n2012.\n\n[26] M. F. Stollenga, J. Masci, F. Gomez, and J. Schmidhuber. Deep networks with internal selective attention\n\nthrough feedback connections. In NIPS, 2014.\n\n[27] T. Tieleman. Optimizing Neural Networks that Generate Images. PhD thesis, University of Toronto, 2014.\n[28] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 dataset.\n\n2011.\n\n[29] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhutdinov, R. Zemel, and Y. Bengio. Show, attend\n\nand tell: Neural image caption generation with visual attention. ICML, 2015.\n\n[30] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun. Ef\ufb01cient and accurate approximations of nonlinear\n\nconvolutional networks. arXiv:1411.4229, 2014.\n\n9\n\n\f", "award": [], "sourceid": 1213, "authors": [{"given_name": "Max", "family_name": "Jaderberg", "institution": "Google DeepMind"}, {"given_name": "Karen", "family_name": "Simonyan", "institution": "Google DeepMind"}, {"given_name": "Andrew", "family_name": "Zisserman", "institution": "Google DeepMind"}, {"given_name": "koray", "family_name": "kavukcuoglu", "institution": "Google DeepMind"}]}