{"title": "Measuring Invariances in Deep Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 646, "page_last": 654, "abstract": "For many computer vision applications, the ideal image feature would be invariant to multiple confounding image properties, such as illumination and viewing angle. Recently, deep architectures trained in an unsupervised manner have been proposed as an automatic method for extracting useful features. However, outside of using these learning algorithms in a classi\ufb01er, they can be sometimes dif\ufb01cult to evaluate. In this paper, we propose a number of empirical tests that directly measure the degree to which these learned features are invariant to different image transforms. We \ufb01nd that deep autoencoders become invariant to increasingly complex image transformations with depth. This further justi\ufb01es the use of \u201cdeep\u201d vs. \u201cshallower\u201d representations. Our performance metrics agree with existing measures of invariance. Our evaluation metrics can also be used to evaluate future work in unsupervised deep learning, and thus help the development of future algorithms.", "full_text": "Measuring Invariances in Deep Networks\n\nIan J. Goodfellow, Quoc V. Le, Andrew M. Saxe, Honglak Lee, Andrew Y. Ng\n\nComputer Science Department\n\nStanford University\nStanford, CA 94305\n\n{ia3n,quocle,asaxe,hllee,ang}@cs.stanford.edu\n\nAbstract\n\nFor many pattern recognition tasks, the ideal input feature would be invariant to\nmultiple confounding properties (such as illumination and viewing angle, in com-\nputer vision applications). Recently, deep architectures trained in an unsupervised\nmanner have been proposed as an automatic method for extracting useful features.\nHowever, it is dif\ufb01cult to evaluate the learned features by any means other than\nusing them in a classi\ufb01er. In this paper, we propose a number of empirical tests\nthat directly measure the degree to which these learned features are invariant to\ndifferent input transformations. We \ufb01nd that stacked autoencoders learn modestly\nincreasingly invariant features with depth when trained on natural images. We \ufb01nd\nthat convolutional deep belief networks learn substantially more invariant features\nin each layer. These results further justify the use of \u201cdeep\u201d vs. \u201cshallower\u201d repre-\nsentations, but suggest that mechanisms beyond merely stacking one autoencoder\non top of another may be important for achieving invariance. Our evaluation met-\nrics can also be used to evaluate future work in deep learning, and thus help the\ndevelopment of future algorithms.\n\n1 Introduction\nInvariance to abstract input variables is a highly desirable property of features for many detection\nand classi\ufb01cation tasks, such as object recognition. The concept of invariance implies a selectivity\nfor complex, high level features of the input and yet a robustness to irrelevant input transformations.\nThis tension between selectivity and robustness makes learning invariant features nontrivial. In the\ncase of object recognition, an invariant feature should respond only to one stimulus despite changes\nin translation, rotation, complex illumination, scale, perspective, and other properties. In this paper,\nwe propose to use a suite of \u201cinvariance tests\u201d that directly measure the invariance properties of\nfeatures; this gives us a measure of the quality of features learned in an unsupervised manner by a\ndeep learning algorithm.\n\nOur work also seeks to address the question: why are deep learning algorithms useful? Bengio and\nLeCun gave a theoretical answer to this question, in which they showed that a deep architecture is\nnecessary to represent many functions compactly [1]. A second answer can also be found in such\nwork as [2, 3, 4, 5], which shows that such architectures lead to useful representations for classi-\n\ufb01cation. In this paper, we give another, empirical, answer to this question: namely, we show that\nwith increasing depth, the representations learned can also enjoy an increased degree of invariance.\nOur observations lend credence to the common view of invariances to minor shifts, rotations and\ndeformations being learned in the lower layers, and being combined in the higher layers to form\nprogressively more invariant features.\n\nIn computer vision, one can view object recognition performance as a measure of the invariance of\nthe underlying features. While such an end-to-end system performance measure has many bene\ufb01ts,\nit can also be expensive to compute and does not give much insight into how to directly improve\nrepresentations in each layer of deep architectures. Moreover, it cannot identify speci\ufb01c invariances\n\n1\n\n\fthat a feature may possess. The test suite presented in this paper provides an alternative that can\nidentify the robustness of deep architectures to speci\ufb01c types of variations. For example, using\nvideos of natural scenes, our invariance tests measure the degree to which the learned representations\nare invariant to 2-D (in-plane) rotations, 3-D (out-of-plane) rotations, and translations. Additionally,\nsuch video tests have the potential to examine changes in other variables such as illumination. We\ndemonstrate that using videos gives similar results to the more traditional method of measuring\nresponses to sinusoidal gratings; however, the natural video approach enables us to test invariance\nto a wide range of transformations while the grating test only allows changes in stimulus position,\norientation, and frequency.\n\nOur proposed invariance measure is broadly applicable to evaluating many deep learning algorithms\nfor many tasks, but the present paper will focus on two different algorithms applied to computer\nvision. First, we examine the invariances of stacked autoencoder networks [2]. These networks\nwere shown by Larochelle et al. [3] to learn useful features for a range of vision tasks; this suggests\nthat their learned features are signi\ufb01cantly invariant to the transformations present in those tasks.\nUnlike the arti\ufb01cial data used in [3], however, our work uses natural images and natural video\nsequences, and examines more complex variations such as out-of-plane changes in viewing angle.\nWe \ufb01nd that when trained under these conditions, stacked autoencoders learn increasingly invariant\nfeatures with depth, but the effect of depth is small compared to other factors such as regularization.\nNext, we show that convolutional deep belief networks (CDBNs) [5], which are hand-designed to be\ninvariant to certain local image translations, do enjoy dramatically increasing invariance with depth.\nThis suggests that there is a bene\ufb01t to using deep architectures, but that mechanisms besides simple\nstacking of autoencoders are important for gaining increasing invariance.\n\n2 Related work\n\nDeep architectures have shown signi\ufb01cant promise as a technique for automatically learning fea-\ntures for recognition systems. Deep architectures consist of multiple layers of simple computational\nelements. By combining the output of lower layers in higher layers, deep networks can represent\nprogressively more complex features of the input. Hinton et al. introduced the deep belief network,\nin which each layer consists of a restricted Boltzmann machine [4]. Bengio et al. built a deep net-\nwork using an autoencoder neural network in each layer [2, 3, 6]. Ranzato et al. and Lee et al.\nexplored the use of sparsity regularization in autoencoding energy-based models [7, 8] and sparse\nconvolutional DBNs with probabilistic max-pooling [5] respectively. These networks, when trained\nsubsequently in a discriminative fashion, have achieved excellent performance on handwritten digit\nrecognition tasks. Further, Lee et al. and Raina et al. show that deep networks are able to learn\ngood features for classi\ufb01cation tasks even when trained on data that does not include examples of\nthe classes to be recognized [5, 9].\n\nSome work in deep architectures draws inspiration from the biology of sensory systems. The human\nvisual system follows a similar hierarchical structure, with higher levels representing more complex\nfeatures [10]. Lee et al., for example, compared the response properties of the second layer of a\nsparse deep belief network to V2, the second stage of the visual hierarchy [11]. One important prop-\nerty of the visual system is a progressive increase in the invariance of neural responses in higher\nlayers. For example, in V1, complex cells are invariant to small translations of their inputs. Higher\nin the hierarchy in the medial temporal lobe, Quiroga et al. have identi\ufb01ed neurons that respond with\nhigh selectivity to, for instance, images of the actress Halle Berry [12]. These neurons are remark-\nably invariant to transformations of the image, responding equally well to images from different\nperspectives, at different scales, and even responding to the text \u201cHalle Berry.\u201d While we do not\nknow exactly the class of all stimuli such neurons respond to (if tested on a larger set of images, they\nmay well turn out to respond also to other stimuli than Halle Berry related ones), they nonetheless\nshow impressive selectivity and robustness to input transformations.\n\nComputational models such as the neocognitron [13], HMAX model [14], and Convolutional Net-\nwork [15] achieve invariance by alternating layers of feature detectors with local pooling and sub-\nsampling of the feature maps. This approach has been used to endow deep networks with some\ndegree of translation invariance [8, 5]. However, it is not clear how to explicitly imbue models with\nmore complicated invariances using this \ufb01xed architecture. Additionally, while deep architectures\nprovide a task-independent method of learning features, convolutional and max-pooling techniques\nare somewhat specialized to visual and audio processing.\n\n2\n\n\f3 Network architecture and optimization\nWe train all of our networks on natural images collected separately (and in geographically different\nareas) from the videos used in the invariance tests. Speci\ufb01cally, the training set comprises a set of\nstill images taken in outdoor environments free from arti\ufb01cial objects, and was not designed to relate\nin any way to the invariance tests.\n\n3.1 Stacked autoencoder\nThe majority of our tests focus on the stacked autoencoder of Bengio et al. [2], which is a deep\nnetwork consisting of an autoencoding neural network in each layer. In the single-layer case, in\nresponse to an input pattern x \u2208 Rn, the activation of each neuron, hi, i = 1, \u00b7 \u00b7 \u00b7 , m is computed as\n\nh(x) = tanh (W1x + b1) ,\n\nwhere h(x) \u2208 Rm is the vector of neuron activations, W1 \u2208 Rm\u00d7n is a weight matrix, b1 \u2208 Rm is a\nbias vector, and tanh is the hyperbolic tangent applied componentwise. The network output is then\ncomputed as\n\n\u02c6x = tanh (W2h(x) + b2) ,\n\nwhere \u02c6x \u2208 Rn is a vector of output values, W2 \u2208 Rn\u00d7m is a weight matrix, and b2 \u2208 Rn is a bias\nvector. Given a set of p input patterns x(i), i = 1, \u00b7 \u00b7 \u00b7 , p, the weight matrices W1 and W2 are adapted\nusing backpropagation [16, 17, 18] to minimize the reconstruction error Pp\nFollowing [2], we successively train up layers of the network in a greedy layerwise fashion. The\n\ufb01rst layer receives a 14 \u00d7 14 patch of an image as input. After it achieves acceptable levels of\nreconstruction error, a second layer is added, then a third, and so on.\n\n2.\ni=1 (cid:13)(cid:13)x(i) \u2212 \u02c6x(i)(cid:13)(cid:13)\n\nIn some of our experiments, we use the method of [11], and constrain the expected activation of the\nhidden units to be sparse. We never constrain W1 = W T\n2 , although we found this to approximately\nhold in practice.\n\n3.2 Convolutional Deep Belief Network\nWe also test a CDBN [5] that was trained using two hidden layers. Each layer includes a collection\nof \u201cconvolution\u201d units as well as a collection of \u201cmax-pooling\u201d units. Each convolution unit has\na receptive \ufb01eld size of 10x10 pixels, and each max-pooling unit implements a probabilistic max-\nlike operation over four (i.e., 2x2) neighboring convolution units, giving each max-pooling unit an\noverall receptive \ufb01eld size of 11x11 pixels in the \ufb01rst layer and 31x31 pixels in the second layer.\nThe model is regularized in a way that the average hidden unit activation is sparse. We also use a\nsmall amount of L2 weight decay.\nBecause the convolution units share weights and because their outputs are combined in the max-\npooling units, the CDBN is explicitly designed to be invariant to small amounts of image translation.\n4 Invariance measure\nAn ideal feature for pattern recognition should be both robust and selective. We interpret the hidden\nunits as feature detectors that should respond strongly when the feature they represent is present in\nthe input, and otherwise respond weakly when it is absent. An invariant neuron, then, is one that\nmaintains a high response to its feature despite certain transformations of its input. For example,\na face selective neuron might respond strongly whenever a face is present in the image; if it is\ninvariant, it might continue to respond strongly even as the image rotates.\n\nBuilding on this intuition, we consider hidden unit responses above a certain threshold to be \ufb01ring,\nthat is, to indicate the presence of some feature in the input. We adjust this threshold to ensure that\nthe neuron is selective, and not simply always active. In particular we choose a separate threshold\nfor each hidden unit such that all units \ufb01re at the same rate when presented with random stimuli.\nAfter identifying an input that causes the neuron to \ufb01re, we can test the robustness of the unit by\ncalculating its \ufb01ring rate in response to a set of transformed versions of that input.\nMore formally, a hidden unit i is said to \ufb01re when sihi(x) > ti, where ti is a threshold chosen\nby our test for that hidden unit and si \u2208 {\u22121, 1} gives the sign of that hidden unit\u2019s values. The\nsign term si is necessary because, in general, hidden units are as likely to use low values as to\nuse high values to indicate the presence of the feature that they detect. We therefore choose si to\nmaximize the invariance score. For hidden units that are regularized to be sparse, we assume that\nsi = 1, since their mean activity has been regularized to be low. We de\ufb01ne the indicator function\n\n3\n\n\ffi(x) = 1{sihi(x) > ti}, i.e., it is equal to one if the neuron \ufb01res in response to input x, and zero\notherwise.\n\nA transformation function \u03c4 (x, \u03b3) transforms a stimulus x into a new, related stimulus, where the\ndegree of transformation is parametrized by \u03b3 \u2208 R.\n(One could also imagine a more complex\ntransformation parametrized by \u03b3 \u2208 Rn.) In order for a function \u03c4 to be useful with our invariance\nmeasure, |\u03b3| should relate to the semantic dissimilarity between x and \u03c4 (x, \u03b3). For example, \u03b3 might\nbe the number of degrees by which x is rotated.\n\nA local trajectory T (x) is a set of stimuli that are semantically similar to some reference stimulus\nx, that is\n\nT (x) = {\u03c4 (x, \u03b3) | \u03b3 \u2208 \u0393}\n\nwhere \u0393 is a set of transformation amounts of limited size, for example, all rotations of less than 15\ndegrees.\n\nThe global \ufb01ring rate is the \ufb01ring rate of a hidden unit when applied to stimuli drawn randomly\nfrom a distribution P (x):\n\nG(i) = E[fi(x)],\n\nwhere P (x) is a distribution over the possible inputs x de\ufb01ned for each implementation of the test.\nUsing these de\ufb01nitions, we can measure the robustness of a hidden unit as follows. We de\ufb01ne the\nset Z as a set of inputs that activate hi near maximally. The local \ufb01ring rate is the \ufb01ring rate of a\nhidden unit when it is applied to local trajectories surrounding inputs z \u2208 Z that maximally activate\nthe hidden unit,\n\nL(i) =\n\n1\n|Z| X\n\nz\u2208Z\n\n1\n\n|T (z)| X\n\nx\u2208T (z)\n\nfi(x),\n\ni.e., L(i) is the proportion of transformed inputs that the neuron \ufb01res in response to, and hence is a\nmeasure of the robustness of the neuron\u2019s response to the transformation \u03c4 .\nOur invariance score for a hidden unit hi is given by\n\nS(i) =\n\nL(i)\nG(i)\n\n.\n\nThe numerator is a measure of the hidden unit\u2019s robustness to transformation \u03c4 near the unit\u2019s opti-\nmal inputs, and the denominator ensures that the neuron is selective and not simply always active.\nIn our tests, we tried to select the threshold ti for each hidden unit so that it \ufb01res one percent of the\ntime in response to random inputs, that is, G(i) = 0.01. For hidden units that frequently repeat the\nsame activation value (up to machine precision), it is sometimes not possible to choose ti such that\nG(i) = 0.01 exactly. In such cases, we choose the smallest value of t(i) such that G(i) > 0.01.\nEach of the tests presented in the paper is implemented by providing a different de\ufb01nition of P (x),\n\u03c4 (x, \u03b3), and \u0393.\nS(i) gives the invariance score for a single hidden unit. The invariance score Invp(N ) of a network\nN is given by the mean of S(i) over the top-scoring proportion p of hidden units in the deepest layer\nof N. We discard the (1 \u2212 p) worst hidden units because different subpopulations of units may be\ninvariant to different transformations. Reporting the mean of all unit scores would strongly penalize\nnetworks that discover several hidden units that are invariant to transformation \u03c4 but do not devote\nmore than proportion p of their hidden units to such a task.\n\nFinally, note that while we use this metric to measure invariances in the visual features learned\nby deep networks, it could be applied to virtually any kind of feature in virtually any application\ndomain.\n5 Grating test\nOur \ufb01rst invariance test is based on the response of neurons to synthetic images. Following such au-\nthors as Berkes et al.[19], we systematically vary the parameters used to generate images of gratings.\nWe use as input an image I of a grating, with image pixel intensities given by\n\nI(x, y) = b + a sin (\u03c9(x cos(\u03b8) + y sin(\u03b8) \u2212 \u03c6)) ,\n\n4\n\n\fwhere \u03c9 is the spatial frequency, \u03b8 is the orientation of the grating, and \u03c6 is the phase. To imple-\nment our invariance measure, we de\ufb01ne P (x) as a distribution over grating images. We measure\ninvariance to translation by de\ufb01ning \u03c4 (x, \u03b3) to change \u03c6 by \u03b3. We measure invariance to rotation by\nde\ufb01ning \u03c4 (x, \u03b3) to change \u03c9 by \u03b3.1\n\n6 Natural video test\nWhile the grating-based invariance test allows us to systematically vary the parameters used to\ngenerate the images, it shares the dif\ufb01culty faced by a number of other methods for quantifying\ninvariance that are based on synthetic (or nearly synthetic) data [19, 20, 21]: it is dif\ufb01cult to generate\ndata that systematically varies a large variety of image parameters.\n\nOur second suite of invariance tests uses natural video data. Using this method, we will measure\nthe degree to which various learned features are invariant to a wide range of more complex image\nparameters. This will allow us to perform quantitative comparisons of representations at each layer\nof a deep network. We also verify that the results using this technique align closely with those\nobtained with the grating-based invariance tests.\n\n6.1 Data collection\nOur dataset consists of natural videos containing common image transformations such as transla-\ntions, 2-D (in-plane) rotations, and 3-D (out-of-plane) rotations. In contrast to labeled datasets like\nthe NORB dataset [21] where the viewpoint changes in large increments between successive images,\nour videos are taken at sixty frames per second, and thus are suitable for measuring more modest\ninvariances, as would be expected in lower layers of a deep architecture. After collection, the images\nare reduced in size to 320 by 180 pixels and whitened by applying a band pass \ufb01lter. Finally, we\nadjust the constrast of the whitened images with a scaling constant that varies smoothly over time\nand attempts to make each image use as much of the dynamic range of the image format as possible.\nEach video sequence contains at least one hundred frames. Some video sequences contain motion\nthat is only represented well near the center of the image; for example, 3-D (out-of-plane) rotation\nabout an object in the center of the \ufb01eld of view. In these cases we cropped the videos tightly in\norder to focus on the relevant transformation.\n\nInvariance calculation\n\n6.2\nTo implement our invariance measure using natural images, we de\ufb01ne P (x) as a uniform distribution\nover image patches contained in the test videos, and \u03c4 (x, \u03b3) to be the image patch at the same\nimage location as x but occurring \u03b3 video frames later in time. We de\ufb01ne \u0393 = {\u22125, . . . , 5}. To\nmeasure invariance to different types of transformation, we simply use videos that involve each type\nof transformation. This obviates the need to de\ufb01ne a complex \u03c4 capable of synthetically performing\noperations such as 3-D rotation.\n7 Results\n7.1 Stacked autoencoders\n7.1.1 Relationship between grating test and natural video test\nSinusoidal gratings are already used as a common reference stimulus. To validate our approach\nof using natural videos, we show that videos involving translation give similar test results to the\nphase variation grating test. Fig. 1 plots the invariance score for each of 378 one layer autoencoders\nregularized with a range of sparsity and weight decay parameters (shown in Fig. 3). We were not able\nto \ufb01nd as close of a correspondence between the grating orientation test and natural videos involving\n2-D (in-plane) rotation. Our 2-D rotations were captured by hand-rotating a video camera in natural\nenvironments, which introduces small amounts of other types of transformations. To verify that\nthe problem is not that rotation when viewed far from the image center resembles translation, we\ncompare the invariance test scores for translation and for rotation in Fig. 2. The lack of any clear\n\n1Details: We de\ufb01ne P (x) as a uniform distribution over patches produced by varying \u03c9 \u2208 {2, 4, 6, 8},\n\u03b8 \u2208 {0, \u00b7 \u00b7 \u00b7 , \u03c0} in steps of \u03c0/20, and \u03c6 \u2208 {0, \u00b7 \u00b7 \u00b7 , \u03c0} in steps of \u03c0/20. After identifying a grating that\nstrongly activates the neuron, further local gratings T (x) are generated by varying one parameter while holding\nall other optimal parameters \ufb01xed. For the translation test, local trajectories T (x) are generated by modifying\n\u03c6 from the optimal value \u03c6opt to \u03c6 = \u03c6opt \u00b1 {0, \u00b7 \u00b7 \u00b7 , \u03c0} in steps of \u03c0/20, where \u03c6opt is the optimal grating\nphase shift. For the rotation test, local trajectories T (x) are generated by modifying \u03b8 from the optimal value\n\u03b8opt to \u03b8 = \u03b8opt \u00b1 {0, \u00b7 \u00b7 \u00b7 , \u03c0} in steps of \u03c0/40, where \u03b8opt is the optimal grating orientation.\n\n5\n\n\fGrating and natural video test comparison\n25\n\nNatural 2\u2212D rotation and translation test\n20\n\nt\ns\ne\n\nt\n \n\nn\no\n\ni\nt\n\nl\n\na\ns\nn\na\nr\nt\n \nl\n\na\nr\nu\n\nt\n\na\nN\n\n20\n\n15\n\n10\n\n5\n\n0\n\n0\n\n10\n\n20\n\nt\ns\ne\n\nt\n \n\nn\no\n\ni\nt\n\nt\n\na\no\nr\n \n\nD\n\u2212\n2\n\n \nl\n\na\nr\nu\n\nt\n\na\nN\n\n18\n\n16\n\n14\n\n12\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n5\n\n80\n\n90\n\n100\n\n40\n\n30\n70\nGrating phase test\n\n50\n\n60\n\nFigure 1: Videos involving translation\ngive similar test results to synthetic\nvideos of gratings with varying phase.\n\n10\n\n20\nNatural translation test\n\n15\n\n25\n\nFigure 2: We verify that our translation\nand 2-D rotation videos do indeed cap-\nture different transformations.\n\nLayer 1 Natural Video Test\n\n\u22122\n\n\u22123\n\n\u22123\n\n\u22123.5\n\n\u22124\n\n\u22124\n\n0\n\n\u22120.5\n\n\u22121\n\n\u22121.5\n\n\u22122\n\n\u22122.5\n\nlog\n\n Target Mean Activation\n\n10\n\n \n\ne\nr\no\nc\nS\ne\nc\nn\na\ni\nr\na\nv\nn\n\nI\n\n40\n\n30\n\n20\n\n10\n\n0\n\n2\n\n1\n\n0\n\n\u22121\n\nlog\n\n Weight Decay\n\n10\n\nFigure 3: Our invariance measure selects networks that learn edge detectors resembling Gabor func-\ntions as the maximally invariant single-layer networks. Unregularized networks that learn high-\nfrequency weights also receive high scores, but are not able to match the scores of good edge detec-\ntors. Degenerate networks in which every hidden unit learns essentially the same function tend to\nreceive very low scores.\n\ntrend makes it obvious that while our 2-D rotation videos do not correspond exactly to rotation, they\nare certainly not well-approximated by translation.\n\n7.1.2 Pronounced effect of sparsity and weight decay\nWe trained several single-layer autoencoders using sparsity regularization with various target mean\nactivations and amounts of weight decay. For these experiments, we averaged the invariance scores\nof all the hidden units to form the network score, i.e., we used p = 1. Due to the presence of the\nsparsity regularization, we assume si = 1 for all hidden units. We found that sparsity and weight\ndecay have a large effect on the invariance of a single-layer network. In particular, there is a semi-\ncircular ridge trading sparsity and weight decay where invariance scores are high. We interpret this\nto be the region where the problem is constrained enough that the autoencoder must throw away\nsome information, but is still able to extract meaningful patterns from its input. These results are\nvisualized in Fig. 3. We \ufb01nd that a network with no regularization obtains a score of 25.88, and the\nbest-scoring network receives a score of 32.41.\n\n7.1.3 Modest improvements with depth\nTo investigate the effect of depth on invariance, we chose to extensively cross-validate several depths\nof autoencoders using only weight decay. The majority of successful image classi\ufb01cation results in\n\n6\n\n\fFigure 4: Left to right: weight visualizations from layer 1, layer 2, and layer 3 of the autoencoders;\nlayer 1 and layer 2 of the CDBN. Autoencoder weight images are taken from the best autoencoder at\neach depth. All weight images are contrast normalized independently but plotted on the same spatial\nscale. Weight images in deeper layers are formed by making linear combinations of weight images\nin shallower layers. This approximates the function computed by each unit as a linear function.\n\nthe literature do not use sparsity, and cross-validating only a single parameter frees us to sample the\nsearch space more densely. We trained a total of 73 networks with weight decay at each layer set to\na value from {10, 1, 10\u22121, 10\u22122, 10\u22123, 10\u22125, 0}. For these experiments, we averaged the invariance\nscores of the top 20% of the hidden units to form the network score, i.e., we used p = .2, and chose\nsi for each hidden unit to maximize the invariance score, since there was no sparsity regularization\nto impose a sign on the hidden unit values.\n\nAfter performing this grid search, we trained 100 additional copies of the network with the best\nmean invariance score at each depth, holding the weight decay parameters constant and varying\nonly the random weights used to initialize training. We found that the improvement with depth was\nhighly signi\ufb01cant statistically (see Fig. 5). However, the magnitude of the increase in invariance is\nlimited compared to the increase that can be gained with the correct sparsity and weight decay.\n\n7.2 Convolutional Deep Belief Networks\nWe also ran our invariance tests on a two layer\nCDBN. This provides a measure of the effec-\ntiveness of hard-wired techniques for achiev-\ning invariance, including convolution and max-\npooling. The results are summarized in Table\n1. These results cannot be compared directly to\nthe results for autoencoders, because of the dif-\nferent receptive \ufb01eld sizes. The receptive \ufb01eld\nsizes in the CDBN are smaller than those in the\nautoencoder for the lower layers, but larger than\nthose in the autoencoder for the higher layers\ndue to the pooling effect. Note that the great-\nest relative improvement comes in the natural\nimage tests, which presumably require greater\nsophistication than the grating tests. The single\ntest with the greatest relative improvement is\nthe 3-D (out-of-plane) rotation test. This is the\nmost complex transformation included in our\ntests, and it is where depth provides the greatest\npercentagewise increase.\n8 Discussion and conclusion\nIn this paper, we presented a set of tests for\nmeasuring invariances in deep networks. We\nde\ufb01ned a general formula for a test metric, and\ndemonstrated how to implement it using syn-\nthetic grating images as well as natural videos\nwhich reveal more types of invariances than\njust 2-D (in-plane) rotation, translation and fre-\nquency.\n\nMean Invariance\n\nTranslation\n\n2\u2212D Rotation\n\n3\u2212D Rotation\n\n21\n\n20.5\n\n20\n\n19.5\n\n19\n\n18.5\n\n18\n\n17.5\n\n17\n\n16.5\n\n \n\ne\nr\no\nc\nS\ne\nc\nn\na\ni\nr\na\nv\nn\n\nI\n\n \n\ne\nr\no\nc\nS\ne\nc\nn\na\ni\nr\na\nv\nn\n\nI\n\n35.5\n\n35\n\n34.5\n\n34\n\n33.5\n\n33\n\n32.5\n\n32\n\n31.5\n\n31\n\n \n\ne\nr\no\nc\nS\ne\nc\nn\na\ni\nr\na\nv\nn\n\nI\n\n19.5\n\n19\n\n18.5\n\n18\n\n17.5\n\n17\n\n16.5\n\n16\n\n15.5\n\n15\n\n11\n\n10.5\n\n10\n\n9.5\n\n9\n\n8.5\n\n8\n\n7.5\n\n7\n\n6.5\n\n \n\ne\nr\no\nc\nS\ne\nc\nn\na\ni\nr\na\nv\nn\n\nI\n\n1\n\n2\n\n3\n\nLayer\n\n1\n\n2\n\n3\n\nLayer\n\n1\n\n2\n\n3\n\nLayer\n\n1\n\n2\n\n3\n\nLayer\n\nFigure 5: To verify that the improvement in invari-\nance score of the best network at each layer is an\neffect of the network architecture rather than the\nrandom initialization of the weights, we retrained\nthe best network of each depth 100 times. We \ufb01nd\nthat the increase in the mean is statistically signif-\nicant with p < 10\u221260. Looking at the scores for\nindividual invariances, we see that the deeper net-\nworks trade a small amount of translation invari-\nance for a larger amount of 2-D (in-plane) rotation\nand 3-D (out-of-plane) rotation invariance. All\nplots are on the same scale but with different base-\nlines so that the worst invariance score appears at\nthe same height in each plot.\n\nAt the level of a single hidden unit, our \ufb01ring\nrate invariance measure requires learned fea-\ntures to balance high local \ufb01ring rates with low global \ufb01ring rates. This concept resembles the\ntrade-off between precision and recall in a detection problem. As learning algorithms become more\n\n7\n\n\fTest\nGrating phase\nGrating orientation\nNatural translation\nNatural 3-D rotation\n\nLayer 1 Layer 2 % change\n\n68.7\n52.3\n15.2\n10.7\n\n95.3\n77.8\n23.0\n19.3\n\n38.2\n48.7\n51.0\n79.5\n\nTable 1: Results of the CDBN invariance tests.\n\nadvanced, another appropriate measure of invariance may be a hidden unit\u2019s invariance to object\nidentity. As an initial step in this direction, we attempted to score hidden units by their mutual\ninformation with categories in the Caltech 101 dataset [22]. We found that none of our networks\ngave good results. We suspect that current learning algorithms are not yet sophisticated enough to\nlearn, from only natural images, individual features that are highly selective for speci\ufb01c Caltech 101\ncategories, but this ability will become measurable in the future.\n\nAt the network level, our measure requires networks to have at least some subpopulation of hidden\nunits that are invariant to each type of transformation. This is accomplished by using only the\ntop-scoring proportion p of hidden units when calculating the network score. Such a quali\ufb01cation\nis necessary to give high scores to networks that decompose the input into separate variables. For\nexample, one very useful way of representing a stimulus would be to use some subset of hidden units\nto represent its orientation, another subset to represent its position, and another subset to represent\nits identity. Even though this would be an extremely powerful feature representation, a value of p\nset too high would result in penalizing some of these subsets for not being invariant.\n\nWe also illustrated extensive \ufb01ndings made by applying the invariance test on computer vision tasks.\nHowever, the de\ufb01nition of our metric is suf\ufb01ciently general that it could easily be used to test, for\nexample, invariance of auditory features to rate of speech, or invariance of textual features to author\nidentity.\n\nA surprising \ufb01nding in our experiments with visual data is that stacked autoencoders yield only\nmodest improvements in invariance as depth increases. This suggests that while depth is valuable,\nmere stacking of shallow architectures may not be suf\ufb01cient to exploit the full potential of deep\narchitectures to learn invariant features.\n\nAnother interesting \ufb01nding is that by incorporating sparsity, networks can become more invariant.\nThis suggests that, in the future, a variety of mechanisms should be explored in order to learn better\nfeatures. For example, one promising approach that we are currently investigating is the idea of\nlearning slow features [19] from temporal data.\n\nWe also document that explicit approaches to achieving invariance such as max-pooling and weight-\nsharing in CDBNs are currently successful strategies for achieving invariance. This is not suprising\ngiven the fact that invariance is hard-wired into the network, but it validates the fact that our metric\nfaithfully measures invariances. It is not obvious how to extend these explicit strategies to become\ninvariant to more intricate transformations like large-angle out-of-plane rotations and complex illu-\nmination changes, and we expect that our metrics will be useful in guiding efforts to develop learning\nalgorithms that automatically discover much more invariant features without relying on hard-wired\nstrategies.\n\nAcknowledgments This work was supported in part by the National Science Foundation under\ngrant EFRI-0835878, and in part by the Of\ufb01ce of Naval Research under MURI N000140710747.\nAndrew Saxe is supported by a Scott A. and Geraldine D. Macomber Stanford Graduate Fellowship.\nWe would also like to thank the anonymous reviewers for their helpful comments.\n\nReferences\n\n[1] Y. Bengio and Y. LeCun. Scaling learning algorithms towards ai. In L. Bottou, O. Chapelle,\n\nD. DeCoste, and J. Weston, editors, Large-Scale Kernel Machines. MIT Press, 2007.\n\n8\n\n\f[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep\n\nnetworks. In NIPS, 2007.\n\n[3] H. Larochelle, D. Erhan, A. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation of\ndeep architectures on problems with many factors of variation. ICML, pages 473\u2013480, 2007.\n[4] G.E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18(7):1527\u20131554, 2006.\n\n[5] H. Lee, R. Grosse, R. Ranganath, and A.Y. Ng. Convolutional deep belief networks for scalable\n\nunsupervised learning of hierarchical representations. In ICML, 2009.\n\n[6] H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep\n\nneural networks. The Journal of Machine Learning Research, pages 1\u201340, 2009.\n\n[7] M. Ranzato, Y-L. Boureau, and Y. LeCun. Sparse feature learning for deep belief networks. In\n\nNIPS, 2007.\n\n[8] M. Ranzato, F.-J. Huang, Y-L. Boureau, and Y. LeCun. Unsupervised learning of invariant\n\nfeature hierarchies with applications to object recognition. In CVPR. IEEE Press, 2007.\n\n[9] Rajat Raina, Alexis Battle, Honglak Lee, Benjamin Packer, and Andrew Y. Ng. Self-taught\nlearning: Transfer learning from unlabeled data. In ICML \u201907: Proceedings of the 24th inter-\nnational conference on Machine learning, 2007.\n\n[10] D.J. Felleman and D.C. Van Essen. Distributed hierarchical processing in the primate cerebral\n\ncortex. Cerebral Cortex, 1(1):1\u201347, 1991.\n\n[11] H. Lee, C. Ekanadham, and A.Y. Ng. Sparse deep belief network model for visual area v2. In\n\nNIPS, 2008.\n\n[12] R. Quian Quiroga, L. Reddy, G. Kreiman, C. Koch, and I. Fried. Invariant visual representation\n\nby single neurons in the human brain. Nature, 435:1102\u20131107, 2005.\n\n[13] K. Fukushima and S. Miyake. Neocognitron: A new algorithm for pattern recognition tolerant\n\nof deformations and shifts in position. Pattern Recognition, 1982.\n\n[14] M. Riesenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature\n\nneuroscience, 2(11):1019\u20131025, 1999.\n\n[15] Y. LeCun, B. Boser, J.S. Denker, D. Henderson, R.E. Howard, W. Hubbard, and L.D. Jackel.\nBackpropagation applied to handwritten zip code recognition. Neural Computation, 1:541\u2013\n551, 1989.\n\n[16] P. Werbos. Beyond regression: New tools for prediction and analysis in the behavioral sci-\n\nences. PhD thesis, Harvard University, 1974.\n\n[17] Y. LeCun. Une proc\u00b4edure d\u2019apprentissage pour r\u00b4eseau a seuil asymmetrique (a learning scheme\nfor asymmetric threshold networks). In Proceedings of Cognitiva 85, pages 599\u2013604, Paris,\nFrance, 1985.\n\n[18] D.E. Rumelhart, G.E. Hinton, and R.J. Williams.\n\npropagating errors. Nature, 323:533\u2013536, 1986.\n\nLearning representations by back-\n\n[19] P. Berkes and L. Wiskott. Slow feature analysis yields a rich repertoire of complex cell prop-\n\nerties. Journal of Vision, 5(6):579\u2013602, 2005.\n\n[20] L. Wiskott and T. Sejnowski. Slow feature analysis: Unsupervised learning of invariances.\n\nNeural Computation, 14(4):715\u2013770, 2002.\n\n[21] Y. LeCun, F.J. Huang, and L. Bottou. Learning methods for generic object recognition with\n\ninvariance to pose and lighting. In CVPR, 2004.\n\n[22] Li Fei-Fei, Rod Fergus, and Pietro Perona. Learning generative visual models from few train-\ning examples: An incremental bayesian approach tested on 101 object categories. page 178,\n2004.\n\n9\n\n\f", "award": [], "sourceid": 463, "authors": [{"given_name": "Ian", "family_name": "Goodfellow", "institution": null}, {"given_name": "Honglak", "family_name": "Lee", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Andrew", "family_name": "Saxe", "institution": null}, {"given_name": "Andrew", "family_name": "Ng", "institution": null}]}