{"title": "Eigen-Distortions of Hierarchical Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 3530, "page_last": 3539, "abstract": "We develop a method for comparing hierarchical image representations in terms of their ability to explain perceptual sensitivity in humans. Specifically, we utilize Fisher information to establish a model-derived prediction of sensitivity to local perturbations of an image. For a given image, we compute the eigenvectors of the Fisher information matrix with largest and smallest eigenvalues, corresponding to the model-predicted most- and least-noticeable image distortions, respectively. For human subjects, we then measure the amount of each distortion that can be reliably detected when added to the image. We use this method to test the ability of a variety of representations to mimic human perceptual sensitivity. We find that the early layers of VGG16, a deep neural network optimized for object recognition, provide a better match to human perception than later layers, and a better match than a 4-stage convolutional neural network (CNN) trained on a database of human ratings of distorted image quality. On the other hand, we find that simple models of early visual processing, incorporating one or more stages of local gain control, trained on the same database of distortion ratings, provide substantially better predictions of human sensitivity than either the CNN, or any combination of layers of VGG16.", "full_text": "Eigen-Distortions of Hierarchical Representations\n\nAlexander Berardino\n\nCenter for Neural Science\n\nNew York University\nagb313@nyu.edu\n\nJohannes Ball\u00e9\n\nCenter for Neural Science\nNew York University\u2217\n\njohannes.balle@nyu.edu\n\nValero Laparra\n\nImage Processing Laboratory\n\nUniversitat de Val\u00e8ncia\n\nvalero.laparra@uv.es\n\nHoward Hughes Medical Institute,\n\nCenter for Neural Science and\n\nCourant Institute of Mathematical Sciences\n\nEero Simoncelli\n\nNew York University\n\neero.simoncelli@nyu.edu\n\nAbstract\n\nWe develop a method for comparing hierarchical image representations in terms\nof their ability to explain perceptual sensitivity in humans. Speci\ufb01cally, we utilize\nFisher information to establish a model-derived prediction of sensitivity to local\nperturbations of an image. For a given image, we compute the eigenvectors of the\nFisher information matrix with largest and smallest eigenvalues, corresponding\nto the model-predicted most- and least-noticeable image distortions, respectively.\nFor human subjects, we then measure the amount of each distortion that can be\nreliably detected when added to the image. We use this method to test the ability\nof a variety of representations to mimic human perceptual sensitivity. We \ufb01nd that\nthe early layers of VGG16, a deep neural network optimized for object recognition,\nprovide a better match to human perception than later layers, and a better match\nthan a 4-stage convolutional neural network (CNN) trained on a database of human\nratings of distorted image quality. On the other hand, we \ufb01nd that simple models\nof early visual processing, incorporating one or more stages of local gain control,\ntrained on the same database of distortion ratings, provide substantially better\npredictions of human sensitivity than either the CNN, or any combination of layers\nof VGG16.\n\nHuman capabilities for recognizing complex visual patterns are believed to arise through a cascade\nof transformations, implemented by neurons in successive stages in the visual system. Several\nrecent studies have suggested that representations of deep convolutional neural networks trained\nfor object recognition can predict activity in areas of the primate ventral visual stream better than\nmodels constructed explicitly for that purpose (Yamins et al. [2014], Khaligh-Razavi and Kriegeskorte\n[2014]). These results have inspired exploration of deep networks trained on object recognition\nas models of human perception, explicitly employing their representations as perceptual distortion\nmetrics or loss functions (H\u00e9naff and Simoncelli [2016], Johnson et al. [2016], Dosovitskiy and Brox\n[2016]).\nOn the other hand, several other studies have used synthesis techniques to generate images that\nindicate a profound mismatch between the sensitivity of these networks and that of human observers.\nSpeci\ufb01cally, Szegedy et al. [2013] constructed image distortions, imperceptible to humans, that\ncause their networks to grossly misclassify objects. Similarly, Nguyen and Clune [2015] optimized\nrandomly initialized images to achieve reliable recognition by a network, but found that the resulting\n\n\u2217Currently at Google, Inc.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f\u2018fooling images\u2019 were uninterpretable by human viewers. Simpler networks, designed for texture\nclassi\ufb01cation and constrained to mimic the early visual system, do not exhibit such failures (Portilla\nand Simoncelli [2000]). These results have prompted efforts to understand why generalization failures\nof this type are so consistent across deep network architectures, and to develop more robust training\nmethods to defend networks against attacks designed to exploit these weaknesses (Goodfellow et al.\n[2014]).\nFrom the perspective of modeling human perception, these synthesis failures suggest that repre-\nsentational spaces within deep neural networks deviate signi\ufb01cantly from those of humans, and\nthat methods for comparing representational similarity, based on \ufb01xed object classes and discrete\nsampling of the representational space, are insuf\ufb01cient to expose these deviations. If we are going to\nuse such networks as models for human perception, we need better methods of comparing model\nrepresentations to human vision. Recent work has taken the \ufb01rst step in this direction, by analyzing\ndeep networks\u2019 robustness to visual distortions on classi\ufb01cation tasks, as well as the similarity\nof classi\ufb01cation errors that humans and deep networks make in the presence of the same kind of\ndistortion (Dodge and Karam [2017]).\nHere, we aim to accomplish something in the same spirit, but rather than testing on a set of hand-\nselected examples, we develop a model-constrained synthesis method for generating targeted test\nstimuli that can be used to compare the layer-wise representational sensitivity of a model to human\nperceptual sensitivity. Utilizing Fisher information, we isolate the model-predicted most and least\nnoticeable changes to an image. We test these predictions by determining how well human observers\ncan discriminate these same changes. We apply this method to six layers of VGG16 (Simonyan\nand Zisserman [2015]), a deep convolutional neural network (CNN) trained to classify objects. We\nalso apply the method to several models explicitly trained to predict human sensitivity to image\ndistortions, including both a 4-stage generic CNN, an optimally-weighted version of VGG16, and\na family of highly-structured models explicitly constructed to mimic the physiology of the early\nhuman visual system. Example images from the paper, as well as additional examples, are available\nat http://www.cns.nyu.edu/~lcv/eigendistortions/.\n\n1 Predicting discrimination thresholds\n\nSuppose we have a model for human visual representation, de\ufb01ned by conditional density p((cid:126)r|(cid:126)x),\nwhere (cid:126)x is an N-dimensional vector containing the image pixels, and (cid:126)r is an M-dimensional random\nvector representing responses internal to the visual system (e.g., \ufb01ring rates of a population of\nneurons). If the image is modi\ufb01ed by the addition of a distortion vector, (cid:126)x + \u03b1\u02c6u, where \u02c6u is a unit\nvector, and scalar \u03b1 controls the amplitude of distortion, the model can be used to predict the threshold\nat which the distorted image can be reliably distinguished from the original image. Speci\ufb01cally, one\ncan express a lower bound on the discrimination threshold in direction \u02c6u for any observer or model\nthat bases its judgments on (cid:126)r (Seri\u00e8s et al. [2009]):\n\nT (\u02c6u; (cid:126)x) \u2265 \u03b2(cid:113)\u02c6uT J\u22121[(cid:126)x]\u02c6u\n\nwhere \u03b2 is a scale factor that depends on the noise amplitude of the internal representation (as well as\nexperimental conditions, when measuring discrimination thresholds of human observers), and J[(cid:126)x] is\nthe Fisher information matrix (FIM; Fisher [1925]), a second-order expansion of the log likelihood:\n\n(1)\n\n(2)\n\nJ[(cid:126)x] = E\n\n(cid:126)r|(cid:126)x(cid:20)(cid:16) \u2202\n\n\u2202(cid:126)x\n\nlog p((cid:126)r|(cid:126)x)(cid:17)(cid:16) \u2202\n\n\u2202(cid:126)x\n\nlog p((cid:126)r|(cid:126)x)(cid:17)T(cid:21)\n\nHere, we restrict ourselves to models that can be expressed as a deterministic (and differentiable)\nmapping from the input pixels to mean output response vector, f ((cid:126)x), with additive white Gaussian\nnoise in the response space. The log likelihood in this case reduces to a quadratic form:\n\nSubstituting this into Eq. (2) gives:\n\nlog p((cid:126)r|(cid:126)x) = \u2212\n\n1\n\n2(cid:16)[(cid:126)r \u2212 f ((cid:126)x)]T [(cid:126)r \u2212 f ((cid:126)x)](cid:17) + const.\n\nThus, for these models, the Fisher information matrix induces a locally adaptive Euclidean metric on\nthe space of images, as speci\ufb01ed by the Jacobian matrix, \u2202f /\u2202(cid:126)x.\n\nJ[(cid:126)x] =\n\n\u2202f\n\u2202(cid:126)x\n\nT \u2202f\n\u2202(cid:126)x\n\n2\n\n\fFigure 1: Measuring and comparing model-derived predictions of image discriminability. Two models\nare applied to an image (depicted as a point (cid:126)x in the space of pixel values), producing response vectors\n(cid:126)rA and (cid:126)rB. Responses are assumed to be stochastic, and drawn from known distributions p((cid:126)rA|(cid:126)x)\nand p((cid:126)rB|(cid:126)x). The Fisher Information Matrices (FIM) of the models, JA[(cid:126)x] and JB[(cid:126)x], provide a\nquadratic approximation of the discriminability of distortions relative to an image (rightmost plot,\ncolored ellipses). The extremal eigenvalues and eigenvectors of the FIMs (directions indicated by\ncolored lines) provide predictions of the most and least visible distortions. We test these predictions\nby measuring human discriminability in these directions (colored points). In this example, the ratio\nof discriminability along the extremal eigenvectors is larger for model A than for model B, indicating\nthat model A provides a better description of human perception of distortions (for this image).\n\n1.1 Extremal eigen-distortions\n\nThe FIM is generally too large to be stored in memory or inverted. Even if we could store and invert\nit, the high dimensionality of input (pixel) space renders the set of possible distortions too large to\ntest experimentally. We resolve both of these issues by restricting our consideration to the most-\nand least-noticeable distortion directions, corresponding to the eigenvectors of J[(cid:126)x] with largest and\nsmallest eigenvalues, respectively. First, note that if a distortion direction \u02c6e is an eigenvector of J[(cid:126)x]\nwith associated eigenvalue \u03bb, then it is also an eigenvector of J\u22121[(cid:126)x] (with eigenvalue 1/\u03bb), since\nthe FIM is symmetric and positive semi-de\ufb01nite. In this case, Eq. (1) becomes\n\nT (\u02c6e; (cid:126)x) \u2265 \u03b2/\u221a\u03bb\n\nIf human discrimination thresholds attain this bound, or are a constant multiple above it, then the ratio\nof discrimination thresholds along two different eigenvectors is the square root of the ratio of their\nassociated eigenvalues. In this case, the strongest prediction arising from a given model is the ratio\nof the extremal (maximal and minimal) eigenvalues of its FIM, which can be compared to the ratio\nof human discrimination thresholds for distortions in the directions of the corresponding extremal\neigenvectors (Fig. 1).\nAlthough the FIM cannot be stored, it is straightforward to compute its product with an input vector\n(i.e., an image). Using this operation, we can solve for the extremal eigenvectors using the well-\nknown power iteration method (von Mises and Pollaczek-Geiringer [1929]). Speci\ufb01cally, to obtain the\nmaximal eigenvalue of a given function and its associated eigenvector (\u03bbm and \u02c6em, respectively), we\nstart with a vector consisting of white noise, \u02c6e(0)\nm , and then iteratively apply the FIM, renormalizing\nthe resulting vector, until convergence:\n\nTo obtain the minimal eigenvector, \u02c6el, we perform a second iteration using the FIM with the maximal\neigenvalue subtracted from the diagonal:\n\n\u02c6e (k+1)\nm\n\n= J[(cid:126)x]\u02c6e (k)\n\nm /\u03bb(k+1)\n\nm\n\n\u03bb (k+1)\nm\n\n=(cid:13)(cid:13)(cid:13)J[(cid:126)x]\u02c6e (k)\nm (cid:13)(cid:13)(cid:13) ;\n(cid:13)(cid:13)(cid:13) ;\n=(cid:13)(cid:13)(cid:13)(J[(cid:126)x] \u2212 \u03bbmI) \u02c6e (k)\n\nl\n\n\u03bb (k+1)\nl\n\n\u02c6e (k+1)\nl\n\n= (J[(cid:126)x] \u2212 \u03bbmI) \u02c6e (k)\n\nl\n\n/\u03bb(k+1)\n\nl\n\n3\n\nImagepixel 1pixel 2p(~rA|~x)model Amodel Bp(~rB|~x)response 1response 2response 1response 2pixel 1pixel 2JB[~x]JA[~x]\u02c6uTJ1B[~x]\u02c6u\u02c6uTJ1A[~x]\u02c6uT(\u02c6u;~x)(human)T(\u02c6u;~x)\u02c6uFor unit vectors :(human)q\u02c6uTJ1A[~x]\u02c6u\u02c6uq\u02c6uTJ1B[~x]\u02c6u\u02c6u\u02c6u\f1.2 Measuring human discrimination thresholds\n\nFor each model under consideration, we synthesized extremal eigen-distortions for 6 images from\nthe Kodak image set2. We then estimated human thresholds for detecting these distortions using a\ntwo-alternative forced-choice task. On each trial, subjects were shown (for one second each with\na half second blank screen between images, and in randomized order) a photographic image (18\ndegrees across), (cid:126)x, and the same image distorted using one of the extremal eigenvectors, (cid:126)x + \u03b1\u02c6e,\nand then asked to indicate which image appeared more distorted. This procedure was repeated for\n120 trials for each distortion vector, \u02c6e, over a range of \u03b1 values, with ordering chosen by a standard\npsychophysical staircase procedure. The proportion of correct responses, as a function of \u03b1, was \ufb01t\nwith a cumulative Gaussian function, and the subject\u2019s detection threshold, Ts(\u02c6e; (cid:126)x) was estimated\nas the value of \u03b1 for which the subject could distinguish the distorted image 75% of the time. We\ncomputed the natural logarithm of the ratio of these detection thresholds for the minimal and maximal\neigenvectors, and averaged this over images (indexed by i) and subjects (indexed by s):\n\nD(f ) =\n\n1\nS\n\n1\nI\n\nlog (cid:107)Ts(\u02c6eli; (cid:126)xi)/Ts(\u02c6emi; (cid:126)xi)(cid:107)\n\nS(cid:88)s=1\n\nI(cid:88)i=1\n\nwhere Ts indicates the threshold measured for human subject s. D(f ) provides a measure of a model\u2019s\nability to predict human performance with respect to distortion detection: the ratio of thresholds for\nmodel-generated extremal distortions will be larger for models that are more similar to the human\nsubjects (Fig. 1).\n\n2 Probing representational sensitivity of VGG16 layers\n\nWe begin by examining discrimination predictions derived\nfrom the deep convolutional network known as VGG16,\nwhich has been previously studied in the context of per-\nceptual sensitivity. Speci\ufb01cally, Johnson et al. [2016]\ntrained a neural network to generate super-resolution im-\nages using the representation of an intermediate layer of\nVGG16 as a perceptual loss function, and showed that the\nimages this network produced looked signi\ufb01cantly better\nthan images generated with simpler loss functions (e.g.\npixel-domain mean squared error). H\u00e9naff and Simoncelli\n[2016] used VGG16 as an image metric to synthesize min-\nimal length paths (geodesics) between images modi\ufb01ed\nby simple global transformations (rotation, dilation, etc.).\nThe authors found that a modi\ufb01ed version of the network\nproduced geodesics that captured these global transforma-\ntions well (as measured perceptually), especially in deeper\nlayers. Implicit in both of these studies, and others like\nthem (e.g., Dosovitskiy and Brox [2016]), is the idea that\na deep neural network trained to recognize objects may\nexhibit additional human perceptual characteristics.\nHere, we compare VGG16\u2019s sensitivity to distortions directly to human perceptual sensitivity to\nthe same distortions. We transformed luminance-valued images and distortion vectors to proper\ninputs for VGG16 following the preprocessing steps described in the original paper, and veri\ufb01ed\nthat our implementation replicated the published object recognition results. For human perceptual\nmeasurements, all images were transformed to produce the same luminance values on our calibrated\ndisplay as those assumed by the model.\nWe computed eigen-distortions of VGG16 at 6 different layers: the recti\ufb01ed convolutional layer\nimmediately prior to the \ufb01rst max-pooling operation (Front), as well as each subsequent layer\nfollowing a pooling operation (Layer2\u2013Layer6). A subset of these are shown, both in isolation and\nsuperimposed on the image from which they were derived, in Fig. 3. Note that the detectability of\nthese distortions in isolation is not necessarily indicative of their detectability when superimposed\n\nFigure 2: Top: Average log-thresholds\nfor detection of the least-noticeable\n(red) and most-noticeable (blue) eigen-\ndistortions derived from layers within\nVGG16 (10 observers), and a baseline\nmodel (MSE) for which distortions in all\ndirections are equally visible.\n\n2Downloaded from http://www.cipr.rpi.edu/resource/stills/kodak.html.\n\n4\n\nln threshold210-1-2-3-4VGG16 layerFront23456MSE\fMost-noticeable eigen-distortions\n\nLayer 3\n\nLayer 5\n\nLeast-noticeable eigen-distortions\n\nLayer 3\n\nLayer 5\n\n4\u02c6em\n\nFront\n\nImage X\n\n30\u02c6el\n\nFront\n\nImage X\n\nFigure 3: Eigen-distortions derived from three layers of the VGG16 network for an example image.\nImages are best viewed in a display with luminance range from 5 to 300 cd/m2 and a \u03b3 exponent\nof 2.4. Top: Most-noticeable eigen-distortions. All distortion image intensities are scaled by the\nsame amount (\u00d74). Second row: Original image ((cid:126)x), and sum of this image with each of the eigen-\ndistortions. Third and fourth rows: Same, for the least-noticeable eigen-distortions. Distortion\nimage intensities are scaled the same (\u00d730).\n\non the underlying image, as measured in our experiments. We compared all of these predictions to\na baseline model (MSE), where the image transformation, f ((cid:126)x), is replaced by the identity matrix.\nFor this model, every distortion direction is equally discriminable, and distortions are generated as\nsamples of Gaussian white noise.\nAverage Human detection thresholds measured across 10 subjects and 6 base images are summarized\nin Fig. 2, and indicate that the early layers of VGG16 (in particular, Front and Layer3) are better\npredictors of human sensitivity than the deeper layers (Layer4, Layer5, Layer6). Speci\ufb01cally, the\nmost noticeable eigen-distortions from representations within VGG16 become more discriminable\nwith depth, but so generally do the least-noticeable eigen-distortions. This discrepancy could arise\nfrom overlearned invariances, or invariances induced by network architecture (e.g. layer 6, the \ufb01rst\nstage in the network where the number of output coef\ufb01cients falls below the number of input pixels,\nis an under-complete representation). Notably, including the \"L2 pooling\" modi\ufb01cation suggested in\nH\u00e9naff and Simoncelli [2016] did not signi\ufb01cantly alter the visibility of eigen-distortions synthesized\nfrom VGG16 (images and data not shown).\n\n3 Probing representational similarity of IQA-optimized models\n\nThe results above suggest that training a neural network to recognize objects imparts some ability to\npredict human sensitivity to distortions. However, we \ufb01nd that deeper layers of the network produce\nworse predictions than shallower layers. This could be a result of the mismatched training objective\n\n5\n\n\fFigure 4: Architecture of a 4-layer Convolutional Neural Network (CNN). Each layer consists of a\nconvolution, downsampling, and a rectifying nonlinearity (see text). The network was trained, using\nbatch normalization, to maximize correlation with the TID-2008 database of human image distortion\nsensitivity.\n\nfunction (object recognition) or the particular architecture of the network. Since we clearly cannot\nprobe the entire space of networks that achieve good results on object recognition, we aim instead to\nprobe a more general form of the latter question. Speci\ufb01cally, we train multiple models of differing\narchitecture to predict human image quality ratings, and test their ability to generalize by measuring\nhuman sensitivity to their eigen-distortions.\nWe constructed a generic 4-layer convolutional neural network (CNN, 436908 parameters - Fig.\n4). Within this network, each layer applies a bank of 5 \u00d7 5 convolution \ufb01lters to the outputs of the\nprevious layer (or, for the \ufb01rst layer, the input image). The convolution responses are subsampled\nby a factor of 2 along each spatial dimension (the number of \ufb01lters at each layer is increased by\nthe same factor to maintain a complete representation at each stage). Following each convolution,\nwe employ batch normalization, in which all responses are divided by the standard deviation taken\nover all spatial positions and all layers, and over a batch of input images (Ioffe and Szegedy [2015]).\nFinally, outputs are recti\ufb01ed with a softplus nonlinearity, log(1 + exp(x)). After training, the batch\nnormalization factors are \ufb01xed to the global mean and variance across the entire training set.\nWe compare our generic CNN to a model re\ufb02ecting the\nstructure and computations of the Lateral Geniculate Nu-\ncleus (LGN), the visual relay center of the Thalamus. Pre-\nvious results indicate that such models can successfully\nmimic human judgments of image quality (Laparra et al.\n[2017]). The full model (On-Off), is constructed from a\ncascade of linear \ufb01ltering, and nonlinear computational\nmodules (local gain control and recti\ufb01cation). The \ufb01rst\nstage decomposes the image into two separate channels.\nWithin each channel, the image is \ufb01ltered by a difference-\nof-Gaussians (DoG) \ufb01lter (2 parameters, controlling spa-\ntial size of the Gaussians - DoG \ufb01lters in On and Off\nchannels are assumed to be of opposite sign). Following\nthis linear stage, the outputs are normalized by two se-\nquential stages of gain control, a known property of LGN\nneurons (Mante et al. [2008]). Filter outputs are \ufb01rst nor-\nmalized by a local measure of luminance (2 parameters,\ncontrolling \ufb01lter size and amplitude), and subsequently\nby a local measure of contrast (2 parameters, again con-\ntrolling size and amplitude). Finally, the outputs of each\nchannel are recti\ufb01ed by a softplus nonlinearity, for a total\nof 12 model parameters. In order to evaluate the neces-\nsity of each structural element of this model, we also test\nthree reduced sub-models, each trained on the same data\n(Fig. 5).\nFinally, we compare both of these models to a version\nof VGG16 targeted at image quality assessment (VGG-\nIQA). This model computes the weighted mean squared\nerror over all recti\ufb01ed convolutional layers of the VGG16\nnetwork (13 weight parameters in total), with weights\ntrained on the same perceptual data as the other models.\n\nFigure 5: Architecture of our LGN\nmodel (On-Off), and several reduced\nmodels (LGG, LG, and LN). Each model\nwas trained to maximize correlation with\nthe TID-2008 database of human image\ndistortion sensitivity.\n\n6\n\nConvolution, 5x5 \ufb01lters Downsampling 2x2, batch normalization, recti\ufb01cationLNLGLGGOn-Off\f3.1 Optimizing models for IQA\n\nWe trained all of the models on the TID-2008 database, which contains a large set of original and\ndistorted images, along with corresponding human ratings of perceived distortion [Ponomarenko\net al., 2009]. Perceptual distortion distance for each model was calculated as the Euclidean distance\nbetween the model\u2019s representations of the original and distorted images:\n\nFor each model, we optimized its parameters, \u03c6, so as to maximize the correlation between the\nmodel-predicted perceptual distance, D\u03c6 and the human mean opinion scores (MOS) reported in the\nTID-2008 database:\n\nD\u03c6 = ||f\u03c6((cid:126)x) \u2212 f\u03c6((cid:126)x (cid:48))||2\n\n\u03c6\u2217 = arg max\n\n\u03c6\n\n(cid:16)corr(D\u03c6, M OS)(cid:17)\n\nOptimization of VGG-IQA weights was performed using non-negative least squares. Optimization of\nall other models was performed using regularized stochastic gradient ascent with the Adam algorithm\n(Kingma and Ba [2015]).\n\n3.2 Comparing perceptual predictions of generic and structured models\n\nAfter training, we evaluated each model\u2019s predic-\ntive performance using traditional cross-validation\nmethods on a held-out test set of the TID-2008\ndatabase. By this measure, all three models per-\nformed well (Pearson correlation: CNN \u03c1 = .86,\nOn-Off: \u03c1 = .82, VGG-IQA: \u03c1 = .84).\n\nStepping beyond the TID-2008 database, and using\nthe more stringent eigen-distortion test, yielded a very\ndifferent outcome (Figs. 7, 6 and 8). The average de-\ntection thresholds measured across 19 human subjects\nand 6 base images indicates that all of our models\nsurpassed the baseline model in at least one of their\npredictions. However, the eigen-distortions derived\nfrom the generic CNN and VGG-IQA were signi\ufb01-\nFigure 6: Top: Average log-thresholds for\ncantly less predictive of human sensitivity than those\ndetection of the least-noticeable (red) and\nderived from the On-Off model (Fig. 6) and, surpris-\nmost-noticeable (blue) eigen-distortions de-\ningly, even somewhat less predictive than early layers\nrived from IQA models (19 human observers).\nof VGG16 (see Fig. 8). Thus, the eigen-distortion\ntest reveals generalization failures in the CNN and VGG16 architectures that are not exposed by\ntraditional methods of cross-validation. On the other hand, the models with architectures that mimic\nbiology (On-Off, LGG, LG) are constrained in a way that enables better generalization.\nWe compared these results to the performance of each of our reduced LGN models (Fig. 5), to\ndetermine the necessity of each structural element of the full On-Off model. As expected, the models\nincorporating more LGN functional elements performed better on a traditional cross-validation\ntest, with the most complex of the reduced models (LGG) performing at the same level as On-Off\nand the CNN (LN: \u03c1 = .66, LG: \u03c1 = .74, LGG: \u03c1 = .83). Likewise, models with more LGN\nfunctional elements produced eigen-distortions with increasing predictive accuracy (Fig. 6 and 8). It\nis worth noting that the three LGN models that incorporate some form of local gain control perform\nsigni\ufb01cantly better than the CNN and VGG-IQA models, and better than all layers of VGG16,\nincluding the early layers (see Fig. 8).\n\n4 Discussion\n\nWe have presented a new methodology for synthesizing most and least-noticeable distortions from\nperceptual models, applied this methodology to a set of different models, and tested the resulting\npredictions by measuring their detectability by human subjects. We show that this methodology\nprovides a powerful form of \u201cTuring test\u201d: perceptual measurements on this limited set of model-\noptimized examples reveal failures that are not be apparent in measurements on a large set of\nhand-curated examples.\n\n7\n\nIQA Modelln thresholdMSELNLGLGGOn-OffCNNVGGIQA\u22124\u22123\u22122\u221210123\fMost-noticeable eigen-distortion (4\u02c6em)\n\nLG\n\nLGG\n\nOn-Off\n\nCNN\n\nVGG-IQA\n\nLeast-noticeable eigen-distortion (30\u02c6el)\n\nLG\n\nLGG\n\nOn-Off\n\nCNN\n\nVGG-IQA\n\nFigure 7: Eigen-distortions for several models trained to maximize correlation with human distortion\nratings in TID-2008 [Ponomarenko et al., 2009]. Images are best viewed in a display with luminance\nrange from 5 to 300 cd/m2 and a \u03b3 exponent of 2.4. Top: Most-noticeable eigen-distortions. All\ndistortion image intensities are re-scaled by the same amount (\u00d74). Second row: Original image\n((cid:126)x), and sum of this image with each eigen-distortion. Third and fourth rows: Same, for the\nleast-noticeable eigen-distortions. All distortion image intensities re-scaled by the same amount\n(\u00d730).\n\nWe are not the \ufb01rst to introduce a method of this kind. Wang and Simoncelli [2008] introduced\nMaximum Differentiation (MAD) competition, which creates images optimized for one metric while\nholding constant a competing metric\u2019s rating. Our method relies on a Fisher approximation to generate\nextremal perturbations, and uses the ratio of their empirically measured discrimination thresholds as\nan absolute measure of alignment to human sensitivity (as opposed to relative pairwise comparisons\nof model performance). Our method can easily be generalized to incorporate more physiologically\nrealistic noise assumptions, such as Poisson noise, and could potentially be extended to include noise\nat each stage of a hierarchical model.\nWe\u2019ve used this method to analyze the ability of VGG16, a deep convolutional neural network\ntrained to recognize objects, to account for human perceptual sensitivity. First, we \ufb01nd that the early\nlayers of the network are moderately successful in this regard. Second, these layers (Front, Layer\n3) surpassed the predictive power of a generic shallow CNN explicitly trained to predict human\nperceptual sensitivity, but underperformed models of the LGN trained on the same objective. And\nthird, perceptual sensitivity predictions synthesized from a layer of VGG16 decline in accuracy for\ndeeper layers.\nWe also showed that a highly structured model of the LGN generates predictions that substantially\nsurpass the predictive power of any individual layer of VGG16, as well as a version of VGG16\ntrained to \ufb01t human sensitivity data (VGG-IQA), or a generic 4-layer CNN trained on the same\n\n8\n\n\fFigure 8: Average empirical log-threshold ratio (D) for eigen-distortions derived from each IQA\noptimized model and each layer of VGG16.\n\ndata. These failures of both the shallow and deep neural networks were not seen in traditional\ncross-validation tests on the human sensitivity data, but were revealed by measuring human sensitivity\nto model-synthesized eigen-distortions. Finally, we con\ufb01rmed that known functional properties\nof the early visual system (On and Off pathways) and ubiquitous neural computations (local gain\ncontrol, Carandini and Heeger [2012]) have a direct impact on perceptual sensitivity, a \ufb01nding that is\nbuttressed by several other published results (Malo et al. [2006], Lyu and Simoncelli [2008], Laparra\net al. [2010, 2017], Ball\u00e9 et al. [2017]).\nMost importantly, we demonstrate the utility of prior knowledge in constraining the choice of models.\nAlthough the biologically structured models used components similar to generic CNNs, they had\nfar fewer layers and their parameterization was highly restricted, thus allowing a far more limited\nfamily of transformations. Despite this, they outperformed the generic CNN and VGG models.\nThese structural choices were informed by knowledge of primate visual physiology, and training on\nhuman perceptual data was used to determine parameters of the model that are either unknown or\nunderconstrained by current experimental knowledge. Our results imply that this imposed structure\nserves as a powerful regularizer, enabling these models to generalize much better than generic\nunstructured networks.\n\nAcknowledgements\nThe authors would like to thank the members of the LCV and VNL groups at NYU, especially Olivier Henaff\nand Najib Majaj, for helpful feedback and comments on the manuscript. Additionally, we thank Rebecca Walton\nand Lydia Cassard for their tireless efforts in collecting the perceptual data presented here. This work was funded\nin part by the Howard Hughes Medical Institute, the NEI Visual Neuroscience Training Program and the Samuel\nJ. and Joan B. Williamson Fellowship.\nReferences\nJ. Ball\u00e9, V. Laparra, and E.P. Simoncelli. End-to-end optimized image compression. ICLR 2017, pages 1\u201327,\n\nMarch 2017.\n\nMatteo Carandini and David J. Heeger. Normalization as a canonical neural computation. Nature Reviews\n\nNeuroscience, 13, 2012.\n\nSamuel Dodge and Lina Karam. A study and comparison of human and deep learning recognition performance\n\nunder visual distortions. arxiv.org, 2017.\n\nAlexey Dosovitskiy and Thomas Brox. Generating images with perceptual similarity metrics based on deep\n\nnetworks. NIP2 2016: Neural Information Processing Systems, 2016.\n\nR.A. Fisher. Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society, 22:700\u2013725,\n\n1925.\n\nI.J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. ICLR 2014,\n\nDecember 2014.\n\n9\n\nln threshold ratio (D(f))01234567MSELNLGLGGOn-OffCNNVGGIQAIQA ModelsFRONT23456VGG16 Layers\fOlivier J H\u00e9naff and Eero P Simoncelli. Geodesics of learned representations. ICLR 2016, November 2016.\n\nSergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing\n\nInternal Covariate Shift. ICLR 2015, February 2015.\n\nJustin Johnson, Alexandre Alahi, and Fei Fei Li. Perceptual losses for real-time style transfer and super-resolution.\n\nECCV: The European Conference on Computer Vision, 2016.\n\nSeyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep Supervised, but Not Unsupervised, Models May\n\nExplain IT Cortical Representation. PLOS Computational Biology, 10(11):e1003915, November 2014.\n\nDiederik P Kingma and Jimmy Lei Ba. ADAM: A Method for Stochastic Optimization. ICLR 2015, pages 1\u201315,\n\nJanuary 2015.\n\nV. Laparra, A. Berardino, J. Ball\u00e9, and E.P. Simoncelli. Perceptually optimized image rendering. Journal of the\n\nOptical Society of America A, 34(9):1511\u20131525, September 2017.\n\nValero Laparra, Jordi Mu\u00f1oz-Mar\u00ed, and Jes\u00fas Malo. Divisive normalization image quality metric revisited.\n\nJournal of the Optical Society of America A, 27, 2010.\n\nSiwei Lyu and Eero P. Simoncelli. Nonlinear image representation using divisive normalization. Proc. Computer\n\nVision and Pattern Recognition, 2008.\n\nJ. Malo, I Epifanio, R. Navarro, and E.P. Simoncelli. Nonlinear image representation for ef\ufb01cient perceptual\n\ncoding. IEEE Transactions on Image Processing, 15, 2006.\n\nValerio Mante, Vincent Bonin, and Matteo Carandini. Functional mechanisms shaping lateral geniculate\n\nresponses to arti\ufb01cial and natural stimuli. Neuron, 58(4):625\u2013638, May 2008.\n\nJ. Nguyen, A. Yosinski and J. Clune. Deep neural networks are easily fooled: High con\ufb01dence predictions for\n\nunrecognizable images. in computer vision and pattern recognition. IEEE CVPR, 2015.\n\nN Ponomarenko, V Lukin, and A Zelensky. TID2008-a database for evaluation of full-reference visual quality\n\nassessment metrics. Advances of Modern . . . , 2009.\n\nJavier Portilla and Eero P. Simoncelli. A parametric texture model based on joint statistics of complex wavelet\n\ncoef\ufb01cients. Int\u2019l Journal of Computer Vision, 40(1):\"49\u201371\", Dec 2000.\n\nPeggy Seri\u00e8s, Alan A. Stocker, and Eero P. Simoncelli. Is the homunculus \"aware\" of sensory adaptation?\n\nNeural Computation, 2009.\n\nKaren Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition.\n\nICLR 2015, September 2015.\n\nC. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of\n\nneural networks. arXiv.org, December 2013.\n\nRichard von Mises and H. Pollaczek-Geiringer. Praktische verfahren der gleichungsau\ufb02\u00f6sung. ZAMM -\n\nZeitschrift f\u00fcr Angewandte Mathematik und Mechanik, 9:152\u2013164, 1929.\n\nZhou Wang and Eero P. Simoncelli. Maximum differentiation (mad) competition: A methodology for comparing\n\ncomputational models of perceptual qualities. Journal of Vision, 2008.\n\nD. L. K. Yamins, H. Hong, C. Cadieu, E.A. Solomon, D. Seibert, and J.J. DiCarlo. Performance-optimized\nhierarchical models predict neural responses in higher visual cortex. Proceedings of the National Academy of\nSciences, 111(23):8619\u20138624, June 2014.\n\n10\n\n\f", "award": [], "sourceid": 1993, "authors": [{"given_name": "Alexander", "family_name": "Berardino", "institution": "New York University"}, {"given_name": "Valero", "family_name": "Laparra", "institution": "Universitat de Val\u00e8ncia"}, {"given_name": "Johannes", "family_name": "Ball\u00e9", "institution": "Google Inc."}, {"given_name": "Eero", "family_name": "Simoncelli", "institution": "HHMI / New York University"}]}