{"title": "Metamers of neural networks reveal divergence from human perceptual systems", "book": "Advances in Neural Information Processing Systems", "page_first": 10078, "page_last": 10089, "abstract": "Deep neural networks have been embraced as models of sensory systems, instantiating representational transformations that appear to resemble those in the visual and auditory systems. To more thoroughly investigate their similarity to biological systems, we synthesized model metamers \u2013 stimuli that produce the same responses at some stage of a network\u2019s representation. We generated model metamers for natural stimuli by performing gradient descent on a noise signal, matching the responses of individual layers of image and audio networks to a natural image or speech signal. The resulting signals reflect the invariances instantiated in the network up to the matched layer. We then measured whether model metamers were recognizable to human observers \u2013 a necessary condition for the model representations to replicate those of humans. Although model metamers from early network layers were recognizable to humans, those from deeper layers were not. Auditory model metamers became more human-recognizable with architectural modifications that reduced aliasing from pooling operations, but those from the deepest layers remained unrecognizable. We also used the metamer test to compare model representations. Cross-model metamer recognition dropped off for deeper layers, roughly at the same point that human recognition deteriorated, indicating divergence across model representations. The results reveal discrepancies between model and human representations, but also show how metamers can help guide model refinement and elucidate model representations.", "full_text": "Metamers of neural networks reveal divergence from\n\nhuman perceptual systems\n\nJenelle Feather1,2,3 Alex Durango1,2,3 Ray Gonzalez1,2,3 Josh McDermott1,2,3,4\n\n1 Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology\n\n2 McGovern Institute, Massachusetts Institute of Technology\n\n3 Center for Brains Minds and Machines, Massachusetts Institute of Technology\n\n4 Speech and Hearing Bioscience and Technology, Harvard University\n\n{jfeather,durangoa,raygon,jhm}@mit.edu\n\nAbstract\n\nDeep neural networks have been embraced as models of sensory systems, in-\nstantiating representational transformations that appear to resemble those in\nthe visual and auditory systems. To more thoroughly investigate their similarity\nto biological systems, we synthesized model metamers \u2013 stimuli that produce\nthe same responses at some stage of a network\u2019s representation. We gener-\nated model metamers for natural stimuli by performing gradient descent on\na noise signal, matching the responses of individual layers of image and au-\ndio networks to a natural image or speech signal. The resulting signals re\ufb02ect\nthe invariances instantiated in the network up to the matched layer. We then\nmeasured whether model metamers were recognizable to human observers \u2013\na necessary condition for the model representations to replicate those of hu-\nmans. Although model metamers from early network layers were recognizable\nto humans, those from deeper layers were not. Auditory model metamers be-\ncame more human-recognizable with architectural modi\ufb01cations that reduced\naliasing from pooling operations, but those from the deepest layers remained\nunrecognizable. We also used the metamer test to compare model representa-\ntions. Cross-model metamer recognition dropped off for deeper layers, roughly\nat the same point that human recognition deteriorated, indicating divergence\nacross model representations. The results reveal discrepancies between model\nand human representations, but also show how metamers can help guide model\nre\ufb01nement and elucidate model representations.\n\n1 Introduction\n\nArti\ufb01cial neural networks now achieve human-level performance on tasks such as image and\nspeech recognition, raising the question of whether they should be taken seriously as models of\nbiological sensory systems [1, 2, 3, 4, 5]. Detailed comparisons of network performance character-\nistics in some cases reveal human-like error patterns, suggesting computational similarities with\nhumans [6, 7, 8]. Other studies have found that brain responses can be better predicted by features\nlearned by deep neural networks than by those of traditional sensory models [2, 8]. On the other\nhand, neural network models can typically be fooled by adversarial perturbations that have no\neffect on humans [9, 10], are in some cases excessively dependent on particular image features,\nsuch as texture [11], and do not fully mirror human sensitivity to image distortions [12, 13], sug-\ngesting differences with human perceptual systems. However, these discrepancies have primarily\nbeen demonstrated using stimuli speci\ufb01cally constructed to induce classi\ufb01cation errors. Here, we\ndemonstrate that the divergence between arti\ufb01cial network and human representations occurs\ngenerically rather than only in adversarial situations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fWe use \u201cmodel metamers\u201d to test the similarity between human and arti\ufb01cial neural network\nrepresentations. Metamers are stimuli that are physically distinct but that are perceived to be\nthe same by an observer. Stimuli that are metameric for humans have long been used to infer\nthe underlying structure of the human perceptual system. Metamers provided some of the\noriginal evidence for trichromacy in human color vision, and have also been applied to texture\nperception [14] and visual crowding [15, 16]. Related ideas can also be used to test models of\nneural computation [17]. Here we leverage the idea that metamers for a valid model of human\nperception should also be metamers for humans. Model metamers produce the same activations\nin a model layer as some other stimulus (here a natural sound or image). Because the activations\nat all subsequent layers must also be the same, the metamers are classi\ufb01ed the same by the model.\nHere, we approximate model metamers via iterative optimization, producing stimuli that produce\nnearly the same activations as a natural stimulus, thus leading to the same network prediction. As a\ntest of whether the model accurately re\ufb02ects human perception, we measure whether humans also\ncorrectly classify the model metamers. Although this test is looser than the classical metamer test\n(which requires metamers to be fully indistinguishable), it is conservative with respect to the goal\nof testing a model of human recognition. We consider model metamers that are unrecognizable\nto a human to be a model failure, cognizant that models that do not perfectly match human\nrepresentations in this way might nonetheless be useful in other respects.\n\nBecause the neural network models we consider are trained to classify exemplars of highly variable\nobject or speech classes, and thus to instantiate representations that are invariant to within-class\nvariation, it is expected that metamers from deeper layers will exhibit greater physical variability\nthan those from early layers. The question we sought to answer is whether the nature of the\ninvariances would be similar to those of humans, in which case the model metamers should\nremain human-recognizable regardless of the stage from which they are generated. We generated\nmodel metamers for three image-trained and \ufb01ve sound-trained models that perform well on state-\nof-the-art tasks and then measured human recognition of the model metamers in psychophysical\nexperiments. We also applied the same method across networks, to ask whether the invariances\nlearned by one network resemble those learned by another. The results establish metamers as a\ntool to test and understand deep neural networks, with potential uses for multi-task applications,\ntransfer learning, and network interpretability.\n\n2 Related Work\n\n2.1 Visualization of deep networks\n\nPrevious neural network visualizations have used gradient descent on the input signals to visualize\nthe representations in neural networks [18], in some cases matching the activations at a given layer\n[19] as we do here. Natural image priors have been shown to make images reconstructed in this way\n\u201clook\u201d more natural, and further regularization tools have been proposed with a similar purpose\n[20, 21]. Although such regularization can generate visually appealing images, the importance of\nusing a natural image prior suggests differences between the network representations and those\nof humans. Taking this observation as a starting point, we measured the human-recognizability\nof images or sounds that were matched at different network stages without imposing a separate\nprior, to quantify the potential divergence in representations and get clues as to its origins.\n\n2.2 Comparing networks with other networks\n\nPrior work on network similarity relates the learned representations via methods such as canonical\ncorrelation analysis (CCA) [22, 23, 24]. Other such work has been inspired by the neuroscience\ntechnique of representational similarity analysis [25, 26]. Here we also use metamers for model\ncomparison, on the grounds that metamers for one model should also be metamers for another\nmodel (as measured here by producing the same class labels, although one could apply more\n\ufb01ne-grained methods) if the two models share invariances.\n\n2.3 Metamers applied to averaged features\n\nMetamers have been used to develop models of human perception by pooling features to directly\ninduce invariance across space or time. Work on visual crowding used images that have the\n\n2\n\n\fFigure 1: Model metamers are constructed by optimizing a random input signal such that it\nmatches the measured activations of an original signal at a particular network stage. Model\nmetamers are then presented to humans (or other networks) to measure the similarity of internal\nrepresentations.\n\nsame spatially-averaged statistics in the periphery and are indistinguishable from the original\nin particular viewing conditions [27, 16, 28, 29]. Other work has used time-averaged statistics\nmeasured from auditory models, generating auditory textures that are mistaken for the original\nnatural sound [30, 31]. Our work here is a more general instantiation of the metamerism approach,\napplicable to domains outside of peripheral vision and texture where invariances arise in the\nservice of recognition rather than as a direct consequence of pooling.\n\n3 Methods\n\n3.1 Metamer generation\n\nModel metamers were generated using an iterative feature visualization technique [19] 1. We\ninitialized the metamer with noise and then performed gradient descent to minimize the squared\nerror between its network activations and those for a paired natural signal. All models and metamer\ngeneration were implemented in TensorFlow [32]. Metamer synthesis used 15000 iterations of the\nAdam optimizer [33] with a learning rate of 0.001, with the exception of the VGGish Embedding\n(0.01) and DeepSpeech (0.0001) models.\n\nIn order to validate that we had appropriately matched the synthetic signal to the original, we\ncomputed the Spearman correlation between the model metamer and corresponding original\nsignal. These correlations were typically close to 1 (Figure 2). Once candidate metamers were\ngenerated, the following two conditions had to be true for a model metamer to be included in our\nexperiments: (1) The network predicted the same label for the synthetic metamer and the paired\nnatural image. This is the same classi\ufb01cation test we apply to humans and other networks. (2) The\nSpearman \u03c1 between the metamer and natural image fell outside of a null distribution measured\nbetween 1,000,000 randomly chosen image or audio pairs from the training set. We compare\nto a null distribution rather than applying a strict threshold because the expected correlation\nvaries with the network and layer. Setting hard cutoffs could potentially call samples metameric\nwhich are no more matched than chance, and we empirically found this procedure crucial for the\nrandom network (Figure S3). Histograms of the null and metamer correlations for all networks\nand selected layers are included in Tables S4-S5 and Figures S1-S8.\n\nWe found empirically that it was dif\ufb01cult to match some layers after a ReLU activation due to\nthe initialized signal producing many activations of zero (Fig 2(b)). To improve the optimization,\nwe modi\ufb01ed the gradient through the metamer generation layer ReLU to be 1 for all values,\nincluding for values below zero, when generating a metamer for activations immediately following\na ReLU. Figure 2(c) shows the matching \ufb01delity (as measured by Spearman\u2019s \u03c1) for 20 example\nmetamers generated with either the normal gradient or the modi\ufb01ed gradient. The modi\ufb01ed\ngradient substantially improved the matching on some layers (layer_3 of DeepSpeech, and conv_4\nof the Word Trained CNN). We used the modi\ufb01ed gradient for all metamers generated after a ReLU.\n\n1Example generation code and trained models: https://github.com/jenellefeather/model_metamers\n\n3\n\n\fFigure 2: Validation of model metamer optimization (a) The model metamer is intended to\nproduce the same activations as the original stimulus in a particular network layer. We quanti\ufb01ed\nthe \ufb01delity of the matching as the Spearman correlation between the activations produced by\na model metamer and the corresponding original stimulus, with histograms across stimuli. For\na comparison null distribution, we also measured the correlation for randomly chosen pairs of\nsignals from the training set. As intended, metamers generated from an early layer (top row)\nare well matched to the original in the early layer, with correlations close to 1 (blue distribution,\ntop left), far above the null distribution across stimuli (red). Because the networks used here are\ndeterministic and feedforward, the metamers should also produce the same activations at all\nsubsequent layers, and they do (correlations near 1 in late layers, blue distribution, top right).\nBecause of the many-to-one mapping instantiated by the network, metamers for a late layer\n(bottom row) do not match the activations in the early layer better than chance (left), but match\nthe late layer as intended (right). (b) Comparison of activation matching with a standard ReLU\nactivation function gradient and with the modi\ufb01ed ReLU gradient. Without the modi\ufb01cation,\nmany non-zero values in the original activation get matched to zero. (c) Example layer-wise\nmatching \ufb01delity for metamers generated with either the standard ReLU gradient (blue) and the\nlinear gradient ReLU (red) for two audio networks. In both networks there are layers that are\nsigni\ufb01cantly better matched using the modi\ufb01ed ReLU gradients.\n\nFor visual metamers, pixel values were bounded between 0-255 or 0-1 (matching the preprocessing\nof the trained network), and were initialized with white noise with mean at the center value of the\nrange. No other regularization was employed. For audio metamers, we applied gradient clipping\nto operations that resulted in problems with the optimization (speci\ufb01cally, logarithms and power\noperations) which were present in the audio pre-processing (that transformed the waveform to a\nfrequency representation that provided the input to the networks). The audio metamer generation\nwas initialized with pink noise at an RMS value of 0.01.\n\n3.2 Auditory models\n\nOur experiments used a \ufb01ve-layer convolutional network trained on the output of a model of the\nhuman ear. This cochlear model consisted of a \ufb01lterbank of 171 \ufb01lters spaced between 20Hz-80Hz\nwith bandwidths and spacing modeled on the human ear [34, 30]. The envelope of each resulting\naudio subband was extracted via the Hilbert transform, downsampled to 200Hz, and passed\nthrough a compressive non-linearity. This yielded a \u2018cochleagram\u2019 representation, similar to a\nconventional spectrogram but with frequency resolution based on the human cochlea. We trained\nan architecture similar to that in [8] (full architecture described in Table S2).\n\nMany neural networks do not obey the sampling theorem (because downsampling occurs without\na preceding lowpass \ufb01lter), and others have suggested that this could yield invariances that do\nnot align with human perception [35, 36, 37]. Motivated by these observations, we constructed\na modi\ufb01ed architecture to reduce aliasing artifacts (Table S3). The modi\ufb01cations replaced max\npooling operations with weighted average pooling using a hanning kernel applied with stride equal\nto that of the original max pooling. Any convolutional layer with a stride greater than one was\nreplaced with a convolutional layer with a stride of one, followed by a hanning pooling operation\nwith stride equal to the original convolutional stride.\n\n4\n\n\fAs a demonstration that model metamers could be used to investigate representations in other\naudio models, we also generated example metamers from the VGGish network, which outputs em-\nbeddings used for training an environmental sound classi\ufb01er and was released with the AudioSet\ndataset [38]. We also generated metamers for the publicly available DeepSpeech architecture [39].\n\n3.3 Auditory CNN training\n\nThe auditory models were trained a word recognition task similar to [8], using segments from the\nWall Street Journal [40] and Spoken Wikipedia Corpora [41]. Two-second speech segments were\nused for training examples, with the word in the middle of the clip assigned as the class label for\ntraining. There were 793 word classes sourced from 432 unique speakers, with 230357 unique clips\nin the training set and 40651 segments in the validation set (full details of the dataset construction\nare in Section S1.1). During training, the speech segments were randomly shifted in time and\nsuperimposed on a subset of 718625 AudioSet examples, spanning 516 AudioSet categories [42].\nSome CNN models were trained to predict the AudioSet labels. In order to match performance\nbetween multiple models trained on the same task in Section 4.3 and eliminate confounds due to\ntask performance, we used an early stopping criteria on the validation set of 57% correct for the\nword task and a mean area under the curve (AUC) of 0.83 for the AudioSet task.\n\n3.4 Auditory metamer generation and experiments\n\nWe measured human recognition of model metamers using a task similar to that of [8]. A human\nobserver listened to a clip and chose one of 587 possible word labels. Sixteen participants com-\npleted the experiment, each completing \ufb01ve trials from each of the included conditions, randomly\nordered. Stimuli were generated from a set of 295 speech exemplars from the WSJ corpus (see\nTable S1.2 for a summary of auditory model metamers, and Figures S1-S5 for full histograms of the\nnull and metamer Spearman \u03c1). Five sets of CNN metamers were generated for the experiment,\none for each of \ufb01ve models: 1) the architecture inspired by [8], trained on the word task, 2) the\nrandom initialization of the reduced aliasing architecture, and 3) the reduced aliasing architecture\ntrained on the word task, 4) the reduced aliasing architecture trained on the AudioSet task, and\n5) the reduced aliasing architecture trained simultaneously on the AudioSet and word tasks. For\neach model, we included metamers constructed by matching the representations of the activation\nfollowing each convolutional layer, fully-connected layer, and the logits (with the exception of\nthe hanning pooling layer in the reduced aliasing networks that immediately followed strided\nconvolutions, to equate the number of features to that for the aliasing networks). We also included\nmetamers for the cochlear representation.\n\n3.5 Image models, metamer generation, and experiments\n\nImageNet-trained models were obtained from publicly available pretrained checkpoints2. We\ngenerated metamers from a subset of layers for each of VGG-19 [43], Inception-V3 [44], and ResNet-\n101-V2 [45]. To compare performance between networks and humans in the visual domain, we\nused a modi\ufb01ed version of the image classi\ufb01cation task described in [13]. For each of a set of layers\nin the three pretrained ImageNet models, we generated metamers of 36 randomly selected natural\nimages across each of the 16 MS-COCO categories (see Supplement Table 4 for a summary of\nmatching the visual model metamers, and Supplement Figures 1-3 for full histograms of the null\nand metamer histograms). Each of sixteen participants had to classify a subset of these metameric\nstimuli and their corresponding natural image seeds, choosing the MS-COCO category; each\nparticipant classi\ufb01ed 10 examples per network-layer metamer condition.\n\n4 Results\n\n4.1 Image network model metamers\n\nFor all tested image networks,the metamers became unrecognizable to humans by the \ufb01nal stages\nof the network (Figure 3a-b). The appearance of the metamers to humans varied depending on\nthe architecture. In Inception-V3 and ResNet-101-V2 (both of which include convolutions with\n\n2https://github.com/tensor\ufb02ow/models/tree/master/research/slim\n\n5\n\n\fFigure 3: Deep network model metamers and their recognition by human observers. (a) Example\nvisual network model metamers synthesized to produce the same activations at a particular layer\nof a particular network as the image in the top left. (b) Human recognition of visual network\nmodel metamers. Recognition is good for early-layer metamers but poor for deep-layer metamers,\nimplying a divergence from human perceptual representations. Error bars are standard error of\nthe mean (SEM). (c) Example cochleagrams (time-frequency decompositions) for metamers from\nan audio network trained to recognize words. (d) Human recognition of word-trained CNN model\nmetamers. As for vision-trained models, recognition is good for early-layer metamers but poor for\ndeep-layer metamers. Error bars are SEM.\n\n6\n\n\fFigure 4: Human recognition of audio network model metamers. Architectural manipulations that\nreduce aliasing (left), training (middle), and task (right) all altered the recognizability of metamers.\n\na stride greater than one) there is visible \u2018gridding\u2019 in the metamers generated from early layers,\nplausibly due to aliasing.\n\n4.2 Audio network model metamers\n\nThe metamers from the word-trained network with an architecture based on [8] also quickly\nbecome unrecognizable to humans (Figure 3c-d). Although not included in the human behavioral\nexperiment, we also generated example metamers from DeepSpeech and the VGGish Embedding\nNetwork3. All metamers from DeepSpeech sound unnatural due to the input representation\n(framed MFCCs). The metamers on the VGGish embedding network become dif\ufb01cult to recognize\nby conv_4 (perhaps unsurprisingly, as we only generated metamers for speech, and the network\nwas not trained for speech recognition).\n\n4.3 Model metamers from audio networks with modi\ufb01ed task or architecture\n\nWe considered that the decrease in metamerism for humans might be due to aliasing (from\nconvolutional layers with strides greater than 1, and maxpooling layers). Consistent with this\nidea, the modi\ufb01ed architecture that reduces aliasing yielded model metamers that were more\nrecognizable to humans (Figure 4). We also considered the effect of training on metamerism.\nUnlike, the trained networks, metamers from a random network with reduced aliasing remained\nrecognizable through all convolutional layers, only becoming unrecognizable at the top fully-\nconnected layer. This result suggests that task optimization adds invariances to the network that\ncan in some cases be different than human invariances. However, the human-recognizability of the\nmodel metamers was task-speci\ufb01c \u2013 the same network architecture trained to classify the AudioSet\nbackgrounds produced metamers that became unrecognizable more quickly than when trained on\nthe word task. Training on the AudioSet classi\ufb01cation in addition to the word task did not impair\nmetamerism (Figure 4). In all cases the metamers from deep layers remained unrecognizable to\nhumans, but the effects of these manipulations raise the possibility that appropriate choices of\ntraining and architecture might produce a model that better accounts for human perception.\n\n4.4 Metamer comparisons between ImageNet architectures\n\nThe metamer test can also be used to compare different architectures. We generated metameric\nimages for one ImageNet-trained network and then presented its metamers to a second network.\nIf the representational spaces between the two networks are the same, then the second network\nshould be able to correctly classify the metamers from the \ufb01rst network. For all three tested\nnetworks, we \ufb01nd that the representations diverge from those of the other networks (Figure\n5). Further, at late layers the model metamers are generally not even recognizable to the same\n\n3Example audio metamers: http://mcdermottlab.mit.edu/jfeather/model_metamers/audio_metamers.html\n\n7\n\n\fImageNet-trained architecture trained with a different initialization (especially evident in the 1000\nway classi\ufb01cation task). Interestingly, image metamers for one network become non-metameric\nfor another network at roughly the same layer at which human performance diverges.\n\nFigure 5: Network recognition of metamers from other networks and for networks with the\nsame architecture but different initializations. All networks were trained on ImageNet. Top row:\nperformance on 16-way classi\ufb01cation task with metamers (using groups of the original ImageNet\nclasses, used for human recognition experiment in Figure 2). Bottom row: performance on original\nImageNet classi\ufb01cation task with metamers.\n\nFigure 6: Word-trained network recognition of metamers from other networks. Metamers were\ngenerated from networks with the same architecture but trained on different tasks. Error bars are\nbootstrapped SEM.\n\n4.5 Metamer comparisons between audio networks trained on different tasks\n\nIn the audio domain, we tested whether model metamers generalized across training tasks and\nrandom seeds (Figure 6). We measured performance of the word-trained network on metamers\n\n8\n\n\fgenerated from networks with the same architecture but trained on a different task. Metamers gen-\nerated from untrained networks were poorly recognized by the word-trained network, providing\nfurther evidence that training alters the network invariances. Model metamerism did not transfer\nbetween the word-trained network and the AudioSet-trained network, but metamers generated\nfrom the network trained on both tasks were only slightly less metameric than metamers from\na word-trained network with weights initialized with a different random seed. This latter result\nprovides a proof of concept that it is possible for metamers to be shared across distinct systems.\n\n5 Discussion\n\nOur results show that model metamers generated from deep layers of arti\ufb01cial neural networks\nare not metameric for humans or other networks. These \ufb01ndings demonstrate a divergence in the\ninvariances learned by neural networks from those present in human perceptual systems. They\nalso highlight the bene\ufb01ts of using model metamers as a network comparison tool. Our results\nsuggest that discrepancies between model and human representations, and between different\nmodels, arise in later model stages, identifying those stages as targets for model re\ufb01nement.\nIndeed, we were able to modify some aspects of our audio-trained models to reduce aliasing and\nincrease human recognition of the model metamers. We also demonstrate that human recognition\nof the metamers is dependent on the training task, possibly suggesting that the failure of humans\nto recognize the model metamers may be a re\ufb02ection of training on a single task (in this case,\nrecognizing speech but ignoring the background). Future work could investigate this by modifying\ntasks to be more diverse, or more human-like, and assessing whether the improved models better\npredict human behavior.\n\nThe transfer of metamers with different random seeds was surprisingly different between the\nimage- and audio-trained networks. Further investigation revealed that optimizing the cochlea-\ngram representation rather than the audio yielded model metamers that were less recognizable\nby a network trained on a different random seed (Figure S9). This result raises the possibility\nthat the shared \"cochlear\" pre-processing (consisting of \ufb01xed stages of convolution, pooling,\nand non-linearities) enforces shared invariances between audio-trained networks with differ-\nent initializations. Future work could use metamerism to explore the use of shared early-layer\nrepresentations as a way to unify representations across models and potentially better model\nhuman perceptual systems, for instance by adding additional biological constraints on the input\nrepresentation.\n\nModel metamers are complementary to adversarial examples. Adversarial examples are metameric\n(perceived similarly) for humans but are not metameric to the network they are derived for, demon-\nstrating that the network lacks some invariances present in humans. Model metamers conversely\ndemonstrate that invariances present in networks are not necessarily invariances for human\nperception (or other networks). The relationship between adversarial and metameric images was\nexplored recently in [46], who concluded that the cross-entropy loss creates excessive invariance in\nthe \ufb01nal classi\ufb01cation layer, leading to adversarial examples. We explore related issues but exam-\nined a more diverse set of network layers and explicitly performed human and network-network\nexperiments. Together, these lines of work suggest that techniques for reducing adversarial vul-\nnerability may also improve the transfer of metamers across models. Moreover, metamers could\nbe useful for evaluating the adversarial vulnerability of a model. However, unlike adversarial\nexamples, which are speci\ufb01cally engineered to fool a particular system, model metamers are\nconstrained only to produce the same model activations (rather than to fool humans). The consid-\nerable lack of metamer transfer to humans thus arguably represents a more substantial model\nfailure, and a useful measuring stick for models of perceptual systems.\n\nAcknowledgements\n\nWe thank Richard McWalter, Alex Kell, and Sam Norman-Haignere for comments on an early draft\nof this work. We also thank Mark Saddler and Andrew Francl for contributing to a shared codebase\nused in this project. This work was funded by a McDonnell Scholar Award to J.H.M., NSF grant\nBCS-1634050, NIH grant R01-DC017970 and a DOE CSGF Fellowship to J.J.F.\n\n9\n\n\fReferences\n\n[1] Nikolaus Kriegeskorte. Deep neural networks: a new framework for modeling biological\nvision and brain information processing. Annual Review of Neuroscience, pages 417\u2013446,\n2015.\n\n[2] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand\n\nsensory cortex. Nature neuroscience, 19(3):356, 2016.\n\n[3] Alex Kell and Josh H. McDermott. Deep neural network models of sensory systems: windows\n\nonto the role of task constraints. Current Opinion in Neurobiology, 55:121\u2013132, 2019.\n\n[4] David GT Barrett, Ari S Morcos, and Jakob H Macke. Analyzing biological and arti\ufb01cial neural\nnetworks: challenges with opportunities for synergy? Current opinion in neurobiology,\n55:55\u201364, 2019.\n\n[5] Adam H Marblestone, Greg Wayne, and Konrad P Kording. Toward an integration of deep\n\nlearning and neuroscience. Frontiers in computational neuroscience, 10:94, 2016.\n\n[6] Rishi Rajalingham, Kailyn Schmidt, and James J DiCarlo. Comparison of object recognition\n\nbehavior in human and monkey. Journal of Neuroscience, 35(35):12127\u201312136, 2015.\n\n[7] Saeed Reza Kheradpisheh, Masoud Ghodrati, Mohammad Ganjtabesh, and Timoth\u00e9e\nMasquelier. Deep networks can resemble human feed-forward vision in invariant object\nrecognition. Scienti\ufb01c reports, 6:32672, 2016.\n\n[8] Alexander JE Kell, Daniel LK Yamins, Erica N Shook, Sam V Norman-Haignere, and Josh H\nMcDermott. A task-optimized neural network replicates human auditory behavior, predicts\nbrain responses, and reveals a cortical processing hierarchy. Neuron, 98(3):630\u2013644, 2018.\n\n[9] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adver-\n\nsarial examples. International Conference on Learning Representations(ICLR), 2014.\n\n[10] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-\n\nto-text. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1\u20137. IEEE, 2018.\n\n[11] Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and\nWieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias\nimproves accuracy and robustness. International Conference on Learning Representations\n(ICLR), 2019.\n\n[12] Alexander Berardino, Valero Laparra, Johannes Ball\u00e9, and Eero Simoncelli. Eigen-distortions\nof hierarchical representations. In Advances in neural information processing systems, pages\n3530\u20133539, 2017.\n\n[13] Robert Geirhos, Carlos RM Temme, Jonas Rauber, Heiko H Sch\u00fctt, Matthias Bethge, and\nFelix A Wichmann. Generalisation in humans and deep neural networks. In Advances in\nNeural Information Processing Systems, pages 7538\u20137550, 2018.\n\n[14] B. Julesz. Visual pattern discrimination. IEEE Transactions on Information Theory, 8:84\u201392,\n\n1962.\n\n[15] Benjamin Balas, Lisa Nakano, and Ruth Rosenholtz. A summary-statistic representation in\n\nperipheral vision explains visual crowding. Journal of vision, 9(12):13\u201313, 2009.\n\n[16] Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream. Nature neuroscience,\n\n14(9):1195, 2011.\n\n[17] Sam V Norman-Haignere and Josh H McDermott. Neural responses to natural and model-\nmatched stimuli reveal distinct computations in primary and nonprimary auditory cortex.\nPLoS biology, 16(12):e2005127, 2018.\n\n[18] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017.\n\nhttps://distill.pub/2017/feature-visualization.\n\n10\n\n\f[19] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations\nby inverting them. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 5188\u20135196, 2015.\n\n[20] Jason Yosinski, Jeff Clune, Anh Nguyen, Thomas Fuchs, and Hod Lipson. Understanding\nneural networks through deep visualization. International Conference on Machine Learning,\nDeep Learning Workshop, 2015.\n\n[21] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Going deeper into\n\nneural networks. 2015.\n\n[22] Ari Morcos, Maithra Raghu, and Samy Bengio. Insights on representational similarity in\nneural networks with canonical correlation. In Advances in Neural Information Processing\nSystems, pages 5727\u20135736, 2018.\n\n[23] Yixuan Li, Jason Yosinski, Jeff Clune, Hod Lipson, and John E Hopcroft. Convergent learning:\nDo different neural networks learn the same representations? In FE@ NIPS, pages 196\u2013212,\n2015.\n\n[24] Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular\nvector canonical correlation analysis for deep learning dynamics and interpretability. In\nAdvances in Neural Information Processing Systems, pages 6076\u20136085, 2017.\n\n[25] Nikolaus Kriegeskorte, Marieke Mur, and Peter A Bandettini. Representational similarity\nanalysis-connecting the branches of systems neuroscience. Frontiers in systems neuroscience,\n2:4, 2008.\n\n[26] Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of\nneural network representations revisited. International Conference on Machine Learning,\n2019.\n\n[27] Thomas SA Wallis, Christina M Funke, Alexander S Ecker, Leon A Gatys, Felix A Wichmann,\nand Matthias Bethge. Image content is more important than bouma\u2019s law for scene metamers.\neLife, 8:e42512, 2019.\n\n[28] T. S. A. Wallis, C. M. Funke, A. S. Ecker, L. A. Gatys, F. A. Wichmann, and M. Bethge. A\nparametric texture model based on deep convolutional features closely matches texture\nappearance for humans. Journal of Vision, 17(12), Oct 2017.\n\n[29] Arturo Deza, Aditya Jonnalagadda, and Miguel Eckstein. Towards metamerism via foveated\n\nstyle transfer. International Conference on Learning Representations, 2017.\n\n[30] Josh H McDermott and Eero P Simoncelli. Sound texture perception via statistics of the\n\nauditory periphery: Evidence from sound synthesis. Neuron, 71:926\u2013940, 2011.\n\n[31] Jenelle Feather and Josh H. McDermott. Auditory texture synthesis from task-optimized\nconvolutional neural networks. Conference on Computational Cognitive Neuroscience, 2018.\n\n[32] Mart\u00edn Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu\nDevin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al. Tensor\ufb02ow: A system for\nlarge-scale machine learning. In 12th {USENIX} Symposium on Operating Systems Design\nand Implementation ({OSDI} 16), pages 265\u2013283, 2016.\n\n[33] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. International\n\nConference on Learning Representations, 2015.\n\n[34] Brian Glasberg and Brian C J Moore. Derivation of auditory \ufb01lter shapes from notched-noise\n\ndata. Hearing Research, 47:103\u2013138, 1990.\n\n[35] Olivier J H\u00e9naff and Eero P Simoncelli. Geodesics of learned representations. International\n\nConference on Learning Representations, 2016.\n\n[36] Richard Zhang. Making convolutional networks shift-invariant again. International Confer-\n\nence on Machine Learning, 2019.\n\n11\n\n\f[37] Aharon Azulay and Yair Weiss. Why do deep convolutional networks generalize so poorly to\n\nsmall image transformations? arXiv preprint arXiv:1805.12177, 2018.\n\n[38] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, Chan-\nning Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron\nWeiss, and Kevin Wilson. Cnn architectures for large-scale audio classi\ufb01cation. In Interna-\ntional Conference on Acoustics, Speech and Signal Processing (ICASSP). 2017.\n\n[39] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro, Greg Diamos, Erich Elsen, Ryan\nPrenger, Sanjeev Satheesh, Shubho Sengupta, Adam Coates, et al. Deep speech: Scaling up\nend-to-end speech recognition. arXiv preprint arXiv:1412.5567, 2014.\n\n[40] Douglas B Paul and Janet M Baker. The design for the wall street journal-based csr corpus. In\nProceedings of the workshop on Speech and Natural Language, pages 357\u2013362. Association for\nComputational Linguistics, 1992.\n\n[41] Arne K\u00f6hn, Florian Stegen, and Timo Baumann. Mining the spoken wikipedia for speech\ndata and beyond. In Nicoletta Calzolari (Conference Chair), Khalid Choukri, Thierry De-\nclerck, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Asuncion Moreno, Jan Odijk, and\nStelios Piperidis, editors, Proceedings of the Tenth International Conference on Language Re-\nsources and Evaluation (LREC 2016), Paris, France, may 2016. European Language Resources\nAssociation (ELRA).\n\n[42] Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Chan-\nning Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled\ndataset for audio events. In Proc. IEEE ICASSP 2017, New Orleans, LA, 2017.\n\n[43] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale\nImage Recognition. International Conference on Learning Representations, pages 1\u201314, 2014.\n\n[44] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wo-\njna. Rethinking the inception architecture for computer vision. 2016 IEEE Conference on\nComputer Vision and Pattern Recognition (CVPR), pages 2818\u20132826, 2016.\n\n[45] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual\n\nnetworks. In ECCV, 2016.\n\n[46] J.-H. Jacobsen, J. Behrmann, R. Zemel, and M. Bethge. Excessive invariance causes adversarial\n\nvulnerability. International Conference on Learning Representations (ICLR), 2019.\n\n[47] Victor W Zue and Stephanie Seneff. -transcription and alignment of the timit database. In\nRecent Research Towards Advanced Man-Machine Interface Through Spoken Language, pages\n515\u2013525. Elsevier, 1996.\n\n12\n\n\f", "award": [], "sourceid": 5326, "authors": [{"given_name": "Jenelle", "family_name": "Feather", "institution": "MIT"}, {"given_name": "Alex", "family_name": "Durango", "institution": "MIT"}, {"given_name": "Ray", "family_name": "Gonzalez", "institution": "MIT"}, {"given_name": "Josh", "family_name": "McDermott", "institution": "Massachusetts Institute of Technology"}]}