{"title": "Adversarial Examples that Fool both Computer Vision and Time-Limited Humans", "book": "Advances in Neural Information Processing Systems", "page_first": 3910, "page_last": 3920, "abstract": "Machine learning models are vulnerable to adversarial examples: small changes to images can cause computer vision models to make mistakes such as identifying a school bus as an ostrich. However, it is still an open question whether humans are prone to similar mistakes. Here, we address this question by leveraging recent techniques that transfer adversarial examples from computer vision models with known parameters and architecture to other models with unknown parameters and architecture, and by matching the initial processing of the human visual system. We find that adversarial examples that strongly transfer across computer vision models influence the classifications made by time-limited human observers.", "full_text": "Adversarial Examples that Fool both Computer\n\nVision and Time-Limited Humans\n\nGamaleldin F. Elsayed\u21e4\n\nGoogle Brain\n\ngamaleldin.elsayed@gmail.com\n\nShreya Shankar\nStanford University\n\nBrian Cheung\nUC Berkeley\n\nNicolas Papernot\n\nPennsylvania State University\n\nAlexey Kurakin\n\nGoogle Brain\n\nIan Goodfellow\nGoogle Brain\n\nJascha Sohl-Dickstein\n\nGoogle Brain\n\njaschasd@google.com\n\nAbstract\n\nMachine learning models are vulnerable to adversarial examples: small changes\nto images can cause computer vision models to make mistakes such as identifying\na school bus as an ostrich. However, it is still an open question whether humans\nare prone to similar mistakes. Here, we address this question by leveraging recent\ntechniques that transfer adversarial examples from computer vision models with\nknown parameters and architecture to other models with unknown parameters and\narchitecture, and by matching the initial processing of the human visual system.\nWe \ufb01nd that adversarial examples that strongly transfer across computer vision\nmodels in\ufb02uence the classi\ufb01cations made by time-limited human observers.\n\n1\n\nIntroduction\n\nMachine learning models are easily fooled by adversarial examples: inputs optimized by an adversary\nto produce an incorrect model classi\ufb01cation [39, 3]. In computer vision, an adversarial example is\nusually an image formed by making small perturbations to an example image. Many algorithms for\nconstructing adversarial examples [39, 13, 33, 24, 27] rely on access to both the architecture and the\nparameters of the model to perform gradient-based optimization on the input. Without similar access\nto the brain, these methods do not seem applicable to constructing adversarial examples for humans.\nOne interesting phenomenon is that adversarial examples often transfer from one model to another,\nmaking it possible to attack models that an attacker has no access to [39, 26]. This naturally raises\nthe question of whether humans are susceptible to these adversarial examples. Clearly, humans\nare prone to many cognitive biases and optical illusions [17], but these generally do not resemble\nsmall perturbations of natural images, nor are they currently generated by optimization of a ML loss\nfunction. Thus the current understanding is that this class of transferable adversarial examples has no\neffect on human visual perception, yet no thorough empirical investigation has yet been performed.\nA rigorous investigation of the above question creates an opportunity both for machine learning to\ngain knowledge from neuroscience, and for neuroscience to gain knowledge from machine learning.\nNeuroscience has often provided existence proofs for machine learning\u2014before we had working\nobject recognition algorithms, we hypothesized it should be possible to build them because the human\n\n\u21e4Work done as a member of the Google AI Residency program (g.co/airesidency).\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fbrain can recognize objects. See Hassabis et al. [15] for a review of the in\ufb02uence of neuroscience\non arti\ufb01cial intelligence. If we knew conclusively that the human brain could resist a certain class\nof adversarial examples, this would provide an existence proof for a similar mechanism in machine\nlearning security. If we knew conclusively that the brain can be fooled by adversarial examples,\nthen machine learning security research should perhaps shift its focus from designing models that\nare robust to adversarial examples [39, 13, 32, 42, 40, 27, 19, 5] to designing systems that are\nsecure despite including non-robust machine learning components. Likewise, if adversarial examples\ndeveloped for computer vision affect the brain, this phenomenon discovered in the context of machine\nlearning could lead to a better understanding of brain function.\nIn this work, we construct adversarial examples that transfer from computer vision models to the\nhuman visual system. In order to successfully construct these examples and observe their effect, we\nleverage three key ideas from machine learning, neuroscience, and psychophysics. First, we use\nthe recent black box adversarial example construction techniques that create adversarial examples\nfor a target model without access to the model\u2019s architecture or parameters. Second, we adapt\nmachine learning models to mimic the initial visual processing of humans, making it more likely\nthat adversarial examples will transfer from the model to a human observer. Third, we evaluate\nclassi\ufb01cation decisions of human observers in a time-limited setting, so that even subtle effects on\nhuman perception are detectable. By making image presentation suf\ufb01ciently brief, humans are unable\nto achieve perfect accuracy even on clean images, and small changes in performance lead to more\nmeasurable changes in accuracy. Additionally, a brief image presentation limits the time in which\nthe brain can utilize recurrent and top-down processing pathways [34], and is believed to make the\nprocessing in the brain more closely resemble that in a feedforward arti\ufb01cial neural network.\nWe \ufb01nd that adversarial examples that transfer across computer vision models do successfully\nin\ufb02uence the perception of human observers, thus uncovering a new class of illusions that are shared\nbetween computer vision models and the human brain.\n\n2 Background and Related Work\n\n2.1 Adversarial Examples\n\nGoodfellow et al. [12] de\ufb01ne adversarial examples as \u201cinputs to machine learning models that an\nattacker has intentionally designed to cause the model to make a mistake.\u201d In the context of visual\nobject recognition, adversarial examples are images usually formed by applying a small perturbation\nto a naturally occurring image in a way that breaks the predictions made by a machine learning\nclassi\ufb01er. See Figure Supp.1a for a canonical example where adding a small perturbation to an\nimage of a panda causes it to be misclassi\ufb01ed as a gibbon. This perturbation is small enough to be\nimperceptible (i.e., it cannot be saved in a standard png \ufb01le that uses 8 bits because the perturbation\nis smaller than 1/255 of the pixel dynamic range). This perturbation relies on carefully chosen\nstructure based on the parameters of the neural network\u2014but when magni\ufb01ed to be perceptible,\nhuman observers cannot recognize any meaningful structure. Note that adversarial examples also\nexist in other domains like malware detection [14], but we focus here on image classi\ufb01cation tasks.\nTwo aspects of the de\ufb01nition of adversarial examples are particularly important for this work:\n\n1. Adversarial examples are designed to cause a mistake. They are not (as is commonly\nmisunderstood) de\ufb01ned to differ from human judgment. If adversarial examples were\nde\ufb01ned by deviation from human output, it would by de\ufb01nition be impossible to make\nadversarial examples for humans. On some tasks, like predicting whether input numbers\nare prime, there is a clear objectively correct answer, and we would like the model to get\nthe correct answer, not the answer provided by humans (time-limited humans are probably\nnot very good at guessing whether numbers are prime). It is challenging to de\ufb01ne what\nconstitutes a mistake for visual object recognition. After adding a perturbation to an image it\nlikely no longer corresponds to a photograph of a non-contrived physical scene. Furthermore,\nit is philosophically dif\ufb01cult to de\ufb01ne the real object class for an image that is not a picture\nof a real object. In this work, we assume that an adversarial image is misclassi\ufb01ed if the\noutput label differs from the human-provided label of the clean image that was used as the\nstarting point for the adversarial image. We make small adversarial perturbations and we\nassume that these small perturbations are insuf\ufb01cient to change the true class.\n\n2\n\n\f2. Adversarial examples are not (as is commonly misunderstood) de\ufb01ned to be imperceptible.\nIf this were the case, it would be impossible by de\ufb01nition to make adversarial examples for\nhumans, because changing the human\u2019s classi\ufb01cation would constitute a change in what the\nhuman perceives (e.g., see Figure Supp.1b,c).\n\n2.1.1 Clues that Transfer to Humans is Possible\nSome observations give clues that transfer to humans may be possible. Adversarial examples are\nknown to transfer across machine learning models, which suggest that these adversarial perturbations\nmay carry information about target adversarial classes. Adversarial examples that fool one model\noften fool another model with a different architecture [39], another model that was trained on a\ndifferent training set [39], or even trained with a different algorithm [30] (e.g., adversarial examples\ndesigned to fool a convolution neural network may also fool a decision tree). The transfer effect\nmakes it possible to perform black box attacks, where adversarial examples fool models that an\nattacker does not have access to [39, 31]. Kurakin et al. [24] found that adversarial examples transfer\nfrom the digital to the physical world, despite many transformations such as lighting and camera\neffects that modify their appearance when they are photographed in the physical world. Liu et al. [26]\nshowed that the transferability of an adversarial example can be greatly improved by optimizing it to\nfool many machine learning models rather than one model: an adversarial example that fools \ufb01ve\nmodels used in the optimization process is more likely to fool an arbitrary sixth model.\nMoreover, recent studies on stronger adversarial examples that transfer across multiple settings\nhave sometimes produced adversarial examples that appear more meaningful to human observers.\nFor instance, a cat adversarially perturbed to resemble a computer [2] while transfering across\ngeometric transformations develops features that appear computer-like (Figure Supp.1b), and the\n\u2018adversarial toaster\u2019 from Brown et al. [4] possesses features that seem toaster-like (Figure Supp.1c).\nThis development of human-meaningful features is consistent with the adversarial example carrying\ntrue feature information and thus coming closer to fooling humans, if we acounted for the notable\ndifferences between humans visual processing and computer vision models (see section 2.2.2)\n\n2.2 Biological and Arti\ufb01cial Vision\n2.2.1 Similarities\nRecent research has found similarities in representation and behavior between deep convolutional\nneural networks (CNNs) and the primate visual system [6]. This further motivates the possibility that\nadversarial examples may transfer from computer vision models to humans. Activity in deeper CNN\nlayers has been observed to be predictive of activity recorded in the visual pathway of primates [6, 43].\nReisenhuber and Poggio [36] developed a model of object recognition in cortex that closely resembles\nmany aspects of modern CNNs. Kummerer et al. [21, 22] showed that CNNs are predictive of human\ngaze \ufb01xation. Style transfer [10] demonstrated that intermediate layers of a CNN capture notions of\nartistic style which are meaningful to humans. Freeman et al. [9] used representations in a CNN-like\nmodel to develop psychophysical metamers, which are indistinguishable to humans when viewed\nbrie\ufb02y and with carefully controlled \ufb01xation. Psychophysics experiments have compared the pattern\nof errors made by humans, to that made by neural network classi\ufb01ers [11, 35].\n\n2.2.2 Notable Differences\nDifferences between machine and human vision occur early in the visual system. Images are typically\npresented to CNNs as a static rectangular pixel grid with constant spatial resolution. The primate eye\non the other hand has an eccentricity dependent spatial resolution. Resolution is high in the fovea, or\ncentral \u21e0 5 of the visual \ufb01eld, but falls off linearly with increasing eccentricity [41]. A perturbation\nwhich requires high acuity in the periphery of an image, as might occur as part of an adversarial\nexample, would be undetectable by the eye, and thus would have no impact on human perception.\nFurther differences include the sensitivity of the eye to temporal as well as spatial features, as well\nas non-uniform color sensitivity [25]. Modeling the early visual system continues to be an area of\nactive study [29, 28]. As we describe in section 3.1.2, we mitigate some of these differences by using\na biologically-inspired image input layer.\nBeyond early visual processing, there are more major computational differences between CNNs\nand the human brain. All the CNNs we consider are fully feedforward architectures, while the\n\n3\n\n\fvisual cortex has many times more feedback than feedforward connections, as well as extensive\nrecurrent dynamics [29]. Possibly due to these differences in architecture, humans have been found\nexperimentally to make classi\ufb01cation mistakes that are qualitatively different than those made by\ndeep networks [8]. Additionally, the brain does not treat a scene as a single static image, but actively\nexplores it with saccades [18]. As is common in psychophysics experiments [20], we mitigate these\ndifferences in processing by limiting both the way in which the image is presented, and the time\nwhich the subject has to process it, as described in section 3.2.\n\n3 Methods\n\nSection 3.1 details our machine learning vision pipeline. Section 3.2 describes our psychophysics\nexperiment to evaluate the impact of adversarial images on human subjects.\n\n3.1 The Machine Learning Vision Pipeline\n\n3.1.1 Dataset\n\nIn our experiment, we used images from ImageNet [7]. ImageNet contains 1,000 highly speci\ufb01c\nclasses that typical people may not be able to identify, such as \u201cChesapeake Bay retriever\u201d. Thus, we\ncombined some of these \ufb01ne classes to form six coarse classes we were con\ufb01dent would be familiar\nto our experiment subjects ({dog, cat, broccoli, cabbage, spider, snake}). We then grouped these six\nclasses into the following groups: (i) Pets group (dog and cat images); (ii) Hazard group (spider and\nsnake images); (iii) Vegetables group (broccoli and cabbage images).\n\n3.1.2 Ensemble of Models\n\nWe constructed an ensemble of ten CNN models trained on ImageNet. Each model is an instance of\none of these architectures: Inception V3, Inception V4, Inception ResNet V2, ResNet V2 50, ResNet\nV2 101, and ResNet V2 152 [38, 37, 16]. To better match the initial processing of human visual\nsystem, we prepend each model with a retinal layer, which pre-processes the input to incorporate\nsome of the transformations performed by the human eye. In that layer, we perform an eccentricity\ndependent blurring of the image to approximate the input which is received by the visual cortex\nof human subjects through their retinal lattice. The details of this retinal layer are described in\nAppendix B. We use eccentricity-dependent spatial resolution measurements (based on the macaque\nvisual system) [41], along with the known geometry of the viewer and the screen, to determine\nthe degree of spatial blurring at each image location. This limits the CNN to information which is\nalso available to the human visual system. The layer is fully differentiable, allowing gradients to\nbackpropagate through the network when running adversarial attacks. Further details of the models\nand their classi\ufb01cation performance are provided in Appendix E.\n\n3.1.3 Generating Adversarial Images\n\nFor a given image group, we wish to generate targeted adversarial examples that strongly transfer\nacross models. This means that for a class pair (A, B) (e.g., A: cats and B: dogs), we generate\nadversarial perturbations such that models will classify perturbed images from A as B; similarly,\nwe perturbed images from B to be classi\ufb01ed as A. A different perturbation is constructed for each\nimage; however, the `1 norm of all perturbations are constrained to be equal to a \ufb01xed \u270f.\nFormally: given a classi\ufb01er, which assigns probability P (y | X) to each coarse class y given an input\nimage X, a speci\ufb01ed target class ytarget and a perturbation magnitude \u270f, we want to \ufb01nd the image\nXadv that minimizes  log(P (ytarget | Xadv)) with the constraint that ||Xadv  X||1 = \u270f. See\nAppendix C for details on computing the coarse class probabilities P (y | X). With the classi\ufb01er\u2019s\nparameters, we can perform iterated gradient descent on X in order to generate our Xadv (see\nAppendix D). This iterative approach is commonly employed to generate adversarial images [24].\n\n4\n\n\fclass 1 (e.g. cat)class 2 (e.g. dog)light sensorresponsetime box(c)(d) 200 ms 500-1000 ms {71, 63} ms {2500, 2200} mstimestimulusmaskingrepeatimageadv (to dog)adv (to cat)imageadvflip(a)(b)high-refresh monitorFigure1:Experimentsetupandtask.(a)examplesimagesfromtheconditions(image,adv,andflip).Top:advtargetingbroccoliclass.bottom:advtargetingcatclass.Seede\ufb01nitionofconditionsatSection3.2.2.(b)exampleimagesfromthefalseexperimentcondition.(c)Experimentsetupandrecordingapparatus.(d)Taskstructureandtimings.Thesubjectisaskedtorepeatedlyidentifywhichoftwoclasses(e.g.dogvs.cat)abrie\ufb02ypresentedimagebelongsto.Theimageiseitheradversarial,orbelongstooneofseveralcontrolconditions.SeeSection3.2fordetails.3.2HumanPsychophysicsExperiment38subjectswithnormalorcorrectedvisionparticipatedintheexperiment.Subjectsgaveinformedconsenttoparticipate,andwereawardedareasonablecompensationfortheirtimeandeffort2.3.2.1ExperimentalSetupSubjectssatona\ufb01xedchair61cmawayfromahighrefresh-ratecomputerscreen(ViewSonicXG2530)inaroomwithdimmedlight(Figure1c).Subjectswereaskedtoclassifyimagesthatappearedonthescreenintooneoftwoclasses(twoalternativeforcedchoice)bypressingbuttonsonaresponsetimebox(LOBESv5/6:USTC)usingtwo\ufb01ngersontheirrighthand.Theassignmentofclassestobuttonswasrandomizedforeachexperimentsession,andlabelswereplacednexttothebuttonstopreventconfusion.Eachtrialstartedwitha\ufb01xationcrossdisplayedinthemiddleofthescreenfor5001000ms.Subjectswereinstructedtodirecttheirgazetothecenterofthiscross(Figure1d).Afterthe\ufb01xationperiod,animageofsize15.24cm\u21e515.24cm(14.2visualangle)waspresentedbrie\ufb02yatthecenterofthescreenforaperiodof63ms(71msforsomesessions).Theimagewasfollowedbyasequenceoftenhighcontrastbinaryrandommasks,eachdisplayedfor20ms(seeexampleinFigure1d).Subjectswereaskedtoclassifytheobjectintheimage(e.g.,catvs.dog)bypressingoneoftwobuttonsstartingattheimagepresentationtimeandlastinguntil2200ms(or2500msforsomesessions)afterthemaskwasturnedoff.Thewaitingperiodtostartthenexttrialwasofthesamedurationwhethersubjectsrespondedquicklyorslowly.Realizedexposuredurationswere\u00b14msfromthetimesreportedabove,asmeasuredbyaphotodiodeandoscilloscopeinaseparatetestexperiment.Eachsubject\u2019sresponsetimewasrecordedbytheresponsetimeboxrelativetotheimagepresentationtime(monitoredbyaphotodiode).Inthecasewhereasubjectpressedmorethanonebuttoninatrial,onlytheclasscorrespondingtotheir\ufb01rstchoicewasconsidered.Eachsubjectcompletedbetween140and950trials.Further,subjectsperformedoneormoredemotrialsatthestartofthesession,togainfamiliaritywiththetask.Duringthedemotrialsonly,imagepresentationtimewaslong,andsubjectsweregivenfeedbackonthecorrectnessoftheirchoice.3.2.2ExperimentConditionsEachexperimentalsessionincludedonlyoneoftheimagegroups(Pets,VegetablesorHazard).Foreachgroup,imageswerepresentedinoneoffourconditionsasfollows:2ThestudywasgrantedanInstitutionalReviewBoard(IRB)exemptionbyanexternal,independent,ethicsboard(QuorumreviewID33016).5\fclipping when adversarial perturbations are added; see Figure 1a left) ).\n\n\u2022 image: images from the ImageNet training set (rescaled to the [40, 255 40] range to avoid\n\u2022 adv: we added adversarial perturbation adv to image, crafted to cause machine learning\nmodels to misclassify adv as the opposite class in the group (e.g., if image was originally\na cat, we perturbed the image to be classi\ufb01ed as a dog). We used a perturbation size large\nenough to be noticeable by humans on the computer screen but small with respect to the\nimage intensity scale (\u270f = 32; see Figure 1a middle). In other words, we chose \u270f to be large\n(to improve the chances of adversarial examples transfer to time-limited human) but kept it\nsmall enough that the perturbations are class-preserving (as judged by a no-limit human).\n\u2022 flip: similar to adv, but the adversarial perturbation (adv) is \ufb02ipped vertically before being\nadded to image. This is a control condition, chosen to have nearly identical perturbation\nstatistics to the adv condition (see Figure 1a right).\n\n\u2022 false: in this condition, subjects are forced to make a mistake. To show that adversarial\nperturbations actually control the chosen class, we include this condition where neither of\nthe two options available to the subject is correct, so their accuracy is always zero. We test\nwhether adversarial perturbations can in\ufb02uence which of the two wrong choices they make.\nWe show a random image from an ImageNet class other than the two classes in the group,\nand adversarially perturb it toward one of the two classes in the group. The subject must\nthen choose from these two classes. For example, we might show an airplane adversarially\nperturbed toward the dog class, while a subject is in a session classifying images as cats or\ndogs. We used a slightly larger perturbation in this condition (\u270f = 40; see Figure 1b).\n\nThe conditions (image, adv, flip) are ensured to have balanced number of trials within a session\nby either uniformly sampling the condition type in some of the sessions or randomly shuf\ufb02ing a\nsequence with identical trial counts for each condition in other sessions. The number of trials for\neach class in the group was also constrained to be equal. Similarly for the false condition the\nnumber of trials adversarially perturbed towards class 1 and class 2 were balanced for each session.\nTo reduce subjects using strategies based on overall color or brightness distinctions between classes,\nwe pre-\ufb01ltered the dataset to remove images that showed an obvious effect of this nature. Notably, in\nthe pets group we excluded images that included large green lawns or \ufb01elds, since in almost all cases\nthese were photographs of dogs. See Appendix F for images used in the experiment for each coarse\nclass. For example images for each condition, see Figures Supp.5 through Supp.8.\n\n4 Results\n\n4.1 Adversarial Examples Transfer to Computer Vision Models\nWe \ufb01rst assess the transfer of our constructed images to two test models that were not included in\nthe ensemble used to generate adversarial examples. These test models are an adversarially trained\nInception V3 model [23] and a ResNet V2 50 model. Both models perform well (> 75% accuracy)\non clean images. Attacks in the adv and false conditions succeeded against the test models between\n57% and 89% of the time, depending on image class and experimental condition. The flip condition\nchanged the test model predictions on fewer than 1.5% of images in all conditions, validating its use\nas a control. See Tables Supp.3 - Supp.6 for accuracy and attack success measurements on both train\nand test models for all experimental conditions.\n\n4.2 Adversarial Examples Transfer to Humans\nWe now show that adversarial examples transfer to time-limited humans. One could imagine that\nadversarial examples merely degrade image quality or discard information, thus increasing error rate.\nTo rule out this possibility, we begin by showing that for a \ufb01xed error rate (in a setting where the\nhuman is forced to make a mistake), adversarial perturbations in\ufb02uence the human choice among two\nincorrect classes. Then, we demonstrate that adversarial examples increase the error rate.\n\n4.2.1 In\ufb02uencing the Choice between two Incorrect Classes\nAs described in Section 3.2.2, we used the false condition to test whether adversarial perturbations\ncan in\ufb02uence which of two incorrect classes a subject chooses (see example images in Figure Supp.5).\n\n6\n\n\f(a)\n\ns\ns\na\nl\nc\n \nt\ne\ng\nr\na\nt\n \n.\nb\no\nr\np\n\n(b)\n\ny\nc\na\nr\nu\nc\nc\na\n\nhazard\nvegetables\npets\n\n(c)\n\nimage\n\n25% snake (4)\n\nn=13\n\ny\nc\na\nr\nu\nc\nc\na\n\nadv\n\n67% snake (6)\n\nn=6\n\nbrief\n\nlong\n\nimage\nadv\npets\n\nflip adv\n\nflip adv\n\nimage\nhazard\n\nimage\nflip\nvegetables\n\nFigure 2: Adversarial images transfer to humans. (a) By adding adversarial perturbations to an\nimage, we are able to bias which of two incorrect choices subjects make. Plot shows probability of\nchoosing the adversarially targeted class when the true image class is not one of the choices that\nsubjects can report (false condition), estimated by averaging the responses of all subjects (two-tailed\nt-test relative to chance level 0.5). (b) Adversarial images cause more mistakes than either clean\nimages or images with the adversarial perturbation \ufb02ipped vertically before being applied. Plot shows\nprobability of choosing the true image class, when this class is one of the choices that subjects can\nreport, averaged across all subjects. Accuracy is signi\ufb01cantly less than 1 even for clean images due to\nthe brief image presentation time. (error bars \u00b1 SE; *: p < 0.05; **: p < 0.01; ***: p < 0.001) (c)\nA spider image that time-limited humans frequently perceived as a snake (top parentheses: number\nof subjects tested on this image). right: accuracy on this adversarial image when presented brie\ufb02y\ncompared to when presented for long time (long presentation is based on a post-experiment survey of\n13 participants).\n\nWe measured our effectiveness at changing the perception of subjects using the rate at which subjects\nreported the adversarially targeted class. If the adversarial perturbation were completely ineffective\nwe would expect the choice of targeted class to be uncorrelated with the subject\u2019s reported class.\nThe average rate at which the subject chooses the target class metric would be 0.5 as each false\nimage is perturbed to class 1 or class 2 in the group with equal probability. Figure 2a shows the\nprobability of choosing the target class averaged across all subjects for all three experiment groups.\nIn all cases, the probability was signi\ufb01cantly above the chance level of 0.5. This demonstrates that\nthe adversarial perturbations generated using CNNs biased human perception towards the targeted\nclass. This effect was stronger for the the hazard, then pets, then vegetables group. This difference in\nprobability among the class groups was signi\ufb01cant (p < 0.05; Pearson Chi-2 GLM test).\nWe also observed a signi\ufb01cant difference in the mean response time between the class groups\n(p < 0.001; one-way ANOVA test; see Figure Supp.2a). Interestingly, the response time pattern\nacross image groups (Figure Supp.2a)) was inversely correlated to the perceptual bias pattern (Figure\n2a)) (Pearson correlation = 1, p < 0.01; two-tailed Pearson correlation test). In other words,\nsubjects made quicker decisions for the hazard group, then pets group, and then vegetables group. This\nis consistent with subjects being more con\ufb01dent in their decision when the adversarial perturbation\nwas more successful in biasing subjects perception. This inverse correlation between attack success\nand response time was observed within group, as well as between groups (Figure Supp.3).\n\n4.2.2 Adversarial Examples Increase Human Error Rate\n\nWe demonstrated that we are able to bias human perception to a target class when the true class of the\nimage is not one of the options that subjects can choose. Now we show that adversarial perturbations\ncan be used to cause the subject to choose an incorrect class even though the correct class is an\navailable response. As described in Section 3.2.2, we presented image, flip, and adv.\nMost subjects had lower accuracy in adv than image (Table Supp.1). This is also re\ufb02ected on the\naverage accuracy across all subjects signi\ufb01cantly lower for the adv than image (Figure 2b).\n\n7\n\n\ftexture modificationimageadvdark parts modificationimageadvedge enhancementimageadvedge destructionimageadvFigure3:Examplesofthetypesofmanipulationsperformedbytheadversarialattack.SeeFiguresSupp.6throughSupp.8foradditionalexamplesofadversarialimages.AlsoseeFigureSupp.5foradversarialexamplesfromthefalsecondition.Theaboveresultinisolationmightbesimplyexplainedbythefactthatimagesfromtheadvconditionincludeperturbationswhereasimagesfromtheimageconditionareunaltered.WhilethisissueislargelyaddressedbythefalseexperimentresultsinSection4.2.1,toprovideafurthercontrolwealsoevaluatedaccuracyonflipimages.Thiscontrolcaseusesperturbationswithidenticalstatisticstoadvuptoa\ufb02ipoftheverticalaxis.However,thiscontrolbreaksthepixel-to-pixelcorrespondencebetweentheadversarialperturbationandtheimage.Themajorityofsubjectshadloweraccuracyintheadvconditionthanintheflipcondition(TableSupp.1).Whenaveragingacrossalltrials,thiseffectwasverysigni\ufb01cantforthepetsandvegetablesgroup(p<0.001),andlesssigni\ufb01cantforthehazardgroup(p=0.05)(Figure2b).Theseresultssuggestthatthedirectionoftheadversarialimageperturbation,incombinationwithaspeci\ufb01cimage,isperceptuallyrelevanttofeaturesthatthehumanvisualsystemusestoclassifyobjects.These\ufb01ndingsthusgiveevidencethatstrongblackboxadversarialattackscantransferfromCNNstohumans,andshowremarkablesimilaritiesbetweenfailurecasesofCNNsandhumanvision.Inallcases,theaverageresponsetimewaslongerfortheadvconditionrelativetotheotherconditions(FigureSupp.2b),thoughthisresultwasonlystatisticallysigni\ufb01cantfortwocomparisons.Ifthistrendremainspredictive,itwouldseemtocontradictthecasewhenwepresentedfalseimages(FigureSupp.2a).Oneinterpretationisthatinthefalsecase,thetransferofadversarialfeaturestohumanswasaccompaniedbymorecon\ufb01dence,whereasherethetransferwasaccompaniedbylesscon\ufb01dence,possiblyduetocompetingadversarialandtrueclassfeaturesintheadvcondition.5DiscussionOurresultsinviteseveralquestionsthatwediscussbrie\ufb02y.5.1Haveweactuallyfooledhumanobserversordidwechangethetrueclass?Onemightnaturallywonderwhetherwehavefooledthehumanobserverorwhetherwehavereplacedtheinputimagewithanimagethatactuallybelongstoadifferentclass.Inourwork,theperturbationswemadeweresmallenoughthattheygenerallydonotchangetheoutputclassforahumanwhohasnotimelimit(thereadermayverifythisbyobservingFigures1a,b,2c,andSupp.5throughSupp.8).Wecanthusbecon\ufb01dentthatwedidnotchangethetrueclassoftheimage,andthatwereallydidfoolthetime-limitedhuman.Futureworkaimedatfoolinghumanswithnotime-limitwillneedtotacklethedif\ufb01cultproblemofobtainingabettergroundtruthsignalthanvisuallabelingbyhumans.5.2Howdotheadversarialexampleswork?Wedidnotdesigncontrolledexperimentstoprovethattheadversarialexamplesworkinanyspeci\ufb01cway,butweinformallyobservedafewapparentpatternsillustratedinFigure3:disruptingobjectedges,especiallybymid-frequencymodulationsperpendiculartotheedge;enhancingedgesbothbyincreasingcontrastandcreatingtextureboundaries;modifyingtexture;andtakingadvantageofdarkregionsintheimage,wheretheperceptualmagnitudeofsmall\u270fperturbationscanbelarger.8\f5.3 What are the implications for machine learning security and society?\nThe fact that our transfer-based adversarial examples fool time-limited humans but not no-limit\nhumans suggests that the lateral and top-down connections used by the no-limit human are relevant\nto human robustness to adversarial examples. This suggests that machine learning security research\nshould explore the signi\ufb01cance of these top-down or lateral connections further. One possible explana-\ntion for our observation is that no-limit humans are fundamentally more robust to adversarial example\nand achieve this robustness via top-down or lateral connections. If this is the case, it could point the\nway to the development of more robust machine learning models. Another possible explanation is that\nno-limit humans remain highly vulnerable to adversarial examples but adversarial examples do not\ntransfer from feed-forward networks to no-limit humans because of these architectural differences.\nOur results suggest that there is a risk that imagery could be manipulated to cause human observers\nto have unusual reactions; for example, perhaps a photo of a politician could be manipulated in a way\nthat causes it to be perceived as unusually untrustworthy or unusually trustworthy in order to affect\nthe outcome of an election.\n\n5.4 Future Directions\nIn this study, we designed a procedure that according to our hypothesis would transfer adversarial\nexamples to humans. An interesting set of questions relates to how sensitive that transfer is to\ndifferent elements of our experimental design. For example: How does transfer depend on \u270f? Was\nmodel ensembling crucial to transfer? Can the retinal preprocessing layer be removed? We suspect\nthat retinal preprocessing and ensembling are both important for transfer to humans, but that \u270f could\nbe made smaller. See Figure Supp.4 for a preliminary exploration of these questions.\n\n6 Conclusion\n\nSusceptibility to adversarial examples has been widely assumed \u2013 in the absence of experimental\nevidence \u2013 to be a property of machine learning classi\ufb01ers, but not of human judgement. In this\nwork, we correct this assumption by showing that adversarial examples based on perceptible but class-\npreserving perturbations that fool multiple machine learning models also fool time-limited humans.\nOur \ufb01ndings demonstrate striking similarities between the decision boundaries of convolutional\nneural networks and the human visual system. We expect this observation to lead to advances in both\nneuroscience and machine learning research.\n\nAcknowledgements\n\nWe are grateful to Ari Morcos, Bruno Olshausen, David Sussillo, Hanlin Tang, John Cunningham,\nSantani Teng, and Daniel Yamins for useful discussions. We also thank Dan Abola\ufb01a Simon\nKornblith, Katherine Lee, Kathryn Rough, Niru Maheswaranathan, Catherine Olsson, David Sussillo,\nand Santani Teng, for helpful feedback on the manuscript. We thank Google Brain residents for\nuseful feedback on the work. We also thank Deanna Chen, Leslie Philips, Sally Jesmonth, Phing\nTurner, Melissa Strader, Lily Peng, and Ricardo Prada for assistance with IRB and experiment setup.\n\nReferences\n[1] Anish Athalye. Robust adversarial examples, 2017.\n[2] Anish Athalye and Ilya Sutskever. Synthesizing robust adversarial examples. arXiv preprint\n\narXiv:1707.07397, 2017.\n\n[3] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov,\nGiorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In\nMachine Learning and Knowledge Discovery in Databases - European Conference, ECML\nPKDD 2013, Prague, Czech Republic, September 23-27, 2013, Proceedings, Part III, pages\n387\u2013402, 2013.\n\n[4] Tom B Brown, Dandelion Man\u00e9, Aurko Roy, Mart\u00edn Abadi, and Justin Gilmer. Adversarial\n\npatch. arXiv preprint arXiv:1712.09665, 2017.\n\n9\n\n\f[5] Jacob Buckman, Aurko Roy, Colin Raffel, and Ian Goodfellow. Thermometer encoding: One\nhot way to resist adversarial examples. International Conference on Learning Representations,\n2018. accepted as poster.\n\n[6] Charles F Cadieu, Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A Solomon,\nNajib J Majaj, and James J DiCarlo. Deep neural networks rival the representation of primate it\ncortex for core visual object recognition. PLoS computational biology, 10(12):e1003963, 2014.\n[7] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale\nhierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009.\nIEEE Conference on, pages 248\u2013255. IEEE, 2009.\n\n[8] Miguel P Eckstein, Kathryn Koehler, Lauren E Welbourne, and Emre Akbas. Humans, but not\ndeep neural networks, often miss giant targets in scenes. Current Biology, 27(18):2827\u20132832,\n2017.\n\n[9] Jeremy Freeman and Eero P Simoncelli. Metamers of the ventral stream. Nature neuroscience,\n\n14(9):1195, 2011.\n\n[10] Leon A Gatys, Alexander S Ecker, and Matthias Bethge. A neural algorithm of artistic style.\n\narXiv preprint arXiv:1508.06576, 2015.\n\n[11] Robert Geirhos, David HJ Janssen, Heiko H Sch\u00fctt, Jonas Rauber, Matthias Bethge, and Felix A\nWichmann. Comparing deep neural networks against humans: object recognition when the\nsignal gets weaker. arXiv preprint arXiv:1706.06969, 2017.\n\n[12] Ian Goodfellow, Nicolas Papernot, Sandy Huang, Yan Duan, Pieter Abbeel, and Jack Clark.\n\nAttacking machine learning with adversarial examples, 2017.\n\n[13] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversar-\n\nial examples. arXiv preprint arXiv:1412.6572, 2014.\n\n[14] Kathrin Grosse, Nicolas Papernot, Praveen Manoharan, Michael Backes, and Patrick D. Mc-\n\nDaniel. Adversarial examples for malware detection. In ESORICS 2017, pages 62\u201379, 2017.\n\n[15] Demis Hassabis, Dharshan Kumaran, Christopher Summer\ufb01eld, and Matthew Botvinick.\n\nNeuroscience-inspired arti\ufb01cial intelligence. Neuron, 95(2):245\u2013258, 2017.\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. ArXiv\n\ne-prints, March 2016.\n\n[17] James M Hillis, Marc O Ernst, Martin S Banks, and Michael S Landy. Combining sensory\ninformation: mandatory fusion within, but not between, senses. Science, 298(5598):1627\u20131630,\n2002.\n\n[18] Michael Ibbotson and Bart Krekelberg. Visual perception and saccadic eye movements. Current\n\nopinion in neurobiology, 21(4):553\u2013558, 2011.\n\n[19] J Zico Kolter and Eric Wong. Provable defenses against adversarial examples via the convex\n\nouter adversarial polytope. arXiv preprint arXiv:1711.00851, 2017.\n\n[20] GYULA KovAcs, Ru\ufb01n Vogels, and Guy A Orban. Cortical correlate of pattern backward\n\nmasking. Proceedings of the National Academy of Sciences, 92(12):5587\u20135591, 1995.\n\n[21] Matthias K\u00fcmmerer, Lucas Theis, and Matthias Bethge. Deep gaze i: Boosting saliency\n\nprediction with feature maps trained on imagenet. arXiv preprint arXiv:1411.1045, 2014.\n\n[22] Matthias K\u00fcmmerer, Tom Wallis, and Matthias Bethge. Deepgaze ii: Predicting \ufb01xations from\n\ndeep features over time and tasks. Journal of Vision, 17(10):1147\u20131147, 2017.\n\n[23] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial Machine Learning at Scale. ArXiv\n\ne-prints, November 2016.\n\n[24] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical\n\nworld. In ICLR\u20192017 Workshop, 2016.\n\n10\n\n\f[25] Michael F Land and Dan-Eric Nilsson. Animal eyes. Oxford University Press, 2012.\n[26] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial\n\nexamples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016.\n\n[27] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n[28] Lane McIntosh, Niru Maheswaranathan, Aran Nayebi, Surya Ganguli, and Stephen Baccus.\nDeep learning models of the retinal response to natural scenes. In Advances in neural information\nprocessing systems, pages 1369\u20131377, 2016.\n\n[29] Bruno A Olshausen. 20 years of learning about vision: Questions answered, questions unan-\nswered, and questions not yet asked. In 20 Years of Computational Neuroscience, pages 243\u2013270.\nSpringer, 2013.\n\n[30] Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learn-\narXiv preprint\n\nfrom phenomena to black-box attacks using adversarial samples.\n\ning:\narXiv:1605.07277, 2016.\n\n[31] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Anan-\nthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017\nACM on Asia Conference on Computer and Communications Security, pages 506\u2013519. ACM,\n2017.\n\n[32] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation\nas a defense to adversarial perturbations against deep neural networks. In Security and Privacy\n(SP), 2016 IEEE Symposium on, pages 582\u2013597. IEEE, 2016.\n\n[33] Nicolas Papernot, Patrick D. McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik,\nand Ananthram Swami. The limitations of deep learning in adversarial settings. CoRR,\nabs/1511.07528, 2015.\n\n[34] Mary C Potter, Brad Wyble, Carl Erick Hagmann, and Emily S McCourt. Detecting meaning in\n\nrsvp at 13 ms per picture. Attention, Perception, & Psychophysics, 76(2):270\u2013279, 2014.\n\n[35] Rishi Rajalingham, Elias B. Issa, Pouya Bashivan, Kohitij Kar, Kailyn Schmidt, and James J\nDiCarlo. Large-scale, high-resolution comparison of the core visual object recognition behavior\nof humans, monkeys, and state-of-the-art deep arti\ufb01cial neural networks. bioRxiv, 2018.\n\n[36] Maximilian Riesenhuber and Tomaso Poggio. Hierarchical models of object recognition in\n\ncortex. Nature neuroscience, 2(11):1019, 1999.\n\n[37] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi. Inception-v4, Inception-ResNet and the\n\nImpact of Residual Connections on Learning. ArXiv e-prints, February 2016.\n\n[38] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the Inception\n\nArchitecture for Computer Vision. ArXiv e-prints, December 2015.\n\n[39] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[40] F. Tram\u00e8r, A. Kurakin, N. Papernot, D. Boneh, and P. McDaniel. Ensemble Adversarial Training:\n\nAttacks and Defenses. ArXiv e-prints, May 2017.\n\n[41] D. C. Van Essen and C. H Anderson. Information processing strategies and pathways in the\nprimate visual system. In Zornetzer S. F., Davis J. L., Lau C., and McKenna T., editors, An\nintroduction to neural and electronic networks, page 45\u201376, San Diego, CA, 1995. Academic\nPress.\n\n[42] Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in\n\ndeep neural networks. arXiv preprint arXiv:1704.01155, 2017.\n\n[43] Daniel L. K. Yamins and James J. DiCarlo. Using goal-driven deep learning models to under-\n\nstand sensory cortex. Nature Neuroscience, 19:356\u2013365, 2016.\n\n11\n\n\f", "award": [], "sourceid": 1934, "authors": [{"given_name": "Gamaleldin", "family_name": "Elsayed", "institution": "Google Brain"}, {"given_name": "Shreya", "family_name": "Shankar", "institution": "Stanford University"}, {"given_name": "Brian", "family_name": "Cheung", "institution": "UC Berkeley"}, {"given_name": "Nicolas", "family_name": "Papernot", "institution": "Google Brain"}, {"given_name": "Alexey", "family_name": "Kurakin", "institution": "Google Brain"}, {"given_name": "Ian", "family_name": "Goodfellow", "institution": "Google"}, {"given_name": "Jascha", "family_name": "Sohl-Dickstein", "institution": "Google Brain"}]}