{"title": "Deliberative Explanations: visualizing network insecurities", "book": "Advances in Neural Information Processing Systems", "page_first": 1374, "page_last": 1385, "abstract": "A new approach to explainable AI, denoted {\\it deliberative explanations,\\/}\n  is proposed. Deliberative explanations are a visualization technique\n  that aims to go beyond the simple visualization of the image regions\n  (or, more generally, input variables) responsible for a network\n  prediction. Instead, they aim to expose the deliberations carried\n  by the network to arrive at that prediction, by uncovering the\n  insecurities of the network about the latter. The\n  explanation consists of a list of insecurities, each composed of\n  1) an image region (more generally, a set of input variables), and 2)\n  an ambiguity formed by the pair of classes responsible for the network\n  uncertainty about the region. Since insecurity detection requires\n  quantifying the difficulty of network predictions, deliberative\n  explanations combine ideas from the literatures on visual explanations and\n  assessment of classification difficulty. More specifically,\n  the proposed implementation\n  combines attributions with respect to both class\n  predictions and a difficulty score.\n  An evaluation protocol that leverages object recognition (CUB200)\n  and scene classification (ADE20K) datasets that combine part and\n  attribute annotations is also introduced to evaluate the accuracy of\n  deliberative explanations. Finally, an experimental evaluation shows that\n  the most accurate explanations are achieved by combining non self-referential\n  difficulty scores and second-order attributions. The resulting\n  insecurities are shown to correlate with regions of attributes that\n  are shared by different classes. Since these regions are also ambiguous\n  for humans, deliberative explanations are intuitive, suggesting that\n  the deliberative process of modern networks correlates with human\n  reasoning.", "full_text": "Deliberative Explanations: visualizing network\n\ninsecurities\n\nPei Wang and Nuno Vasconcelos\n\nDepartment of Electrical and Computer Engineering\n\nUniversity of California, San Diego\n\n{pew062, nvasconcelos}@ucsd.edu\n\nAbstract\n\nA new approach to explainable AI, denoted deliberative explanations, is proposed.\nDeliberative explanations are a visualization technique that aims to go beyond\nthe simple visualization of the image regions (or, more generally, input variables)\nresponsible for a network prediction. Instead, they aim to expose the deliberations\ncarried by the network to arrive at that prediction, by uncovering the insecurities\nof the network about the latter. The explanation consists of a list of insecurities,\neach composed of 1) an image region (more generally, a set of input variables),\nand 2) an ambiguity formed by the pair of classes responsible for the network\nuncertainty about the region. Since insecurity detection requires quantifying the\ndif\ufb01culty of network predictions, deliberative explanations combine ideas from the\nliterature on visual explanations and assessment of classi\ufb01cation dif\ufb01culty. More\nspeci\ufb01cally, the proposed implementation combines attributions with respect to\nboth class predictions and a dif\ufb01culty score. An evaluation protocol that leverages\nobject recognition (CUB200) and scene classi\ufb01cation (ADE20K) datasets that\ncombine part and attribute annotations is also introduced to evaluate the accuracy of\ndeliberative explanations. Finally, an experimental evaluation shows that the most\naccurate explanations are achieved by combining non self-referential dif\ufb01culty\nscores and second-order attributions. The resulting insecurities are shown to\ncorrelate with regions of attributes shared by different classes. Since these regions\nare also ambiguous for humans, deliberative explanations are intuitive, suggesting\nthat the deliberative process of modern networks correlates with human reasoning.\n\n1\n\nIntroduction\n\nWhile deep learning systems enabled signi\ufb01cant advances in computer vision, the black box nature\nof their predictions causes dif\ufb01culties to many applications. In general, it is dif\ufb01cult to trust a\nsystem unable to justify its decisions. This has motivated a large literature in explainable AI [22,\n42, 21, 44, 43, 26, 28, 13, 18, 49, 3, 8, 34]. In computer vision, most approaches provide visual\nexplanations, in the form of heatmaps or segments that localize image regions responsible for network\npredictions [37, 51, 49, 31, 20]. More generally, explanations can be derived from attribution models\nthat identify input variables to which the prediction can be attributed [2, 34, 40, 1]. While insightful,\nall these explanations fall short of the richness of those produced by humans, which tend to re\ufb02ect\nthe deliberative nature of the inference process.\nIn general, explanations are most needed for inputs that have some types of ambiguity, such that\nthe prediction could reasonably oscillate between different interpretations. In the limit of highly\nambiguous inputs it is even acceptable for different systems (or people) to make con\ufb02icting predictions,\nas long as they provide a convincing justi\ufb01cation. A prime example of this is visual illusion, such as\nthat depicted in the left of Figure 1, where different image regions provide support for con\ufb02icting\nimage interpretations. In this example, the image could depict a \u201ccountry scene\u201d or a \u201cface\u201d. While\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Left: Illustration of the deliberations made by a human to categorize an ambiguous image. Insecurities\nare image regions of ambiguous interpretation. Right: Deliberative explanations expose this deliberative process.\nUnlike existing methods, which simply attribute the prediction to image regions (left), they expose the insecurities\nexperienced by the classi\ufb01er while reaching that prediction (center). Each insecurity consists of an image region\nand an ambiguity, expressed as a pair of classes to which the region appears to belong to. Examples from the\nconfusing classes are shown in the right. Green dots locate attributes common to the two ambiguous classes,\nwhich cause the ambiguity.\n\none of the explanations could be deemed more credible, both are sensible when accompanied by the\nproper justi\ufb01cation. In fact, most humans would consider the two interpretations while deliberating\non their \ufb01nal prediction: \u201cI see a cottage in region A, but region B could be a tree trunk or a nose, and\nregion C looks like a mustache, but could also be a shirt. Since there are sheep in the background,\nI am going with country scene.\u201d More generally, different regions can provide evidence for two or\nmore distinct predictions and there may be a need to deliberate between multiple explanations.\nHaving access to this deliberative process is important to trust an AI system. For example, in\nmedical diagnosis, a single prediction can appear unintuitive to a doctor, even if accompanied by\na heatmap. The doctor\u2019s natural reaction would be to ask \u201cwhat are you thinking about?\u201d Ideally,\ninstead of simply outputting a predicted label and a heat map, the AI system should be able to expose\na visualization of its deliberations, in the form of a list of image regions (or, more generally, groups\nof input variables) that support alternative predictions. This list should be ordered by degree of\nambiguity, with the regions that generate more uncertainty at the top. This requires the ability to\nassess the dif\ufb01culty of the decision and the region of support of this dif\ufb01culty in the image.\nThe quanti\ufb01cation of prediction dif\ufb01culty has received recent interest, with the appearance of several\nprediction dif\ufb01culty scoring methods [25, 47, 16, 50, 30, 4]. In this work, we combine ideas from this\nliterature with ideas from the literature on visual explanations to derive a new explanation strategy.\nBeyond prediction heatmaps, we compute heatmaps of network insecurities, listing the regions that\nsupport alternative predictions. We refer to these as deliberative explanations, since they illustrate\nthe deliberative process of the network. This is illustrated in the right of Figure 1, with an example\nfrom the \ufb01ne-grained CUB birds dataset [48]. On this dataset, where many images contain a single\nbird, state of the art visualization methods, such as grad-CAM [31] (left inset), frequently produce\nheatmaps that 1) cover large portions of the bird, and 2) vary little across classes of largest posterior\nprobability, leading to very uninformative explanations. Instead, deliberative explanations provide a\nlist of insecurities (center inset). Each of these consists of 1) an image region and 2) an ambiguity,\nformed by the pair of classes that led the network to be uncertain about the region. Examples of\nambiguous classes can also be shown (right inset).\nBy using deliberative explanations to analyze the decisions of \ufb01ne-grained deep learning classi\ufb01ers,\nwe show that the latter perform intuitive deliberations. More precisely, we have found that network\ninsecurities correlate with regions of attributes shared by different classes. For example, in Figure 1,\ninsecurity 4 is caused by the presence of \u201cblack legs\u201d and a \u201cblack solid fan-shaped tail,\u201d attributes\nshared by the \u201cCommon\u201d and the \u201cWhite Necked Raven.\u201d Similar ambiguities occur for the other\ninsecurities. This observation is quanti\ufb01ed by the introduction of a procedure to measure the alignment\nbetween network insecurities and attribute ambiguity, for datasets annotated with attributes. We note,\n\n2\n\ncottage?eye?tree trunk?nose?ororface?country scene?shirt?mustache?orInsecuritiesABCPelagic Cormorant (PC)Grad-CAMExisting explanationCR vs BCCR vs WNRPC vs BCPC vs CRCR vs FCPC vs WNRBrandt Cormorant (BC)Fish Crow (FC)Common Raven (CR)White Necked Raven (WNR)Ambiguous classesDeliberative Explanationinsecurity 1Classifier Label: object is Pelagic CormorantInsecurity 1: objectcouldalso be Common Raven or Brandt Cormorant, Insecurity 2:object could also be Pelagic Cormorant or Common Raveninsecurity 6insecurity 5insecurity 4insecurity 3insecurity 2\fhowever, that this is not necessary to produce the visualizations themselves, which can be obtained for\nany dataset. Deliberative visualizations can also leverage most existing visual explanation methods\nand most methods for dif\ufb01culty score prediction, as well as bene\ufb01t from future advances in these\nareas. Nevertheless, we propose the use of an attribution function of second order, which generalizes\nmost existing visualization methods and is shown to produce more accurate explanations.\nWe believe that exposing network insecurities will be helpful for many applications. For example, a\ndoctor could choose to ignore the top network prediction, but go with a secondary one after inspection\nof the insecurities. This is much more ef\ufb01cient than exhaustively analyzing the image for alternatives.\nIn fact, insecurities could help the doctor formulate some hypotheses at the edge of his/her expertise,\nor help the doctor \ufb01nd a colleague more versed on these hypotheses. Designers of deep learning\nsystems could also gain more insight on the errors of these systems, which could be leveraged to\ncollect more diverse datasets. Rather than just more images, they could focus on collecting the\nimages most likely to improve system performance. In the context of machine teaching [17, 38, 23],\ndeliberative visualizations could be used to enhance the automated teaching of tasks, e.g. labeling\nof \ufb01ne-grained image classes, to humans that lack expertise, e.g. Turk annotators. Finally, in cases\nwhere deep learning systems outperform humans, e.g. AlphaGo [35], they could even inspire new\nstrategies for thinking about dif\ufb01cult problems, which could improve human performance.\n\n2 Related work\n\nVisualization: Several works have proposed visualizations of the inner workings of neural networks.\nSome of these aimed for better understanding of the general computations of the model, namely\nthe semantics of different network units [49, 3, 8]. This has shown that early layers tend to capture\nlow-level features, such as edges or texture, while units of deeper layers are more sensitive to objects\nor scenes [49]. Network dissection [3] has also shown that modern networks tend to disentangle\nvisual concepts, even though this is not needed for discrimination. However, there is also evidence\nthat concept selectivity is accomplished by a distributed code, based on small numbers of units [8].\nOther methods aim to explain predictions made for individual images. One possibility is to introduce\nan additional network that explains the predictions of the target network. For example, vision-\nlanguage models can be trained to output a natural language description of the visual predictions.\n[13] used an LSTM model and a discriminative loss to encourage the synthesized sentences to include\nclass-speci\ufb01c attributes and [18] trained a set of auxiliary networks to produce explanations based on\ncomplementary image descriptions. However, the predominant explanation strategy is to compute the\ncontribution (also called importance, relevance, or attribution) of input variables (pixels for vision\nmodels) to the prediction, in a post-hoc manner. Perturbation-based methods [51, 49] remove image\npixels or segments and measure how this affects the prediction. These methods are computationally\nheavy and it is usually impossible to test all candidate segments. This problem is overcome by\nbackpropagation-based that produce a \u2018saliency map\u2019 by computing the gradient of the prediction\nwith respect to the input [37]. Variants include backpropagating layer-wise relevances [2], considering\na reference input [34], integrated gradients [40] and other modi\ufb01cations of detail [1]. However,\nprediction gradients have been shown not to be very discriminant, originating similar heatmaps for\ndifferent predictions [31]. This problem is ameliorated by the CAM family of methods [52, 31],\nwhich produce a heatmap based on the activations of the last convolutional network layer, weighting\neach activation by the gradient of the prediction with respect to it.\nIn our experience, while superior to simple backpropagation, these methods still produce relatively\nuninformative heatmaps for many images (e.g. Figure 1). We show that improved visualizations can\nbe obtained with deliberative explanations. In any case, our goal is not to propose a new attribution\nfunction, but to introduce a new explanation strategy, deliberative explanations, that visualizes\nnetwork insecurities about the prediction. This strategy can be combined with any of the visualization\napproaches above, but also requires an attribution function for the prediction dif\ufb01culty.\nThe deliberative explanation seems closely related to counterfactual explanations [14, 45, 10] (or\ncontrastive explanations in some literature [6, 27]), but the two approaches have different motivations.\nCounterfactual explanations seek regions or language descriptions explaining why an image does not\nbelong to a counter-class. Deliberative explanations seek insecurities, i.e. the regions that make it\ndif\ufb01culty for the model to reach its prediction. To produce an explanation, counterfactual methods\n\n3\n\n\fonly need to consider two pre-speci\ufb01ed classes (predicted and counter), deliberative explanations\nmust consider all classes and determine the ambiguous pair for each region.\nDif\ufb01culty scores: Several scores have been proposed to measure the dif\ufb01culty of a prediction. The\nmost popular is the con\ufb01dence score, the estimate of the posterior probability of the predicted class\nproduced by the model itself. Low con\ufb01dence scores identify high probability of failure. However,\nthis score is known to be unreliable, e.g. many adversarial examples [41, 9] are misclassi\ufb01ed with high\ncon\ufb01dence. Proposals to mitigate this problem include the use of the posterior entropy or its maximum\nand sub-maximum probability values [47]. These methods also have known shortcomings [32], which\ncon\ufb01dence score calibration approaches aim to solve [5, 25]. Alternatives to con\ufb01dence scores include\nBayesian neural networks [24, 11, 15], which provide estimates of model uncertainty by placing a\nprior distribution on network weights. However, they tend to be computationally heavy, for both\nlearning and inference. A more ef\ufb01cient alternative is to train an auxiliary network to predict dif\ufb01culty\nscores. A popular solution is a failure predictor trained on mistakes of the target network, which\noutputs failure scores for test samples in a post-hoc manner [50, 30, 4]. It is also possible to train\na dif\ufb01culty predictor jointly with the target network. This is denoted a hardness predictor in [46].\nDeliberative explanations can leverage any of these dif\ufb01culty prediction methods.\n\n3 Deliberative explanations\n\ni=1, where yi is the label of image xi, and a test set T = {(xj, yj)}M\n\nf (x) : X \u2192 [0, 1]C is a C-dimensional probability distribution with (cid:80)C\n\nIn this section, we discuss the implementation of deliberative explanations. Consider the problem\nof C-class recognition, where an image drawn from random variable X \u2208 X has a class label\ndrawn from random variable Y \u2208 {1, . . . , C}. We assume a training set D of N i.i.d. samples\nD = {(xi, yi)}N\nj=1. Test\nset labels are only used to evaluate performance. The goal is to explain the class label prediction\n\u02c6y produced by a classi\ufb01er F : X \u2192 {1, . . . , C} of the form F(x) = arg maxy fy(x), where\ny=1 fy(x) = 1 and is\nimplemented with a convolutional neural network (CNN). The explanation is based on the analysis of\na tensor of activations A \u2208 RW\u00d7H\u00d7D of spatial dimensions W \u00d7 H and D channels, extracted at\nany layer of the network. We assume that either the classi\ufb01er or an auxiliary predictor also produce\na dif\ufb01culty score s(x) \u2208 [0, 1] for the prediction. This score is self-referential if generated by the\nclassi\ufb01er itself and not self-referential if generated by a separate network.\nFollowing the common practice in the \ufb01eld, explanations are provided as visualizations, in the form\nof image segments [29, 3, 51]. For deliberative explanations, these segments expose the network\ninsecurities about the prediction \u02c6y. An insecurity is a triplet (r, a, b), where r is a segmentation mask\nand (a, b) an ambiguity. This is a pair of class labels such that the network is insecure as to whether\nthe image region de\ufb01ned by r should be attributed to class a or b. Note that none of a or b has to be\nthe prediction \u02c6y made by the network for the whole image, although this could happen for one of\nthem. In Figure 1, \u02c6y is the label \u201cPelagic Cormorant,\u201d and appears in insecurities 2, 5, and 6, but not\non the remaining. This re\ufb02ects the fact that certain parts of the bird could actually be shared by many\nclasses. The explanation consists of a set of Q insecurities I = {(rq, aq, bq)}Q\n\nq=1.\n\n3.1 Generation of insecurities\n\nInsecurities are generated by combining attribution maps for both class predictions and the dif\ufb01culty\nscore s. Given the feature activations ai,j at image location (i, j), the attribution map mp\ni,j for\nprediction p is a map of the importance of ai,j to the prediction. Locations of activations irrelevant\nfor the prediction receive zero attribution, locations of very informative activations receive maximal\nattribution. For deliberative explanations, C + 1 attribution maps are computed: a class prediction\ni,j for each of the c \u2208 {1, . . . , C} classes and the dif\ufb01culty score attribution map\nattribution map mc\ni,j. Given these maps, the K classes of largest attribution are identi\ufb01ed at each location. This\nms\ncorresponds to sorting the attributions such that mc1\ni,j and selecting the K\nlargest values. The resulting set C(i, j) = {c1, c2, . . . , cK} is the set of candidate classes for location\ni, j. A set of candidate class ambiguities is then computed by \ufb01nding all class pairs that appear jointly\ni,j{(a, b)|a, b \u2208 C(i, j), a (cid:54)= b}, and an ambiguity map\n\nin at least one candidate class list, i.e. A =(cid:83)\n\ni,j \u2265 . . . \u2265 mcC\n\ni,j \u2265 mc2\n\nm(a,b)\n\ni,j = f (ma\n\ni,j)\n\ni,j, mb\n\ni,j, ms\n\n(1)\n\n4\n\n\fis computed for each ambiguity in A. While currently f (.) consists of the product of its arguments,\nwe will investigate other possibilities in the future. The goal is for m(a,b)\nto be large only when\ni,j) and this dif\ufb01culty is\nlocation (i, j) is deemed dif\ufb01cult to classify (large dif\ufb01culty attribution ms\ndue to large attributions to both classes a and b. Finally, the ambiguity map is thresholded to obtain\ni,j >T , where 1S is the indicator function of set S and T a\nthe segmentation mask r(a, b) = 1m(a,b)\nthreshold. The ambiguity (a, b) and the mask r(a, b) form an insecurity.\n\ni,j\n\n3.2 Attribution maps\n\nAttribution map mp\ni,j is a measure of how the activations ai,j at location (i, j) contribute to prediction\np. This could be a class prediction or a dif\ufb01culty prediction. In this section, we make no difference\nbetween the two, simply denoting p = gp(A), where g is the mapping from activation tensor A\ninto prediction vector g(A) \u2208 [0, 1]P . For class predictions P = C, the prediction is a class,\nand gy(A(x)) = fy(x). For dif\ufb01culty predictions P = 1, the prediction is a dif\ufb01culty score, and\ng(A(x)) = s(x). Apart from popular 1st order attribution maps [33, 31, 40], we also attempt the\nattribution map based on a second-order Taylor series expansion of gp at each location (i, j) and\nsome approximations that are discussed in Appendix. This has the form\n\ni,j = [\u2207gp(A)]T\nmp\n\ni,jai,j +\n\n(2)\nwhere H(A) = \u22072gp(A) is the Hessian matrix of gp at A. In Appendix, we also show that most\nattribution maps previously used in the literature are special cases of (2), based on a \ufb01rst order Taylor\nexpansion. In section 5.2, we show that the second order approximation leads to more accurate\nresults.\n\naT\ni,j[H(A)]i,jai,j,\n\n1\n2\n\n3.3 Dif\ufb01culty scores\n\nFor class attributions, gp(A) = fp(x), i.e. the pth output of the softmax at the top of the CNN. For\ndif\ufb01culty scores, g(A) is the output a single unit that produces the score. The exact form of the\nmapping depends on the de\ufb01nition of the latter. We consider three scores previously used in the\nliterature. The hesitancy score is de\ufb01ned as the complement of the largest class posterior probability\nin the work, i.e.\n\nshe(x) = 1 \u2212 max\n\nfy(x).\n\ny\n\n(3)\n\nThis can be implemented by adding a max pooling layer to the softmax outputs. The score is large\nwhen the con\ufb01dence of the classi\ufb01cation prediction is low. The entropy score [47] is the normalized\nentropy of the softmax probability distribution de\ufb01ned by\n\nfy(x) log fy(x).\n\n(4)\n\nse(x) = \u2212 1\nlog C\n\n(cid:88)\n\ny\n\nThese two scores are self-referential. The \ufb01nal score, denoted the hardness score [46], relies on a\nclassi\ufb01er-speci\ufb01c hardness predictor S, which is jointly trained with the classi\ufb01er F. S thresholds\nthe output of a network s(x) : X \u2192 [0, 1] whose output is a sigmoid unit. The dif\ufb01culty score is\n\nsha(x) = s(x).\n\n(5)\n\n4 Evaluation of deliberative explanations\n\nExplanations are usually dif\ufb01cult to evaluate, since explanation ground truth is usually not available.\nWhile some previous works only show visualizations [40, 39], two major classes of evaluation\nstrategies were used. One possibility is to perform Turk experiments, e.g. measuring whether humans\ncan predict a class label given a visualization, or identify the most trustworthy of two models that\nmake identical predictions from their explanations [31]. In this paper, we attempted to measure\nwhether, given an image and an insecurity produced by the explanation algorithm, humans can predict\nthe associated ambiguities. While this strategy directly measures how intuitive the explanations\nappear to humans, it requires experiments that are somewhat cumbersome to perform and dif\ufb01cult\nto replicate. A second evaluation strategy is to rely on a proxy task, such as localization [52, 31]\n\n5\n\n\fon datasets with object bounding boxes. This is much easier to implement and replicate and is the\napproach that we pursue in this work. However, this strategy requires ground-truth for insecurities.\nFor this, we leverage datasets annotated with parts and attributes. More precisely, we equate segments\nto parts, and de\ufb01ne insecurities as ambiguous parts, e.g., object parts common to multiple object\nclasses or scene parts (e.g. objects) shared between different scene classes. To quantify part ambiguity,\nparts are annotated with attributes1. Speci\ufb01cally, the kth part is annotated with a semantic descriptor\nof Dk attribute values. For example, in a bird dataset, the \u201ceye\u201d part can have color attribute values\n\u201cgreen,\u201d \u201cblue,\u201d \u201cbrown,\u201d etc. The descriptor is a probability distribution over these attribute values,\ncharacterizing the variability of attribute values of the part under each class. The attribute distribution\nc . The strength of the ambiguity between classes a and b, according\nof part k under class c is denoted \u03c6k\nto segment k, is then de\ufb01ned as \u03b1k\nb ), where \u03b3 is a similarity measure. This declares as\na,b = \u03b3(\u03c6k\nambiguous parts that have similar attribute distributions under the two classes.\nTo generate insecurity ground-truth, ambiguity strengths \u03b1k\na,b are computed for all parts pk and class\npairs (a, b). The M insecurities G = {(pi, ai, bi)}M\ni=1 of largest ambiguity strength are selected as\nthe insecurity ground-truth. Two metrics are used for evaluation, depending on the nature of part\nannotations. For datasets where parts labelled with a single location (usually the center of mass of the\npart), i.e. pi is a point, the quality of insecurity (r, a, b) is computed by precision (P) and recall (R),\n|{i|(pi,ai,bi)\u2208G,ai=a,bi=b}| and J = |{i|pi \u2208 r, ai = a, bi = b}| is the\nwhere P =\nnumber of included ground-truth insecurities by the generated insecurity. For datasets where parts\nhave segmentation masks, the quality of (r, a, b) is computed by the intersection over union (IoU)\n|r\u2229p|\n|r\u222ap| , where p is the part in G with ambiguity (a, b) and largest overlap with r. Curves\nmetric IoU =\nof precision-recall curves and IoU are generated by varying the threshold T used to transform the\nambiguity maps of (1) into insecurity masks r(a, b). For each image, T is chosen so that insecurities\ncover from 1% to 90% of the image, with steps of 1%.\n\n|{k|pk\u2208r}|, R =\n\na, \u03c6k\n\nJ\n\nJ\n\n5 Experiments\n\nIn this section we discuss experiments performed to evaluate the quality of deliberative explanations.\n\n5.1 Experimental setup\n\nDataset: Experiments were performed on the CUB200 [48] and ADE20K [53] datasets. CUB200 [48]\nis a dataset of \ufb01ne-grained bird classes, annotated with parts. 15 part locations (points) are annotated\nincluding back, beak, belly, breast, crown, forehead, left/right eye, left/right leg, left/right wing, nape,\ntail and throat. Attributes are de\ufb01ned per part according to [48] (see Appendix). ADE20K [53]\nis a scene image dataset with more than 1000 scene categories and segmentation masks for 150\nobjects. In this case, objects are seen as scene parts and each object has a single attribute, which is\nits probability of appearance in a scene. Both datasets were subject to standard normalizations. All\nresults are presented on the standard CUB200 test set and the of\ufb01cial validation set of ADE20K. Since\ndeliberative explanations are most useful for examples that are dif\ufb01cult to classify, explanations were\nproduced only for the 100 test images having largest dif\ufb01culty score on each dataset. All experiments\nused candidate class sets C(i, j) of 3 members and among top 5 predictions, and were ran three times.\nNetwork: Unless otherwise noted, VGG16 [36] is used as the default architecture for all visualiza-\ntions. This is because it is the most popular architecture in visualization papers. Its performance\nwas also compared to those of ResNet50 [12] and AlexNet [19]. All classi\ufb01ers and predictors are\ntrained by standard strategies [36, 12, 19, 47, 46]. The widely used last convolutional layer output\nwith positive contributions in the visualization literature [20, 52, 31] is used.\nc are multidimensional, i.e. Dk > 1,\u2200k. In\nEvaluation: On CUB200 all semantic descriptors \u03c6k\na)} [7]\nthis case, ambiguity strengths \u03b1k\naveragely among all attributes, where D(.||.) is the Kullback\u2013Leibler divergence. The number of M\nof ground-truth insecurities is set to the 20% triplets (pi, ai, bi) in the dataset of strongest ambiguity.\nSince parts are labelled with points, insecurity accuracy is measured with precision and recall. On\nc is the probability of\nADE20K, the semantic descriptors \u03c6k\n1It should be noted that part and attribute annotations are only required to evaluate the accuracy of insecurities,\n\nc are scalar, i.e. Dk = 1,\u2200k, and \u03c6k\n\na,b are computed with \u03b3(\u03c6k\n\nb ) = e\u2212{D(\u03c6k\n\nb )+D(\u03c6k\n\na, \u03c6k\n\na||\u03c6k\n\nb||\u03c6k\n\nnot to compute the visualizations. These require no annotation.\n\n6\n\n\fFigure 2: Impact of different algorithm components on precision-recall (CUB200). Left: dif\ufb01culty scores, center:\nattribution functions, right: network architectures.\n\n30%\n\n50%\n\n20%\n\n40%\n\nAvg.\n\n10%\nMethods\n8.32(0.05) 15.62(0.01) 22.25(0.02) 28.45(0.06) 34.31(0.11) 21.79(0.03)\nHesitancy score\nEntropy score [47] 8.16(0.06) 15.10(0.08) 21.26(0.07) 26.92(0.18) 32.23(0.30) 20.73(0.09)\nHardness score [46] 8.63(0.12) 16.59(0.16) 24.14(0.19) 31.34(0.22) 38.29(0.24) 23.80(0.19)\n8.63(0.12) 16.59(0.16) 24.14(0.19) 31.34(0.22) 38.29(0.24) 23.80(0.19)\nGradient [33]\nGradient w/o ms\n8.54(0.17) 16.35(0.44) 23.70(0.77) 30.67(1.16) 37.39(1.59) 23.33(0.82)\n8.70(0.12) 16.75(0.20) 24.37(0.27) 31.60(0.31) 38.56(0.30) 23.99(0.24)\nInt. grad. [40]\n8.86(0.20) 17.00(0.29) 24.65(0.32) 31.92(0.35) 38.88(0.34) 24.26(0.30)\nGradient-Hessian\n8.53(0.16) 16.03(0.39) 22.97(0.65) 29.50(0.90) 35.71(1.17) 22.55(0.65)\nAlexNet\n8.63(0.12) 16.59(0.16) 24.14(0.19) 31.34(0.22) 38.29(0.24) 23.80(0.19)\nVGG16\n8.23(0.14) 15.80(0.21) 22.92(0.24) 29.76(0.27) 36.30(0.26) 22.60(0.22)\nResNet50\n\nTable 1: Impact of algorithm components on IoU precision (ADE20K).\n\noccurrence of part (object) k in scenes of class c. This is estimated by the relative frequency with\nwhich the part appears in scenes of class c. Only parts such that \u03c6k\nc > 0.3 are considered. Ambiguity\nstrengths are computed with \u03b3(\u03c6k\nb ). This is large when object k appears very\nfrequently in both classes, i.e. the object adds ambiguity, and smaller when this is not the case. Due\nto the sparsity of the matrix of ambiguity strengths \u03b1k\na,b, the number M of ground-truth insecurities\nis set to the 1% triplets of strongest ambiguity. Insecurity accuracy is measured with the IoU metric.\n\nb ) = 1\n\na + \u03c6k\n\n2 (\u03c6k\n\na, \u03c6k\n\n5.2 Ablation study\n\nDif\ufb01culty Scores: Figure 2 (left) shows the precision-recall curves obtained on CUB200 for different\ndif\ufb01culty scores. The top section of Table 1 presents the corresponding analysis for IoUs on ADE20K.\nSome conclusions can be drawn. First, all methods substantially outperform random insecurity\nextraction, whose precision is around 20%. Second, precision curves are fairly constant while IoU\nincreases substantially above 30% image coverage. This suggests that insecurities tend to cover\nambiguous image regions, but segmentation is imperfect. Third, in both cases, the hardness score\nsubstantially outperforms the remaining scores. This suggests that self-referential dif\ufb01culty scores\nshould be avoided. The hardness score is used in the remaining experiments.\nAttribution Function: Deliberative explanations are compatible with any attribution function. Fig-\nure 2 (center) and the second section of Table 1 compare the 2nd order approximation of (2), denoted\n\u2018gradient-Hessian,\u2019 to the more popular 1st order approximation [33] consisting of the \ufb01rst term of (2)\nonly (\u2018gradient\u2019), and the integrated gradient of [40]. \u2018Gradient\u2019 is also implemented without using\nthe dif\ufb01culty score attribution map in (1), denoted \u2018gradient w/o ms\u2019. A few conclusions are possible.\nFirst, gradient-Hessian always outperforms gradient generally but on ADE20K there is no signi\ufb01cant\ndifference. [39] found experimentally that gains of second-order term decrease as the number of\nclasses increases. This could explain why no clear gain on ADE20K (> 1000 categories) compared\nwith CUB200 (200 categories). Second, \u2018gradient w/o ms\u2019 has the worst performance of all methods,\nshowing that dif\ufb01culty attributions are important for deliberative explanations.\nNetwork Architectures: Figure 2 (right) and the bottom section of Table 1 compare the explanations\nproduced by ResNet50, VGG16, and AlexNet. Since for the ResNet the second- and higher-order of\n\n7\n\n0.30.40.50.60.70.8recall0.380.400.420.440.460.480.50precisionscoreshesitancy scoreentropy scorehardness score0.30.40.50.60.70.8recall0.420.440.460.480.50precisionattribution functiongradientgradient w/o msintegrated gradientgradient-Hessian0.30.40.50.60.70.8recall0.400.450.500.550.600.650.700.75precisionarchitectureAlexNetVGG16ResNet50\fGlaucus Gull\n\nBlack Tern\n\nFigure 3: Deliberative visualizations for two images from CUB. Left: a Glaucus Gull creates two insecurities.\nTop: the insecurity shown on the left elicits ambiguity between the California and Herring Gull classes. The\nattributes of the shared part are listed on the right. Bottom: insecurity with ambiguity between Western and\nGlaucous Gull classes. Right: similar for Black Tern. In all insecurities, green dots locate the shared part.\n\nFigure 4: Deliberative visualizations for four images from ADE20K.\n\n(2) are zero (see a proof in [31]), we used the \ufb01rst order approximation on these experiments. On\nCUB200, AlexNet performed the worst and ResNet50 the best. Interestingly, although ResNet50\nand VGG16 have similar classi\ufb01cation performance, the ResNet insecurities are much more accurate\nthan those of VGG16. This suggests that the ResNet architecture uses more intuitive, i.e. human-like,\ndeliberations. On ADE20K, the classi\ufb01cation task is harder ( < 60% mean accuracy). There is no\nclear difference among three architectures.\n\n5.3 Deliberative explanation examples\n\nWe \ufb01nish by discussing some deliberative visualizations of images from the two datasets. These\nresults were obtained with the hardness score of (5) and gradient-based attributions on ResNet50.\nFigure 3 shows two examples of two insecurities each. On the left side of the \ufb01gure, an insecurity on\nthe leg/belly region of a \u2018Glaucus gull\u2019 is due to and ambiguity with classes \u2018California gull\u2019 and\n\u2018Herring gull\u2019 with whom it shares leg color \u2018buff\u2019, belly color \u2018white\u2019, and belly pattern \u2018solid\u2019.\nA second insecurity emerges in the bill/forhead region of the gull, due to an ambiguity between\n\u2018Glaucus gull\u2019 and \u2018Western gull\u2019 with whom the \u2018Glaucus gull\u2019 shares a \u2018hooked\u2019 bill shape and\na \u2018white\u2019 colored forehead. The right side of the \ufb01gure shows insecurities for a \u2018Black tern,\u2019 due\nto a tail ambiguity between \u2018Artic\u2019 and \u2018Elegant\u2019 terns and a wing ambiguity between \u2018Elegant\u2019\nand \u2018Forsters\u2019 terns. Figure 4 shows single insecurities from four images of ADE20K. In all cases,\nthe insecurities correlate with regions of attributes shared by different classes. This shows that\ndeliberative explanations unveil truly ambiguous image regions, generating intuitive insecurities that\nhelp understand network predictions. Note, for example, how the visualization of insecurities tends\nto highlight classes that are semantically very close, such as the different families of gulls or terns\nand class subsets such as \u2018plaza\u2019, \u2018hacienda\u2019, and \u2018mosque\u2019 or \u2018bedroom\u2019 and \u2018living room\u2019. All of\nthis suggests that the deliberative process of the network correlates well with human reasoning.\n\n8\n\nClass: Glaucous gullClass: California gullClass: Herring gullShared Part: \u2022Leg color is buff; \u2022Belly color is white and pattern is solid;Class: Western gullClass: Glaucous gullClass: Glaucous gullShared Part: \u2022Bill shape is hooked; \u2022Forehead color is white;InsecurityAmbiguityClass: Black ternClass: Artic ternClass: Elegant ternShared Part: \u2022Tail shape is forked; \u2022Tail patternis solid;Class: Black ternClass: ForstersternShared Part: \u2022Wing color is white; \u2022Wing shape is long;\u2022Wing patternis solid;Class: Elegant ternInsecurityAmbiguityClass: PlazaClass: HaciendaClass: MosqueShared Part: BuildingEdificeShared Part: WallFloorCeilingWindowClass: BedroomClass: living roomClass: BedroomInsecurityAmbiguityClass: Junk pileClass: BarnyardClass: Vege GardenShared Part: SoilTreeGrassShared Part: WallFloorLightChairClass: miscClass: auditoriumClass: bleachersInsecurityAmbiguity\fFigure 5: MTurk interface\n\n6 Human evaluation results\n\nThe designed interface of the human experiment with an example is given in Figure 5. The region\nof support of the uncertainty is shown on the left and examples from \ufb01ve classes are displayed on\nthe right. These include the two ambiguous classes found by the explanation algorithm, the \u201cLaysan\nAlbatross\u201d and the \u201cGlaucous Winged Gull\u201d. If the Tuker selects these two classes there is evidence\nthat the insecurity is intuitive. Otherwise, there is evidence that it is not.\nWe performed a preliminary human evaluation for the generated insecurities on MTurk. Given an\ninsecurity (r, a, b) found by the explanation algorithm, turkers were shown r and asked to identify\n(a, b) among 5 classes (for which a random image was displayed) including the two classes a and b\nfound by the algorithm. As a comparison, randomly cropped regions with the same size as insecurities\nwere also shown to turkers. We found that turkers agreed amongst themselves on a and b for 59.4% of\nthe insecurities and 33.7% of randomly cropped regions. Turkers agreed with the algorithm for 51.9%\nof the insecurities and 26.3% of the random crops. This shows that 1) insecurities are much more\npredictive of the ambiguities sensed by humans, and 2) the algorithm predicts those ambiguities with\nexciting levels of consistency, given the very limited amount of optimization of algorithm components\nthat we have performed so far. In both cases, the \u201cDon\u2019t know\u201d rate was around 12%.\n\n7 Conclusion\n\nIn this work, we have presented a novel explanation strategy, deliberative explanations, aimed at\nvisualizing the deliberative process that leads a network to a certain prediction. A procedure was\nproposed to generate these explanations, using second order attributions with respect to both classes\nand a dif\ufb01culty score. Experimental results have shown that the latter outperform the \ufb01rst-order\nattributions commonly used in the literature, and that referential dif\ufb01culty scores should be avoided,\nwhenever possible. The strong annotations are just needed to evaluate explanation performance, i.e.\non the test set. Hence, the requirement for annotations is a limitation but only for the evaluation of\ndeliberative explanation methods, not for their use by practitioners. Finally, deliberative explanations\nwere shown to identify insecurities that correlate with human notions of ambiguity, which makes\nthem intuitive.\n\nAcknowledgement\n\nThis work was partially funded by NSF awards IIS-1546305, IIS-1637941, IIS-1924937, and NVIDIA\nGPU donations.\n\n9\n\n\fReferences\n[1] Marco Ancona, Enea Ceolini, Cengiz \u00d6ztireli, and Markus Gross. A uni\ufb01ed view of gradient-based\nattribution methods for deep neural networks. In NIPS 2017-Workshop on Interpreting, Explaining and\nVisualizing Deep Learning. ETH Zurich, 2017.\n\n[2] Sebastian Bach, Alexander Binder, Gr\u00e9goire Montavon, Frederick Klauschen, Klaus-Robert M\u00fcller, and\nWojciech Samek. On pixel-wise explanations for non-linear classi\ufb01er decisions by layer-wise relevance\npropagation. PloS one, 10(7):e0130140, 2015.\n\n[3] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantify-\ning interpretability of deep visual representations. In Proceedings of the IEEE Conference on Computer\nVision and Pattern Recognition, pages 6541\u20136549, 2017.\n\n[4] Shreyansh Daftry, Sam Zeng, J Andrew Bagnell, and Martial Hebert. Introspective perception: Learning to\npredict failures in vision systems. In IEEE International Conference on Intelligent Robots and Systems,\npages 1743\u20131750. IEEE, 2016.\n\n[5] Terrance DeVries and Graham W Taylor. Learning con\ufb01dence for out-of-distribution detection in neural\n\nnetworks. arXiv preprint arXiv:1802.04865, 2018.\n\n[6] Amit Dhurandhar, Pin-Yu Chen, Ronny Luss, Chun-Chen Tu, Paishun Ting, Karthikeyan Shanmugam, and\nPayel Das. Explanations based on the missing: Towards contrastive explanations with pertinent negatives.\nIn Advances in Neural Information Processing Systems, pages 592\u2013603, 2018.\n\n[7] Dominik Maria Endres and Johannes E Schindelin. A new metric for probability distributions. IEEE\n\nTransactions on Information theory, 2003.\n\n[8] Ruth Fong and Andrea Vedaldi. Net2vec: Quantifying and explaining how concepts are encoded by\n\ufb01lters in deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 8730\u20138738, 2018.\n\n[9] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples.\n\narXiv preprint arXiv:1412.6572, 2014.\n\n[10] Yash Goyal, Ziyan Wu, Jan Ernst, Dhruv Batra, Devi Parikh, and Stefan Lee. Counterfactual visual\nexplanations. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th Interna-\ntional Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages\n2376\u20132384, 2019.\n\n[11] Alex Graves. Practical variational inference for neural networks. In Advances in neural information\n\nprocessing systems, pages 2348\u20132356, 2011.\n\n[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770\u2013778, 2016.\n\n[13] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell.\nGenerating visual explanations. In European Conference on Computer Vision, pages 3\u201319. Springer, 2016.\n\n[14] Lisa Anne Hendricks, Ronghang Hu, Trevor Darrell, and Zeynep Akata. Generating counterfactual\n\nexplanations with natural language. arXiv preprint arXiv:1806.09809, 2018.\n\n[15] Jos\u00e9 Miguel Hern\u00e1ndez-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of\n\nbayesian neural networks. In International Conference on Machine Learning, pages 1861\u20131869, 2015.\n\n[16] Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. To trust or not to trust a classi\ufb01er. In Advances\n\nin Neural Information Processing Systems, pages 5541\u20135552, 2018.\n\n[17] Edward Johns, Oisin Mac Aodha, and Gabriel J Brostow. Becoming the expert-interactive multi-class\nmachine teaching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 2616\u20132624, 2015.\n\n[18] Atsushi Kanehira and Tatsuya Harada. Learning to explain with complemental examples. arXiv preprint\n\narXiv:1812.01280, 2018.\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n10\n\n\f[20] Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, and Yun Fu. Tell me where to look: Guided attention\ninference network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,\npages 9215\u20139223, 2018.\n\n[21] Scott M Lundberg and Su-In Lee. A uni\ufb01ed approach to interpreting model predictions. In Advances in\n\nNeural Information Processing Systems, pages 4765\u20134774, 2017.\n\n[22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[23] Oisin Mac Aodha, Shihan Su, Yuxin Chen, Pietro Perona, and Yisong Yue. Teaching categories to human\nlearners with visual explanations. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3820\u20133828, 2018.\n\n[24] David JC MacKay. A practical bayesian framework for backpropagation networks. Neural computation, 4\n\n(3):448\u2013472, 1992.\n\n[25] Amit Mandelbaum and Daphna Weinshall. Distance-based con\ufb01dence score for neural network classi\ufb01ers.\n\narXiv preprint arXiv:1709.09844, 2017.\n\n[26] David Alvarez Melis and Tommi Jaakkola. Towards robust interpretability with self-explaining neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 7775\u20137784, 2018.\n\n[27] Tim Miller. Contrastive explanation: A structural-model approach. arXiv preprint arXiv:1811.03163,\n\n2018.\n\n[28] Gr\u2019egoire Montavon, Sebastian Lapuschkin, Alexander Binder, Wojciech Samek, and Klaus-Robert M\"uller.\nExplaining nonlinear classi\ufb01cation decisions with deep taylor decomposition. Pattern Recognition, 65(C):\n211\u2013222, 2017.\n\n[29] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why should i trust you?: Explaining the\npredictions of any classi\ufb01er. In Proceedings of the ACM SIGKDD international conference on knowledge\ndiscovery and data mining, pages 1135\u20131144. ACM, 2016.\n\n[30] Dhruv Mauria Saxena, Vince Kurtz, and Martial Hebert. Learning robust failure response for autonomous\nvision based \ufb02ight. In IEEE International Conference on Robotics and Automation, pages 5824\u20135829.\nIEEE, 2017.\n\n[31] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and\nDhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In\nProceedings of the IEEE International Conference on Computer Vision, pages 618\u2013626, 2017.\n\n[32] Murat Sensoy, Melih Kandemir, and Lance Kaplan. Evidential deep learning to quantify classi\ufb01cation\n\nuncertainty. Advances in Neural Information Processing Systems, 2018.\n\n[33] Avanti Shrikumar, Peyton Greenside, Anna Shcherbina, and Anshul Kundaje. Not just a black box:\nLearning important features through propagating activation differences. arXiv preprint arXiv:1605.01713,\n2016.\n\n[34] Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. Learning important features through propagat-\ning activation differences. In Proceedings of the International Conference on Machine Learning, pages\n3145\u20133153. JMLR. org, 2017.\n\n[35] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez,\nThomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human\nknowledge. Nature, 550(7676):354, 2017.\n\n[36] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recogni-\n\ntion. arXiv preprint arXiv:1409.1556, 2014.\n\n[37] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside convolutional networks: Visualising\n\nimage classi\ufb01cation models and saliency maps. arXiv preprint arXiv:1312.6034, 2013.\n\n[38] Adish Singla, Ilija Bogunovic, G\u00e1bor Bart\u00f3k, Amin Karbasi, and Andreas Krause. Near-optimally teaching\nthe crowd to classify. In Proceedings of the International Conference on Machine Learning, volume 1,\npage 3, 2014.\n\n[39] Sahil Singla, Eric Wallace, Shi Feng, and Soheil Feizi. Understanding impacts of high-order loss approxi-\nmations and features in deep learning interpretation. In International Conference on Machine Learning,\n2019.\n\n11\n\n\f[40] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks. In Proceedings\n\nof the International Conference on Machine Learning, pages 3319\u20133328, 2017.\n\n[41] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\n[42] Laurens Van Der Maaten. Learning a parametric embedding by preserving local structure. In Arti\ufb01cial\n\nIntelligence and Statistics, pages 384\u2013391, 2009.\n\n[43] Laurens Van Der Maaten. Accelerating t-sne using tree-based algorithms. The Journal of Machine Learning\n\nResearch, 15(1):3221\u20133245, 2014.\n\n[44] Laurens Van der Maaten and Geoffrey Hinton. Visualizing non-metric similarities in multiple maps.\n\nMachine learning, 87(1):33\u201355, 2012.\n\n[45] Sandra Wachter, Brent Mittelstadt, and Chris Russell. Counterfactual explanations without opening the\n\nblack box: Automated decisions and the gpdr. Harv. JL & Tech., 31:841, 2017.\n\n[46] Pei Wang and Nuno Vasconcelos. Towards realistic predictors. In European Conference on Computer\n\nVision, 2018.\n\n[47] Xin Wang, Yujia Luo, Daniel Crankshaw, Alexey Tumanov, Fisher Yu, and Joseph E Gonzalez. Idk\n\ncascades: Fast deep learning by learning not to overthink. arXiv preprint arXiv:1706.00885, 2017.\n\n[48] P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200.\n\nTechnical Report CNS-TR-2010-001, California Institute of Technology, 2010.\n\n[49] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutional networks. In European\n\nconference on computer vision, pages 818\u2013833, 2014.\n\n[50] Peng Zhang, Jiuling Wang, Ali Farhadi, Martial Hebert, and Devi Parikh. Predicting failures of vision\n\nsystems. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3566\u20133573, 2014.\n\n[51] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object detectors emerge\n\nin deep scene cnns. International Conference on Learning Representations, 2015.\n\n[52] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features\nfor discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern\nrecognition, pages 2921\u20132929, 2016.\n\n[53] Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene\nparsing through ade20k dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 2017.\n\n12\n\n\f", "award": [], "sourceid": 789, "authors": [{"given_name": "Pei", "family_name": "Wang", "institution": "UC San Diego"}, {"given_name": "Nuno", "family_name": "Nvasconcelos", "institution": "UC San Diego"}]}