Paper ID: | 7516 |
---|---|

Title: | Explanations can be manipulated and geometry is to blame |

The main concern with this work is originality; the paper declares introducing the targeted attack method while it was introduced in the previous literature (https://arxiv.org/pdf/1710.10547.pdf, it has also been cited throughout the work) and spends a substantial portion of the paper on the targeted attack method and its results. Although shallowly discussed in the previous literature and very intuitive, the theoretical discussions are very useful and the direction of using differential geometry to analyze the interpretation fragility of ReLu networks sounds promising. I very much enjoyed the theoretical and empirical discussion that relates the SmoothGrad method to the introduced B-Smoothing method. Given the originality concern, I am afraid that the current contributions of the paper, overall, are not enough for a venue like NeurIPS. The discussed material in the paper is also not coherent enough for the paper to be published as it is. Some questions: - How could we argue that by changing the activation of a network for the B-Smoothed maps, the given explanation is still explaining the original network? In other words, we are computing the explanations for a different network? I think comparing the saliency map given by this method to the original saliency in a non-adversarial setting could show that similarity is preserved. (The question is similar to that of accuracy-robustness trade-off of the adversarial defense community) - How does the B-Smoothed explanation method (as a new explanation method) compare to other methods in the metrics related to explanation goodness (such as removing pixels with the order of saliency and tracking the prediction accuracy, ...) - How would training a network with softplus activations from the beginning work? In this case, there is no guarantee that the F-Norm bound of the Hessian of this new network is smaller than that of the same architecture trained with a ReLu activation and it would be hard to argue that the softplus network is more robust? - Just to make sure, in the results section, for the B-Smooethd and Smothgradient results, was the perturbation recreated or the specific explanations or the same perturbation of the ReLU network was used? (Fig 4,5 and 14-17) - How could we use the provided theoretical discussion to compare two trained models' interpretation fragility? _________________________________________ ADDITIONAL COMMENT AFTER AUTHOR RESPONSE: The authors did a very good job at clarifying distinctions from previous work. The given answers are very clear and it is amazing that so many experiments were performed in the short rebuttal period. The score is adjusted to reflect the response.

Originality: To the best of my knowledge, adversarial manipulation of explanations was (foreshadowed by previous research but) new. The constrained optimization method introduced to generate adversarial manipulations follows closely the ones used for generating class-adversarial examples, but it is also new. The theoretical analysis is very closely related to ideas of Ghorbani et al. 2017, but it goes quite beyond it. Quality: The paper is of good quality. Its best quality is the intuition that fragility of explanations can be effectively exploited. This is an important message. The methodological contribution is somewhat simple and limited (e.g. the manipulation algorithm only applies to differentiable explainers like input gradients). The theoretical analysis does make up for it. The length of the related work is somewhat surprising, considering the amount of recent work on explainability. Clarity: All ideas are presented clearly. Significance: This paper showcases an important phenomenon; I am confident that it will receive a lot of attention, as it highlights an important phenomenon at the interface between adversarial attacks and explainability.

The paper is well written and strikes a balance between theory and practice. I find Fig. 2 very clear but Fig. 1 not so clear as the first figure of the paper: Should add bottom row=explanation maps, and mention adversarial maps, present a proof of concept of bad manipulations in practice, etc. Revise caption. I would prefer a different title like "Adversarial explanations depend on the geometric curvature: Better smoothed than manipulated" In eq. 4, you mean a minus and not a plus for gamma>0? If so, please explain why. line 65: states pointwise multiplication for symbol Can you describe more what is the effects of clamping at each iteration, and why not clamp after a few iterations? line 137: Hyperplane -> Hypersurface Say that 2nd fundamental form is also called embedded curvature and may be state that 1st fundamental form is metric tensor. After line 150, remove second minus (inner product is non-negative) I like fig 19 of SI Talk about manifold foliations in general? Minor edits: \mathrm{adv} Upper letter in bibref 7 and 16