Reviews: Robust Attribution Regularization

In this paper, the authors focus on the problem preventing attribution attacks (small input modifications which don't change the prediction but change the attribution). They do this by exploring two different framings of robust optimization that consider the Path Integrated Gradients, (P)IGs, for examples near the original example. Most interesting to me (although not an expert in this subfield), is the degree to which the paper connects robust predictions and robust attribution; showing they are equivalent in simple NNs, explaining how robust prediction optimization as in Madry et al. allows for "cancelling out" of attribution shifts, and demonstrating empirically that optimizing for robust attribution improves robust predictions. I'd encourage the authors to draw this out more. The biggest weakness for me is does this approach change my interpretation of attributions? Following the example from the intro about pathology, should I trust now the attributions from a model that is designed to not be sensitive to small perturbations? Attacking attributions demonstrates their brittleness and shows that interpreting the "reasoning" of a NN from these attributions should be taken with a grain of salt, but it is unclear to me if this is a surface-level fix to a deeper rooted problem with attribution or if this is directly addressing a central problem. So while the theoretical connection is interesting, I'd like to see a clearer argument for why designing models that have robust attribution is an important property. Overall, this paper is interesting and in my opinion a valuable contribution. Details: Sec 3.2 is gratuitous to me. The objective is never used and at least as written I didn't find the connection as intriguing. Is there any functional value to the use of an intermediate lay r for IG in (4)? Table 2 is redundant with Table 3.

Reviewer 2

I have scored the paper a 7 since I feel that this is a well-written paper on an interesting and relevant topic. Kudos to the authors for clear exposition, including code, and providing some guidance on how to set regularization parameters. While I don't have much to critique, I do have two comments: 1. I do think that the paper could be *much* stronger if the authors could present some negative results regarding "robust" saliency maps. For example, it is possible to use their methods to build saliency maps that are robust to perturbations, yet are completely flawed in their attributions? This would make a nice use case for the method - giving us a sanity check to prevent rationalization. In line with this comment, it is worth adding a short note on the issues that are not fixed in with saliency maps (see e.g., https://arxiv.org/abs/1901.09749, https://arxiv.org/abs/1811.10154). 2. In some sense the previous comment highlights the following issue, which I would mention in the text. This method "appears to work" since the experiments report metrics that are bound to improve given the constraints (e.g., Kendall’s correlation between a saliency map and perturbed saliency maps is bound to improve given the set of constraints). While I don't think that this is a limitation of the work, I think that the paper should mention this explicitly for the sake of transparency. In short, it would help new readers to know that we are only fixing a form of brittleness that we can measure, and that it will not necessarily fix the more difficult attribution problem (though it can screen away clear cases of misattribution). Other issues: - l.183 One-Layer <- Single-Layer) - p.4 footnote: "We stress that this regularization term depends on model parameters θ through loss function l[y]" <- what does this mean?

Reviewer 3

The work studies an important problem, that of optimizing networks with respect to attribution robustness. The problem, in a way, is not only related to interpretability for its own sake, but also for improving the robustness of models in general, and that connection is very well phrased in the paper. Comments: - The paper would benefit by getting another, maybe richer, dataset in the evaluation. MNIST is not a great example, especially when it comes to interpretations and robustness. - In general, it is unclear (even from previous work in this space), how does attribution robustness correlate to the human perception of model interpretability. I am not aware of studies that have tried to measure this (empirically) but if there are, it is very useful including them in the paper. If such studies do not exist, it would be beneficial to at least have a paragraph that analytically explains how close this may or may not be to human interpretation. It is fair for the reader to know whether implementing such optimizations in practice, would even be visible to people at all. On the same point, it is hard to understand how the improvements in IN and CC (Table 3) relate to practical improvements in robustness. Does an improvement of 0.02 really make the model outputs more interpretable to the input changes? - On the ineffective optimization paragraph, point (2). This point deserves further and more precise discussion on why the authors think that the architecture is not suitable. Also, it needs a clarification on whether "architecture" here means the generic nature of NNs or the particular architecture of the networks studied in this paper. - Minor: You could actually write the actual name of the metric in the header of Table 3 instead of the Accronyms.

Paper ID:	8088
Title:	Robust Attribution Regularization

Reviewer 1

Reviewer 2

Reviewer 3