NeurIPS 2019
Sun Dec 8th through Sat the 14th, 2019 at Vancouver Convention Center
Paper ID:1057
Title:Defense Against Adversarial Attacks Using Feature Scattering-based Adversarial Training

Reviewer 1

To my knowledge, the idea of coupling perturbations across examples is a new idea, and worthy of additional exploration. This paper makes a nice contribution in that direction. The idea, and algorithmic instantiation, both seem well-done. The paper claims (significantly) state-of-the-art results across a variety of tasks. Currently, the adversarial evaluation seems suspect, but if these holes are addressed, this paper would be a strong contribution in originality, quality, and significance. (I remain unclear whether the proposed algorithm ought to be an improvement over standard adversarial training, and thus, the empirical results are particularly important here.) My major concerns are whether gradient masking is present in the model, and whether it is tested against the strongest possible attacks. I will focus on CIFAR-10 at eps=8, as this is by far the most competitive benchmark. First, it is suspicious that the black-box attack (transferring from an undefended model) does better than the white-box attack. Against most adversarially robust models, black box attacks are extremely weak. E.g. Madry reports 86% accuracy when transferring from an undefended model. Second, the large gap between PGD and the CW-variant (the paper says this is PGD using the CW loss) is suspicious. If I understand correctly, the only difference is a cross-entropy loss vs. a margin loss. In typical models, e.g. Madry, these converge to similar values by 100 iterations, but here there is an 8% gap (68.6 vs 60.6). Several comments on clarity, and miscellaneous questions: I found the description of Algorithm 1 unclear. For estimating grad_x’ D(mu, nu), my understanding is that the algorithm first estimates the transport matrix T (using e.g. Sinkhorn), and then computes the gradient treating T as fixed, ignoring the dependence of T on x’. On first read, it was unclear how this gradient was estimated, and this seemed the most sensible to me, but please correct me if this interpretation is mistaken. How are u_i and v_i defined? I couldn’t find this in the paper. The natural choice seems to be 1/N (uniform over the dataset), but why introduce u_i and v_i in this case? e.g. the description in equation (7) would be more straightforward. Minor: I would mention Sinkhorn and IPOT sooner. Otherwise the reader is wondering how the max in Eq. 6 is solved, until the end of the experimental section. Minor: Feature scattering seems to combine two distinct ideas: first, using an unsupervised adversary operating on distances between activations (rather than labels) and second, coupling perturbations across examples. It would be nice to decouple the impacts of these — looking at the Identity matching scheme is nice (as this isolates the first idea without the second), and the comparison here could be developed further. Why pick label smoothing lambda=0.5? In the supplement, lambda=0.8 seems significantly better (65% against strongest adversary vs. 60%). It’s nice that only 1 attack iteration is necessary for the reported results. I think this is worth emphasizing earlier. Finally - I’m glad to see the authors note that they intend to open source their model and evaluation code. I believe this is a great practice for the adversarial robustness community. __________________ Update: I have changed my score from a 5 to an 7, largely in response to the noticed bug regarding reported black-box numbers in fact corresponding to a white-box evaluation (as well as additional convergence plots, and stronger attacks). With the reconciled results, the paper proposes a promising idea (computing perturbations which are coupled across examples), and a significant improvement over SOTA on a competitive benchmark (CIFAR-10 at eps=8, white-box). I am somewhat hesitant after the initial mistake. In particular, this seems like a sanity check which the authors should have noticed before the submission, and the fact that it was unnoticed is somewhat concerning. The fact that the code is being open-sourced, and so that it will be relatively easy for the community to verify the claims made in the paper is a significant contributing factor to my updated score, and not dwelling too much on this oversight. I would also encourage the authors to include the additional adversarial evaluations (with fixes as suggested by other reviewers) and ablation studies (over label smoothing parameter) in the final version.

Reviewer 2

Update after reading authors response: I changed my score to accept after reading author's response because authors addressed most of the comments from reviewers and they also explain that initial black box results [which created impression of gradient masking] were erroneously entered. Still my concern about gradient free attack not fully addressed. Authors did provide result on gradient free attack, however chosen method ( seem to be weak attack. So I would highly encourage authors to perform experiments using few additional gradient free attacks (ex: ) and report it in the final version of the paper. ----------------------------------------------- Original review: Originality: The paper proposes a novel technique to improve model robustness against L-infinity adversarial attacks. The technique takes into account similarity between features in the minibatch. Quality: There are several issues with the paper which makes me lean towards rejection: 1) One big question which seems to be not address is computational complexity of the proposed method. With the same number of inner iterations T, the proposed method requires more compute compared to PGD adversarial training. So assuming that number of iterations T were the same in the experiments, proposed method is actually given some advantage in terms of extra compute. 2) I could not find information about number of iterations T of inner optimization which was used in experiments. I assume that it was the same for PGD adversarial training and feature scattering, but it’s not clear from the text. 3) One of the issues which feature scattering claim to be addressing is label leaking. However as authors mentioned it could be addressed by other means, like guessing labels. So I wonder how proposed method would compare with PGD adversarial training which does not use true labels. 4) There are issues with evaluation of proposed defenses: 4a) According to table 3, proposed method reaches 68.6 accuracy against PGD100 attack and 60.6 accuracy against CW100 attack in white box case. At the same time according to table in the section 5.3, it has lower accuracy against same attacks in black box case. The fact that white box accuracy is lower compared to black box accuracy is usually an indication of gradient masking. 4b) Authors only limit their evaluation to no more than 100 iterations of PGD attack. Also it’s not clear whether they do random restarts and how many. And from the data tables is could be observed that accuracy is decreasing with increase of number of iterations. This means that attack which was used for evaluation might be too weak. 4c) No attempt to use gradient free attacks to evaluate robustness. Clarity: Paper is reasonably well structured. The section about feature scattering is somewhat harder to read. I would encourage authors to expand their intuitive explanation on what happening during feature scattering. Also it would be useful to include intuitive explanation and/or illustration on what happening in feature scattering adversarial training. Significance: If all issues with evaluation are addressed and method still shows improvement over baseline then I would say it has moderate significant: it does not solve the problem of adversarial examples completely, but it does provide an interesting idea and improved metrics.

Reviewer 3

****** Update ******* Like other reviewers, I'm happy to see that there was a good explanation for why the black-box setting was broken. There are two additional points, I'd like to make: 1) The authors either explain the multiple restarts setting wrongly or apply it wrongly. Instead of running 5 separate evaluations and picking the worst (min of mean of accuracy under attack), they should repeat for each example in the dataset the attack 5 times and take the worst (mean of min). This is really important and the authors to fix this and add these results to the paper. 2) I encourage the authors to share their model as quickly as possible and before the conference. This will allow the community to make sure that the authors claims are correct. If they'd rather have this evaluation made in private before fully releasing the model, I'm happy for the AC to transmit my contact details. All in all, I believe the current score is fair if the authors follow-up on their promise to open-source their code (and hopefully model). *********************** The authors introduce a feature scattering-based adversarial training approach (based on solving an optimal transport problem). The main motivation is avoid label leaking. Hence the authors also use label smoothing. Overall, the paper is well written. The motivation is sound and the evaluation seems appropriate. In any case, the results are very impressive (beating the previous state-of-the-art method by 4% in absolute terms on CIFAR-10 with a similar evaluation - see TRADES: TRadeoff-inspired Adversarial DEfense via Surrogate-loss minimization by Zhang et al.). As always, it is very hard to judge whether the evaluation is done correctly and I would urge the authors to release their models if possible. Pros: * SOTA results. * Novel method to generate adversarial examples. Cons: * Some details are missing (e.g., better study of label smoothing). * The black-box attack is stronger than CW100 which is suspicious. Details: 1) Label leaking seems to be the core of the problem. The authors should define it earlier in the introduction. 2) The transport cost is defined by the cosine distance in logit-space. Can the author motivate this choice? 3) The batch size is likely to be an important factor. Can the author provide results with varying batch sizes? 4) Label smoothing is very large (0.5). On CIFAR-10, I haven't seen such a large smoothing applied to adversarial training. The authors should analyze the effect of such smoothing (or consider removing it). In particular could the author add rows corresponding to 0.0, 0.1 and 0.2 to Table 1 in the supplementary material and also evaluate standard adversarial training (Madry et al.) with such smoothing applied. 5) The table in Section 5.3 (black-box attack) is slightly at odds with the other results. In particular, the examples found using the undefended network seem to transfer extremely well (even beating CW100 on the original model) yielding an accuracy under attack of 59.7% (while CW100 yields 60.6%). Maybe I misunderstood what the authors meant by black-box attack. Minor: a) In Table 1, the notation 0.0 and 0.00 is inconsistent.