Part of Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Kaiwen Yang, Tianyi Zhou, Yonggang Zhang, Xinmei Tian, Dacheng Tao
What is the minimum necessary information required by a neural net D(⋅) from an image x to accurately predict its class? Extracting such information in the input space from x can allocate the areas D(⋅) mainly attending to and shed novel insights to the detection and defense of adversarial attacks. In this paper, we propose ''class-disentanglement'' that trains a variational autoencoder G(⋅) to extract this class-dependent information as x−G(x) via a trade-off between reconstructing x by G(x) and classifying x by D(x−G(x)), where the former competes with the latter in decomposing x so the latter retains only necessary information for classification in x−G(x). We apply it to both clean images and their adversarial images and discover that the perturbations generated by adversarial attacks mainly lie in the class-dependent part x−G(x). The decomposition results also provide novel interpretations to classification and attack models. Inspired by these observations, we propose to conduct adversarial detection and adversarial defense respectively on x−G(x) and G(x), which consistently outperform the results on the original x. In experiments, this simple approach substantially improves the detection and defense against different types of adversarial attacks.