Attacks Meet Interpretability: Attribute-steered Detection of\n Adversarial Samples

201863 citationsPreprintgreen Open Access

Authors

Abstract

Adversarial sample attacks perturb benign inputs to induce DNN misbehaviors.\nRecent research has demonstrated the widespread presence and the devastating\nconsequences of such attacks. Existing defense techniques either assume prior\nknowledge of specific attacks or may not work well on complex models due to\ntheir underlying assumptions. We argue that adversarial sample attacks are\ndeeply entangled with interpretability of DNN models: while classification\nresults on benign inputs can be reasoned based on the human perceptible\nfeatures/attributes, results on adversarial samples can hardly be explained.\nTherefore, we propose a novel adversarial sample detection technique for face\nrecognition models, based on interpretability. It features a novel\nbi-directional correspondence inference between attributes and internal neurons\nto identify neurons critical for individual attributes. The activation values\nof critical neurons are enhanced to amplify the reasoning part of the\ncomputation and the values of other neurons are weakened to suppress the\nuninterpretable part. The classification results after such transformation are\ncompared with those of the original model to detect adversaries. Results show\nthat our technique can achieve 94% detection accuracy for 7 different kinds of\nattacks with 9.91% false positives on benign inputs. In contrast, a\nstate-of-the-art feature squeezing technique can only achieve 55% accuracy with\n23.3% false positives.\n

Topics & Keywords

Adversarial Robustness in Machine Learning Anomaly Detection Techniques and Applications Domain Adaptation and Few-Shot Learning

UN Sustainable Development Goals

Peace, Justice and strong institutions

Publication Details

Published in: arXiv (Cornell University)

DOI: 10.48550/arxiv.1810.11580