CMU-ML-20-102
Machine Learning Department
School of Computer Science, Carnegie Mellon University



CMU-ML-20-102

Provable, structured, and efficient methods for
robustness of deep networks to adversarial examples

Eric Wong

May 2020

Ph.D. Thesis

CMU-ML-20-102.pdf


Keywords: Adversarial examples, provable defenses, adversarial training, deep networks


While deep networks have contributed to major leaps in raw performance across various applications, they are also known to be quite brittle to targeted data perturbations. By adding a small amount of adversarial noise to the data, it is possible to drastically change the output of a deep network. The existence of these so-called adversarial examples, perturbed data points which fool the model, pose a serious risk for safety- and security-centric applications where reliability and robustness are critical. In this dissertation, we present and analyze a number of approaches for mitigating the effect of adversarial examples, also known as adversarial defenses. These defenses can offer varying degrees and types of robustness, and in this dissertation we study defenses which differ in the strength of the the robustness guarantee, the efficiency and simplicity of the defense, and the type of perturbation being defended against.

We start with the strongest type of guarantee called provable adversarial defenses, showing that is possible to compute duality-based certificates that guarantee no adversarial examples exist within an 𝓁p-bounded region, which are trainable and can be minimized to learn networks which are provably robust to adversarial attacks. The approach is agnostic to the specific architecture and is applicable to arbitrary computational graphs, scaling to medium sized convolutional networks with random projections.

We then switch gears to developing a deeper understanding of a more empirical defense known as adversarial training. Although adversarial training does not come with formal guarantees, it can learn networks more efficiently and with better empirical performance against attacks. We study the optimization process and reveal several intriguing properties of the robust learning problem, finding that a simple modification to one of the earliest adversarial attacks can be sufficient to learn networks robust to much stronger attacks, as well as finding that adversarial training as a general procedure is highly susceptible to overfitting. These discoveries have significant implications on both the efficiency of adversarial training as well as the state of the field: for example, virtually all recent algorithmic improvements in adversarial training can be matched by simply using early stopping.

The final component of this dissertation expands the realm of adversarial exam-ples beyond 𝓁p-norm bounded perturbations, to enable more realistic threat models for applications beyond imperceptible noise. We define a threat model called the Wasserstein adversarial example, which captures semantically meaningful image transformations like translations and rotations previously uncaptured by existing threat models. We present an efficient algorithm for projecting onto Wasserstein balls, enabling both generation of and adversarial training against Wasserstein adversarial examples. Finally, we demonstrate how to generalize adversarial training to defend against multiple types of threats simultaneously, improving upon naive aggregations of adversarial attacks.

168 pages

Thesis Committee:
J. Zico Kolter (Chair)
Barnabás Póczos
Matt Fredrikson
Aleksander Madry (Massachusetts Institute of Technology)

Roni Rosenfeld, Head, Machine Learning Department
Martial Hebert, Dean, School of Computer Science


SCS Technical Report Collection
School of Computer Science