{"title": "Measuring Neural Net Robustness with Constraints", "book": "Advances in Neural Information Processing Systems", "page_first": 2613, "page_last": 2621, "abstract": "Despite having high accuracy, neural nets have been shown to be susceptible to adversarial examples, where a small perturbation to an input can cause it to become mislabeled. We propose metrics for measuring the robustness of a neural net and devise a novel algorithm for approximating these metrics based on an encoding of robustness as a linear program. We show how our metrics can be used to evaluate the robustness of deep neural nets with experiments on the MNIST and CIFAR-10 datasets. Our algorithm generates more informative estimates of robustness metrics compared to estimates based on existing algorithms. Furthermore, we show how existing approaches to improving robustness \u201coverfit\u201d to adversarial examples generated using a specific algorithm. Finally, we show that our techniques can be used to additionally improve neural net robustness both according to the metrics that we propose, but also according to previously proposed metrics.", "full_text": "Measuring Neural Net Robustness with Constraints\n\nOsbert Bastani\nStanford University\n\nobastani@cs.stanford.edu\n\nDimitrios Vytiniotis\nMicrosoft Research\n\ndimitris@microsoft.com\n\nYani Ioannou\n\nUniversity of Cambridge\n\nyai20@cam.ac.uk\n\nAditya V. Nori\n\nMicrosoft Research\n\nadityan@microsoft.com\n\nLeonidas Lampropoulos\nUniversity of Pennsylvania\nllamp@seas.upenn.edu\n\nAntonio Criminisi\nMicrosoft Research\n\nantcrim@microsoft.com\n\nAbstract\n\nDespite having high accuracy, neural nets have been shown to be susceptible to\nadversarial examples, where a small perturbation to an input can cause it to become\nmislabeled. We propose metrics for measuring the robustness of a neural net and\ndevise a novel algorithm for approximating these metrics based on an encoding of\nrobustness as a linear program. We show how our metrics can be used to evaluate\nthe robustness of deep neural nets with experiments on the MNIST and CIFAR-10\ndatasets. Our algorithm generates more informative estimates of robustness metrics\ncompared to estimates based on existing algorithms. Furthermore, we show how\nexisting approaches to improving robustness \u201cover\ufb01t\u201d to adversarial examples\ngenerated using a speci\ufb01c algorithm. Finally, we show that our techniques can be\nused to additionally improve neural net robustness both according to the metrics\nthat we propose, but also according to previously proposed metrics.\n\n1\n\nIntroduction\n\nRecent work [21] shows that it is often possible to construct an input mislabeled by a neural net\nby perturbing a correctly labeled input by a tiny amount in a carefully chosen direction. Lack of\nrobustness can be problematic in a variety of settings, such as changing camera lens or lighting\nconditions, successive frames in a video, or adversarial attacks in security-critical applications [18].\nA number of approaches have since been proposed to improve robustness [6, 5, 1, 7, 20]. However,\nwork in this direction has been handicapped by the lack of objective measures of robustness. A typical\napproach to improving the robustness of a neural net f is to use an algorithm A to \ufb01nd adversarial\nexamples, augment the training set with these examples, and train a new neural net f(cid:48) [5]. Robustness\nis then evaluated by using the same algorithm A to \ufb01nd adversarial examples for f(cid:48)\u2014if A discovers\nfewer adversarial examples for f(cid:48) than for f, then f(cid:48) is concluded to be more robust than f. However,\nf(cid:48) may have over\ufb01t to adversarial examples generated by A\u2014in particular, a different algorithm A(cid:48)\nmay \ufb01nd as many adversarial examples for f(cid:48) as for f. Having an objective robustness measure is\nvital not only to reliably compare different algorithms, but also to understand robustness of production\nneural nets\u2014e.g., when deploying a login system based on face recognition, a security team may\nneed to evaluate the risk of an attack using adversarial examples.\nIn this paper, we study the problem of measuring robustness. We propose to use two statistics of\nthe robustness \u03c1(f, x\u2217) of f at point x\u2217 (i.e., the L\u221e distance from x\u2217 to the nearest adversarial\nexample) [21]. The \ufb01rst one measures the frequency with which adversarial examples occur; the\nother measures the severity of such adversarial examples. Both statistics depend on a parameter \u0001,\nwhich intuitively speci\ufb01es the threshold below which adversarial examples should not exist (i.e.,\npoints x with L\u221e distance to x\u2217 less than \u0001 should be assigned the same label as x\u2217).\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fThe key challenge is ef\ufb01ciently computing \u03c1(f, x\u2217). We give an exact formulation of this problem\nas an intractable optimization problem. To recover tractability, we approximate this optimization\nproblem by constraining the search to a convex region Z(x\u2217) around x\u2217. Furthermore, we devise\nan iterative approach to solving the resulting linear program that produces an order of magnitude\nspeed-up. Common neural nets (speci\ufb01cally, those using recti\ufb01ed linear units as activation functions)\nare in fact piecewise linear functions [15]; we choose Z(x\u2217) to be the region around x\u2217 on which\nf is linear. Since the linear nature of neural nets is often the cause of adversarial examples [5], our\nchoice of Z(x\u2217) focuses the search where adversarial examples are most likely to exist.\nWe evaluate our approach on a deep convolutional neural network f for MNIST. We estimate \u03c1(f, x\u2217)\nusing both our algorithm ALP and (as a baseline) the algorithm AL-BFGS introduced by [21]. We show\nthat ALP produces a substantially more accurate estimate of \u03c1(f, x\u2217) than AL-BFGS. We then use data\naugmentation with each algorithm to improve the robustness of f, resulting in \ufb01ne-tuned neural nets\nfLP and fL-BFGS. According to AL-BFGS, fL-BFGS is more robust than f, but not according to ALP. In\nother words, fL-BFGS over\ufb01ts to adversarial examples computed using AL-BFGS. In contrast, fLP is\nmore robust according to both AL-BFGS and ALP. Furthermore, to demonstrate scalability, we apply\nour approach to evaluate the robustness of the 23-layer network-in-network (NiN) neural net [13] for\nCIFAR-10, and reveal a surprising lack of robustness. We \ufb01ne-tune NiN and show that robustness\nimproves, albeit only by a small amount. In summary, our contributions are:\n\npropose two statistics for measuring robustness based on this notion (\u00a72).\n\n\u2022 We formalize the notion of pointwise robustness studied in previous work [5, 21, 6] and\n\u2022 We show how computing pointwise robustness can be encoded as a constraint system\n(\u00a73). We approximate this constraint system with a tractable linear program and devise an\noptimization for solving this linear program an order of magnitude faster (\u00a74).\n\u2022 We demonstrate experimentally that our algorithm produces substantially more accurate\nmeasures of robustness compared to algorithms based on previous work, and show evidence\nthat neural nets \ufb01ne-tuned to improve robustness (\u00a75) can over\ufb01t to adversarial examples\nidenti\ufb01ed by a speci\ufb01c algorithm (\u00a76).\n\n1.1 Related work\n\nThe susceptibility of neural nets to adversarial examples was discovered by [21]. Given a test point\nx\u2217 with predicted label (cid:96)\u2217, an adversarial example is an input x\u2217 + r with predicted label (cid:96) (cid:54)= (cid:96)\u2217\nwhere the adversarial perturbation r is small (in L\u221e norm). Then, [21] devises an approximate\nalgorithm for \ufb01nding the smallest possible adversarial perturbation r. Their approach is to minimize\nthe combined objective loss(f (x\u2217 + r), (cid:96)) + c(cid:107)r(cid:107)\u221e, which is an instance of box-constrained convex\noptimization that can be solved using L-BFGS-B. The constant c is optimized using line search.\nOur formalization of the robustness \u03c1(f, x\u2217) of f at x\u2217 corresponds to the notion in [21] of \ufb01nding the\nminimal (cid:107)r(cid:107)\u221e. We propose an exact algorithm for computing \u03c1(f, x\u2217) as well as a tractable approxi-\nmation. The algorithm in [21] can also be used to approximate \u03c1(f, x\u2217); we show experimentally\nthat our algorithm is substantially more accurate than [21].\nThere has been a range of subsequent work studying robustness;\n[17] devises an algorithm for\n\ufb01nding purely synthetic adversarial examples (i.e., no initial image x\u2217), [22] searches for adversarial\nexamples using random perturbations, showing that adversarial examples in fact exist in large regions\nof the pixel space, [19] shows that even intermediate layers of neural nets are not robust to adversarial\nnoise, and [3] seeks to explain why neural nets may generalize well despite poor robustness properties.\nStarting with [5], a major focus has been on devising faster algorithms for \ufb01nding adversarial\nexamples. Their idea is that adversarial examples can then be computed on-the-\ufb02y and used as\ntraining examples, analogous to data augmentation approaches typically used to train neural nets [10].\nTo \ufb01nd adversarial examples quickly, [5] chooses the adversarial perturbation r to be in the direction\nof the signed gradient of loss(f (x\u2217 + r), (cid:96)) with \ufb01xed magnitude. Intuitively, given only the gradient\nof the loss function, this choice of r is most likely to produce an adversarial example with (cid:107)r(cid:107)\u221e \u2264 \u0001.\nIn this direction, [16] improves upon [5] by taking multiple gradient steps, [7] extends this idea to\nnorms beyond the L\u221e norm, [6] takes the approach of [21] but \ufb01xes c, and [20] formalizes [5] as\nrobust optimization.\nA key shortcoming of these lines of work is that robustness is typically measured using the same\nalgorithm used to \ufb01nd adversarial examples, in which case the resulting neural net may have over\ufb01t\n\n2\n\n\fto adversarial examples generating using that algorithm. For example, [5] shows improved accuracy\nto adversarial examples generated using their own signed gradient method, but do not consider\nwhether robustness increases for adversarial examples generated using more precise approaches such\nas [21]. Similarly, [7] compares accuracy to adversarial examples generated using both itself and [5]\n(but not [21]), and [20] only considers accuracy on adversarial examples generated using their own\napproach on the baseline network. The aim of our paper is to provide metrics for evaluating robustness,\nand to demonstrate the importance of using such impartial measures to compare robustness.\nAdditionally, there has been work on designing neural network architectures [6] and learning proce-\ndures [1] that improve robustness to adversarial perturbations, though they do not obtain state-of-the-\nart accuracy on the unperturbed test sets. There has also been work using smoothness regularization\nrelated to [5] to train neural nets, focusing on improving accuracy rather than robustness [14].\nRobustness has also been studied in more general contexts; [23] studies the connection between\nrobustness and generalization, [2] establishes theoretical lower bounds on the robustness of linear and\nquadratic classi\ufb01ers, and [4] seeks to improve robustness by promoting resiliance to deleting features\nduring training. More broadly, robustness has been identi\ufb01ed as a desirable property of classi\ufb01ers\nbeyond prediction accuracy. Traditional metrics such as (out-of-sample) accuracy, precision, and\nrecall help users assess prediction accuracy of trained models; our work aims to develop analogous\nmetrics for assessing robustness.\n\n2 Robustness Metrics\nConsider a classi\ufb01er f : X \u2192 L, where X \u2286 Rn is the input space and L = {1, ..., L} are the labels.\nWe assume that training and test points x \u2208 X have distribution D. We \ufb01rst formalize the notion\nof robustness at a point, and then describe two statistics to measure robustness. Our two statistics\ndepend on a parameter \u0001, which captures the idea that we only care about robustness below a certain\nthreshold\u2014we disregard adversarial examples x whose L\u221e distance to x\u2217 is greater than \u0001. We use\n\u0001 = 20 in our experiments on MNIST and CIFAR-10 (on the pixel scale 0-255).\nPointwise robustness. Intuitively, f is robust at x\u2217 \u2208 X if a \u201csmall\u201d perturbation to x\u2217 does not\naffect the assigned label. We are interested in perturbations suf\ufb01ciently small that they do not affect\nhuman classi\ufb01cation; an established condition is (cid:107)x \u2212 x\u2217(cid:107)\u221e \u2264 \u0001 for some parameter \u0001. Formally, we\nsay f is (x\u2217, \u0001)-robust if for every x such that (cid:107)x \u2212 x\u2217(cid:107)\u221e \u2264 \u0001, f (x) = f (x\u2217). Finally, the pointwise\nrobustness \u03c1(f, x\u2217) of f at x\u2217 is the minimum \u0001 for which f fails to be (x\u2217, \u0001)-robust:\n\n\u03c1(f, x\u2217)\n\ndef\n\n= inf{\u0001 \u2265 0 | f is not (x\u2217, \u0001)-robust}.\n\n(1)\n\nThis de\ufb01nition formalizes the notion of robustness in [5, 6, 21].\n\nAdversarial frequency. Given a parameter \u0001, the adversarial frequency\n\n\u03c6(f, \u0001)\n\ndef\n\n= Prx\u2217\u223cD[\u03c1(f, x\u2217) \u2264 \u0001]\n\nmeasures how often f fails to be (x\u2217, \u0001)-robust. In other words, if f has high adversarial frequency,\nthen it fails to be (x\u2217, \u0001)-robust for many inputs x\u2217.\n\nAdversarial severity. Given a parameter \u0001, the adversarial severity\n= Ex\u2217\u223cD[\u03c1(f, x\u2217) | \u03c1(f, x\u2217) \u2264 \u0001]\n\n\u00b5(f, \u0001)\n\ndef\n\nmeasures the severity with which f fails to be robust at x\u2217 conditioned on f not being (x\u2217, \u0001)-robust.\nWe condition on pointwise robustness since once f is (x\u2217, \u0001)-robust at x\u2217, then the degree to which f\nis robust at x\u2217 does not matter. Smaller \u00b5(f, \u0001) corresponds to worse adversarial severity, since f is\nmore susceptible to adversarial examples if the distances to the nearest adversarial example are small.\nThe frequency and severity capture different robustness behaviors. A neural net may have high\nadversarial frequency but low adversarial severity, indicating that most adversarial examples are\nabout \u0001 distance away from the original point x\u2217. Conversely, a neural net may have low adversarial\nfrequency but high adversarial severity, indicating that it is typically robust, but occasionally severely\nfails to be robust. Frequency is typically the more important metric, since a neural net with low\nadversarial frequency is robust most of the time. Indeed, adversarial frequency corresponds to the\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: Neural net with a single hidden layer and\nReLU activations trained on dataset with binary labels.\n(a) The training data and loss surface. (b) The linear\nregion corresponding to the red training point.\n\n(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 2: For MNIST, (a) an im-\nage classi\ufb01ed 1, (b) its adversar-\nial example classifed 3, and (c)\nthe (scaled) adversarial perturba-\ntion. For CIFAR-10, (d) an im-\nage classi\ufb01ed as \u201cautomobile\u201d, (e)\nits adversarial example classi\ufb01ed as\n\u201ctruck\u201d, and (f) the (scaled) adver-\nsarial perturbation.\n\naccuracy on adversarial examples used to measure robustness in [5, 20]. Severity can be used to\ndifferentiate between neural nets with similar adversarial frequency.\nGiven a set of samples X \u2286 X drawn i.i.d. from D, we can estimate \u03c6(f, \u0001) and \u00b5(f, \u0001) using the\nfollowing standard estimators, assuming we can compute \u03c1:\n\n|{x\u2217 \u2208 X | \u03c1(f, x\u2217) \u2264 \u0001}|\n\n|X|\n\n\u02c6\u03c6(f, \u0001, X)\n\ndef\n=\n\n\u02c6\u00b5(f, \u0001, X)\n\ndef\n=\n\n(cid:80)\n\nx\u2217\u2208X \u03c1(f, x\u2217)I[\u03c1(f, x\u2217) \u2264 \u0001]\n|{x\u2217 \u2208 X | \u03c1(f, x\u2217) \u2264 \u0001}|\n\n.\n\nAn approximation \u02c6\u03c1(f, x\u2217) \u2248 \u03c1(f, x\u2217) of \u03c1, such as the one we describe in Section 4, can be used in\nplace of \u03c1. In practice, X is taken to be the test set Xtest.\n\n3 Computing Pointwise Robustness\n\n3.1 Overview\n\nConsider the training points in Figure 1 (a) colored based on the ground truth label. To classify this\ndata, we train a two-layer neural net f (x) = arg max(cid:96){(W2g(W1x))(cid:96)}, where the ReLU function g\nis applied pointwise. Figure 1 (a) includes contours of the per-point loss function of this neural net.\nExhaustively searching the input space to determine the distance \u03c1(f, x\u2217) to the nearest adversarial\nexample for input x\u2217 (labeled (cid:96)\u2217) is intractable. Recall that neural nets with recti\ufb01ed-linear (ReLU)\nunits as activations are piecewise linear [15]. Since adversarial examples exist because of this\nlinearity in the neural net [5], we restrict our search to the region Z(x\u2217) around x\u2217 on which the\nneural net is linear. This region around x\u2217 is de\ufb01ned by the activation of the ReLU function: for\neach i, if (W1x\u2217)i \u2265 0 (resp., (W1x\u2217) \u2264 0), we constrain to the half-space {x | (W1x)i \u2265 0} (resp.,\n{x | (W1x)i \u2264 0}). The intersection of these half-spaces is convex, so it admits ef\ufb01cient search.\nFigure 1 (b) shows one such convex region 1.\nAdditionally, x is labeled (cid:96) exactly when f (x)(cid:96) \u2265 f (x)(cid:96)(cid:48) for each (cid:96)(cid:48) (cid:54)= (cid:96). These constraints are linear\nsince f is linear on Z(x\u2217). Therefore, we can \ufb01nd the distance to the nearest input with label (cid:96) (cid:54)= (cid:96)\u2217\nby minimizing (cid:107)x \u2212 x\u2217(cid:107)\u221e on Z(x\u2217). Finally, we can perform this search for each label (cid:96) (cid:54)= (cid:96)\u2217,\nthough for ef\ufb01ciency we take (cid:96) to be the label assigned the second-highest score by f. Figure 1 (b)\nshows the adversarial example found by our algorithm in our running example. In Figure 1 note that\nthe direction of the nearest adversarial example is not necessary aligned with the signed gradient of\nthe loss function, as observed by others [7].\n\n1Our neural net has 8 hidden units, but for this x\u2217, 6 of the half-spaces entirely contain the convex region.\n\n4\n\n\f3.2 Formulation as Optimization\nWe compute \u03c1(f, \u0001) by expressing (1) as constraints C, which consist of\n\nb = 0), where x \u2208 Rm (for some m) are variables and w \u2208 Rm, b \u2208 R are constants.\n\n\u2022 Linear relations; speci\ufb01cally, inequalities C \u2261 (wT x + b \u2265 0) and equalities C \u2261 (wT x +\n\u2022 Conjunctions C \u2261 C1 \u2227 C2, where C1 and C2 are themselves constraints. Both constraints\n\u2022 Disjunctions C \u2261 C1\u2228C2, where C1 and C2 are themselves constraints. One of the constraints\n\nmust be satis\ufb01ed for the conjunction to be satis\ufb01ed.\n\nmust be satis\ufb01ed for the disjunction to be satis\ufb01ed.\n\nThe feasible set F(C) of C is the set of x \u2208 Rm that satisfy C; C is satis\ufb01able if F(C) is nonempty.\nIn the next section, we show that the condition f (x) = (cid:96) can be expressed as constraints Cf (x, (cid:96));\ni.e., f (x) = (cid:96) if and only if Cf (x, (cid:96)) is satis\ufb01able. Then, \u03c1(f, \u0001) can be computed as follows:\n\n\u03c1(f, x\u2217) = min\n(cid:96)(cid:54)=(cid:96)\u2217\n\n\u03c1(f, x\u2217, (cid:96))\n\n\u03c1(f, x\u2217, (cid:96))\n\ndef\n\n= inf{\u0001 \u2265 0 | Cf (x, (cid:96)) \u2227 (cid:107)x \u2212 x\u2217(cid:107)\u221e \u2264 \u0001 satis\ufb01able}.\n\n(2)\n\n(3)\n\nThe optimization problem is typically intractable; we describe a tractable approximation in \u00a74.\n\n(cid:9), where the ith layer of\n\nassume f has form f (x) = arg max(cid:96)\u2208L(cid:8)(cid:2)f (k)(f (k\u22121)(...(f (1)(x))...))(cid:3)\n\n3.3 Encoding a Neural Network\nWe show how to encode the constraint f (x) = (cid:96) as constraints Cf (x, (cid:96)) when f is a neural net. We\nthe network is a function f (i) : Rni\u22121 \u2192 Rni, with n0 = n and nk = |L|. We describe the encoding\nof fully-connected and ReLU layers; convolutional layers are encoded similarly to fully-connected\nlayers and max-pooling layers are encoded similarly to ReLU layers. We introduce the variables\nx(0), . . . , x(k) into our constraints, with the interpretation that x(i) represents the output vector of\nlayer i of the network; i.e., x(i) = f (i)(x(i\u22121)). The constraint Cin(x) \u2261 (x(0) = x) encodes the\ninput layer. For each layer f (i), we encode the computation of x(i) given x(i\u22121) as a constraint Ci.\nIn this case, x(i) = f (i)(x(i\u22121)) = W (i)x(i\u22121) + b(i), which we encode\nis the j-th row of W (i).\n\nFully-connected layer.\n\n, where W (i)\n\nj x(i\u22121) + b(i)\n\nx(i)\nj = W (i)\n\n(cid:111)\n\n(cid:110)\n\n(cid:96)\n\nj\n\nj\n\nj=1\n\nusing the constraints Ci \u2261(cid:86)ni\nthe constraints Ci \u2261(cid:86)ni\nFinally, the constraints Cout((cid:96)) \u2261(cid:86)\nthe constraints Cf (x, (cid:96)) \u2261 Cin(x) \u2227(cid:16)(cid:86)k\n\nIn this case, x(i)\n\nReLU layer.\n\n(cid:96)(cid:48)(cid:54)=(cid:96)\n\nj = max {x(i\u22121)\n\nj\n\nj=1 Cij, where Cij = (x(i\u22121)\n\nj\n\n<0 \u2227 x(i)\n\n, 0} (for each 1 \u2264 j \u2264 ni), which we encode using\n).\n\nj =0) \u2228 (x(i\u22121)\nensure that the output label is (cid:96). Together,\n\n(cid:17) \u2227 Cout((cid:96)) encodes the computation of f:\n\n(cid:110)\n(cid:96) \u2265 x(k)\nx(k)\n(cid:96)(cid:48)\ni=1 Ci\n\nj =x(i\u22121)\n\n\u2265 0 \u2227 x(i)\n\n(cid:111)\n\nj\n\nj\n\nTheorem 1 For any x \u2208 X and (cid:96) \u2208 L, we have f (x) = (cid:96) if and only if Cf (x, (cid:96)) is satis\ufb01able.\n\n4 Approximate Computation of Pointwise Robustness\nConvex restriction. The challenge to solving (3) is the non-convexity of the feasible set of Cf (x, (cid:96)).\nTo recover tractability, we approximate (3) by constraining the feasible set to x \u2208 Z(x\u2217), where\nZ(x\u2217) \u2286 X is carefully chosen so that the constraints \u02c6Cf (x, (cid:96)) \u2261 Cf (x, (cid:96)) \u2227 (x \u2208 Z(x\u2217)) have\nconvex feasible set. We call \u02c6Cf (x, (cid:96)) the convex restriction of Cf (x, (cid:96)). In some sense, convex\nrestriction is the opposite of convex relaxation. Then, we can approximately compute robustness:\n\n\u02c6\u03c1(f, x\u2217, (cid:96))\n\ndef\n\n= inf{\u0001 \u2265 0 | \u02c6Cf (x, (cid:96)) \u2227 (cid:107)x \u2212 x\u2217(cid:107)\u221e \u2264 \u0001 satis\ufb01able}.\n\n(4)\n\nThe objective is optimized over x \u2208 Z(x\u2217), which approximates the optimum over x \u2208 X .\n\n5\n\n\fChoice of Z(x\u2217). We construct Z(x\u2217) as the feasible set of constraints D(x\u2217); i.e., Z(x\u2217) =\nF(D(x\u2217)). We now describe how to construct D(x\u2217).\nNote that F(wT x + b = 0) and F(wT x + b \u2265 0) are convex sets. Furthermore, if F(C1) and F(C2)\nare convex, then so is their conjunction F(C1 \u2227 C2). However, their disjunction F(C1 \u2228 C2) may not\nbe convex; for example, F((x \u2265 0) \u2228 (y \u2265 0)). The potential non-convexity of disjunctions makes\n(3) dif\ufb01cult to optimize.\nWe can eliminate disjunction operations by choosing one of the two disjuncts to hold. For example,\nnote that for C1 \u2261 C2 \u2228 C3, we have both F(C2) \u2286 F(C1) and F(C3) \u2286 F(C1). In other words, if we\nreplace C1 with either C2 or C3, the feasible set of the resulting constraints can only become smaller.\nTaking D(x\u2217) \u2261 C2 (resp., D(x\u2217) \u2261 C3) effectively replaces C1 with C2 (resp., C3).\nTo restrict (3), for every disjunction C1 \u2261 C2 \u2228C3, we systematically choose either C2 or C3 to replace\nthe constraint C1. In particular, we choose C2 if x\u2217 satis\ufb01es C2 (i.e., x\u2217 \u2208 F(C2)) and choose C3\notherwise. In our constraints, disjunctions are always mutually exclusive, so x\u2217 never simultaneously\nsatis\ufb01es both C2 and C3. We then take D(x\u2217) to be the conjunction of all our choices. The resulting\nconstraints \u02c6Cf (x, (cid:96)) contains only conjunctions of linear relations, so its feasible set is convex. In\nfact, it can be expressed as a linear program (LP) and can be solved using any standard LP solver.\nFor example, consider a recti\ufb01ed linear layer (as before, max pooling layers are similar). The original\nconstraint added for unit j of recti\ufb01ed linear layer f (i) is\nx(i\u22121)\n\n(cid:17) \u2228(cid:16)\n\n\u2264 0 \u2227 x(i)\n\n\u2265 0 \u2227 x(i)\n\nx(i\u22121)\n\n(cid:16)\n\n(cid:17)\n\nj = x(i\u22121)\n\nj\n\nj = 0\n\nj\n\nj\n\nTo restrict this constraint, we evaluate the neural network on the seed input x\u2217 and look at the input\nto f (i), which equals x(i\u22121)\n\n\u2217\n\n(cid:40)\n\n= f (i\u22121)(...(f (1)(x\u2217))...). Then, for each 1 \u2264 j \u2264 ni:\n)j \u2264 0\n)j > 0.\n\nj = x(i\u22121)\nj = 0\n\n\u2264 0 \u2227 x(i)\n\u2265 0 \u2227 x(i)\n\nif (x(i\u22121)\nif (x(i\u22121)\n\nx(i\u22121)\nx(i\u22121)\n\n\u2217\n\u2217\n\nj\n\nj\n\nj\n\nD(x\u2217) \u2190 D(x\u2217) \u2227\n\nIterative constraint solving. We implement an optimization for solving LPs by lazily adding\nconstraints as necessary. Given all constraints C, we start off solving the LP with the subset of\nequality constraints \u02c6C \u2286 C, which yields a (possibly infeasible) solution z. If z is feasible, then z is\nalso an optimal solution to the original LP; otherwise, we add to \u02c6C the constraints in C that are not\nsatis\ufb01ed by z and repeat the process. This process always yields the correct solution, since in the\nworst case \u02c6C becomes equal to C. In practice, this optimization is an order of magnitude faster than\ndirectly solving the LP with constraints C.\nSingle target label. For simplicity, rather than minimize over \u03c1(f, x\u2217, (cid:96)) for each (cid:96) (cid:54)= (cid:96)\u2217, we \ufb01x (cid:96)\nto be the second most probable label \u02dcf (x\u2217); i.e.,\n\n\u02c6\u03c1(f, x\u2217)\n\ndef\n\n= inf{\u0001 \u2265 0 | \u02c6Cf (x, \u02dcf (x\u2217)) \u2227 (cid:107)x \u2212 x\u2217(cid:107)\u221e \u2264 \u0001 satis\ufb01able}.\n\n(5)\n\nApproximate robustness statistics. We can use \u02c6\u03c1 in our statistics \u02c6\u03c6 and \u02c6\u00b5 de\ufb01ned in \u00a72. Because\n\u02c6\u03c1 is an overapproximation of \u03c1 (i.e., \u02c6\u03c1(f, x\u2217) \u2265 \u03c1(f, x\u2217)), the estimates \u02c6\u03c6 and \u02c6\u00b5 may not be unbiased\n(in particular, \u02c6\u03c6(f, \u0001) \u2264 \u03c6(f, \u0001)). In \u00a76, we show empirically that our algorithm produces substantially\nless biased estimates than existing algorithms for \ufb01nding adversarial examples.\n\n5\n\nImproving Neural Net Robustness\n\nFinding adversarial examples. We can use our algorithm for estimating \u02c6\u03c1(f, x\u2217) to compute\nadversarial examples. Given x\u2217, the value of x computed by the optimization procedure used to solve\n(5) is an adversarial example for x\u2217 with (cid:107)x \u2212 x\u2217(cid:107)\u221e = \u02c6\u03c1(f, x\u2217).\nFinetuning. We use \ufb01ne-tuning to reduce a neural net\u2019s susceptability to adversarial examples.\nFirst, we use an algorithm A to compute adversarial examples for each x\u2217 \u2208 Xtrain and add them to\nthe training set. Then, we continue training the network on a the augmented training set at a reduced\ntraining rate. We can repeat this process multiple rounds (denoted T ); at each round, we only consider\nx\u2217 in the original training set (rather than the augmented training set).\n\n6\n\n\fNeural Net\n\nLeNet (Original)\nBaseline (T = 1)\nBaseline (T = 2)\nOur Algo. (T = 1)\nOur Algo. (T = 2)\n\nAccuracy (%) Adversarial Frequency (%) Adversarial Severity (pixels)\nOur Algo.\n12.4\n12.3\n12.4\n12.2\n11.7\n\nOur Algo.\n7.15\n6.89\n6.97\n5.40\n5.03\n\nBaseline\n11.9\n11.0\n10.9\n12.8\n12.2\n\nBaseline\n1.32\n1.02\n0.99\n1.18\n1.12\n\n99.08\n99.14\n99.15\n99.17\n99.23\n\nTable 1: Evaluation of \ufb01ne-tuned networks. Our method discovers more adversarial examples than\nthe baseline [21] for each neural net, hence producing better estimates. LeNet \ufb01ne-tuned for T = 1, 2\nrounds (bottom four rows) exhibit a notable increase in robustness compared to the original LeNet.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: The cumulative number of test points x\u2217 such that \u03c1(f, x\u2217) \u2264 \u0001 as a function of \u0001. In (a)\nand (b), the neural nets are the original LeNet (black), LeNet \ufb01ne-tuned with the baseline and T = 2\n(red), and LeNet \ufb01ne-tuned with our algorithm and T = 2 (blue); in (a), \u02c6\u03c1 is measured using the\nbaseline, and in (b), \u02c6\u03c1 is measured using our algorithm. In (c), the neural nets are the original NiN\n(black) and NiN \ufb01netuned with our algorithm, and \u02c6\u03c1 is estimated using our algorithm.\n\n(cid:96) \u2265 x(k)\n\nRounding errors. MNIST images are represented as integers, so we must round the perturbation\nto obtain an image, which oftentimes results in non-adversarial examples. When \ufb01ne-tuning, we add\n(cid:96)(cid:48) + \u03b1 for all (cid:96)(cid:48) (cid:54)= (cid:96), which eliminates this problem by ensuring that the neural\na constraint x(k)\nnet has high con\ufb01dence on its adversarial examples. In our experiments, we \ufb01x \u03b1 = 3.0.\nSimilarly, we modi\ufb01ed the L-BFGS-B baseline so that during the line search over c, we only count\n(cid:96)(cid:48) + \u03b1 for all (cid:96)(cid:48) (cid:54)= (cid:96). We choose \u03b1 = 0.15, since larger \u03b1 causes the\nx\u2217 + r as adversarial if x(k)\nbaseline to \ufb01nd signi\ufb01cantly fewer adversarial examples, and small \u03b1 results in smaller improvement\nin robustness. With this choice, rounding errors occur on 8.3% of the adversarial examples we \ufb01nd\non the MNIST training set.\n\n(cid:96) \u2265 x(k)\n\n6 Experiments\n\n6.1 Adversarial Images for CIFAR-10 and MNIST\n\nWe \ufb01nd adversarial examples for the neural net LeNet [12] (modi\ufb01ed to use ReLUs instead of\nsigmoids) trained to classify MNIST [11], and for the network-in-network (NiN) neural net [13]\ntrained to classify CIFAR-10 [9]. Both neural nets are trained using Caffe [8]. For MNIST, Figure 2\n(b) shows an adversarial example (labeled 1) we \ufb01nd for the image in Figure 2 (a) labeled 3, and\nFigure 2 (c) shows the corresponding adversarial perturbation scaled so the difference is visible (it\nhas L\u221e norm 17). For CIFAR-10, Figure 2 (e) shows an adversarial example labeled \u201ctruck\u201d for\nthe image in Figure 2 (d) labeled \u201cautomobile\u201d, and Figure 2 (f) shows the corresponding scaled\nadversarial perturbation (which has L\u221e norm 3).\n\n6.2 Comparison to Other Algorithms on MNIST\n\nWe compare our algorithm for estimating \u03c1 to the baseline L-BFGS-B algorithm proposed by [21].\nWe use the tool provided by [22] to compute this baseline. For both algorithms, we use adversarial\ntarget label (cid:96) = \u02dcf (x\u2217). We use LeNet in our comparisons, since we \ufb01nd that it is substantially more\nrobust than the neural nets considered in most previous work (including [21]). We also use versions\n\n7\n\n\fof LeNet \ufb01ne-tuned using both our algorithm and the baseline with T = 1, 2. To focus on the most\nsevere adversarial examples, we use a stricter threshold for robustness of \u0001 = 20 pixels.\nWe performed a similar comparison to the signed gradient algorithm proposed by [5] (with the signed\ngradient multiplied by \u0001 = 20 pixels). For LeNet, this algorithm found only one adversarial example\non the MNIST test set (out of 10,000) and four adversarial examples on the MNIST training set (out\nof 60,000), so we omit results 2.\n\nIn Figure 3, we plot the number of test points x\u2217 for which \u02c6\u03c1(f, x\u2217) \u2264 \u0001, as a function\nResults.\nof \u0001, where \u02c6\u03c1(f, x\u2217) is estimated using (a) the baseline and (b) our algorithm. These plots compare\nthe robustness of each neural network as a function of \u0001. In Table 1, we show results evaluating the\nrobustness of each neural net, including the adversarial frequency and the adversarial severity. The\nrunning time of our algorithm and the baseline algorithm are very similar; in both cases, computing\n\u02c6\u03c1(f, x\u2217) for a single input x\u2217 takes about 1.5 seconds. For comparison, without our iterative constraint\nsolving optimization, our algorithm took more than two minutes to run.\n\nDiscussion. For every neural net, our algorithm produces substantially higher estimates of the\nadversarial frequency. In other words, our algorithm estimates \u02c6\u03c1(f, x\u2217) with substantially better\naccuracy compared to the baseline.\nAccording to the baseline metrics shown in Figure 3 (a), the baseline neural net (red) is similarly\nrobust to our neural net (blue), and both are more robust than the original LeNet (black). Our neural\nnet is actually more robust than the baseline neural net for smaller values of \u0001, whereas the baseline\nneural net eventually becomes slightly more robust (i.e., where the red line dips below the blue line).\nThis behavior is captured by our robustness statistics\u2014the baseline neural net has lower adversarial\nfrequency (so it has fewer adversarial examples with \u02c6\u03c1(f, x\u2217) \u2264 \u0001) but also has worse adversarial\nseverity (since its adversarial examples are on average closer to the original points x\u2217).\nHowever, according to our metrics shown in Figure 3 (b), our neural net is substantially more robust\nthan the baseline neural net. Again, this is re\ufb02ected by our statistics\u2014our neural net has substantially\nlower adversarial frequency compared to the baseline neural net, while maintaining similar adversarial\nseverity. Taken together, our results suggest that the baseline neural net is over\ufb01tting to the adversarial\nexamples found by the baseline algorithm. In particular, the baseline neural net does not learn the\nadversarial examples found by our algorithm. On the other hand, our neural net learns both the\nadversarial examples found by our algorithm and those found by the baseline algorithm.\n\n6.3 Scaling to CIFAR-10\n\nWe also implemented our approach for the for the CIFAR-10 network-in-network (NiN) neural\nnet [13], which obtains 91.31% test set accuracy. Computing \u02c6\u03c1(f, x\u2217) for a single input on NiN\ntakes about 10-15 seconds on an 8-core CPU. Unlike LeNet, NiN suffers severely from adversarial\nexamples\u2014we measure a 61.5% adversarial frequency and an adversarial severity of 2.82 pixels. Our\nneural net (NiN \ufb01ne-tuned using our algorithm and T = 1) has test set accuracy 90.35%, which is\nsimilar to the test set accuracy of the original NiN. As can be seen in Figure 3 (c), our neural net\nimproves slightly in terms of robustness, especially for smaller \u0001. As before, these improvements are\nre\ufb02ected in our metrics\u2014the adversarial frequency of our neural net drops slightly to 59.6%, and\nthe adversarial severity improves to 3.88. Nevertheless, unlike LeNet, our \ufb01ne-tuned version of NiN\nremains very prone to adversarial examples. In this case, we believe that new techniques are required\nto signi\ufb01cantly improve robustness.\n\n7 Conclusion\n\nWe have shown how to formulate, ef\ufb01ciently estimate, and improve the robustness of neural nets\nusing an encoding of the robustness property as a constraint system. Future work includes devising\nbetter approaches to improving robustness on large neural nets such as NiN and studying properties\nbeyond robustness.\n\n2Futhermore, the signed gradient algorithm cannot be used to estimate adversarial severity since all the\n\nadversarial examples it \ufb01nds have L\u221e norm \u0001.\n\n8\n\n\fReferences\n[1] K. Chalupka, P. Perona, and F. Eberhardt. Visual causal feature learning. 2015.\n\n[2] A. Fawzi, O. Fawzi, and P. Frossard. Analysis of classifers\u2019 robustness to adversarial perturbations. ArXiv\n\ne-prints, 2015.\n\n[3] Jiashi Feng, Tom Zahavy, Bingyi Kang, Huan Xu, and Shie Mannor. Ensemble robustness of deep learning\n\nalgorithms. arXiv preprint arXiv:1602.02389, 2016.\n\n[4] Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion.\nProceedings of the 23rd international conference on Machine learning, pages 353\u2013360. ACM, 2006.\n\nIn\n\n[5] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. 2015.\n\n[6] S. Gu and L. Rigazio. Towards deep neural network architectures robust to adversarial examples. 2014.\n\n[7] Ruitong Huang, Bing Xu, Dale Schuurmans, and Csaba Szepesv\u00e1ri. Learning with a strong adversary.\n\nCoRR, abs/1511.03034, 2015.\n\n[8] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio\nGuadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv\npreprint arXiv:1408.5093, 2014.\n\n[9] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images, 2009.\n\n[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. 2012.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\ndocument recognition. In S. Haykin and B. Kosko, editors, Intelligent Signal Processing, pages 306\u2013351.\nIEEE Press, 2001.\n\n[13] Min Lin, Qiang Chen, and Shuicheng Yan. Network In Network. CoRR, abs/1312.4400, 2013.\n\n[14] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii. Distributional smoothing\n\nwith virtual adversarial training. stat, 1050:25, 2015.\n\n[15] Guido F. Mont\u00fafar, Razvan Pascanu, KyungHyun Cho, and Yoshua Bengio. On the number of linear\nregions of deep neural networks. In Advances in Neural Information Processing Systems 27: Annual\nConference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec,\nCanada, pages 2924\u20132932, 2014.\n\n[16] Seyed Mohsen Moosavi Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate\nmethod to fool deep neural networks. In Proceedings of 2016 IEEE Conference on Computer Vision and\nPattern Recognition (CVPR), number EPFL-CONF-218057, 2016.\n\n[17] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High con\ufb01dence\npredictions for unrecognizable images. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE\nConference on, pages 427\u2013436. IEEE, 2015.\n\n[18] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami.\nPractical black-box attacks against deep learning systems using adversarial examples. arXiv preprint\narXiv:1602.02697, 2016.\n\n[19] Sara Sabour, Yanshuai Cao, Fartash Faghri, and David J Fleet. Adversarial manipulation of deep represen-\n\ntations. arXiv preprint arXiv:1511.05122, 2015.\n\n[20] Uri Shaham, Yutaro Yamada, and Sahand Negahban. Understanding adversarial training: Increasing local\n\nstability of neural nets through robust optimization. arXiv preprint arXiv:1511.05432, 2015.\n\n[21] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and\n\nRob Fergus. Intriguing properties of neural networks. 2014.\n\n[22] Pedro Tabacof and Eduardo Valle. Exploring the space of adversarial images. CoRR, abs/1510.05328,\n\n2015.\n\n[23] Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391\u2013423, 2012.\n\n9\n\n\f", "award": [], "sourceid": 1356, "authors": [{"given_name": "Osbert", "family_name": "Bastani", "institution": "Stanford University"}, {"given_name": "Yani", "family_name": "Ioannou", "institution": "University of Cambridge"}, {"given_name": "Leonidas", "family_name": "Lampropoulos", "institution": "University of Pennsylvania"}, {"given_name": "Dimitrios", "family_name": "Vytiniotis", "institution": "Microsoft Research"}, {"given_name": "Aditya", "family_name": "Nori", "institution": "Microsoft Research"}, {"given_name": "Antonio", "family_name": "Criminisi", "institution": "Microsoft Research"}]}