{"title": "Convergence of Adversarial Training in Overparametrized Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 13029, "page_last": 13040, "abstract": "Neural networks are vulnerable to adversarial examples, i.e. inputs that are imperceptibly perturbed from natural data and yet incorrectly classified by the network. Adversarial training \\cite{madry2017towards}, a heuristic form of robust optimization that alternates between minimization and maximization steps, has proven to be among the most successful methods to train networks to be robust against a pre-defined family of perturbations. This paper provides a partial answer to the success of adversarial training, by showing that it converges to a network where the surrogate loss with respect to the the attack algorithm is within $\\epsilon$ of the optimal robust loss. Then we show that the optimal robust loss is also close to zero, hence adversarial training finds a robust classifier. The analysis technique leverages recent work on the analysis of neural networks via Neural Tangent Kernel (NTK), combined with motivation from online-learning when the maximization is solved by a heuristic, and the expressiveness of the NTK kernel in the $\\ell_\\infty$-norm. In addition, we also prove that robust interpolation requires more model capacity, supporting the evidence that adversarial training requires wider networks.", "full_text": "Convergence of Adversarial Training in\n\nOverparametrized Neural Networks\n\nRuiqi Gao1,\u2217 Tianle Cai1,\u2217 Haochuan Li2 Liwei Wang3 Cho-Jui Hsieh4\n\nJason D. Lee5\n\n1School of Mathematical Sciences, Peking University\n\n2Department of EECS, Massachusetts Institute of Technology\n\n3Key Laboratory of Machine Perception, MOE, School of EECS, Peking University\n\n4Department of Computer Science, University of California, Los Angeles\n\n5Department of Electrical Engineering, Princeton University\n\nAbstract\n\nNeural networks are vulnerable to adversarial examples, i.e. inputs that are imper-\nceptibly perturbed from natural data and yet incorrectly classi\ufb01ed by the network.\nAdversarial training [31], a heuristic form of robust optimization that alternates\nbetween minimization and maximization steps, has proven to be among the most\nsuccessful methods to train networks to be robust against a pre-de\ufb01ned family of\nperturbations. This paper provides a partial answer to the success of adversarial\ntraining, by showing that it converges to a network where the surrogate loss with\nrespect to the the attack algorithm is within \u0001 of the optimal robust loss. Then we\nshow that the optimal robust loss is also close to zero, hence adversarial training\n\ufb01nds a robust classi\ufb01er. The analysis technique leverages recent work on the\nanalysis of neural networks via Neural Tangent Kernel (NTK), combined with mo-\ntivation from online-learning when the maximization is solved by a heuristic, and\nthe expressiveness of the NTK kernel in the (cid:96)\u221e-norm. In addition, we also prove\nthat robust interpolation requires more model capacity, supporting the evidence\nthat adversarial training requires wider networks.\n\n1\n\nIntroduction\n\nRecent studies have demonstrated that neural network models, despite achieving human-level per-\nformance on many important tasks, are not robust to adversarial examples\u2014a small and human\nimperceptible input perturbation can easily change the prediction label [44, 22]. This phenomenon\nbrings out security concerns when deploying neural network models to real world systems [20]. In\nthe past few years, many defense algorithms have been developed [23, 43, 30, 28, 39] to improve the\nnetwork\u2019s robustness, but most of them are still vulnerable under stronger attacks, as reported in [3].\nAmong current defense methods, adversarial training [31] has become one of the most successful\nmethods to train robust neural networks.\nTo obtain a robust network, we need to consider the \u201crobust loss\u201d instead of a regular loss. The robust\nloss is de\ufb01ned as the maximal loss within a neighborhood around the input of each sample, and\nminimizing the robust loss under empirical distribution leads to a min-max optimization problem.\nAdversarial training [31] is a way to minimize the robust loss. At each iteration, it (approximately)\nsolves the inner maximization problem by an attack algorithm A to get an adversarial sample, and\nthen runs a (stochastic) gradient-descent update to minimize the loss on the adversarial samples.\nAlthough adversarial training has been widely used in practice and hugely improves the robustness\nof neural networks in many applications, its convergence properties are still unknown. It is unclear\n\n\u2217Joint \ufb01rst author.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwhether a network with small robust error exists and whether adversarial training is able to converge\nto a solution with minimal adversarial train loss.\nIn this paper, we study the convergence of adversarial training algorithms and try to answer the above\nquestions on over-parameterized neural networks. We consider width-m neural networks both for the\nsetting of deep networks with H layers, and two-layer networks for some additional analysis. Our\ncontributions are summarized below.\n\n\u2022 For an H-layer deep network with ReLU activations, and an arbitrary attack algorithm,\nwhen the width m is large enough, we show that projected gradient descent converges to\na network where the surrogate loss with respect to the attack A is within \u0001 of the optimal\nrobust loss (Theorem 4.1). The required width is polynomial in the depth and the input\ndimension.\n\n\u2022 For a two-layer network with smooth activations, we provide a proof of convergence, where\n\nthe projection step is not required in the algorithm (Theorem 5.1).\n\n\u2022 We then consider the expressivity of neural networks w.r.t. robust loss (or robust interpo-\nlation). We show when the width m is suf\ufb01ciently large, the neural network can achieve\noptimal robust loss \u0001; see Theorems 5.2 and C.1 for the precise statement. By combining\nthe expressivity result and the previous bound of the loss over the optimal robust loss, we\nshow that adversarial training \ufb01nds networks of small robust training loss (Corollary 5.1 and\nCorollary C.1).\n\n\u2022 We show that the VC-Dimension of the model class which can robustly interpolate any\nn samples is lower bounded by \u2126(nd) where d is the dimension. In contrast, there are\nneural net architectures that can interpolate n samples with only O(n) parameters and\nVC-Dimension at most O(n log n). Therefore, the capacity required for robust learning is\nhigher.\n\n2 Related Work\n\nAttack and Defense Adversarial examples are inputs that are slightly perturbed from a natural\nsample and yet incorrectly classi\ufb01ed by the model. An adversarial example can be generated by\nmaximizing the loss function within an \u0001-ball around a natural sample. Thus, generating adversarial\nexamples can be viewed as solving a constrained optimization problem and can be (approximately)\nsolved by a projected gradient descent (PGD) method [31]. Some other techniques have also been\nproposed in the literature including L-BFGS [44], FGSM [22], iterative FGSM [26] and C&W\nattack [12], where they differ from each other by the distance measurements, loss function or\noptimization algorithms. There are also studies on adversarial attacks with limited information about\nthe target model. For instance, [13, 24, 8] considered the black-box setting where the model is hidden\nbut the attacker can make queries and get the corresponding outputs of the model.\nImproving the robustness of neural networks against adversarial attacks, also known as defense, has\nbeen recognized as an important and unsolved problem in machine learning. Various kinds of defense\nmethods have been proposed [23, 43, 30, 28, 39], but many of them are based on obfuscated gradients\nwhich does not really improve robustness under stronger attacks [3]. As an exception, [3] reported\nthat the adversarial training method developed in [31] is the only defense that works even under\ncarefully designed attacks.\n\nAdversarial Training Adversarial training is one of the \ufb01rst defense ideas proposed in earlier\npapers [22]. The main idea is to add adversarial examples into the training set to improve the\nrobustness. However, earlier work usually only adds adversarial example once or only few times\nduring the training phase. Recently, [31] showed that adversarial training can be viewed as solving\na min-max optimization problem where the training algorithm aims to minimize the robust loss,\nde\ufb01ned as the maximal loss within a certain \u0001-ball around each training sample. Based on this\nformulation, a clean adversarial training procedure based on PGD-attack has been developed and\nachieved state-of-the-art results even under strong attacks. This also motivates some recent research\non gaining theoretical understanding of robust error [9, 40]. Also, adversarial training suffers\nfrom slow training time since it runs several steps of attacks within one update, and several recent\nworks are trying to resolve this issue [41, 53]. From the theoretical perspective, a recent work [46]\n\n2\n\n\fconsiders to quantitatively evaluate the convergence quality of adversarial examples found in the\ninner maximization and therefore ensure robustness. [51] consider generalization upper and lower\nbounds for robust generalization. [29] improves the robust generalization by data augmentation with\nGAN. [21] considers to reduce the optimization of min-max problem to online learning setting and\nuse their results to analyze the convergence of GAN. In this paper, our analysis for adversarial is\nquite general and is not restricted to any speci\ufb01c kind of attack algorithm.\n\nGlobal convergence of Gradient Descent Recent works on the over-parametrization of neural\nnetworks prove that when the width greatly exceeds the sample size, gradient descent converges to a\nglobal minimizer from random initialization [27, 18, 19, 1, 55]. The key idea in the earlier literature\nis to show that the Jacobian w.r.t. parameters has minimum singular value lower bounded, and thus\nthere is a global minimum near every random initialization, with high probability. However for\nthe robust loss, the maximization cannot be evaluated and the Jacobian is not necessarily full rank.\nFor the surrogate loss, the heuristic attack algorithm may not even be continuous and so the same\narguments cannot be utilized.\n\nCerti\ufb01ed Defense and Robustness Veri\ufb01cation In contrast to attack algorithms, neural network\nveri\ufb01cation methods [48, 47, 54, 42, 14, 38] tries to \ufb01nd upper bounds of the robust loss and provide\ncerti\ufb01ed robustness measurements. Equipped with these veri\ufb01cation methods for computing upper\nbounds of robust error, one can then apply adversarial training to get a network with certi\ufb01ed\nrobustness. Our analysis in Section 4 can also be extended to certi\ufb01ed adversarial training.\n\n3 Preliminaries\n\n3.1 Notations\nLet [n] = {1, 2, . . . , n}. We use N (0, I) to denote the standard Gaussian distribution. For a vector\nv, we use (cid:107)v(cid:107)2 to denote the Euclidean norm. For a matrix A we use (cid:107)A(cid:107)F to denote the Frobenius\nnorm and (cid:107)A(cid:107)2 to denote the spectral norm. We use (cid:104)\u00b7,\u00b7(cid:105) to denote the standard Euclidean inner\nproduct between two vectors, matrices, or tensors. We let O(\u00b7), \u0398(\u00b7) and \u2126 (\u00b7) denote standard Big-O,\nBig-Theta and Big-Omega notations that suppress multiplicative constants.\n\n3.2 Deep Neural Networks\n\nHere we give the de\ufb01nition of our deep fully-connected neural networks. For the convenience of\nproof, we use the same architecture as de\ufb01ned in [1].2 Formally, we consider a neural network of the\nfollowing form.\nLet x \u2208 Rd be the input, the fully-connected neural network is de\ufb01ned as follows: A \u2208 Rm\u00d7d\nis the \ufb01rst weight matrix, W(h) \u2208 Rm\u00d7m is the weight matrix at the h-th layer for h \u2208 [H],\na \u2208 Rm\u00d71 is the output layer, and \u03c3(\u00b7) is the ReLU activation function.3 The parameters\nare W = (vec{A}(cid:62), vec{W(1)}(cid:62),\u00b7\u00b7\u00b7 , vec{W(H)}(cid:62), a(cid:62))(cid:62). However, without loss of gen-\nerality, during training we will \ufb01x A and a once initialized, so later we will refer to W as\nW = (vec{W(1)}(cid:62),\u00b7\u00b7\u00b7 , vec{W(H)}(cid:62))(cid:62). The prediction function is de\ufb01ned recursively:\n\nx(0) = Ax\nx(h) = W(h)x(h\u22121),\nx(h) = \u03c3\n\nx(h)(cid:17)\n\n(cid:16)\n\nh \u2208 [H]\n\n,\n\nh \u2208 [H]\n\n(1)\n\nf (W, x) = a(cid:62)x(H),\n\nwhere x(h) and x(h) are the feature vectors before and after the activation function, respectively.\nSometimes we also denote x(0) = x(0).\n\n2We only consider the setting when the network output is scalar. However, it is not hard to extend out results\n\nto the setting of vector outputs.\n\nour analysis to rectangular weight matrices.\n\n3We assume intermediate layers are square matrices of size m for simplicity. It is not dif\ufb01cult to generalize\n\n3\n\n\fWe use the following initialization scheme: Each entry in A and W(h) for h \u2208 [H] follows the\ni.i.d. Gaussian distribution N (0, 2\nm ), and each entry in a follows the i.i.d. Gaussian distribution\n(cid:80)n\nN (0, 1). As we mentioned, we only train on W(h) for h \u2208 [H] and \ufb01x a and A. For a training\ni=1, the loss function is denoted (cid:96) : (R, R) (cid:55)\u2192 R, and the (non-robust) training loss is\nset {xi, yi}n\nL(W) = 1\nn\nAssumption 3.1 (Assumption on the Loss Function). The loss (cid:96)(f (W, x), y) is Lipschitz, smooth,\nconvex in f (W, x) and satis\ufb01es (cid:96)(y, y) = 0.\n\ni=1 (cid:96)(f (W, xi), yi). We make the following assumption on the loss function:\n\n3.3 Perturbation and the Surrogate Loss Function\n\nThe goal of adversarial training is to make the model robust in a neighbor of each datum. We \ufb01rst\nintroduce the de\ufb01nition of the perturbation set function to determine the perturbation at each point.\nDe\ufb01nition 3.1 (Perturbation Set). Let the input space be X \u2282 Rd. The perturbation set function\nis B : X \u2192 P(X ), where P(X ) is the power set of X . At each data point x, B(x) gives the\nperturbation set on which we would like to guarantee robustness. For example, a commonly used\nperturbation set is B(x) = {x(cid:48) : (cid:107)x(cid:48) \u2212 x(cid:107)2 \u2264 \u03b4}. Given a dataset {xi, yi}n\ni=1, we say that the\nperturbation set is compatible with the dataset if B(xi) \u2229 B(xj) (cid:54)= \u03c6 implies yi = yj. In the rest of\nthe paper, we will always assume that B is compatible with the given data.\nGiven a perturbation set, we are now ready to de\ufb01ne the perturbation function that maps a data point\nto another point inside its perturbation set. We note that the perturbation function can be quite general\nincluding the identity function and any adversarial attack4. Formally, we give the following de\ufb01nition.\nDe\ufb01nition 3.2 (Perturbation Function). A perturbation function is de\ufb01ned as a function A : W \u00d7\nRd \u2192 Rd, where W is the parameter space. Given the parameter W of the neural network (1),\nA(W, x) maps x \u2208 Rd to some x(cid:48) \u2208 B(x) where B(x) refers to the perturbation set de\ufb01ned in\nDe\ufb01nition 3.1.\n\nWithout loss of generality, throughout Section 4 and 5, we will restrict our input x as well as the\nperturbation set B(x) within the surface of the unit ball S = {x \u2208 Rd : (cid:107)x(cid:107)2 = 1}.\nWith the de\ufb01nition of perturbation function, we can now de\ufb01ne a large family of loss functions on the\ntraining set {xi, yi}n\ni=1. We will show this de\ufb01nition covers the standard loss used in empirical risk\nminimization and the robust loss used in adversarial training.\nDe\ufb01nition 3.3 (Surrogate Loss Function). Given a perturbation function A de\ufb01ned in De\ufb01nition 3.2,\nthe current parameter W of a neural network f, and a training set {xi, yi}n\ni=1, we de\ufb01ne the\nsurrogate loss LA(W) on the training set as\n\nn(cid:88)\n\ni=1\n\nLA(W) =\n\n1\nn\n\n(cid:96)(f (W,A(W, xi)), yi).\n\nIt can be easily observed that the standard training loss L(W) is a special case of surrogate loss\nfunction when A is the identity. The goal of adversarial training is to minimize the robust loss, i.e.\nthe surrogate loss when A is the strongest possible attack. The formal de\ufb01nition is as follows:\nDe\ufb01nition 3.4 (Robust Loss Function). The robust loss function is de\ufb01ned as\n\nwhere\n\nL\u2217(W) := LA\u2217 (W)\n\nA\u2217(W, xi) = argmax\ni\u2208B(xi)\nx(cid:48)\n\n(cid:96)(f (W, x(cid:48)\n\ni), yi).\n\n4 Convergence Results of Adversarial Training\nWe consider optimizing the surrogate loss LA with the perturbation function A(W, x) de\ufb01ned in\nDe\ufb01nition 3.2, which is what adversarial training does given any attack algorithm A. In this section,\n\n4It is also not hard to extend our analysis to perturbation functions involving randomness.\n\n4\n\n\fwe will prove that for a neural network with suf\ufb01cient width, starting from the initialization W0,\nafter certain steps of projected gradient descent within a convex set B(R), the loss LA is provably\nupper-bounded by the best minimax robust loss in this set\nL\u2217(W),\n\nmin\n\n(cid:26)\n\nW\u2208B(R)\n\n(cid:13)(cid:13)(cid:13)W(h) \u2212 W(h)\n\n0\n\n(cid:13)(cid:13)(cid:13)F\n\n(cid:27)\n\nW :\n\nB(R) =\n\n(2)\nDenote PB(R) as the Euclidean projection to the convex set B(R). Denote the parameter W after\nthe t-th iteration as Wt, and similarly W(h)\n. For each step in adversarial training, projected gradient\ndescent takes an update\n\n.\n\nt\n\n\u2264 R\u221a\nm\n\n, h \u2208 [H]\n\nwhere\n\ni=1\n\n(cid:17)\n\nVt+1 = Wt \u2212 \u03b1\u2207WLA(Wt),\nWt+1 = PB(R)(Vt+1),\n\n(cid:17)\n\n(cid:17)\n\n(cid:16)\n\nwhere\n\n\u2207WLA(W) =\n\nl(cid:48) (f (W,A(W, xi)), yi)\u2207Wf (W,A(W, xi)),\n\nn(cid:88)\n\u2202f , the gradient \u2207Wf is with respect to the \ufb01rst argument W.\n\n1\nn\nand the derivative (cid:96)(cid:48) stands for \u2202(cid:96)\nSpeci\ufb01cally, we have the following theorem.\nTheorem 4.1 (Convergence of Projected Gradient Descent for Optimizing Surrogate Loss). Given\n\u0001 > 0, suppose R = \u2126(1), and m \u2265 max\n. Let the loss function satisfy\nAssumption 3.1.5 If we run projected gradient descent based on the convex constraint set B(R) with\nsteps, then with high probability we have\n(3)\n\nstepsize \u03b1 = O(cid:0) \u0001\n\n= \u2126\nLA(Wt) \u2212 L\u2217(W\u2217) \u2264 \u0001,\n\n(cid:16) R9H 16\n(cid:16) R2H 2\n(cid:17)\n\nt=1,\u00b7\u00b7\u00b7 ,T\nwhere W\u2217 = arg minW\u2208B(R) L\u2217(W).\nRemark. Recall that LA(W) is the loss suffered with respect to the perturbation function A. This\nmeans, for example, if the adversary uses the projected gradient ascent algorithm, then the theorem\nguarantees that projected gradient ascent cannot successfully attack the learned network. The\nstronger the attack algorithm is during training, the stronger the guaranteed surrogate loss becomes.\nRemark. The value of R depends on the approximation capability of the network, i.e. the greater R\nis, the less L\u2217(W\u2217) will be, thus affecting the overall bound on mint LA(Wt). We will elaborate on\nthis in the next section, where we show that for R independent of m there exists a network of small\nadversarial training error.\n\n(cid:1) for T = \u0398\n\n(cid:16) R2\n\n, \u0398(d2)\n\nm\u0001\u03b1\n\nmin\n\n\u0398\n\n\u00017\n\nmH 2\n\n\u00012\n\n4.1 Proof Sketch\n\nOur proof idea utilizes the same high-level intuition as [1, 27, 18, 55, 10, 11] that near the initialization\nthe network is linear. However, unlike these earlier works, the surrogate loss neither smooth, nor\nsemi-smooth so there is no Polyak gradient domination phenomenon to allow for the global geometric\ncontraction of gradient descent. In fact due to the the generality of perturbation function A allowed,\nthe surrogate loss is not differentiable or even continuous in W, and so the standard analysis cannot\nbe applied. Our analysis utilizes two key observations. First the network f (W,A(W, x)) is still\nsmooth w.r.t. the \ufb01rst argument6, and is close to linear in the \ufb01rst argument near initialization, which\nis shown by directly bounding the Hessian w.r.t. W. Second, the perturbation function A can be\ntreated as an adversary providing a worst-case loss function (cid:96)A(f, y) as done in online learning.\nHowever, online learning typically assumes the sequence of losses is convex, which is not the case\nhere. We make a careful decoupling of the contribution to non-convexity from the \ufb01rst argument\nand the worst-case contribution from the perturbation function, and then we can prove that gradient\ndescent succeeds in minimizing the surrogate loss. The full proof is in Appendix A.\n\n5We actually didn\u2019t use the assumption (cid:96)(y, y) = 0 in the proof, so common loss functions like the cross-\nentropy loss works in this theorem. Also, with some slight modi\ufb01cations, it is possible to prove for other loss\nfunctions including the square loss.\n\n6It is not jointly smooth in W, which is part of the subtlety of the analysis.\n\n5\n\n\f5 Adversarial Training Finds Robust Classi\ufb01er\n\nMotivated by the optimization result in Theorem 4.1, we hope to show that there is indeed a robust\nclassi\ufb01er in B(R). To show this, we utilize the connection between neural networks and their induced\nReproducing Kernel Hilbert Space (RKHS) via viewing networks near initialization as a random\nfeature scheme [15, 16, 25, 2]. Since we only need to show the existence of a network architecture\nthat robustly \ufb01ts the training data in B(R) and neural networks are at least as expressive as their\ninduced kernels, we may prove this via the RKHS connection. The strategy is to \ufb01rst show the\nexistence of a robust classi\ufb01er in the RKHS, and then show that a suf\ufb01ciently wide network can\napproximate the kernel via random feature analysis. The approximation results of this section will be,\nin general, exponential in dimension dependence due to the known issue of d-dimensional functions\nhaving exponentially large RKHS norm [4], so only offer qualitative guidance on existence of robust\nclassi\ufb01ers.\nSince deep networks contain two-layer networks as a sub-network, and we are concerned with\nexpressivity, we focus on the local expressivity of two-layer networks. We write the standard\ntwo-layer network in the suggestive way7 (where the width m is an even number)\n\n\uf8eb\uf8edm/2(cid:88)\n\nm/2(cid:88)\n\n\uf8f6\uf8f8 ,\n\nf (W, x) =\n\n1\u221a\nm\n\nar\u03c3(w(cid:62)\n\nr x) +\n\na(cid:48)\nr\u03c3( \u00afw(cid:62)\n\nr x)\n\nr=1\n\nr=1\n\n(4)\n\nfor r = 1,\u00b7\u00b7\u00b7 , m\n\nand initialize as wr \u223c N (0, Id) i.i.d.\n2 , and \u00afwr is set to be equal to wr,\nr = \u2212ar. Similarly, we de\ufb01ne the set B(R) =\nar is randomly drawn from {1,\u22121} and a(cid:48)\n{W : (cid:107)W \u2212 W0(cid:107)F \u2264 R}8 for W = (w1,\u00b7\u00b7\u00b7 , wm/2, \u00afw1,\u00b7\u00b7\u00b7 , \u00afwm/2), W0 being the initialization\nof W, and \ufb01x all ar after initialization.\nTo make things cleaner, we will use a smooth activation function \u03c3(\u00b7) throughout this section9,\nformally stated as follows.\nAssumption 5.1 (Smoothness of Activation Function). The activation function \u03c3(\u00b7) is smooth, that\nis, there exists an absolute constant C > 0 such that for any z, z(cid:48) \u2208 R\n\n|\u03c3(cid:48)(z) \u2212 \u03c3(cid:48)(z(cid:48))| \u2264 C|z \u2212 z(cid:48)|.\n\nPrior to proving the approximation results, we would like to \ufb01rst provide a version of convergence\ntheorem similar to Theorem 4.1, but for this two-layer setting. It is encouraged that the reader can\nread Appendix B for the proof of the following Theorem 5.1 \ufb01rst, since it is relatively cleaner than\nthat of the deep setting but the proof logic is analogous.\nTheorem 5.1 (Convergence of Gradient Descent without Projection for Optimizing Surrogate Loss\nfor Two-layer Networks). Suppose the loss function satis\ufb01es Assumption 3.1 and the activation\nfunction satis\ufb01es Assumption 5.1. With high probability, using the two-layer network de\ufb01ned above,\nfor any \u0001 > 0, if we run gradient descent with step size \u03b1 = O (\u0001), and if m = \u2126\n\n(cid:16) R4\n\n, we have\n\n(cid:17)\n\n\u00012\n\nLA(Wt) \u2212 L\u2217(W\u2217) \u2264 \u0001,\n\n(5)\n\nmin\n\nt=1,\u00b7\u00b7\u00b7 ,T\n\n\u221a\n\u03b1 ).\n\nm\n\nwhere W\u2217 = minW\u2208B(R) L\u2217(W) and T = \u0398(\nRemark. Compared to Theorem 4.1, we do not need the projection step for this two-layer theorem.\nWe believe using a smooth activation function can also eliminate the need of the projection step in the\ndeep setting from a technical perspective, and from a practical sense we conjecture that the projection\nstep is not needed anyway.\n\nNow we\u2019re ready to proceed to the approximation results, i.e. proving that L\u2217(W\u2217) is also small,\nand combined with Equation (5) we can give an absolute bound on mint LA(Wt). For the reader\u2019s\nconvenience, we \ufb01rst introduce the Neural Tangent Kernel (NTK) [25] w.r.t. our two-layer network.\n\n7This makes f (W, x) = 0 at initialization, which helps eliminate some unnecessary technical nuisance.\n8Note that we have taken out the term 1\u221a\n\nm explicitly in the network expression for convenience, so in this\n\nsection there is a difference of scaling by a factor of\n\nm from the W used in the previous section.\n\n9Similar approximation results also hold for other activation functions like ReLU.\n\n\u221a\n\n6\n\n\fDe\ufb01nition 5.1 (NTK [25]). The NTK with activation function \u03c3 (\u00b7) and initialization distribution\nw \u223c N (0, Id) is de\ufb01ned as K\u03c3(x, y) = Ew\u223cN (0,Id)(cid:104)x\u03c3(cid:48)(w(cid:62)x), y\u03c3(cid:48)(w(cid:62)y)(cid:105).\nFor a given kernel K, there is a reproducing kernel Hilbert space (RKHS) introduced by K. We\ndenote it as H(K). We refer the readers to [36] for an introduction of the theory of RKHS.\nWe formally make the following assumption on the universality of NTK.\nAssumption 5.2 (Existence of Robust Classi\ufb01er in NTK). For any \u0001 > 0, there exists f \u2208 H(K\u03c3),\nsuch that |f (x(cid:48)\nAlso, we make an additional assumption on the activation function \u03c3(\u00b7):\nAssumption 5.3 (Lipschitz Property of Activation Function). The activation function \u03c3(\u00b7) satis\ufb01es\n|\u03c3(cid:48)(z)| \u2264 C,\u2200z \u2208 R for some constant C.\n\ni) \u2212 yi| \u2264 \u0001, for every i \u2208 [n] and x(cid:48)\n\ni \u2208 B(xi).\n\nUnder these assumptions, by applying the strategy of approximating the in\ufb01nite situation by \ufb01nite\nsum of random features, we can get the following theorem:\nTheorem 5.2 (Existence of Robust Classi\ufb01er near Initialization). Given data set D = {(xi, yi)}n\nand a compatible perturbation set function B with xi and its allowed perturbations taking value on\nS, for the two-layer network de\ufb01ned in (4), if Assumption 3.1, 5.1, 5.2, 5.3 hold, then for any \u0001, \u03b4 > 0,\nthere exists RD,B,\u0001 such that when the width m satis\ufb01es m = \u2126\n, with probability at least\n0.99 over the initialization there exists W such that\n\n(cid:16) R4D,B,\u0001\n\n(cid:17)\n\ni=1\n\n\u00012\n\nL\u2217(W) \u2264 \u0001 and W \u2208 B(RD,B,\u0001).\n\nCombining Theorem 5.1 and 5.2 we \ufb01nally know that\nCorollary 5.1 (Adversarial Training Finds a Network of Small Robust Training Loss). Given data\nset on the unit sphere equipped with a compatible perturbation set function and an associated\nperturbation function A, which also takes value on the unit sphere. Suppose Assumption 3.1, 5.1, 5.2,\n5.3 are satis\ufb01ed. Then there exists a RD,B,\u0001 which only depends on dataset D, perturbation B and\n), if we run gradient\n\u0001, such that for any 2-layer fully connected network with width m = \u2126(\ndescent with stepsize \u03b1 = O (\u0001) for T = \u0398(\n\nR4D,B,\u0001\n\nR2D,B,\u0001\n\n\u00012\n\n\u0001\u03b1 ) steps, then with probability 0.99,\nLA(Wt) \u2264 \u0001.\n\nmin\n\nt=1,\u00b7\u00b7\u00b7 ,T\n\n(6)\n\nRemark 5.1. We point out that Assumption 5.2 is rather general and can be veri\ufb01ed for a large class\nof activation functions by showing their induced kernel is universal as done in [32]. Also, here we use\nan implicit expression of the radius BD,B,\u0001, but the dependence on \u0001 can be calculated under speci\ufb01c\nactivation function with or without the smoothness assumptions. As an example, using quadratic\nReLU as activation function, we solve the explicit dependency on \u0001 in Appendix C.2 that doesn\u2019t rely\non Assumption 5.2.\n\nTherefore, adversarial training is guaranteed to \ufb01nd a robust classi\ufb01er under a given attack algorithm\nwhen the network width is suf\ufb01ciently large.\n\n6 Capacity Requirement of Robustness\n\nIn this section, we will show that in order to achieve adversarially robust interpolation (which is\nformally de\ufb01ned below), one needs more capacity than just normal interpolation. In fact, empirical\nevidence have already shown that to reliably withstand strong adversarial attacks, networks require a\nsigni\ufb01cantly larger capacity than for correctly classifying benign examples only [31]. This implies,\nin some sense, that using a neural network with larger width is necessary.\nLet S\u03b4 = {(x1,\u00b7\u00b7\u00b7 , xn) \u2208 (Rd)n : (cid:107)xi \u2212 xj(cid:107)2 > 2\u03b4} and B\u03b4(x) = {x(cid:48) : (cid:107)x(cid:48) \u2212 x(cid:107)2 \u2264 \u03b4}, where\n\u03b4 is a constant. We consider datasets in S\u03b4 and use B\u03b4 as the perturbation set function in this section.\nWe begin with the de\ufb01nition of the interpolation class and the robust interpolation class.\n\n7\n\n\fDe\ufb01nition 6.1 (Interpolation class). We say that a function class F of functions f : Rd \u2192 {1,\u22121}is\nan n-interpolation class10, if the following is satis\ufb01ed:\n\n\u2200(x1,\u00b7\u00b7\u00b7 , xn) \u2208 S\u03b4,\u2200(y1,\u00b7\u00b7\u00b7 , yn) \u2208 {\u00b11}n,\n\u2203f \u2208 F, s.t. f (xi) = yi,\u2200i \u2208 [n].\n\nDe\ufb01nition 6.2 (Robust interpolation class). We say that a function class F is an n-robust interpolation\nclass, if the following is satis\ufb01ed:\n\n\u2200(x1,\u00b7\u00b7\u00b7 , xn) \u2208 S\u03b4,\u2200(y1,\u00b7\u00b7\u00b7 , yn) \u2208 {\u00b11}n,\ni \u2208 B\u03b4(xi),\u2200i \u2208 [n].\n\u2203f \u2208 F, s.t.f (x(cid:48)\n\ni) = yi,\u2200x(cid:48)\n\nWe will use the VC-Dimension of a function class F to measure its complexity. In fact, as shown in\n[6] (Equation(2)), for neural networks there is a tight connection between the number of parameters\nW , the number of layers H and their VC-Dimension\n\n\u2126(HW log(W/H)) \u2264 VC-Dimension \u2264 O(HW log W ).\n\nIn addition, combining with the results in [52] (Theorem 3) which shows the existence of a 4-layer\nneural network with O(n) parameters that can interpolate any n data points, i.e. an n-interpolation\nclass, we have that an n-interpolation class can be realized by a \ufb01xed depth neural network with\nVC-Dimension upper bound\n\nVC-Dimension \u2264 O(n log n).\n\n(7)\nFor a general hypothesis class F, we can evidently see that when F is an n-interpolation class, F\nhas VC-Dimension at least n. For a neural network that is an n-interpolation class, without further\narchitectural constraints, this lower bound of its VC-dimension is tight up to logarithmic factors as\nindicated in Equation (7). However, we show that for a robust-interpolation class we will have a\nmuch larger VC-Dimension lower bound:\nTheorem 6.1. If F is an n-robust interpolation class, then we have the following lower bound on\nthe VC-Dimension of F\n\nVC-Dimension \u2265 \u2126(nd),\n\n(8)\n\nwhere d is the dimension of the input space.\n\nFor neural networks, Equation (8) shows that any architecture that is an n-robust interpolation\nclass should have VC-Dimension at least \u2126(nd). Compared with Equation (7) which shows an\nn-interpolation class can be realized by a network architecture with VC-Dimension O(n log n), we\ncan conclude that robust interpolation by neural networks needs more capacity, so increasing the\nwidth of neural network is indeed in some sense necessary.\n\n7 Discussion on Limitations and Future Directions\n\nThis work provides a theoretical analysis of the empirically successful adversarial training algorithm\nin the training of robust neural networks. Our main results indicate that adversarial training will\n\ufb01nd a network of low robust surrogate loss, even when the maximization is computed via a heuristic\nalgorithm such as projected gradient ascent. However, there are still some limitations with our current\ntheory, and we also feel our results can lead to several thought-provoking future work, which is\ndiscussed as follows.\nRemoval of projection. It is also natural to ask whether the projection step can be removed, as it is\nempirically unnecessary and also unnecessary for our two-layer analysis. We believe using smooth\nactivations might resolve this issue from a technical perspective, although practically it seems the\nprojection step in the algorithm is unnecessary in any case.\nGeneralizing to different attacks. Firstly, our current guarantee of the surrogate loss is based on the\nsame perturbation function as that used during training. It is natural to ask that whether we can ensure\n10Here we let the classi\ufb01cation output be \u00b11, and a usual classi\ufb01er f outputting a number in R can be treated\n\nas sign(f ) here.\n\n8\n\n\fthe surrogate loss is low with respect to a larger family of perturbation functions than that used during\ntraining.\nExploiting structures of network and data. Same as the recent proof of convergence on overparame-\nterized networks in the non-robust setting, our analysis fails to further incorporate useful network\nstructures apart from being suf\ufb01ciently wide, and as a result increasing depth can only hurt the bound.\nIt would be interesting to provide \ufb01ner analysis based on additional assumptions on the alignment of\nthe network structure and data distribution.\nImproving the approximation bound. On the expressivity side, the current argument utilizes that a\nneural net restricted to a local region can approximate its induced RKHS. Although the RKHS is\nuniversal, they do not avoid the curse of dimensionality (see Appendix C.2). However, we believe in\nreality, the required radius of region R to achieve robust approximation is not as large as the theorem\ndemands. So an interesting question is whether the robust expressivity of neural networks can adapt\nto structures such as low latent dimension of the data mechanism [17, 50], thereby reducing the\napproximation bound.\nCapacity requirement of robustness and robust generalization. Apart from this paper, there are other\nworks supporting the need for capacity including the perspective of network width [31], depth [49]\nand computational complexity [35]. It is argued in [51] that robust generalization is also harder\nusing Rademacher complexity. In fact, it appears empirically that robust generalization is even\nharder than robust training. It is observed that increasing the capacity, though beni\ufb01ting the dacay of\ntraining loss, has much less effect on robust generalization. There are also other factors behind robust\ngeneralization, like the number of training data [40]. The questions about robust generalization, as\nwell as to what extent capacity in\ufb02unces it, are still subject to much debate.\nThe above are several interesting directions of further improvement to our current result. In fact,\nmany of these questions are largely unanswered even for neural nets in the non-robust setting, so we\nleave them to future work.\n\n8 Acknowlegements\n\nWe acknowlegde useful discussions with Siyu Chen, Di He, Runtian Zhai, and Xiyu Zhai. RG and\nTC are partially supported by the elite undergraduate training program of School of Mathematical\nSciences in Peking University. LW acknowledges support by Natioanl Key R&D Program of China\n(no. 2018YFB1402600), BJNSF (L172037). JDL acknowledges support of the ARO under MURI\nAward W911NF-11-1-0303, the Sloan Research Fellowship, and NSF CCF #1900145.\n\nReferences\n[1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via\n\nover-parameterization. arXiv preprint arXiv:1811.03962, 2018.\n\n[2] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis\nof optimization and generalization for overparameterized two-layer neural networks. arXiv\npreprint arXiv:1901.08584, 2019.\n\n[3] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of\n\nsecurity: Circumventing defenses to adversarial examples. In ICML, 2018.\n\n[4] Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal\n\nof Machine Learning Research, 18(1):629\u2013681, 2017.\n\n[5] Francis Bach. On the equivalence between kernel quadrature rules and random feature expan-\n\nsions. The Journal of Machine Learning Research, 18(1):714\u2013751, 2017.\n\n[6] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-\ndimension and pseudodimension bounds for piecewise linear neural networks. Journal of\nMachine Learning Research, 20(63):1\u201317, 2019.\n\n9\n\n\f[7] Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. arXiv preprint\n\narXiv:1905.12173, 2019.\n\n[8] Wieland Brendel, Jonas Rauber, and Matthias Bethge. Decision-based adversarial attacks:\nReliable attacks against black-box machine learning models. arXiv preprint arXiv:1712.04248,\n2017.\n\n[9] S\u00e9bastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational\n\nconstraints. arXiv preprint arXiv:1805.10204, 2018.\n\n[10] Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei\nWang. A gram-gauss-newton method learning overparameterized deep neural networks for\nregression problems. arXiv preprint arXiv:1905.11675, 2019.\n\n[11] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide\n\nand deep neural networks. arXiv preprint arXiv:1905.13210, 2019.\n\n[12] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In\n\n2017 IEEE Symposium on Security and Privacy (SP), pages 39\u201357. IEEE, 2017.\n\n[13] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order\noptimization based black-box attacks to deep neural networks without training substitute models.\nIn Proceedings of the 10th ACM Workshop on Arti\ufb01cial Intelligence and Security, pages 15\u201326.\nACM, 2017.\n\n[14] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Certi\ufb01ed adversarial robustness via\n\nrandomized smoothing. arXiv preprint arXiv:1902.02918, 2019.\n\n[15] Amit Daniely. SGD learns the conjugate kernel class of the network. In Advances in Neural\n\nInformation Processing Systems, pages 2422\u20132430, 2017.\n\n[16] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks:\nThe power of initialization and a dual view on expressivity. In Advances In Neural Information\nProcessing Systems, pages 2253\u20132261, 2016.\n\n[17] Simon S Du and Jason D Lee. On the power of over-parametrization in neural networks with\n\nquadratic activation. arXiv preprint arXiv:1803.01206, 2018.\n\n[18] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent \ufb01nds\n\nglobal minima of deep neural networks. arXiv preprint arXiv:1811.03804, 2018.\n\n[19] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes\n\nover-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.\n\n[20] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul\nPrakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning\nmodels. arXiv preprint arXiv:1707.08945, 2017.\n\n[21] Alon Gonen and Elad Hazan. Learning in non-convex games with an optimization oracle. arXiv\n\npreprint arXiv:1810.07362, 2018.\n\n[22] Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial\n\nexamples. In International Conference on Learning Representations, 2015.\n\n[23] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens van der Maaten. Countering\n\nadversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017.\n\n[24] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks\nwith limited queries and information. In International Conference on Machine Learning, pages\n2142\u20132151, 2018.\n\n10\n\n\f[25] Arthur Jacot, Franck Gabriel, and Cl\u00e9ment Hongler. Neural tangent kernel: Convergence and\n\ngeneralization in neural networks. arXiv preprint arXiv:1806.07572, 2018.\n\n[26] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale.\n\narXiv preprint arXiv:1611.01236, 2016.\n\n[27] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic\n\ngradient descent on structured data. arXiv preprint arXiv:1808.01204, 2018.\n\n[28] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural\nnetworks via random self-ensemble. In European Conference on Computer Vision, pages\n381\u2013397. Springer, 2018.\n\n[29] Xuanqing Liu and Cho-Jui Hsieh. Rob-gan: Generator, discriminator, and adversarial attacker.\n\nIn CVPR, 2019.\n\n[30] Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Michael E Houle,\nGrant Schoenebeck, Dawn Song, and James Bailey. Characterizing adversarial subspaces using\nlocal intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018.\n\n[31] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu.\nTowards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083,\n2017.\n\n[32] Charles A Micchelli, Yuesheng Xu, and Haizhang Zhang. Universal kernels. Journal of\n\nMachine Learning Research, 7(Dec):2651\u20132667, 2006.\n\n[33] Mehryar Mohri and Andres Munoz Medina. New analysis and algorithm for learning with\n\ndrifting distributions. In Algorithmic Learning Theory, pages 124\u2013138. Springer, 2012.\n\n[34] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning.\n\nMIT Press, 2018.\n\n[35] Preetum Nakkiran. Adversarial robustness may be at odds with simplicity. arXiv preprint\n\narXiv:1901.00532, 2019.\n\n[36] Vern I Paulsen and Mrinal Raghupathi. An introduction to the theory of reproducing kernel\n\nHilbert spaces, volume 152. Cambridge University Press, 2016.\n\n[37] Ali Rahimi and Benjamin Recht. Uniform approximation of functions with random bases. In\n2008 46th Annual Allerton Conference on Communication, Control, and Computing, pages\n555\u2013561. IEEE, 2008.\n\n[38] Hadi Salman, Greg Yang, Huan Zhang, Cho-Jui Hsieh, and Pengchuan Zhang. A convex relax-\nation barrier to tight robust veri\ufb01cation of neural networks. arXiv preprint arXiv:1902.08722,\n2019.\n\n[39] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-GAN: Protecting classi\ufb01ers\nagainst adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018.\n\n[40] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry.\nAdversarially robust generalization requires more data. In Advances in Neural Information\nProcessing Systems, pages 5014\u20135026, 2018.\n\n[41] Ali Shafahi, Mahyar Najibi, Amin Ghiasi, Zheng Xu, John Dickerson, Christoph Studer,\nLarry S Davis, Gavin Taylor, and Tom Goldstein. Adversarial training for free! arXiv preprint\narXiv:1904.12843, 2019.\n\n[42] Gagandeep Singh, Timon Gehr, Matthew Mirman, Markus P\u00fcschel, and Martin Vechev. Fast\nand effective robustness certi\ufb01cation. In Advances in Neural Information Processing Systems,\npages 10802\u201310813, 2018.\n\n11\n\n\f[43] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend:\nLeveraging generative models to understand and defend against adversarial examples. arXiv\npreprint arXiv:1710.10766, 2017.\n\n[44] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfel-\nlow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199,\n2013.\n\n[45] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv\n\npreprint arXiv:1011.3027, 2010.\n\n[46] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the\nconvergence and robustness of adversarial training. In International Conference on Machine\nLearning, pages 6586\u20136595, 2019.\n\n[47] Tsui-Wei Weng, Huan Zhang, Hongge Chen, Zhao Song, Cho-Jui Hsieh, Luca Daniel, Duane\nBoning, and Inderjit Dhillon. Towards fast computation of certi\ufb01ed robustness for relu networks.\nIn International Conference on Machine Learning, pages 5273\u20135282, 2018.\n\n[48] Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer\nadversarial polytope. In International Conference on Machine Learning, pages 5283\u20135292,\n2018.\n\n[49] Cihang Xie and Alan Yuille. Intriguing properties of adversarial training. arXiv preprint\n\narXiv:1906.03787, 2019.\n\n[50] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks.\n\narXiv preprint arXiv:1802.03620, 2018.\n\n[51] Dong Yin, Kannan Ramchandran, and Peter Bartlett. Rademacher complexity for adversarially\n\nrobust generalization. arXiv preprint arXiv:1810.11914, 2018.\n\n[52] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Finite sample expressive power of small-width\n\nrelu networks. arXiv preprint arXiv:1810.07770, 2018.\n\n[53] Dinghuai Zhang, Tianyuan Zhang, Yiping Lu, Zhanxing Zhu, and Bin Dong. You only propagate\nonce: Painless adversarial training using maximal principle. arXiv preprint arXiv:1905.00877,\n2019.\n\n[54] Huan Zhang, Tsui-Wei Weng, Pin-Yu Chen, Cho-Jui Hsieh, and Luca Daniel. Ef\ufb01cient neural\nnetwork robustness certi\ufb01cation with general activation functions. In Advances in Neural\nInformation Processing Systems, pages 4939\u20134948, 2018.\n\n[55] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes\n\nover-parameterized deep ReLU networks. arXiv preprint arXiv:1811.08888, 2018.\n\n12\n\n\f", "award": [], "sourceid": 7140, "authors": [{"given_name": "Ruiqi", "family_name": "Gao", "institution": "Peking University"}, {"given_name": "Tianle", "family_name": "Cai", "institution": "Peking University"}, {"given_name": "Haochuan", "family_name": "Li", "institution": "MIT"}, {"given_name": "Cho-Jui", "family_name": "Hsieh", "institution": "UCLA"}, {"given_name": "Liwei", "family_name": "Wang", "institution": "Peking University"}, {"given_name": "Jason", "family_name": "Lee", "institution": "Princeton University"}]}