{"title": "NAIS-Net: Stable Deep Networks from Non-Autonomous  Differential Equations", "book": "Advances in Neural Information Processing Systems", "page_first": 3025, "page_last": 3035, "abstract": "This paper introduces Non-Autonomous Input-Output Stable Network (NAIS-Net), a very deep architecture where each stacked processing block is derived from a time-invariant non-autonomous dynamical system. Non-autonomy is implemented by skip connections from the block input to each of the unrolled processing stages and allows stability to be enforced so that blocks can be unrolled adaptively to a  pattern-dependent processing depth. NAIS-Net induces non-trivial, Lipschitz input-output maps, even for an infinite unroll length. We prove that the network is globally asymptotically stable so that for every initial condition there is exactly one input-dependent equilibrium assuming tanh units, and multiple stable equilibria for ReL units. An efficient implementation that enforces the stability under derived conditions for both fully-connected and convolutional layers is also presented. Experimental results show how NAIS-Net exhibits stability in practice, yielding a significant reduction in generalization gap compared to ResNets.", "full_text": "NAIS-NET: Stable Deep Networks from\nNon-Autonomous Differential Equations\n\nMarco Ciccone\u2217\nPolitecnico di Milano\n\nNNAISENSE SA\n\nmarco.ciccone@polimi.it\n\nMarco Gallieri\u2217\u2020\nNNAISENSE SA\n\nmarco@nnaisense.com\n\nJonathan Masci\nNNAISENSE SA\n\njonathan@nnaisense.com\n\nChristian Osendorfer\n\nNNAISENSE SA\n\nchristian@nnaisense.com\n\nFaustino Gomez\nNNAISENSE SA\n\ntino@nnaisense.com\n\nAbstract\n\nThis paper introduces Non-Autonomous Input-Output Stable Network (NAIS-Net),\na very deep architecture where each stacked processing block is derived from a\ntime-invariant non-autonomous dynamical system. Non-autonomy is implemented\nby skip connections from the block input to each of the unrolled processing stages\nand allows stability to be enforced so that blocks can be unrolled adaptively to\na pattern-dependent processing depth. NAIS-Net induces non-trivial, Lipschitz\ninput-output maps, even for an in\ufb01nite unroll length. We prove that the network is\nglobally asymptotically stable so that for every initial condition there is exactly one\ninput-dependent equilibrium assuming tanh units, and multiple stable equilibria\nfor ReL units. An ef\ufb01cient implementation that enforces the stability under derived\nconditions for both fully-connected and convolutional layers is also presented.\nExperimental results show how NAIS-Net exhibits stability in practice, yielding a\nsigni\ufb01cant reduction in generalization gap compared to ResNets.\n\nIntroduction\n\n1\nDeep neural networks are now the state-of-the-art in a variety of challenging tasks, ranging from\nobject recognition to natural language processing and graph analysis [28, 3, 52, 43, 36]. With enough\nlayers, they can, in principle, learn arbitrarily complex abstract representations through an iterative\nprocess [13] where each layer transforms the output from the previous layer non-linearly until the\ninput pattern is embedded in a latent space where inference can be done ef\ufb01ciently.\nUntil the advent of Highway [40] and Residual (ResNet; [18]) networks, training nets beyond a certain\ndepth with gradient descent was limited by the vanishing gradient problem [19, 4]. These very deep\nnetworks (VDNNs) have skip connections that provide shortcuts for the gradient to \ufb02ow back through\nhundreds of layers. Unfortunately, training them still requires extensive hyper-parameter tuning, and,\neven if there were a principled way to determine the optimal number of layers or processing depth for\na given task, it still would be \ufb01xed for all patterns.\nRecently, several researchers have started to view VDNNs from a dynamical systems perspective.\nHaber and Ruthotto [15] analyzed the stability of ResNets by framing them as an Euler integration of\nan ODE, and [34] showed how using other numerical integration methods induces various existing\nnetwork architectures such as PolyNet [50], FractalNet [30] and RevNet [11]. A fundamental problem\nwith the dynamical systems underlying these architectures is that they are autonomous: the input\npattern sets the initial condition, only directly affecting the \ufb01rst processing stage. This means that if\n\n\u2217The authors equally contributed.\n\u2020The author derived the mathematical results.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: NAIS-Net architecture. Each block represents a time-invariant iterative process as the \ufb01rst layer\nin the i-th block, xi(1), is unrolled into a pattern-dependent number, Ki, of processing stages, using weight\nmatrices Ai and Bi. The skip connections from the input, ui, to all layers in block i make the process non-\nautonomous. Blocks can be chained together (each block modeling a different latent space) by passing \ufb01nal\nlatent representation, xi(Ki), of block i as the input to block i + 1.\nthe system converges, there is either exactly one \ufb01xpoint or exactly one limit cycle [42]. Neither case\nis desirable from a learning perspective because a dynamical system should have input-dependent\nconvergence properties so that representations are useful for learning. One possible approach to\nachieve this is to have a non-autonomous system where, at each iteration, the system is forced by an\nexternal input.\nThis paper introduces a novel network architecture, called the \u201cNon-Autonomous Input-Output Stable\nNetwork\u201d (NAIS-Net), that is derived from a dynamical system that is both time-invariant (weights\nare shared) and non-autonomous.3 NAIS-Net is a general residual architecture where a block (see\n\ufb01gure 1) is the unrolling of a time-invariant system, and non-autonomy is implemented by having the\nexternal input applied to each of the unrolled processing stages in the block through skip connections.\nResNets are similar to NAIS-Net except that ResNets are time-varying and only receive the external\ninput at the \ufb01rst layer of the block.\nWith this design, we can derive suf\ufb01cient conditions under which the network exhibits input-dependent\nequilibria that are globally asymptotically stable for every initial condition. More speci\ufb01cally, in\nsection 3, we prove that with tanh activations, NAIS-Net has exactly one input-dependent equilibrium,\nwhile with ReLU activations it has multiple stable equilibria per input pattern. Moreover, the\nNAIS-Net architecture allows not only the internal stability of the system to be analyzed but, more\nimportantly, the input-output stability \u2014 the difference between the representations generated by two\ndifferent inputs belonging to a bounded set will also be bounded at each stage of the unrolling.4\nIn section 4, we provide an ef\ufb01cient implementation that enforces the stability conditions for both fully-\nconnected and convolutional layers in the stochastic optimization setting. These implementations are\ncompared experimentally with ResNets on both CIFAR-10 and CIFAR-100 datasets, in section 5,\nshowing that NAIS-Nets achieve comparable classi\ufb01cation accuracy with a much better generalization\ngap. NAIS-Nets can also be 10 to 20 times deeper than the original ResNet without increasing the\ntotal number of network parameters, and, by stacking several stable NAIS-Net blocks, models that\nimplement pattern-dependent processing depth can be trained without requiring any normalization at\neach step (except when there is a change in layer dimensionality, to speed up training).\nThe next section presents a more formal treatment of the dynamical systems perspective of neural\nnetworks, and a brief overview of work to date in this area.\n\n2 Background and Related Work\nRepresentation learning is about \ufb01nding a mapping from input patterns to encodings that disentangle\nthe underlying variational factors of the input set. With such an encoding, a large portion of typical\nsupervised learning tasks (e.g. classi\ufb01cation and regression) should be solvable using just a simple\nmodel like logistic regression. A key characteristic of such a mapping is its invariance to input\ntransformations that do not alter these factors for a given input5. In particular, random perturbations\nof the input should in general not be drastically ampli\ufb01ed in the encoding. In the \ufb01eld of control\n\n3The DenseNet architecture [29, 22] is non-autonomous, but time-varying.\n4In the supplementary material, we also show that these results hold both for shared and unshared weights.\n5Such invariance conditions can be very powerful inductive biases on their own: For example, requiring\n\ninvariance to time transformations in the input leads to popular RNN architectures [45].\n\n2\n\nB1B1x1(1)A1u1block 1\u2026B1A1A1B1\u2026\u2026B2B2A2block 2B2A2B3Classi\ufb01er\u2026BNBNANblock NBNANBN-1AN-1\u2026ANBN\u2026\u2026A2B2\u2026\u2026u1u1u1u2u2u2u2u3uNuNuNuNx1(2)x1(3)x2(1)x2(K2)x2(2)x2(3)xN(1)xN(2)xN(3)xN(KN)x1(K1)\ftheory, this property is central to stability analysis which investigates the properties of dynamical\nsystems under which they converge to a single steady state without exhibiting chaos [25, 42, 39].\nIn machine learning, stability has long been central to the study of recurrent neural networks\n(RNNs) with respect to the vanishing [19, 4, 37], and exploding [9, 2, 37] gradient problems,\nleading to the development of Long Short-Term Memory [20] to alleviate the former. More recently,\ngeneral conditions for RNN stability have been presented [52, 24, 31, 47] based on general insights\nrelated to Matrix Norm analysis. Input-output stability [25] has also been analyzed for simple\nRNNs [41, 26, 16, 38].\nRecently, the stability of deep feed-forward networks was more closely investigated, mostly due\nto adversarial attacks [44] on trained networks. It turns out that sensitivity to (adversarial) input\nperturbations in the inference process can be avoided by ensuring certain conditions on the spectral\nnorms of the weight matrices [7, 49]. Additionally, special properties of the spectral norm of weight\nmatrices mitigate instabilities during the training of Generative Adversarial Networks [35].\nAlmost all successfully trained VDNNs [20, 18, 40, 6] share the following core building block:\n\nx(k + 1) = x(k) + f (x(k), \u03b8(k)) , 1 \u2264 k \u2264 K.\n\n(1)\nThat is, in order to compute a vector representation at layer k + 1 (or time k + 1 for recurrent\nnetworks), additively update x(k) with some non-linear transformation f (\u00b7) of x(k) which depends\non parameters \u03b8(k). The reason usual given for why Eq. (1) allows VDNNs to be trained is that the\nexplicit identity connections avoid the vanishing gradient problem.\nThe semantics of the forward path are however still considered unclear. A recent interpretation is that\nthese feed-forward architectures implement iterative inference [13, 23]. This view is reinforced by\nobserving that Eq. (1) is a forward Euler discretization [1] of the ordinary differential equation (ODE)\n\u02d9x(t) = f (x(t), \u0398) if \u03b8(k) \u2261 \u0398 for all 1 \u2264 k \u2264 K in Eq. (1). This connection between dynamical\nsystems and feed-forward architectures was recently also observed by several other authors [48].\nThis point of view leads to a large family of new network architectures that are induced by various\nnumerical integration methods [34]. Moreover, stability problems in both the forward as well the\nbackward path of VDNNs have been addressed by relying on well-known analytical approaches\nfor continuous-time ODEs [15, 5]. In the present paper, we instead address the problem directly in\ndiscrete-time, meaning that our stability result is preserved by the network implementation. With the\nexception of [33], none of this prior research considers time-invariant, non-autonomous systems.\nConceptually, our work shares similarities with approaches that build network according to iterative\nalgorithms [14, 51] and recent ideas investigating pattern-dependent processing time [12, 46, 10].\n\n3 Non-Autonomous Input-Output Stable Nets (NAIS-Nets)\nThis section provides stability conditions for both fully-connected and convolutional NAIS-Net layers.\nWe formally prove that NAIS-Net provides a non-trivial input-dependent output for each iteration k\nas well as in the asymptotic case (k \u2192 \u221e). The following dynamical system:\n\nx(k + 1) = x(k) + hf (x(k), u, \u03b8) , x(0) = 0,\n\n(2)\nis used throughout the paper, where x \u2208 Rn is the latent state, u \u2208 Rm is the network input, and\nh > 0. For ease of notation, in the remainder of the paper the explicit dependence on the parameters,\n\u03b8, will be omitted.\n\nFully Connected NAIS-Net Layer. Our fully connected layer is de\ufb01ned by\n\nx(k + 1) = x(k) + h\u03c3\n\n(3)\nwhere A \u2208 Rn\u00d7n and B \u2208 Rn\u00d7m are the state and input transfer matrices, and b \u2208 Rn is a bias.\nThe activation \u03c3 \u2208 Rn is a vector of (element-wise) instances of an activation function, denoted as\n\u03c3i with i \u2208 {1, . . . , n}. In this paper, we only consider the hyperbolic tangent, tanh, and Recti\ufb01ed\nLinear Units (ReLU) activation functions. Note that by setting B = 0, and the step h = 1 the original\nResNet formulation is obtained.\n\nAx(k) + Bu + b\n\n,\n\nConvolutional NAIS-Net Layer. The architecture can be easily extended to Convolutional Net-\nworks by replacing the matrix multiplications in Eq. (3) with a convolution operator:\n\nX(k + 1) = X(k) + h\u03c3\n\nC \u2217 X + D \u2217 U + E\n\n.\n\n(4)\n\n(cid:19)\n\n(cid:19)\n\n(cid:18)\n\n(cid:18)\n\n3\n\n\f(cid:26)\n\n(cid:27)\n\n\uf8eb\uf8ed NC(cid:88)\n\nNC(cid:88)\n\n\uf8f6\uf8f8 ,\n\nConsider the case of NC channels. The convolutional layer in Eq. (4) can be rewritten, for each latent\nmap c \u2208 {1, 2, . . . , NC}, in the equivalent form:\n\nX c(k + 1) = X c(k) + h\u03c3\n\ni \u2217 X i(k) +\nC c\n\nj \u2217 U j + Ec\nDc\n\n(5)\n\ni\n\nj\n\nwhere: X i(k) \u2208 RnX\u00d7nX is the layer state matrix for channel i, U j \u2208 RnU\u00d7nU is the layer input data\ni \u2208 RnC\u00d7nC\nmatrix for channel j (where an appropriate zero padding has been applied) at layer k, C c\nis the state convolution \ufb01lter from state channel i to state channel c, Dc\nj is its equivalent for the input,\nand Ec is a bias. The activation, \u03c3, is still applied element-wise. The convolution for X has a \ufb01xed\nstride s = 1, a \ufb01lter size nC and a zero padding of p \u2208 N, such that nC = 2p + 1.6\nConvolutional layers can be rewritten in the same form as fully connected layers (see proof of Lemma\n1 in the supplementary material). Therefore, the stability results in the next section will be formulated\nfor the fully connected case, but apply to both.\n\nStability Analysis. Here, the stability conditions for NAIS-Nets which were instrumental to their\ndesign are laid out. We are interested in using a cascade of unrolled NAIS blocks (see Figure 1),\nwhere each block is described by either Eq. (3) or Eq. (4). Since we are dealing with a cascade of\ndynamical systems, then stability of the entire network can be enforced by having stable blocks [25].\nThe state-transfer Jacobian for layer k is de\ufb01ned as:\n\nJ(x(k), u) =\n\n\u2202x(k + 1)\n\n\u2202x(k)\n\n= I + h\n\n\u2202\u03c3(\u2206x(k))\n\n\u2202\u2206x(k)\n\nA,\n\n(6)\n\nwhere the argument of the activation function, \u03c3, is denoted as \u2206x(k). Take an arbitrarily small\nscalar \u03c3 > 0 and de\ufb01ne the set of pairs (x, u) for which the activations are not saturated as:\n\nP =\n\n(x, u) :\n\n\u2202\u03c3i(\u2206x(k))\n\n\u2202\u2206xi(k)\n\n\u2265 \u03c3, \u2200i \u2208 [1, 2, . . . , n]\n\n.\n\n(7)\n\nTheorem 1 below proves that the non-autonomuous residual network produces a bounded output\ngiven a bounded, possibly noisy, input, and that the network state converges to a constant value as the\nnumber of layers tends to in\ufb01nity, if the following stability condition holds:\nCondition 1. For any \u03c3 > 0, the Jacobian satis\ufb01es:\n\n\u00af\u03c1 = sup\n\n(x,u)\u2208P\n\n\u03c1(J(x, u)), s.t. \u00af\u03c1 < 1,\n\n(8)\n\nwhere \u03c1(\u00b7) is the spectral radius.\nThe steady states, \u00afx, are determined by a continuous function of u. This means that a small change in\nu cannot result in a very different \u00afx. For tanh activation, \u00afx depends linearly on u, therefore the block\nneeds to be unrolled for a \ufb01nite number of iterations, K, for the mapping to be non-linear. That is not\nthe case for ReLU, which can be unrolled inde\ufb01nitely and still provide a piece-wise af\ufb01ne mapping.\nIn Theorem 1, the Input-Output (IO) gain function, \u03b3(\u00b7), describes the effect of norm-bounded input\nperturbations on the network trajectory. This gain provides insight as to the level of robust invariance\nof the classi\ufb01cation regions to changes in the input data with respect to the training set. In particular,\nas the gain is decreased, the perturbed solution will be closer to the solution obtained from the\ntraining set. This can lead to increased robustness and generalization with respect to a network that\ndoes not statisfy Condition 1. Note that the IO gain, \u03b3(\u00b7), is linear, and hence the block IO map is\nLipschitz even for an in\ufb01nite unroll length. The IO gain depends directly on the norm of the state\ntransfer Jacobian, in Eq. (8), as indicated by the term \u00af\u03c1 in Theorem 1.7\nTheorem 1. (Asymptotic stability for shared weights)\nIf Condition 1 holds, then NAIS-Net with ReLU or tanh activations is Asymptotically Stable with\nrespect to input dependent equilibrium points. More formally:\n\n(9)\nThe trajectory is described by (cid:107)x(k) \u2212 \u00afx(cid:107) \u2264 \u00af\u03c1k(cid:107)x(0) \u2212 \u00afx(cid:107) , where (cid:107) \u00b7 (cid:107) is a suitable matrix norm.\n\nx(k) \u2192 \u00afx \u2208 Rn, \u2200x(0) \u2208 X \u2286 Rn, u \u2208 Rm.\n\n6 If s \u2265 0, then x can be extended with an appropriate number of constant zeros (not connected).\n7see supplementary material for additional details and all proofs, where the untied case is also covered.\n\n4\n\n\fAlgorithm 1 Fully Connected Reprojection\nInput: R \u2208 R\u02dcn\u00d7n, \u02dcn \u2264 n, \u03b4 = 1 \u2212 2\u0001, \u0001 \u2208\n(0, 0.5).\nif (cid:107)RT R(cid:107)F > \u03b4 then\n\u02dcR \u2190 \u221a\nelse\n\u02dcR \u2190 R\nend if\nOutput: \u02dcR\n\nR\u221a\n(cid:107)RT R(cid:107)F\n\n\u03b4\n\nAlgorithm 2 CNN Reprojection\n\nInput: \u03b4 \u2208 RNC, C \u2208 RnX\u00d7nX\u00d7NC\u00d7NC , and\n0 < \u0001 < \u03b7 < 1.\nfor each feature map c do\n\n(cid:19)\n\n(cid:18)\nmin(cid:0)\u03b4c, 1 \u2212 \u03b7(cid:1),\u22121 + \u03b7\n(cid:12)(cid:12)C c\n(cid:12)(cid:12) > 1 \u2212 \u0001 \u2212 |\u02dc\u03b4c| then\n1 \u2212 \u0001 \u2212 |\u02dc\u03b4c|(cid:17)\n\n(cid:80)\nj(cid:54)=icentre|Cc\nj|\n\nCc\nj\n\n\u02dc\u03b4c \u2190 max\nicentre \u2190 \u22121 \u2212 \u02dc\u03b4c\n\u02dcC c\n\nif(cid:80)\nj \u2190(cid:16)\n\nj(cid:54)=icentre\n\nj\n\n\u02dcC c\nend if\nend for\nOutput: \u02dc\u03b4, \u02dcC\n\nFigure 2: Proposed algorithms for enforcing stability.\n\nIn particular:\n\u2022 With tanh activation, the steady state \u00afx is independent of the initial state, and it is a linear function\n\nof the input, namely, \u00afx = A\u22121Bu. The network is Globally Asymptotically Stable.\nWith ReLU activation, \u00afx is given by a continuous piecewise af\ufb01ne function of x(0) and u. The\nnetwork is Locally Asymptotically Stable with respect to each \u00afx .\n\n\u2022 If the activation is tanh, then the network is Globally Input-Output (robustly) Stable for any\n\nadditive input perturbation w \u2208 Rm. The trajectory is described by:\n\n(cid:107)x(k) \u2212 \u00afx(cid:107) \u2264 \u00af\u03c1k(cid:107)x(0) \u2212 \u00afx(cid:107) + \u03b3((cid:107)w(cid:107)), with \u03b3((cid:107)w(cid:107)) = h\n\n(10)\nwhere \u03b3(\u00b7) is the input-output gain. For any \u00b5 \u2265 0, if (cid:107)w(cid:107) \u2264 \u00b5 then the following set is robustly\npositively invariant (x(k) \u2208 X ,\u2200k \u2265 0):\n\n(cid:107)w(cid:107).\n\n(11)\n\u2022 If the activation is ReLU, then the network is Globally Input-Output practically Stable. In other\n\nX = {x \u2208 Rn : (cid:107)x \u2212 \u00afx(cid:107) \u2264 \u03b3(\u00b5)} .\n\n(cid:107)B(cid:107)\n(1 \u2212 \u00af\u03c1)\n\nwords, \u2200k \u2265 0 we have:\n\n(cid:107)x(k) \u2212 \u00afx(cid:107) \u2264 \u00af\u03c1k(cid:107)x(0) \u2212 \u00afx(cid:107) + \u03b3((cid:107)w(cid:107)) + \u03b6.\n\n(12)\n\nThe constant \u03b6 \u2265 0 is the norm ball radius for x(0) \u2212 \u00afx.\nImplementation\n\n4\nIn general, an optimization problem with a spectral radius constraint as in Eq. (8) is hard [24]. One\npossible approach is to relax the constraint to a singular value constraint [24] which is applicable\nto both fully connected as well as convolutional layer types [49]. However, this approach is only\napplicable if the identity matrix in the Jacobian (Eq. (6)) is scaled by a factor 0 < c < 1 [24]. In this\nwork we instead ful\ufb01l the spectral radius constraint directly.\nThe basic intuition for the presented algorithms is the fact that for a simple Jacobian of the form\nI + M, M \u2208 Rn\u00d7n, Condition 1 is ful\ufb01lled, if M has eigenvalues with real part in (\u22122, 0) and\nimaginary part in the unit circle. In the supplemental material we prove that the following algorithms\nful\ufb01ll Condition 1 following this intuition. Note that, in the following, the presented procedures are\nto be performed for each block of the network.\n\nFully-connected blocks.\nnegative de\ufb01nite by choosing the following parameterization for them:\n\nIn the fully connected case, we restrict the matrix A to by symmetric and\n\n(13)\nwhere R \u2208 Rn\u00d7n is trained, and 0 < \u0001 (cid:28) 1 is a hyper-parameter. Then, we propose a bound on the\nFrobenius norm, (cid:107)RT R(cid:107)F . Algorithm 1, performed during training, implements the following8:\n\nA = \u2212RT R \u2212 \u0001I,\n\n8The more relaxed condition \u03b4 \u2208 (0, 2) is suf\ufb01cient for Theorem 1 to hold locally (supplementary material).\n\n5\n\n\fFigure 3:\nSingle neuron trajectory and convergence. (Left) Average loss of NAIS-Net with different\nresidual architectures over the unroll length. Note that both RESNET-SH-STABLE and NAIS-Net satisfy\nthe stability conditions for convergence, but only NAIS-Net is able to learn, showing the importance of non-\nautonomy. Cross-entropy loss vs processing depth. (Right) Activation of a NAIS-Net single neuron for input\nsamples from each class on MNIST. Trajectories not only differ with respect to the actual steady-state but also\nwith respect to the convergence time.\n\n\u221a\n\nTheorem 2. (Fully-connected weight projection)\nR\u221a\nGiven R \u2208 Rn\u00d7n, the projection \u02dcR =\n(cid:107)RT R(cid:107)F\nA = \u2212 \u02dcRT \u02dcR \u2212 \u0001I is such that Condition 1 is satis\ufb01ed for h \u2264 1 and therefore Theorem 1 holds.\nNote that \u03b4 = 2(1 \u2212 \u0001) \u2208 (0, 2) is also suf\ufb01cient for stability, however, the \u03b4 from Theorem 2 makes\nthe trajectory free from oscillations (critically damped), see Figure 3. This is further discussed in\nAppendix.\n\n, with \u03b4 = 1 \u2212 2\u0001 \u2208 (0, 1), ensures that\n\n\u03b4\n\nX c+j,n2\n\nX c + j, 0 \u2264 c < NC, 0 \u2264 j < n2\n\nX NC\u00d7n2\nX c+j, 0 \u2264 c < NC, 0 \u2264 j < n2\n\nConvolutional blocks. The symmetric parametrization assumed in the fully connected case can\nnot be used for a convolutional layer. We will instead make use of the following result:\nLemma 1. The convolutional layer Eq. (4) with zero-padding p \u2208 N, and \ufb01lter size nC = 2p + 1\nhas a Jacobian of the form Eq. (6). with A \u2208 Rn2\nX NC . The diagonal elements of this matrix,\nX are the central elements of the (c + 1)-th\nnamely, An2\nconvolutional \ufb01lter mapping X c+1(k), into X c+1(k + 1), denoted by C c\nicentre. The other elements in\nX are the remaining \ufb01lter values mapping to X (c+1)(k + 1).\nrow n2\nicentre = \u22121 \u2212 \u03b4c, where \u03b4c is trainable\nTo ful\ufb01ll the stability condition, the \ufb01rst step is to set C c\nparameter satisfying |\u03b4c| < 1 \u2212 \u03b7, and 0 < \u03b7 (cid:28) 1 is a hyper-parameter. Then we will suitably bound\nthe \u221e-norm of the Jacobian by constraining the remaining \ufb01lter elements. The steps are summarized\nin Algorithm 2 which is inspired by the Gershgorin Theorem [21]. The following result is obtained:\nTheorem 3. (Convolutional weight projection)\nAlgorithm 2 ful\ufb01ls Condition 1 for the convolutional layer, for h \u2264 1, hence Theorem 1 holds.\nNote that the algorithm complexity scales with the number of \ufb01lters. A simple design choice for the\nlayer is to set \u03b4 = 0, which results in C c\n\nicentre being \ufb01xed at \u221219.\n\n5 Experiments\nExperiments were conducted comparing NAIS-Net with ResNet, and variants thereof, using both\nfully-connected (MNIST, section 5.1) and convolutional (CIFAR-10/100, section 5.2) architectures to\nquantitatively assess the performance advantage of having a VDNN where stability is enforced.\n\n5.1 Preliminary Analysis on MNIST\nFor the MNIST dataset [32] a single-block NAIS-Net was compared with 9 different 30-layer ResNet\nvariants each with a different combination of the following features: SH (shared weights i.e. time-\ninvariant), NA (non-autonomous i.e. input skip connections), BN (with Batch Normalization), Stable\n9Setting \u03b4 = 0 removes the need for hyper-parameter \u03b7 but does not necessarily reduce conservativeness as\nit will further constrain the remaining element of the \ufb01lter bank. This is further discussed in the supplementary.\n\n6\n\n081624Layerindex(k)0.00.51.01.52.02.53.0AverageCross-EntropyNAIS-NetResNet-SH-STABLEResNet-SH-NAResNet-SHResNet-SH-NA-BNResNet-SH-BNResNet-NAResNetResNet-NA-BNResNet-BN020406080100Layerindex(k)0123456Activationvalueclass0class1class2class3class4class5class6class7class8class9\f(stability enforced by Algorithm 1). For example, RESNET-SH-NA-BN refers to a 30-layer ResNet\nthat is time-invariant because weights are shared across all layers (SH), non-autonomous because\nit has skip connections from the input to all layers (NA), and uses batch normalization (BN). Since\nNAIS-Net is time-invariant, non-autonomous, and input/output stable (i.e. SH-NA-STABLE), the\nchosen ResNet variants represent ablations of the these three features. For instance, RESNET-SH-NA\nis a NAIS-Net without I/O stability being enforced by the reprojection step described in Algorithm 1,\nand RESNET-NA, is a non-stable NAIS-Net that is time-variant, i.e non-shared-weights, etc. The\nNAIS-Net was unrolled for K = 30 iterations for all input patterns. All networks were trained using\nstochastic gradient descent with momentum 0.9 and learning rate 0.1, for 150 epochs.\n\nResults. Test accuracy for NAIS-NET was 97.28%, while RESNET-SH-BN was second best with\n96.69%, but without BatchNorm (RESNET-SH) it only achieved 95.86% (averaged over 10 runs).\nAfter training, the behavior of each network variant was analyzed by passing the activation, x(i),\nthough the softmax classi\ufb01er and measuring the cross-entropy loss. The loss at each iteration describes\nthe trajectory of each sample in the latent space: the closer the sample to the correct steady state the\ncloser the loss to zero (see Figure 3). All variants initially re\ufb01ne their predictions at each iteration\nsince the loss tends to decreases at each layer, but at different rates. However, NAIS-Net is the\nonly one that does so monotonically, not increasing loss as i approaches 30. Figure 3 shows how\nneuron activations in NAIS-Net converge to different steady state activations for different input\npatterns instead of all converging to zero as is the case with RESNET-SH-STABLE, con\ufb01rming the\nresults of [15]. Importantly, NAIS-Net is able to learn even with the stability constraint, showing that\nnon-autonomy is key to obtaining representations that are stable and good for learning the task.\nNAIS-Net also allows training of unbounded processing depth without any feature normalization\nsteps. Note that BN actually speeds up loss convergence, especially for RESNET-SH-NA-BN (i.e.\nunstable NAIS-Net). Adding BN makes the behavior very similar to NAIS-Net because BN also\nimplicitly normalizes the Jacobian, but it does not ensure that its eigenvalues are in the stability\nregion.\n\nImage Classi\ufb01cation on CIFAR-10/100\n\n5.2\nExperiments on image classi\ufb01cation were performed on standard image recognition benchmarks\nCIFAR-10 and CIFAR-100 [27]. These benchmarks are simple enough to allow for multiple runs to\ntest for statistical signi\ufb01cance, yet suf\ufb01ciently complex to require convolutional layers.\n\nSetup. The following standard architecture was used to compare NAIS-Net with ResNet10: three\nsets of 18 residual blocks with 16, 32, and 64 \ufb01lters, respectively, for a total of 54 stacked blocks.\nNAIS-Net was tested in two versions: NAIS-NET1 where each block is unrolled just once, for a total\nprocessing depth of 108, and NAIS-NET10 where each block is unrolled 10 times per block, for\na total processing depth of 540. The initial learning rate of 0.1 was decreased by a factor of 10 at\nepochs 150, 250 and 350 and the experiment were run for 450 epochs. Note that each block in the\nResNet of [17] has two convolutions (plus BatchNorm and ReLU) whereas NAIS-Net unrolls with a\nsingle convolution. Therefore, to make the comparison of the two architectures as fair as possible by\nusing the same number of parameters, a single convolution was also used for ResNet.\n\nResults. Table 5.2 compares the performance on the two datasets, averaged over 5 runs. For\nCIFAR-10, NAIS-Net and ResNet performed similarly, and unrolling NAIS-Net for more than one\niteration had little affect. This was not the case for CIFAR-100 where NAIS-NET10 improves over\nNAIS-NET1 by 1%. Moreover, although mean accuracy is slightly lower than ResNet, the variance\nis considerably lower. Figure 4 shows that NAIS-Net is less prone to over\ufb01tting than a classic ResNet,\nreducing the generalization gap by 33%. This is a consequence of the stability constraint which\nimparts a degree of robust invariance to input perturbations (see Section 3). It is also important to\nnote that NAIS-Net can unroll up to 540 layers, and still train without any problems.\n\n5.3 Pattern-Dependent Processing Depth\nFor simplicity, the number of unrolling steps per block in the previous experiments was \ufb01xed. A\nmore general and potentially more powerful setup is to have the processing depth adapt automatically.\nSince NAIS-Net blocks are guaranteed to converge to a pattern-dependent steady state after an\nindeterminate number of iterations, processing depth can be controlled dynamically by terminating\nthe unrolling process whenever the distance between a layer representation, x(i), and that of the\n\n10https://github.com/tensorflow/models/tree/master/official/resnet\n\n7\n\n\fMODEL\n\nRESNET\n\nNAIS-NET1\n\nNAIS-NET10\n\nCIFAR-10\nTRAIN/TEST\n99.86\u00b10.03\n91.72\u00b10.38\n99.37\u00b10.08\n91.24\u00b10.10\n99.50\u00b10.02\n91.25\u00b10.46\n\nCIFAR-100\nTRAIN/TEST\n97.42 \u00b1 0.06\n66.34 \u00b1 0.82\n86.90 \u00b1 1.47\n65.00 \u00b1 0.52\n86.91 \u00b1 0.42\n66.07 \u00b1 0.24\n\nFigure 4: CIFAR Results. (Left) Classi\ufb01cation accuracy on the CIFAR-10 and CIFAR-100 datasets averaged\nover 5 runs. Generalization gap on CIFAR-10. (Right) Dotted curves (training set) are very similar for the\ntwo networks but NAIS-Net has a considerably lower test curve (solid).\n\n(a) frog\n\n(b) bird\n\n(c) ship\n\n(d) airplane\n\nFigure 5: Image samples with corresponding NAIS-Net depth. The \ufb01gure shows samples from CIFAR-10\ngrouped by \ufb01nal network depth, for four different classes. The qualitative differences evident in images inducing\ndifferent \ufb01nal depths indicate that NAIS-Net adapts processing systematically according characteristics of the\ndata. For example, \u201cfrog\u201d images with textured background are processed with fewer iterations than those with\nplain background. Similarly, \u201cship\u201d and \u201cairplane\u201d images having a predominantly blue color are processed\nwith lower depth than those that are grey/white, and \u201cbird\u201d images are grouped roughly according to bird\nsize with larger species such as ostriches and turkeys being classi\ufb01ed with greater processing depth. A higher\nde\ufb01nition version of the \ufb01gure is made available in the supplementary materials.\n\nimmediately previous layer, x(i \u2212 1), drops below a speci\ufb01ed threshold. With this mechanism,\nNAIS-Net can determine the processing depth for each input pattern. Intuitively, one could speculate\nthat similar input patterns would require similar processing depth in order to be mapped to the same\nregion in latent space. To explore this hypothesis, NAIS-Net was trained on CIFAR-10 with an\nunrolling threshold of \u0001 = 10\u22124. At test time the network was unrolled using the same threshold.\nFigure 5 shows selected images from four different classes organized according to the \ufb01nal network\ndepth used to classify them after training. The qualitative differences seen from low to high depth\nsuggests that NAIS-Net is using processing depth as an additional degree of freedom so that, for a\ngiven training run, the network learns to use models of different complexity (depth) for different types\nof inputs within each class. To be clear, the hypothesis is not that depth correlates to some notion of\ninput complexity where the same images are always classi\ufb01ed at the same depth across runs.\n\n6 Conclusions\nWe presented NAIS-Net, a non-autonomous residual architecture that can be unrolled until the latent\nspace representation converges to a stable input-dependent state. This is achieved thanks to stability\n\n8\n\n02004006008001000TrainingIterationsx1020.00.51.01.5AverageLossNAIS-NetResNet256257258259260261262263264265266267268269270271272257258259260261262263264265266267268269270271272273258259260261262263264265266267268269270271272273257258259260261262263264265266267268269270271\fand non-autonomy properties. We derived stability conditions for the model and proposed two\nef\ufb01cient reprojection algorithms, both for fully-connected and convolutional layers, to enforce the\nnetwork parameters to stay within the set of feasible solutions during training.\nNAIS-Net achieves asymptotic stability and, as consequence of that, input-output stability. Stability\nmakes the model more robust and we observe a reduction of the generalization gap by quite some\nmargin, without negatively impacting performance. The question of scalability to benchmarks such\nas ImageNet [8] will be a main topic of future work.\nWe believe that cross-breeding machine learning and control theory will open up many new interesting\navenues for research, and that more robust and stable variants of commonly used neural networks,\nboth feed-forward and recurrent, will be possible.\n\nAknowledgements\nWe want to thank Wojciech Ja\u00b4skowski, Rupesh Srivastava and the anonymous reviewers for their\ncomments on the idea and initial drafts of the paper.\n\nReferences\n[1] U. M. Ascher and L. R. Petzold. Computer methods for ordinary differential equations and differential-\n\nalgebraic equations, volume 61. Siam, 1998.\n\n[2] P. Baldi and K. Hornik. Universal approximation and learning of trajectories using oscillators. In Advances\n\nin Neural Information Processing Systems, pages 451\u2013457, 1996.\n\n[3] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, S. Satheesh, D. Seetapun, A. Sriram,\nand Z. Zhu. Exploring neural transducers for end-to-end speech recognition. CoRR, abs/1707.07413, 2017.\n\n[4] Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is dif\ufb01cult.\n\nNeural Networks, 5(2):157\u2013166, 1994.\n\n[5] Bo Chang, Lili Meng, Eldad Haber, Frederick Tung, and David Begert. Multi-level residual networks from\n\ndynamical systems view. arXiv preprint arXiv:1710.10348, 2017.\n\n[6] K. Cho, B. Van Merri\u00ebnboer, C. Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and\nYoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine\ntranslation. arXiv preprint arXiv:1406.1078, 2014.\n\n[7] M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier. Parseval networks: Improving robustness\nto adversarial examples. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70, pages 854\u2013863, Sydney, Australia, 06\u201311 Aug 2017. PMLR.\n\n[8] J. Deng, W. Dong, R. Socher, L. J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image\n\nDatabase. In Conference on Computer Vision and Pattern Recognition (CVPR), 2009.\n\n[9] K. Doya. Bifurcations in the learning of recurrent neural networks.\n\nIn Circuits and Systems, 1992.\nISCAS\u201992. Proceedings., 1992 IEEE International Symposium on, volume 6, pages 2777\u20132780. IEEE,\n1992.\n\n[10] M. Figurnov, A. Sobolev, and D. Vetrov. Probabilistic adaptive computation time. CoRR, abs/1712.00386,\n\n2017.\n\n[11] A. Gomez, M. Ren, R. Urtasun, and R. B. Grosse. The reversible residual network: Backpropagation\n\nwithout storing activations. In NIPS, 2017.\n\n[12] A. Graves. Adaptive computation time for recurrent neural networks. CoRR, abs/1603.08983, 2016.\n\n[13] K. Greff, R. K. Srivastava, and J. Schmidhuber. Highway and residual networks learn unrolled iterative\n\nestimation. arXiv preprint arXiv:1612.07771, 2016.\n\n[14] K. Gregor and Y. LeCun. Learning fast approximations of sparse coding. In International Conference on\n\nMachine Learning (ICML), 2010.\n\n[15] E. Haber and L. Ruthotto. Stable architectures for deep neural networks. arXiv preprint arXiv:1705.03341,\n\n2017.\n\n[16] R. Haschke and J. J. Steil. Input space bifurcation manifolds of recurrent neural networks. Neurocomputing,\n\n64:25\u201338, 2005.\n\n9\n\n\f[17] K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into recti\ufb01ers: Surpassing human-level performance on\n\nimagenet classi\ufb01cation. arXiv preprint arXiv:1502.01852, 2015.\n\n[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778, Dec\n2016.\n\n[19] S. Hochreiter. Untersuchungen zu dynamischen neuronalen netzen. diploma thesis, 1991. Advisor:J.\n\nSchmidhuber.\n\n[20] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u20131780, 1997.\n\n[21] R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd\n\nedition, 2012.\n\n[22] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger. Densely connected convolutional networks.\n\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.\n\n[23] S. Jastrzebski, D. Arpit, N. Ballas, V. Verma, T. Che, and Y. Bengio. Residual connections encourage\n\niterative inference. arXiv preprint arXiv:1710.04773, 2017.\n\n[24] S. Kanai, Y. Fujiwara, and S. Iwamura. Preventing gradient explosions in gated recurrent units. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in\nNeural Information Processing Systems 30, pages 435\u2013444. Curran Associates, Inc., 2017.\n\n[25] H. K. Khalil. Nonlinear Systems. Pearson Education, 3rd edition, 2014.\n\n[26] J. N. Knight. Stability analysis of recurrent neural networks with applications. Colorado State University,\n\n2008.\n\n[27] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. 2009.\n\n[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Advances in Neural Information Processing Systems (NIPS), 2012.\n\n[29] J.K. Lang and M. J. Witbrock. Learning to tell two spirals apart.\n\nIn D. Touretzky, G. Hinton, and\nT. Sejnowski, editors, Proceedings of the Connectionist Models Summer School, pages 52\u201359, Mountain\nView, CA, 1988.\n\n[30] G. Larsson, M. Maire, and G. Shakhnarovich. Fractalnet: Ultra-deep neural networks without residuals.\n\narXiv preprint arXiv:1605.07648, 2016.\n\n[31] T. Laurent and J. von Brecht. A recurrent neural network without chaos. arXiv preprint arXiv:1612.06212,\n\n2016.\n\n[32] Yann LeCun. The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/, 1998.\n\n[33] Qianli Liao and Tomaso Poggio. Bridging the gaps between residual learning, recurrent neural networks\n\nand visual cortex. arXiv preprint arXiv:1604.03640, 2016.\n\n[34] Y. Lu, A. Zhong, D. Bin, and Q. Li. Beyond \ufb01nite layer neural networks: Bridging deep architectures and\n\nnumerical differential equations, 2018.\n\n[35] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida. Spectral normalization for generative adversarial\n\nnetworks. International Conference on Learning Representations, 2018.\n\n[36] F. Monti, D. Boscaini, J. Masci, E. Rodol\u00e0, J. Svoboda, and M. M. Bronstein. Geometric deep learning on\n\ngraphs and manifolds using mixture model cnns. In CVPR2017, 2017.\n\n[37] R. Pascanu, T. Mikolov, and Y. Bengio. On the dif\ufb01culty of training recurrent neural networks.\n\nInternational Conference on Machine Learning, pages 1310\u20131318, 2013.\n\nIn\n\n[38] J. Singh and N. Barabanov. Stability of discrete time recurrent neural networks and nonlinear optimization\n\nproblems. Neural Networks, 74:58\u201372, 2016.\n\n[39] E. Sontag. Mathematical Control Theory: Deterministic Finite Dimensional Systems. Springer-Verlag,\n\n2nd edition, 1998.\n\n[40] R. K. Srivastava, K. Greff, and J. Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387,\n\nMay 2015.\n\n[41] Jochen J Steil. Input Output Stability of Recurrent Neural Networks. Cuvillier G\u00f6ttingen, 1999.\n\n10\n\n\f[42] S. H. Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and\n\nengineering. Westview Press, 2nd edition, 2015.\n\n[43] I. Sutskever, O. Vinyals, and Le. Q. V. Sequence to sequence learning with neural networks. CoRR,\n\nabs/1409.3215, 2014.\n\n[44] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing\n\nproperties of neural networks. arXiv preprint arXiv:1312.6199, 2013.\n\n[45] C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? International Conference on Learning\n\nRepresentations, 2018.\n\n[46] A. Veit and S. Belongie. Convolutional networks with adaptive computation graphs. CoRR, 2017.\n\n[47] E. Vorontsov, C. Trabelsi, S. Kadoury, and C. Pal. On orthogonality and learning recurrent networks with\n\nlong term dependencies. arXiv preprint arXiv:1702.00071, 2017.\n\n[48] E. Weinan. A proposal on machine learning via dynamical systems. Communications in Mathematics and\n\nStatistics, 5(1):1\u201311, 2017.\n\n[49] Y. Yoshida and T. Miyato. Spectral norm regularization for improving the generalizability of deep learning.\n\narXiv preprint arXiv:1705.10941, 2017.\n\n[50] X. Zhang, Z. Li, C. C. Loy, and D. Lin. Polynet: A pursuit of structural diversity in very deep networks. In\n2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3900\u20133908. IEEE,\n2017.\n\n[51] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr.\nIn Proceedings of the IEEE International\n\nConditional random \ufb01elds as recurrent neural networks.\nConference on Computer Vision, pages 1529\u20131537, 2015.\n\n[52] J. G. Zilly, R. K. Srivastava, J. Koutn\u00edk, and J. Schmidhuber. Recurrent highway networks. In ICML2017,\n\npages 4189\u20134198. PMLR, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1571, "authors": [{"given_name": "Marco", "family_name": "Ciccone", "institution": "Politecnico di Milano"}, {"given_name": "Marco", "family_name": "Gallieri", "institution": "NNAISENSE"}, {"given_name": "Jonathan", "family_name": "Masci", "institution": "NNAISENSE"}, {"given_name": "Christian", "family_name": "Osendorfer", "institution": "NNAISENSE"}, {"given_name": "Faustino", "family_name": "Gomez", "institution": "NNAISENSE"}]}