{"title": "Universal Approximation of Input-Output Maps by Temporal Convolutional Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 14071, "page_last": 14081, "abstract": "There has been a recent shift in sequence-to-sequence modeling from recurrent network architectures  to convolutional network architectures due to computational advantages in training and operation while still achieving competitive performance. For systems having limited long-term temporal dependencies, the approximation capability of recurrent networks is essentially equivalent to that of temporal convolutional nets (TCNs). We prove that TCNs can approximate a large class of input-output maps having approximately finite memory to arbitrary error tolerance. Furthermore, we derive quantitative approximation rates for deep ReLU TCNs in terms of the width and depth of the network and modulus of continuity of the original input-output map, and apply these results to input-output maps of systems that admit finite-dimensional state-space realizations (i.e., recurrent models).", "full_text": "Universal Approximation of Input-Output Maps by\n\nTemporal Convolutional Nets\n\nJoshua Hanson\n\njmh4@illinois.edu\n\nUniversity of Illinois\n\nUrbana, IL 61801\n\nMaxim Raginsky\nUniversity of Illinois\n\nUrbana, IL 61801\n\nmaxim@illinois.edu\n\nAbstract\n\nThere has been a recent shift in sequence-to-sequence modeling from recurrent\nnetwork architectures to convolutional network architectures due to computational\nadvantages in training and operation while still achieving competitive performance.\nFor systems having limited long-term temporal dependencies, the approximation\ncapability of recurrent networks is essentially equivalent to that of temporal con-\nvolutional nets (TCNs). We prove that TCNs can approximate a large class of\ninput-output maps having approximately \ufb01nite memory to arbitrary error tolerance.\nFurthermore, we derive quantitative approximation rates for deep ReLU TCNs\nin terms of the width and depth of the network and modulus of continuity of the\noriginal input-output map, and apply these results to input-output maps of systems\nthat admit \ufb01nite-dimensional state-space realizations (i.e., recurrent models).\n\n1\n\nIntroduction\n\nUntil recently, recurrent networks have been considered the de facto standard for modeling input-\noutput maps that transform sequences to sequences. Convolutional network architectures are becom-\ning favorable alternatives for several applications due to reduced computational overhead incurred\nduring both training and regular operation, while often performing as well as or better than recurrent\narchitectures in practice. The computational advantage of convolutional networks follows from the\nlack of feedback elements, which enables shifted copies of the input sequence to be processed in\nparallel rather than sequentially [Gehring et al., 2017]. Convolutional architectures have demonstrated\nexceptional accuracy in sequence modeling tasks that have typically been approached using recurrent\narchitectures, such as machine translation, audio generation, and language modeling [Dauphin et al.,\n2017, Kalchbrenner et al., 2016, van den Oord et al., 2016, Wu et al., 2016, Gehring et al., 2017,\nJohnson and Zhang, 2017].\nOne explanation for this shift is that both convolutional and recurrent architectures are inherently\nsuited to modeling systems with limited long-term dependencies. Recurrent models possess in\ufb01nite\nmemory (the output at each time is a function of the initial conditions and the entire history of\ninputs until that time), and thus are strictly more expressive than \ufb01nite-memory autoregressive\nmodels. However, in synthetic stress tests designed to measure the ability to model long-term\nbehavior, recurrent architectures often fail to learn long sequences [Bai et al., 2018]. Furthermore,\nthis unlimited memory property is usually unnecessary, which is supported in theory [Sharan et al.,\n2018] and in practice [Chelba et al., 2017, Gehring et al., 2017]. In situations where it is only\nimportant to learn \ufb01nite-length sequences, feedforward architectures based on temporal convolutions\n(temporal convolutional nets, or TCNs) can achieve similar results and even outperform recurrent\nnets [Dauphin et al., 2017, Yin et al., 2017, Bai et al., 2018].\nThese results prompt a closer look at the conditions under which convolutional architectures provide\nbetter approximation than recurrent architectures. Recent work by Miller and Hardt [2019] has shown\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fthat recurrent models that are exponentially stable (in the sense that the effect of the initial conditions\non the output decays exponentially with time) can be ef\ufb01ciently approximated by feedforward models.\nA key consequence is that exponentially stable recurrent models can be approximated by systems\nthat only consider a \ufb01nite number of recent values of the input sequence for determining the value of\nthe subsequent output.\nHowever, this notion of stability is inherently tied to a particular state-space realization, and it is\nnot dif\ufb01cult to come up with examples of sequence-to-sequence maps that have both a stable and\nan unstable state-space realization (e.g., simply by adding unstable states that do not affect the\noutput). This suggests that the question of approximating sequence-to-sequence maps by feedforward\nconvolutional maps should be studied by abstracting away the notion of stability and only requiring\nthat the system output depend appreciably on recent input values and negligibly on input values in the\ndistant past. The formalization of this property was introduced by Sandberg [1991] under the name of\napproximately \ufb01nite memory, building on earlier work by Boyd and Chua [1985]. Outputs of systems\ncharacterized by this property can be approximated by the output of the same system when applied to\na truncated version of the input sequence. These systems are naturally suited to be modeled using\nTCNs, which by construction only operate on values of the input sequence for times within a \ufb01nite\nhorizon into the past.\nIn this work, we aim to develop quantitative results for the approximation capability of TCNs for\nmodeling input-output maps that have the properties of causality, time invariance, and approximately\n\ufb01nite memory. In Section 2, we introduce the necessary de\ufb01nitions and review the approximately\n\ufb01nite memory property due to Sandberg [1991]. Section 3 gives the main result for approximating\ninput-output maps by ReLU TCNs, together with a quantitative result on the equivalence between\napproximately \ufb01nite memory and a related notion of fading memory [Boyd and Chua, 1985, Park and\nSandberg, 1992]. These results are applied in Section 4 to recurrent models that are incrementally\nstable [Tran et al., 2017], i.e., the in\ufb02uence of the initial condition is asymptotically negligible. We\nshow that incrementally stable recurrent models have approximately \ufb01nite memory, and then use\nthis formalism to derive a generalization of the result of Miller and Hardt [2019]. We provide a\ncomparison in Section 5 to other architectures used for approximating input-output maps. All omitted\nproofs are provided in the Supplementary Material.\n\nInput-output maps and approximately \ufb01nite memory\n\n2\nLet S denote the set of all real-valued sequences u = (ut)t\u2208Z+, where Z+ := {0, 1, 2, . . .}. An\ninput-output map (or i/o map, for short) is a nonlinear operator F : S \u2192 S that maps an input\nsequence u \u2208 S to an output sequence y = Fu \u2208 S. (We are considering real-valued input and\noutput sequences for simplicity; all our results carry over to vector-valued sequences at the expense of\nadditional notation.) We will denote the application and the composition of i/o maps by concatenation.\nIn this paper, we are concerned with i/o maps F that are:\n\n\u2022 causal \u2014 for any t \u2208 Z+, u0:t = v0:t implies (Fu)t = (Fv)t, where u0:t := (u0, . . . , ut);\n\u2022 time-invariant \u2014 for any k \u2208 Z+,\n\n(cid:40)\n\nfor t \u2265 k\nfor 0 \u2264 t < k\n,\nwhere R : S \u2192 S is the right shift operator (Ru)t := ut\u221211{t\u22651}.\n\n(Fu)t\u2212k,\n0,\n\n(FRku)t =\n\n(cid:12)(cid:12)(Fu)t \u2212 (FWt,mu)t\n\nThe key notion we will work with is that of approximately \ufb01nite memory [Sandberg, 1991]:\nDe\ufb01nition 2.1. An i/o map F has approximately \ufb01nite memory on a set of inputs M \u2286 S if for any\n\u03b5 > 0 there exists m \u2208 Z+, such that\nsup\nu\u2208M\n\n(1)\nwhere Wt,m : S \u2192 S is the windowing operator (Wt,mu)\u03c4 := u\u03c4 1{max{t\u2212m,0}\u2264\u03c4\u2264t}. We will\ndenote by m\u2217\nF(0) < \u221e, then we say that F has \ufb01nite memory on M. If F is causal and time-invariant, this is\nIf m\u2217\nequivalent to the existence of an integer m \u2208 Z+ and a nonlinear functional f : Rm+1 \u2192 R, such\n\nF(\u03b5) the smallest m \u2208 Z+, for which (1) holds.\n\n(cid:12)(cid:12) \u2264 \u03b5,\n\nsup\nt\u2208Z+\n\n2\n\n\fthat f (0, . . . , 0) = 0 and, for any u \u2208 M and any t \u2208 Z+,\n\n(Fu)t = f (ut\u2212m, ut\u2212m+1, . . . , ut),\n\n(2)\n\nwith the convention that us = 0 if s < 0. In this work, we will focus on the important case when f is\na feedforward neural net with recti\ufb01ed linear unit (ReLU) activations ReLU(x) := max{x, 0}. That\nis, there exist k af\ufb01ne maps Ai : Rdi \u2192 Rdi+1 with d1 = m + 1 and dk+1 = 1, such that f is given\nby the composition\n\nf = Ak \u25e6 ReLU\u25e6Ak\u22121 \u25e6 ReLU\u25e6 . . . \u25e6 ReLU\u25e6A1,\n\nwhere, for any r \u2265 1, ReLU(x1, . . . , xr) := (ReLU(x1), . . . , ReLU(xr)). Here, k is the depth\n(number of layers) and max{d2, . . . , dk} is the width (largest number of units in any hidden layer).\nDe\ufb01nition 2.2. An i/o map F is a ReLU temporal convolutional net (or ReLU TCN, for short) with\ncontext length m if (2) holds for some feedforward ReLU neural net f : Rm+1 \u2192 R.\nRemark 2.3. While such an F is evidently causal, it is generally not time-invariant unless\nf (0, . . . , 0) = 0.\n\n3 The universal approximation theorem\n\nIn this section, we state and prove our main result: any causal and time-invariant i/o map that has\napproximately \ufb01nite memory and satis\ufb01es an additional continuity condition can be approximated\narbitrarily well by a ReLU temporal convolutional net. In what follows, we will consider i/o maps\nwith uniformly bounded inputs, i.e., inputs in the set\nM(R) := {u \u2208 S : (cid:107)u(cid:107)\u221e := sup\nt\u2208Z+\n\nfor some R > 0.\n\n|ut| \u2264 R}\n\nFor any t \u2208 Z+ and any u \u2208 M(R), the \ufb01nite subsequence u0:t = (u0, . . . , ut) is an element of the\ncube [\u2212R, R]t+1 \u2282 Rt+1; conversely, any vector x \u2208 [\u2212R, R]t+1 can be embedded into M(R) by\nsetting us = xs1{0\u2264s\u2264t}. To any causal and time-invariant i/o map F we can associate the nonlinear\nfunctional \u02dcFt : Rt+1 \u2192 R de\ufb01ned in the obvious way: for any x = (x0, x1, . . . , xt) \u2208 Rt+1,\n\n\u02dcFt(x) := (Fu)t,\n\nwhere u \u2208 S is any input such that us = xs for s \u2208 {0, 1, . . . , t} (the values of us for s > t can be\narbitrary by causality). We impose the following assumptions on F:\nAssumption 3.1. The i/o map F has approximately \ufb01nite memory on M(R).\nAssumption 3.2. For any t \u2208 Z+, the functional \u02dcFt : Rt+1 \u2192 R is uniformly continuous on\n[\u2212R, R]t+1 with modulus of continuity\n\n\u03c9t,F(\u03b4) := sup\n\nand inverse modulus of continuity\n\u03c9\u22121\n\n(cid:110)|\u02dcFt(x) \u2212 \u02dcFt(x(cid:48))| : x, x(cid:48) \u2208 [\u2212R, R]t+1,(cid:107)x \u2212 x(cid:48)(cid:107)\u221e \u2264 \u03b4\nt,F (\u03b5) := sup(cid:8)\u03b4 > 0 : \u03c9t,F(\u03b4) \u2264 \u03b5(cid:9) .\n\n(cid:111)\n\n,\n\nwhere (cid:107)x(cid:107)\u221e := max0\u2264i\u2264t |xi| is the (cid:96)\u221e norm on Rt+1.\nThe following qualitative universal approximation result was obtained by Sandberg [1991]: if a causal\nand time-invariant i/o map F satis\ufb01es the above two assumptions, then, for any \u03b5 > 0, there exists an\naf\ufb01ne map A : Rm+1 \u2192 Rd and a lattice map (cid:96) : Rd \u2192 R, such that\n\n(cid:12)(cid:12)(Fu)t \u2212 (cid:96) \u25e6 A(ut\u2212m:t)(cid:12)(cid:12) < \u03b5,\n\nsup\n\nu\u2208M(R)\n\nsup\nt\u2208Z+\n\n(3)\n\nwhere we say that a map (cid:96) : Rd \u2192 R is a lattice map if (cid:96)(x0, . . . , xd\u22121) is generated from\nx = (x0, . . . , xd\u22121) by a \ufb01nite number of min and max operations that do not depend on x. Any\nlattice map can be implemented using ReLU units, so (3) is a ReLU TCN approximation guarantee.\nOur main result is a quantitative version of Sandberg\u2019s theorem:\n\n3\n\n\fTheorem 3.3. Let F be a causal and time-invariant i/o map satisfying Assumptions 3.1 and 3.2. Then,\nF(\u03b3\u03b5),\n\nfor any \u03b5 > 0 and any \u03b3 \u2208 (0, 1), there exists a ReLU TCN(cid:98)F with context length m = m\u2217\nwidth m + 2, and depth(cid:0)\n\n(cid:1)m+2, such that\n\nO(R)\n\n\u22121\nm,F((1\u2212\u03b3)\u03b5)\n\n\u03c9\n\n(cid:107)Fu \u2212(cid:98)Fu(cid:107)\u221e < \u03b5.\n\nsup\n\nu\u2208M(R)\n\n(4)\n\nProof. Let m = m\u2217\n\nRemark 3.4. The role of the additional parameter \u03b3 \u2208 (0, 1) is to trade off the context length and\nthe depth of the ReLU TCN.\n\nRemark 3.5. While the approximating ReLU TCN(cid:98)F is clearly causal, it may not be time-invariant\nunless (cid:98)f (0, . . . , 0) = 0, where (cid:98)f is the ReLU net constructed in the proof below.\nthere exists a ReLU net (cid:98)f : Rm+1 \u2192 R of width m + 2 and depth(cid:0)\n(cid:1)m+2, such that\n|\u02dcFm(x) \u2212 (cid:98)f (x)| < (1 \u2212 \u03b3)\u03b5\n[Hanin and Sellke, 2018]. Consider the TCN(cid:98)F de\ufb01ned by (Fu)t := (cid:98)f (ut\u2212m, . . . , ut). Fix an input\n\nF(\u03b3\u03b5). Since \u02dcFm : Rm+1 \u2192 R is continuous with modulus of continuity \u03c9m,F(\u00b7),\n\nu \u2208 M(R) and consider two cases:\n1) If t \u2265 m, then ut\u2212m:t = (Lt\u2212mWt,mu)0:m, where L : S \u2192 S is the left shift operator (Lu)t :=\nut+1. Therefore,\n\n\u22121\nm,F((1\u2212\u03b3)\u03b5)\n\nx\u2208[\u2212R,R]m+1\n\nsup\n\nO(R)\n\n\u03c9\n\n(FWt,mu)t\n\n(a)\n\n= (FRt\u2212mLt\u2212mWt,mu)t\n\n(b)\n\n= (FLt\u2212mWt,mu)m\n\n(c)\n= \u02dcFm(ut\u2212m:t),\n\nwhere (a) uses the fact that t \u2265 m, (b) is by time invariance of F, and (c) is by the de\ufb01nition of \u02dcFm.\n2) If t < m, then ut\u2212m:t = (Rm\u2212tWt,mu)0:m (recall the convention that, for any v, we set vs = 0\nwhenever s < 0). Therefore\n\n(FWt,mu)t\n\n(a)\n\n= (Rm\u2212tFWt,mu)m\n\n(b)\n\n= (FRm\u2212tWt,mu)m\n\n(c)\n= \u02dcFm(ut\u2212m:t),\n\nwhere (a) uses the fact that m > t, (b) is by time invariance, and (c) is by the de\ufb01nition of \u02dcFm.\nIn either case, the triangle inequality gives\n\n|(Fu)t \u2212 ((cid:98)Fu)t| \u2264 |(Fu)t \u2212 (FWt,mu)t| + |(FWt,mu)t \u2212 ((cid:98)Fu)t|\n\n= |(Fu)t \u2212 (FWt,mu)t| + |\u02dcFm(ut\u2212m:t) \u2212 (cid:98)f (ut\u2212m:t)|\n\n< \u03b3\u03b5 + (1 \u2212 \u03b3)\u03b5 = \u03b5.\n\nSince this holds for all t and all u with (cid:107)u(cid:107)\u221e \u2264 R, the result follows.\n\n3.1 The fading memory property\nF(\u00b7) and on the modulus of\nIn order to apply Theorem 3.3, we need control on the context length m\u2217\ncontinuity \u03c9t,F(\u00b7). In general, these quantities are dif\ufb01cult to estimate. However, it was shown by\nPark and Sandberg [1992] that the property of approximately \ufb01nite memory is closely related to the\nnotion of fading memory, \ufb01rst introduced by Boyd and Chua [1985]. Intuitively, an i/o map F has\nfading memory if the outputs at any time t due to any two inputs u and v that were close to one\nanother in recent past will also be close.\nLet W denote the subset of S consisting of all sequences w, such that wt \u2208 (0, 1] for all t and wt \u2193 0\nas t \u2192 \u221e. We will refer to the elements of W as weighting sequences. Then we have the following\nde\ufb01nition, due to Park and Sandberg [1992]:\nDe\ufb01nition 3.6. We say that an i/o map F has fading memory on M \u2286 S with respect to w \u2208 W if for\nany \u03b5 > 0 there exists \u03b4 > 0 such that, for all u, v \u2208 M and all t \u2208 Z+,\n\ns\u2208{0,...,t} wt\u2212s|us \u2212 vs| < \u03b4 =\u21d2 |(Fu)t \u2212 (Fv)t| < \u03b5.\n\nmax\n\n(5)\n\n4\n\n\fThe weighting sequence w governs the rate at which the past values of the input are discounted in\ndetermining the current output. To capture the best trade-offs in (5), we will also use a w-dependent\nmodulus of continuity:\n\n(cid:110)|(Fu)t \u2212 (Fv)t| : t \u2208 Z+, u, v \u2208 M, max\n\n\u03b1w,F(\u03b4) := sup\n\ns\u2208{0,...,t} wt\u2212s|us \u2212 vs| \u2264 \u03b4\n\n(cid:111)\n\n.\n\nIt was shown by Park and Sandberg [1992] that an i/o map satis\ufb01es Assumptions 3.1 and (3.2) if and\nonly if it has fading memory with respect to some (and hence any) w \u2208 W. The following result\nprovides a quantitative version of this equivalence:\nProposition 3.7. Let F be an i/o map.\n\n1. If F satis\ufb01es Assumptions 3.1 and 3.2, then it has fading memory on M with respect to any\n\nweighting sequence w \u2208 W, and\n\nw,F(\u03b5) \u2265 wm\u2217\n\u03b1\u22121\n\nF (\u03b5/3)\u03c9\u22121\nF (\u03b5/3),F(\u03b5/3).\nm\u2217\n\n(6)\n\n2. If F has fading memory on M(R) with respect to some w \u2208 W, then it has satis\ufb01es\n\nAssumptions 3.1 and 3.2, and\n\n(cid:110)\nm \u2208 Z+ : wm \u2264 \u03b1\u22121\n\nw,F(\u03b5)\nR\n\n(cid:111)\n\nF(\u03b5; R) \u2264 inf\nm\u2217\n\nand\n\n\u03c9t,F(\u03b4) \u2264 \u03b1w,F(\u03b4).\n\n(7)\n\n4 Recurrent systems\nSo far, we have considered arbitrary i/o maps F : S \u2192 S. However, many such maps admit state-space\nrealizations [Sontag, 1998] \u2014 there exist a state transition map f : Rn \u00d7 R \u2192 Rn, an output map\ng : Rn \u2192 R, and an initial condition \u03be \u2208 Rn, such that the output sequence y = Fu is detemined\nrecursively by\n\nxt+1 = f (xt, ut)\n\nyt = g(xt)\n\n(8a)\n(8b)\n\nwith x0 = \u03be. The i/o map F realized in this way is evidently causal, and it is time-invariant if\nf (\u03be, 0) = \u03be and g(\u03be) = 0. In this section, we will identify the conditions under which recurrent\nmodels satisfy Assumptions 3.1 and 3.2. Along the way, we will derive the approximation results of\nMiller and Hardt [2019] as a special case.\n\n4.1 Approximately \ufb01nite memory and incremental stability\nConsider the system in (8). Given any input u \u2208 S, any \u03be \u2208 Rn, and any s, t \u2208 Z+ with t \u2265 s, we\ns,t(\u03be) the state at time t when xs = \u03be. Let M be a subset of S. We say that X \u2286 Rn\ndenote by \u03d5u\nis a positively invariant set of (8) for inputs in M if, for all \u03be \u2208 X, all u \u2208 M, and all 0 \u2264 s \u2264 t,\ns,t(\u03be) \u2208 X. We will be interested in systems with the following property [Tran et al., 2017]:\n\u03d5u\nDe\ufb01nition 4.1. The system (8) is uniformly asymptotically incrementally stable for inputs in M on a\npositively invariant set X if there exists a function \u03b2 : R+ \u00d7 R+ \u2192 R+ of class KL1, such that the\ninequality\n\n(cid:107)\u03d5u\n\ns,t(\u03be) \u2212 \u03d5u\n\ns,t(\u03be(cid:48))(cid:107) \u2264 \u03b2((cid:107)\u03be \u2212 \u03be(cid:48)(cid:107), t \u2212 s)\n\n(9)\nholds for all inputs u \u2208 M, all initial conditions \u03be, \u03be(cid:48) \u2208 X, and all 0 \u2264 s \u2264 t, where (cid:107) \u00b7 (cid:107) is the (cid:96)2\nnorm on Rn.\nIn other words, a system is incrementally stable if the in\ufb02uence of any initial condition in X on the\nstate trajectory is asymptotically negligible. A key consequence is the following estimate:\n\n1A function \u03b2 : R+ \u00d7 R+ \u2192 R+ is of class KL if it is continuous and strictly increasing in its \ufb01rst argument,\ncontinuous and strictly decreasing in its second argument, \u03b2(0, t) = 0 for any t, and limt\u2192\u221e \u03b2(r, t) = 0 for\nany r [Sontag, 1998].\n\n5\n\n\fProposition 4.2. Let u, \u02dcu be two input sequences in M. Then, for any \u03be \u2208 X and any t \u2208 Z+,\n\n(cid:107)\u03d5u\n\n0,t(\u03be) \u2212 \u03d5 \u02dcu\n\n0,t(\u03be)(cid:107) \u2264 t\u22121(cid:88)\n\ns=0\n\n\u03b2(cid:0)(cid:107)f (\u02dcxs, us) \u2212 f (\u02dcxs, \u02dcus)(cid:107), t \u2212 s \u2212 1(cid:1) ,\n\nwhere xs and \u02dcxs denote the states at time s due to inputs u and \u02dcu, respectively, with x0 = \u02dcx0 = \u03be.\nConsider a state-space model (8) with a positively invariant set X, with the following assumptions:\nAssumption 4.3. The state transition map f (x, u) is Lf -Lipschitz in u for all x \u2208 X and the output\nmap g(x) is Lg-Lipschitz in x \u2208 X.\nAssumption 4.4. For any initial condition \u03be \u2208 X there exists a compact set S\u03be \u2286 X such that\n0,t(\u03be) \u2208 S\u03be for all u \u2208 M(R) and all t \u2208 Z+.\n\u03d5u\nAssumption 4.5. The system (8) is uniformly asymptotically incrementally stable on X for inputs in\n(cid:88)\nM(R), and the function \u03b2 in (9) satis\ufb01es the summability condition\n\n\u03b2(C, t) < \u221e\n\nt\u2208Z+\n\nfor any C \u2265 0. (For example, if \u03b2(C, k) = Ck\u2212\u03b1 for some \u03b1 > 1, then this condition is satis\ufb01ed.)\nWe are now in position to prove the main result of this section:\nTheorem 4.6. Suppose that Assumptions 4.3\u20134.5 are satis\ufb01ed. Then the i/o map F of the system (8)\nsatis\ufb01es Assumptions 3.1 and 3.2 with\n\nF(\u03b5) \u2264 min\nm\u2217\n\nm \u2208 Z+ :\n\n\u03b2(diam(S\u03be), k) < \u03b5/Lg\n\n(cid:110)\n\n(cid:88)\n\nk\u2265m\n\n(cid:111)\n\nand\n\n\u03c9t,F(\u03b4) \u2264 Lg\n\nt\u22121(cid:88)\n\ns=0\n\n\u03b2(Lf \u03b4, s),\n\n\u2200t \u2208 Z+.\n\nProof. Fix some t, m \u2208 Z+. For an arbitrary input u \u2208 M(R), let \u02dcu = Wt,mu, where we may\nassume without loss of generality that t \u2265 m. Then \u02dcus = us1{t\u2212m\u2264s\u2264t}, and therefore\n\n\u03b2(cid:0)(cid:107)f (\u02dcxs, us) \u2212 f (\u02dcxs, \u02dcus)(cid:107), t \u2212 s \u2212 1(cid:1) =\n\n\u03b2(cid:0)(cid:107)f (\u02dcxs, us) \u2212 f (\u02dcxs, 0)(cid:107), t \u2212 s \u2212 1(cid:1)\n\nt\u22121(cid:88)\n\ns=0\n\n(10)\n\n(11)\n\n(12)\n\n(13)\n\nt\u2212m\u22121(cid:88)\n\u2264 t\u2212m\u22121(cid:88)\n\u221e(cid:88)\n\ns=0\n\ns=0\n\n\u2264\n\n\u03b2(diam(S\u03be), t \u2212 s \u2212 1)\n\n\u03b2(diam(S\u03be), s).\n\n(14)\n\nBy the summability condition (11), the summation in (14) converges to 0 as m \u2191 \u221e. Thus, if we\nchoose m so that the right-hand side of (14) is smaller than \u03b5/Lg, it follows from Proposition 4.2 that\n\ns=m\n\n|(Fu)t \u2212 (FWt,mu)t| = |g(\u03d5u\n\n0,t(\u03be)) \u2212 g(\u03d5 \u02dcu\n\n0,t(\u03be))| \u2264 Lg(cid:107)\u03d5u\n\n0,t(\u03be) \u2212 \u03d5 \u02dcu\n\n0,t(\u03be)| < \u03b5.\n\nThis proves (12). Now \ufb01x any two u, \u02dcu \u2208 M(R) with (cid:107)u0:t \u2212 \u02dcu0:t(cid:107)\u221e < \u03b4. Then\nmax0\u2264s\u2264t (cid:107)f (x, us) \u2212 f (x, \u02dcus)(cid:107) \u2264 Lf \u03b4 for all x \u2208 X, so Proposition 4.2 gives\n\n|\u02dcFt(u0:t) \u2212 \u02dcFt(\u02dcu0:t)| = |g(\u03d5u\n\u2264 Lg(cid:107)\u03d5u\n\u2264 Lg\n\n0,t(\u03be)) \u2212 g(\u03d5 \u02dcu\n0,t(\u03be) \u2212 \u03d5 \u02dcu\nt\u22121(cid:88)\n\n\u03b2(Lf \u03b4, s),\n\n0,t(\u03be))|\n0,t(\u03be)(cid:107)\n\nwhich proves (13).\n\ns=0\n\n6\n\n\f4.2 Exponential incremental stability and the Demidovich criterion\nMiller and Hardt [2019] consider the case of contracting systems: there exists some \u03bb \u2208 (0, 1) and a\nset U \u2286 Rm, such that\n\n(cid:107)f (x, u) \u2212 f (x(cid:48), u)(cid:107) \u2264 \u03bb(cid:107)x \u2212 x(cid:48)(cid:107)\n\n(15)\nfor all x, x(cid:48) \u2208 Rn and all u \u2208 U. Such a system is uniformly exponentially incrementally stable on\nany positively invariant set X, with \u03b2(C, t) = C\u03bbt. In this section, we obtain their result as a special\ncase of a more general stability criterion, known in the literature on nonlinear system stability as the\nDemidovich criterion [Pavlov et al., 2006]. The following result is a simpli\ufb01ed version of a more\ngeneral result of Tran et al. [2017]:\nProposition 4.7 (the discrete-time Demidovich criterion). Consider the recurrent system (8) with a\nconvex positively invariant set X, where the state transition map f (x, u) is differentiable in x for any\nu \u2208 U. Suppose that there exists a symmetric positive de\ufb01nite matrix P and a constant \u00b5 \u2208 (0, 1),\nsuch that\n\nthe system (8) is uniformly exponentially incrementally stable with \u03b2(C, t) =(cid:112)\u03ba(P )C\u00b5t/2, where\n\n(16)\n\u2202x f (x, u) is the Jacobian of f (\u00b7, u) with respect to x. Then\n\nfor all x \u2208 X and all u \u2208 U, where \u2202\n\nf (x, u)(cid:62)P\n\nf (x, u) \u2212 \u00b5P (cid:22) 0\n\n\u2202\n\u2202x\n\n\u2202\n\u2202x\n\n\u03ba(P ) is the condition number of P .\nProof. Fix any u \u2208 U and \u03be, \u03be(cid:48) \u2208 X, and de\ufb01ne the function \u03a6 : [0, 1] \u2192 R by\n\u03a6(s) := (f (\u03be, u) \u2212 f (\u03be(cid:48), u))(cid:62)P f (s\u03be + (1 \u2212 s)\u03be(cid:48), u).\n\nThen\n\n\u03a6(1) \u2212 \u03a6(0) = (f (\u03be, u) \u2212 f (\u03be(cid:48), u))(cid:62)P (f (\u03be, u) \u2212 f (\u03be(cid:48), u)).\n\nBy the mean-value theorem, there exists some \u00afs \u2208 [0, 1], such that\n= (f (\u03be, u) \u2212 f (\u03be(cid:48), u))(cid:62)P\n\n\u03a6(1) \u2212 \u03a6(0) =\n\n\u03a6(s)\n\n(cid:12)(cid:12)(cid:12)s=\u00afs\n\nd\nds\n\nf ( \u00af\u03be, u)(\u03be \u2212 \u03be(cid:48)),\n\n\u2202\n\u2202x\n\nwhere \u00af\u03be = \u00afs\u03be + (1 \u2212 \u00afs)\u03be(cid:48) \u2208 X, since X is convex. From (16), (17), and (18) it follows that\n\n(f (\u03be, u) \u2212 f (\u03be(cid:48), u))(cid:62)P (f (\u03be, u) \u2212 f (\u03be(cid:48), u))\n\n\u2264 (\u03be \u2212 \u03be(cid:48))(cid:62) \u2202\nf ( \u00af\u03be, u)(cid:62)P\n\u2202x\n\u2264 \u00b5(\u03be \u2212 \u03be(cid:48))(cid:62)P (\u03be \u2212 \u03be(cid:48)).\n\n\u2202\n\u2202x\n\nf ( \u00af\u03be, u)(\u03be \u2212 \u03be(cid:48))\n\nDe\ufb01ne the function V : X \u00d7 X \u2192 R+ by V (\u03be, \u03be(cid:48)) := (\u03be \u2212 \u03be(cid:48))(cid:62)P (\u03be \u2212 \u03be(cid:48)). From the above estimate,\nit follows that V is a Lyapunov function for the dynamics, i.e., for any u \u2208 U and \u03be, \u03be(cid:48) \u2208 X,\n\nV (f (\u03be, u), f (\u03be(cid:48), u)) \u2264 \u00b5V (\u03be, \u03be(cid:48)).\n\nConsequently, for any input u with ut \u2208 U for all t and any \u03be, \u03be(cid:48) \u2208 X,\n0,t(\u03be), ut), f (\u03d5u\n0,t(\u03be(cid:48))).\n0,t(\u03be), \u03d5u\n\n0,t+1(\u03be(cid:48))) = V (f (\u03d5u\n\u2264 \u00b5V (\u03d5u\n\n0,t+1(\u03be), \u03d5u\n\nV (\u03d5u\n\n0,t(\u03be(cid:48)), ut))\n\nIterating, we obtain the inequality V (\u03d5u\n\n(cid:107)\u03d5u\n\n0,t(\u03be) \u2212 \u03d5u\n\n0,t(\u03be), \u03d5u\n0,t(\u03be)(cid:107)2 \u2264 \u03bbmax(P )\n\u03bbmin(P )\n\n0,t(\u03be(cid:48))) \u2264 \u00b5tV (\u03be, \u03be(cid:48)). Finally, since P (cid:31) 0,\n\n\u00b5t(cid:107)\u03be \u2212 \u03be(cid:48)(cid:107)2 = \u03ba(P )(cid:107)\u03be \u2212 \u03be(cid:48)(cid:107)2\u00b5t,\n\n(17)\n\n(18)\n\n(19)\n\nand the proof is complete.\nTheorem 4.8. Suppose the system (8) satis\ufb01es Assumption 4.3 and the Demidovich criterion with\nU = [\u2212R, R], its positively invariant set X contains 0, and f (0, 0) = 0. Then its i/o map F with zero\ninitial condition x0 = 0 satis\ufb01es Assumptions 3.1 and 3.2 with\n\n(cid:112)\u03ba(P )Lf Lg\u03b4\n1 \u2212 \u221a\n\n\u00b5\n\n.\n\n(20)\n\nF(\u03b5) \u2264 2 log( 2\u03ba(P )Lf LgR\n\u00b5)2\u03b5 )\nm\u2217\n\n(1\u2212\u221a\nlog 1\n\u00b5\n\nand\n\n\u03c9t,F(\u03b4) \u2264\n\n7\n\n\fProof. Since P is symmetric and positive de\ufb01nite, (cid:107)x(cid:107)P :=\n\u03bbmin(P )(cid:107) \u00b7 (cid:107)2 \u2264 (cid:107) \u00b7 (cid:107)2\n\nP \u2264 \u03bbmax(P )(cid:107) \u00b7 (cid:107)2. Then, for all \u03be \u2208 X, u \u2208 M(R), and t,\n\nx(cid:62)P x is a norm on Rn with\n\n\u221a\n\n(cid:107)\u03d5u\n\n0,t+1(\u03be)(cid:107)P = (cid:107)f (\u03d5u\n\u2264 (cid:107)f (\u03d5u\n\u2264 \u221a\n\u00b5(cid:107)\u03d5u\n\n0,t(\u03be), ut)(cid:107)P\n0,t(\u03be), ut) \u2212 f (0, ut)(cid:107)P + (cid:107)f (0, ut) \u2212 f (0, 0)(cid:107)P\n\n0,t(\u03be)(cid:107)P +(cid:112)\u03bbmax(P )Lf R,\n0,t(\u03be)(cid:107)P \u2264 \u221a\n(cid:107)\u03d5u\n\n\u00b5(cid:107)\u03be(cid:107)P +\n\nwhere we have used the Lyapunov bound (19). Unrolling the recursion gives the estimate\n\n(cid:112)\u03bbmax(P )Lf R\n1 \u2212 \u221a\nThus, Assumption 4.4 is satis\ufb01ed, where S\u03be is the ball of (cid:96)2-radius(cid:112)\u03ba(P )\n\n(cid:16)(cid:107)\u03be(cid:107) + Lf R\n\ncentered\nat 0. Assumption 4.5 is also satis\ufb01ed by Proposition 4.7. The estimates in (20) follow from\nTheorem 4.6.\n\nu\u2208M(R)\n\nsup\nt\u2208Z+\n\n1\u2212\u221a\n\n\u00b5\n\n(cid:17)\n\nsup\n\n\u00b5\n\n.\n\nThe following result now follows as a direct consequence of Theorems 3.3 and 4.8:\nCorollary 4.9. If the system (8) satis\ufb01es the conditions of Theorem 4.8, then its i/o map F with zero\n\ninitial condition can be \u03b5-approximated in the sense of Theorem 3.3 by a ReLU TCN(cid:98)F with width\n\npolylog( 1\n\n\u03b5 ) and depth quasipoly( 1\n\n\u03b5 ).2\n\n4.3 Contractivity vs. the Demidovich criterion\n\nIf the contractivity condition (15) holds and f (x, u) is differentiable in x, then the Demidovich\ncriterion is satis\ufb01ed with P = In and \u00b5 = \u03bb2. In that case, we immediately obtain the exponential\nestimate \u03b2(C, t) \u2264 C\u03bbt. However, the Demidovich criterion covers a wider class of nonlinear\nsystems. As an example, consider a discrete-time nonlinear system of Lur\u2019e type (cf. Sandberg and\nXu [1993], Kim and Braatz [2014], Sarkans and Logemann [2016] and references therein):\n\nxt+1 = Axt + B\u03c8(ut \u2212 yt)\n\n(21a)\n(21b)\nHere, the state xt is n-dimensional while the input ut and the output yt are scalar, so A \u2208 Rn\u00d7n,\nB \u2208 Rn\u00d71, and C \u2208 R1\u00d7n. The map \u03c8 : R \u2192 R is a \ufb01xed differentiable nonlinearity. The system in\n(21) has the form (8) with f (x, u) = Ax + B\u03c8(u \u2212 Cx) and g(x) = Cx, and can be realized as the\nnegative feedback interconnection of the discrete-time linear system\n\nyt = Cxt\n\nyt = Cxt\n\nxt+1 = Axt + Bvt\n\n(22a)\n(22b)\nand the nonlinear element \u03c8 using the feedback law vt = \u03c8(ut \u2212 yt). We make the following\nassumptions (see, e.g., Sontag [1998] for the requisite control-theoretic background):\nAssumption 4.10. The nonlinearity \u03c8 : R \u2192 R satis\ufb01es \u03c8(0) = 0, and there exist real numbers\n\u2212\u221e < a \u2264 b < \u221e such that a \u2264 \u03c8(cid:48)(\u00b7) \u2264 b.\nAssumption 4.11. A is a Schur matrix, i.e., its spectral radius \u03c1(A) is strictly smaller than 1; the\npair (A, B) is controllable, i.e., the n \u00d7 n matrix [B | AB | . . . | An\u22121B] has rank n; and the pair\n(A, C) is observable, i.e., the n \u00d7 n matrix [C(cid:62) | A(cid:62)C(cid:62) | . . . | (A(cid:62))n\u22121C(cid:62)] has rank n.\nAssumption 4.12. Let T := {z \u2208 C : |z| = 1} denote the unit circle in the complex plane. The\nrational function G(z) := C(zIn \u2212 A)\u22121B satis\ufb01es\n(cid:107)G(cid:107)H\u221e(T) := sup\nz\u2208T\nfor some \u03b3 > 0 such that r2 \u2264 \u03b32 for all a \u2264 r \u2264 b.\n\n|G(z)| < \u03b3\u22121\n\n(23)\n\n2We say that a given quantity N has quasipolynomial growth in 1/\u03b5, and write N \u2264 quasipoly(1/\u03b5), if\n\nN = O(exp(polylog( 1\n\n\u03b5 ))).\n\n8\n\n\fRemark 4.13. Assumption 4.10 imposes a slope condition on \u03c8 and is standard in the analysis of\nLur\u2019e systems [Tsypkin, 1964, Sandberg, 1991, Kim and Braatz, 2014]. The function G(z) is the\ntransfer function of the linear system (22). Assumption 4.11 states that the triple (A, B, C) is a\nminimal realization of G. The quantity (cid:107)G(cid:107)H\u221e(T) appearing in Eq. (23) in Assumption 4.12 is the\nH\u221e-norm of G on the unit circle in the complex plane. Assumptions 4.11 and 4.12 are also common\nand are in the spirit of the well-known circle criterion [Tsypkin, 1964, Sandberg and Xu, 1993].\n\nWith these preliminaries out of the way, we have the following:\nProposition 4.14. Suppose that the system (21) satis\ufb01es Assumptions 4.10\u20134.12. Then it satis\ufb01es the\ndiscrete-time Demidovich criterion with X = Rn and U = R, and moreover \u00b5 > \u03c1(A)2.\nThe crucial ingredient in the proof is the Discrete-Time Bounded-Real Lemma [Vaidyanathan, 1985],\nwhich guarantees the existence of the matrix P appearing in the Demidovich criterion. The main\ntakeaway here is that the function f (x, u) = Ax + B\u03c8(u \u2212 Cx) need not be contractive (i.e., it may\nbe the case that P (cid:54)= In), but it will be contractive in the (cid:107) \u00b7 (cid:107)P norm.\n\n5 Comparison of architectures\n\nSo far, we have shown that any i/o map F with approximately \ufb01nite memory can be approximated\nby a ReLU temporal convolutional net. We have also considered recurrent models and shown\nthat any incrementally stable recurrent model has approximately \ufb01nite memory and can therefore\nbe approximated by a ReLU TCN. As far as their approximation capabilities are concerned, both\nrecurrent models and autoregressive models like TCNs are equivalent, since any \ufb01nite-memory i/o\nmap of the form (2) admits the state-space realization\n\nx1\nt+1 = x2\nyt = f (x1\n\nt , x2\nt , x2\n\nt+1 = x3\nt , . . . , xm\n\nt , . . . , xm\u22121\nt , ut)\n\nt+1 = xm\n\nt , xm\n\nt+1 = ut\n\n0, . . . , xm\n\nof the tapped delay line type, with zero initial condition (x1\n0 ) = (0, . . . , 0). (Compared\nto (8), we are allowing a direct \u2018feedthrough\u2019 connection from the input ut to the output yt.) The\nadvantage of autoregressive models like TCNs shows up during training and regular operation, since\nshifted copies of the input sequence can be ef\ufb01ciently processed in parallel rather than sequentially.\nAnother point worth mentioning is that, while the construction in the proof of Theorem 3.3 makes\nuse of ReLU nets as a universal function approximator, any other family of universal approximators\ncan be used instead, for example, multivariate polynomials or rational functions. In fact, if one uses\nmultivariate polynomials to approximate the functionals \u02dcFt, the resulting family of i/o maps is known\nas the (discrete-time) \ufb01nite Volterra series [Boyd and Chua, 1985], and has been used widely in the\nanalysis of nonlinear systems. However, TCNs generally provide a more parsimonious representation.\nTo see this, consider the following (admittedly contrived) example of an i/o map:\n\n(cid:32) \u221e(cid:88)\n\n(cid:32) m(cid:88)\n\n(cid:33)\n\n(cid:33)\n\n(Fu)t = ReLU\n\n(24)\nwhere the \ufb01lter coef\ufb01cients ht have the exponential decay property |ht| \u2264 C\u03bbt for some C > 0\nand \u03bb \u2208 (0, 1). It is not hard to show that F has exponentially fading memory, and a very simple\n\u03b5-approximation by a TCN is obtained by zeroing out all of the \ufb01lter coef\ufb01cients hs, s > m \u223c log( 1\n\u03b5 ):\n\nhsut\u2212s\n\ns=0\n\n,\n\n((cid:98)Fu)t = ReLU\n\nhsut\u2212s\n\n.\n\ns=0\n\nHowever, any \u03b5-approximation for F using Volterra series would need poly( 1\n\u03b5 ) terms, since the best\npolynomial \u03b5-approximation of the ReLU on any compact interval has degree \u2126( 1\n\u03b5 ) [DeVore and\nLorentz, 1993, Chap. 9, Thm. 3.3]. On the other hand, if we consider an i/o map of the form (24), but\nwith a degree-d univariate polynomial instead of ReLU, then we can \u03b5-approximate it with a TCN of\ndepth O(d + log d\n\n\u03b5 ) units [Liang and Srikant, 2017].\n\n\u03b5 ) and O(d log d\n\nAcknowledgments\n\nThis work was supported in part by the National Science Foundation under the Center for Advanced\nElectronics through Machine Learning (CAEML) I/UCRC award no. CNS-16-24811.\n\n9\n\n\fReferences\nShaojie Bai, J. Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convolutional and\nrecurrent networks for sequence modeling, 2018. URL https://arxiv.org/abs/1803.01271.\n\nStephen Boyd and Leon O. Chua. Fading memory and the problem of approximating nonlinear\noperators with Volterra series. IEEE Transactions on Circuits and Systems, CAS-32(11):1150\u2013\n1161, 1985.\n\nCiprian Chelba, Mohammad Norouzi, and Samy Bengio. N-gram language modeling using recurrent\n\nneural network estimation, 2017. URL https://arxiv.org/abs/1703.10724.\n\nYann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated\n\nconvolutional networks. In International Conference on Machine Learning, 2017.\n\nRonald A. DeVore and George G. Lorentz. Constructive Approximation. Springer-Verlag, Berlin,\n\n1993.\n\nJonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. Convolutional\n\nsequence to sequence learning. In International Conference on Machine Learning, 2017.\n\nBoris Hanin and Mark Sellke. Approximating continuous functions by ReLU nets of minimal width,\n\n2018. URL http://arxiv.org/abs/1710.11278.\n\nRie Johnson and Tong Zhang. Deep pyramid convolutional neural networks for text categoriza-\ntion. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics\n(Volume 1: Long Papers), pages 562\u2013570, Vancouver, Canada, July 2017. Association for Computa-\ntional Linguistics. doi: 10.18653/v1/P17-1052. URL https://www.aclweb.org/anthology/\nP17-1052.\n\nNal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray\nKavukcuoglu. Neural machine translation in linear time, 2016. URL https://arxiv.org/abs/\n1610.10099.\n\nKwang-Ki K. Kim and Richard D. Braatz. Observer-based output feedback control of discrete-time\nLur\u2019e systems with sector-bounded slope-restricted nonlinearities. International Journal of Robust\nand Nonlinear Control, 24:2458\u20132472, 2014.\n\nShiyu Liang and R. Srikant. Why deep neural networks for function approximation? In International\n\nConference on Learning Representations, 2017.\n\nJohn Miller and Moritz Hardt. Stable recurrent models. In International Conference on Learning\n\nRepresentations, 2019.\n\nJooyoung Park and Irwin W. Sandberg. Criteria for the approximation of nonlinear systems. IEEE\nTransactions on Circuits and Systems \u2014 I: Fundamental Theory and Applications, 39(8):673\u2013676,\n1992.\n\nAlexey Pavlov, Nathan van de Wouw, and Henk Nijmeijer. Uniform Output Regulation of Nonlinear\n\nSystems: A Convergent Dynamics Approach. Birkh\u00e4user, 2006.\n\nIrwin W. Sandberg. Structure theorems for nonlinear systems. Multidimensional Systems and Signal\n\nProcessing, 2:267\u2013286, 1991.\n\nIrwin W. Sandberg and Lilian Y. Xu. Steady-state errors in discrete-time control systems. Automatica,\n\n29(2):523\u2013526, 1993.\n\nElvijs Sarkans and Hartmut Logemann. Input-to-state stability of discrete-time Lur\u2019e systems. SIAM\n\nJournal on Control and Optimization, 54(3):1739\u20131768, 2016.\n\nVatsal Sharan, Sham Kakade, Percy Liang, and Gregory Valiant. Prediction with a short memory. In\n\nSymposium on Theory of Computing, 2018.\n\nEduardo D. Sontag. Mathematical Control Theory: Deterministic \ufb01nite Dimensional Systems.\n\nSpringer-Verlag, 1998.\n\n10\n\n\fDuc N. Tran, Bj\u00f6rn S. R\u00fcf\ufb02er, and Christopher M. Kellett. Convergence properties for discrete-time\n\nnonlinear systems. IEEE Transactions on Automatic Control, 2017.\n\nYakov Z. Tsypkin. A criterion of absolute stability for sampled-data systems with monotone\ncharacteristics of the nonlinear element. Doklady Akademii Nauk SSSR, 155(5):1029\u20131032, 1964.\nIn Russian.\n\nPalghat P. Vaidyanathan. The discrete-time bounded-real lemma in digital \ufb01ltering. IEEE Transactions\n\non Circuits and Systems, CAS-32(9):918\u2013924, September 1985.\n\nAaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,\nNal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw\naudio, 2016. URL https://arxiv.org/abs/1609.03499.\n\nYonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey,\nMaxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson,\nXiaobing Liu, \u0141ukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith\nStevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex\nRudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google\u2019s neural\nmachine translation system: Bridging the gap between human and machine translation, 2016.\n\nWenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Sch\u00fctze. Comparative study of CNN and RNN\n\nfor natural language processing, 2017. URL https://arxiv.org/abs/1702.01923.\n\n11\n\n\f", "award": [], "sourceid": 7848, "authors": [{"given_name": "Joshua", "family_name": "Hanson", "institution": "University of Illinois at Urbana-Champaign"}, {"given_name": "Maxim", "family_name": "Raginsky", "institution": "University of Illinois at Urbana-Champaign"}]}