{"title": "Structured Transforms for Small-Footprint Deep Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 3088, "page_last": 3096, "abstract": "We consider the task of building compact deep learning pipelines suitable for deploymenton storage and power constrained mobile devices. We propose a uni-fied framework to learn a broad family of structured parameter matrices that arecharacterized by the notion of low displacement rank. Our structured transformsadmit fast function and gradient evaluation, and span a rich range of parametersharing configurations whose statistical modeling capacity can be explicitly tunedalong a continuum from structured to unstructured. Experimental results showthat these transforms can significantly accelerate inference and forward/backwardpasses during training, and offer superior accuracy-compactness-speed tradeoffsin comparison to a number of existing techniques. In keyword spotting applicationsin mobile speech recognition, our methods are much more effective thanstandard linear low-rank bottleneck layers and nearly retain the performance ofstate of the art models, while providing more than 3.5-fold compression.", "full_text": "Structured Transforms for\n\nSmall-Footprint Deep Learning\n\nVikas Sindhwani\n\nTara N. Sainath\nGoogle, New York\n\n{sindhwani, tsainath, sanjivk}@google.com\n\nSanjiv Kumar\n\nAbstract\n\nWe consider the task of building compact deep learning pipelines suitable for de-\nployment on storage and power constrained mobile devices. We propose a uni-\n\ufb01ed framework to learn a broad family of structured parameter matrices that are\ncharacterized by the notion of low displacement rank. Our structured transforms\nadmit fast function and gradient evaluation, and span a rich range of parameter\nsharing con\ufb01gurations whose statistical modeling capacity can be explicitly tuned\nalong a continuum from structured to unstructured. Experimental results show\nthat these transforms can signi\ufb01cantly accelerate inference and forward/backward\npasses during training, and offer superior accuracy-compactness-speed tradeoffs\nin comparison to a number of existing techniques. In keyword spotting applica-\ntions in mobile speech recognition, our methods are much more effective than\nstandard linear low-rank bottleneck layers and nearly retain the performance of\nstate of the art models, while providing more than 3.5-fold compression.\n\n1\n\nIntroduction\n\nNon-linear vector-valued transforms of the form, f (x, M) = s(Mx), where s is an elementwise\nnonlinearity, x is an input vector, and M is an m \u00d7 n matrix of parameters are building blocks\nof complex deep learning pipelines and non-parametric function estimators arising in randomized\nkernel methods [20]. When M is a large general dense matrix, the cost of storing mn parameters\nand computing matrix-vector products in O(mn) time can make it prohibitive to deploy such models\non lightweight mobile devices and wearables where battery life is precious and storage is limited.\nThis is particularly relevant for \u201calways-on\u201d mobile applications, such as continuously looking for\nspeci\ufb01c keywords spoken by the user or processing a live video stream onboard a mobile robot. In\nsuch settings, the models may need to be hosted on specialized low-power digital signal processing\ncomponents which are even more resource constrained than the device CPU.\nA parsimonious structure typically imposed on parameter matrices is that of low-rankness [22]. If\nM is a rank r matrix, with r (cid:28) min(m, n), then it has a (non-unique) product representation of the\nform M = GHT where G, H have only r columns. Clearly, this representation reduces the storage\nrequirements to (mr + nr) parameters, and accelerates the matrix-vector multiplication time to\nO(mr+nr) via Mx = G(HT x). Another popular structure is that of sparsity [6] typically imposed\nduring optimization via zero-inducing l0 or l1 regularizers. Other techniques include freezing M\nto be a random matrix as motivated via approximations to kernel functions [20], storing M in low\n\ufb01xed-precision formats [7, 24], using speci\ufb01c parameter sharing mechanisms [3], or training smaller\nmodels on outputs of larger models (\u201cdistillation\u201d) [11].\nStructured Matrices: An m \u00d7 n matrix which can be described in much fewer than mn param-\neters is referred to as a structured matrix. Typically, the structure should not only reduce memory\n\n1\n\n\frequirements, but also dramatically accelerate inference and training via fast matrix-vector products\nand gradient computations. Below are classes of structured matrices arising pervasively in many\ncontexts [18] with different types of parameter sharing (indicated by the color).\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(i) Toeplitz\n\nt0\n\nt1\n\n...\ntn\u22121\n\nt\u22121\n\nt0\n\n...\n\n. . .\n\n. . .\n\n. . .\n\n...\n\nt1\n\nt\u2212(n\u22121)\n\n...\n\nt\u22121\nt0\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n1\n1\n\n...\n\n1\n\nv0\nv1\n\n(ii) Vandermonde\nvn\u22121\nvn\u22121\n...\nvn\u22121\nn\u22121\n\n...\nvn\u22121\n\n. . .\n. . .\n\n. . .\n\n0\n\n1\n\n...\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\n\uf8ee\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8ef\uf8f0\n\n(iii) Cauchy\n\n1\n\nu0\u2212v0\n\n. . .\n\n. . .\n\n1\n\nu1\u2212v0\n\n...\n\n. . .\n\n. . .\n\n...\n\n...\n\n1\n\nu0\u2212vn\u22121\n\n...\n...\n\n1\n\nun\u22121\u2212v0\n\n. . .\n\n. . .\n\n1\n\nun\u22121\u2212vn\u22121\n\n\uf8f9\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fa\uf8fb\n\nn\n\nToeplitz matrices have constant values along each of their diagonals. When the same property holds\nfor anti-diagonals, the resulting class of matrices are called Hankel matrices. Toeplitz and Hankel\nmatrices are intimately related to one-dimensional discrete convolutions [10], and arise naturally\nin time series analysis and dynamical systems. A Vandermonde matrix is determined by taking\nelementwise powers of its second column. A very important special case is the complex matrix\nassociated with the Discrete Fourier transform (DFT) which has Vandermonde structure with vj =\nn, j = 1 . . . n where \u03c9n = exp \u22122\u03c0i\nis the primitive nth root of unity. Similarly, the entries of\n\u03c9j\nn \u00d7 n Cauchy matrices are completely de\ufb01ned by two length n vectors. Vandermonde and Cauchy\nmatrices arise naturally in polynomial and rational interpolation problems.\n\u201cSuperfast\u201d Numerical Linear Algebra: The structure in these matrices can be exploited for faster\nlinear algebraic operations such as matrix-vector multiplication, inversion and factorization. In par-\nticular, the matrix-vector product can be computed in time O(n log n) for Toeplitz and Hankel ma-\ntrices, and in time O(n log2 n) for Vandermonde and Cauchy matrices.\nDisplacement Operators: At \ufb01rst glance, these matrices appear to have very different kinds of\nparameter sharing and consequently very different algorithms to support fast linear algebra. It turns\nout, however, that each structured matrix class described above, can be associated with a speci\ufb01c\ndisplacement operator, L : Rm\u00d7n (cid:55)\u2192 Rm\u00d7n which transforms each matrix, say M, in that class into\nan m \u00d7 n matrix L[M] that has very low-rank, i.e. rank(L[M]) (cid:28) min(m, n). This displacement\nrank approach, which can be traced back to a seminal 1979 paper [13], greatly uni\ufb01es algorithm\ndesign and complexity analysis for structured matrices [13], [18], [14].\nGeneralizations of Structured Matrices: Consider deriving a matrix by taking arbitrary linear\ncombinations of products of structured matrices and their inverses, e.g. \u03b11T1T\u22121\n4 T5\nwhere each Ti is a Toeplitz matrix. The parameter sharing structure in such a derived matrix is by\nno means apparent anymore. Yet, it turns out that the associated displacement operator remarkably\ncontinues to expose the underlying parsimony structure, i.e. such derived matrices are still mapped to\nrelatively low-rank matrices! The displacement rank approach allows fast linear algebra algorithms\nto be seamlessly extended to these broader classes of matrices. The displacement rank parameter\ncontrols the degree of structure in these generalized matrices.\nTechnical Preview, Contributions and Outline: We propose building deep learning pipelines\nwhere parameter matrices belong to the class of generalized structured matrices characterized by\nlow displacement rank. In Section 2, we attempt to give a self-contained overview of the displace-\nment rank approach [13], [18] drawing key results from the relevant literature on structured matrix\ncomputations (proved in our supplementary material [1] for completeness). In Section 3, we show\nthat the proposed structured transforms for deep learning admit fast matrix multiplication and gra-\ndient computations, and have rich statistical modeling capacity that can be explicitly controlled by\nthe displacement rank hyperparameter, covering, along a continuum, an entire spectrum of con-\n\ufb01gurations from highly structured to unstructured matrices. While our focus in this paper is on\nToeplitz-related transforms, our proposal extends to other structured matrix generalizations. In Sec-\ntion 4, we study inference and training-time acceleration with structured transforms as a function of\ndisplacement rank and dimensionality. We \ufb01nd that our approach compares highly favorably with\nnumerous other techniques for learning size-constrained models on several benchmark datasets. Fi-\nnally, we demonstrate our approach on mobile speech recognition applications where we are able to\nmatch the performance of much bigger state of the art models with a fraction of parameters.\nNotation: Let e1 . . . en denote the canonical basis elements of Rn (viewed as column vectors).\nIn, 0n denote n \u00d7 n identity and zero matrices respectively. Jn = [en . . . e1] is the anti-identity\nre\ufb02ection matrix whose action on a vector is to reverse its entries. When the dimension is obvious\n\n2 + \u03b12T3T\u22121\n\n2\n\n\fwe may drop the subscript; for rectangular matrices, we may specify both the dimensions explicitly,\ne.g. we use 01\u00d7n for a zero-valued row-vector, and 1n for all ones column vector of length n. u \u25e6 v\ndenotes Hadamard (elementwise) product between two vectors v, u. For a complex vector u, \u00afu will\ndenote the vector of complex conjugate of its entries. The Discrete Fourier Transform (DFT) matrix\nwill be denoted by \u2126 (or \u2126n); we will also use \ufb00t(x) to denote \u2126x, and i\ufb00t(x) to denote \u2126\u22121x.\nFor a vector v, diag(v) denotes a diagonal matrix given by diag(v)ii = vi.\n\n2 Displacement Operators associated with Structured Matrices\n\nWe begin by providing a brisk background on the displacement rank approach. Unless otherwise\nspeci\ufb01ed, for notational convenience we will henceforth assume squared transforms, i.e., m = n,\nand discuss rectangular transforms later. Proofs of various assertions can be found in our self-\ncontained supplementary material [1] or in [18, 19].\nThe Sylvester displacement operator, denoted as L = \u2207A,B : Rn\u00d7n (cid:55)\u2192 Rn\u00d7n is de\ufb01ned by,\n\n(1)\nwhere A \u2208 Rn\u00d7n, B \u2208 Rn\u00d7n are \ufb01xed matrices referred to as operator matrices. Closely related is\nthe Stein displacement operator, denoted as L = (cid:52)A,B : Rn\u00d7n (cid:55)\u2192 Rn\u00d7n, and de\ufb01ned by,\n\n\u2207A,B[M] = AM \u2212 MB\n\n(cid:52)A,B[M] = M \u2212 AMB\n\n(2)\n\nBy carefully choosing A and B one can instantiate Sylvester and Stein displacement operators with\ndesirable properties. In particular, for several important classes of displacement operators, A and/or\nB are chosen to be an f-unit-circulant matrix de\ufb01ned as follows.\nDe\ufb01nition 2.1 (f-unit-Circulant Matrix). For a real-valued scalar f, the (n\u00d7 n) f-circulant matrix,\ndenoted by Zf , is de\ufb01ned as follows,\n\n\uf8ee\uf8ef\uf8ef\uf8f0 0\n\n1\n...\n0\n\n0\n0\n...\n. . .\n\n. . .\n. . .\n...\n1\n\nf\n0\n...\n0\n\n\uf8f9\uf8fa\uf8fa\uf8fb =\n\n(cid:20) 01\u00d7(n\u22121)\n\nIn\u22121\n\n(cid:21)\n\nf\n\n0(n\u22121)\u00d71\n\nZf = [e2, e3 . . . en, f e1] =\n\nThe f-unit-circulant matrix is associated with a basic downward shift-and-scale transformation, i.e.,\nthe matrix-vector product Zf v shifts the elements of the column vector v \u201cdownwards\u201d, and scales\nand brings the last element vn to the \u201ctop\u201d, resulting in [f vn, v1, . . . vn\u22121]T . It has several basic\nalgebraic properties (see Proposition 1.1 [1]) that are crucial for the results stated in this section\nFigure 1 lists the rank of the Sylvester displacement operator in Eqn 1 when applied to matrices\nbelonging to various structured matrix classes, where the operator matrices A, B in Eqn. 1 are\nchosen to be diagonal and/or f-unit-circulant. It can be seen that despite the difference in their\nstructures, all these classes are characterized by very low displacement rank. Figure 2 shows how\nthis low-rank transformation happens in the case of a 4 \u00d7 4 Toeplitz matrix (also see section 1,\nLemma 1.2 [1]). Embedded in the 4 \u00d7 4 Toeplitz matrix T are two copies of a 3 \u00d7 3 Toeplitz matrix\nshown in black and red boxes. The shift and scale action of Z1 and Z\u22121 aligns these sub-matrices.\nBy taking the difference, the Sylvester displacement operator nulli\ufb01es the aligned submatrix leaving\na rank 2 matrix with non-zero elements only along its \ufb01rst row and last column. Note that the\nnegative sign introduced by TZ\u22121 term prevents the complete zeroing out of the value of t (marked\nby red star) and is hence critical for invertibility of the displacement action.\n\nFigure 2: Displacement Action on Toeplitz Matrix\n\nFigure 1: Below r is rank(\u2207A,B[M])\n\nStructured Matrix M\nToeplitz T, T\u22121\nHankel H, H\u22121\n\nVandermonde V (v)\n\nT + H\nV (v)\u22121\nV (v)T\nCauchy C(s, t)\nC(s, t)\u22121\n\nA\nZ1\nZ1\n\nZ0 + ZT\n0\ndiag(v)\n\nZ0\nZT\n0\n\ndiag(s)\ndiag(t)\n\nB\nZ\u22121\nZT\n0\n\nZ0 + ZT\n0\n\nZ0\n\ndiag(v)\ndiag(v)\ndiag(t)\ndiag(s)\n\nr\n\u2264 2\n\u2264 2\n\u2264 4\n\u2264 1\n\u2264 1\n\u2264 1\n\u2264 1\n\u2264 1\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\nT\n\nt u v w\nx t u v\ny x t u\nz y x t\n\n\u2014\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\ndo w nshift\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nZ1T\nz y x\nt\nt u v w\nx t u v\ny x t u\n\n3\n\nl\ne\nft\ns\n\nh\n\nift\n\nTZ\u22121\nu v w -t\nt u v -x\nx t u -y\ny x t\n-z\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\n=\n\n\uf8eb\uf8ec\uf8ec\uf8ec\uf8ec\uf8ed\n\n\uf8f6\uf8f7\uf8f7\uf8f7\uf8f7\uf8f8\n\nZ1T \u2212 TZ\u22121\n* * * *\n0 0 0 *\n0 0 0 *\n0 0 0 *\n\n\fEach class of structured matrices listed in Figure 1 can be naturally generalized by allowing\nthe rank of the displacement operator to be higher. Speci\ufb01cally, given a displacement opera-\ntor L, and displacement rank parameter r, one may consider the class of matrices M that satis-\n\ufb01es rank(L(M)) \u2264 r. Clearly then, L[M] = GHT for rank r matrices G, H. We refer to\nrank(L(M)) as the displacement rank of M under L, and to the low-rank factors G, H \u2208 Rn\u00d7r as\nthe associated low-displacement generators. For the operators listed in Table 1, these broader classes\nof structured matrices are correspondingly called Toeplitz-like, Vandermonde-like and Cauchy-like.\nFast numerical linear algebra algorithms extend to such matrices [18].\nIn order to express structured matrices with low-displacement rank directly as a function of its low-\ndisplacement generators, we need to invert L and obtain a learnable parameterization. For Stein type\ndisplacement operator, the following elegant result is known (see proof in [1]):\nTheorem 2.2 ( [19], Krylov Decomposition). If an n\u00d7 n matrix M is such that (cid:52)A,B[M] = GHT\nwhere G = [g1 . . . gr], H = [h1 . . . hr] \u2208 Rn\u00d7r and the operator matrices satisfy: An = aI,\nBn = bI for some scalars a, b, then M can be expressed as:\n\nr(cid:88)\n\nj=1\n\nM =\n\n1\n\n1 \u2212 ab\n\nkrylov(A, gj)krylov(BT , hj)T\n\nwhere krylov(A, v) is de\ufb01ned by:\n\nkrylov(A, v) = [v Av A2v . . . An\u22121v]\n\n(3)\n\n(4)\n\nHenceforth, our focus in this paper will be on Toeplitz-like matrices for which the displacement op-\nerator of interest (see Table 1) is of Sylvester type: \u2207Z1,Z\u22121. In order to apply Theorem 2.2, one can\nswitch between Sylvester and Stein operators, setting A = Z1 and B = Z\u22121 which both satisfy the\nconditions of Theorem 2.2 (see property 3, Proposition 1.1 [1]). The resulting expressions involve\nKrylov matrices generated by f-unit-circulant matrices which are called f-circulant matrices in the\nliterature.\nDe\ufb01nition 2.3 (f-circulant matrix). Given a vector v, the f-Circulant matrix, Zf (v), is de\ufb01ned as\nfollows:\n\nZf (v) = krylov(Zf , v) =\n\n\uf8ee\uf8ef\uf8ef\uf8f0 v0\n\nv1\n...\nvn\u22121\n\n\uf8f9\uf8fa\uf8fa\uf8fb\n\nf vn\u22121\n\nv0\n...\n. . .\n\n. . .\n. . .\n...\nv1\n\nf v1\nf v2\n\nf vn\u22121\n\nv0\n\nTwo special cases are of interest: f = 1 corresponds to Circulant matrices, and f = \u22121 corre-\nsponds to skew-Circulant matrices.\n\nFinally, one can obtain an explicit parameterization for Toeplitz-like matrices which turns out to\ninvolve taking sums of products of Circulant and skew-Circulant matrices.\nIf an n \u00d7 n matrix M satis\ufb01es \u2207Z1,Z\u22121 [M] = GHT where G =\nTheorem 2.4 ([18]).\n[g1 . . . gr], H = [h1 . . . hr] \u2208 Rn\u00d7r, then M can be written as:\n\nZ1(gj)Z\u22121(Jhj)\n\n(5)\n\nr(cid:88)\n\nj=1\n\nM =\n\n1\n2\n\n3 Learning Toeplitz-like Structured Transforms\n\nMotivated by Theorem 2.4, we propose learning parameter matrices of the form in Eqn. 5 by opti-\nmizing the displacement factors G, H. First, from the properties of displacement operators [18], it\nfollows that this class of matrices is very rich from a statistical modeling perspective.\nTheorem 3.1 (Richness). The set of all n \u00d7 n matrices that can be written as,\n\nM(G, H) =\n\nZ1(gi)Z\u22121(hi)\n\n(6)\n\nr(cid:88)\n\nfor some G = [g1 . . . gr], H = [h1 . . . hr] \u2208 Rn\u00d7r contains:\n\ni=1\n\n4\n\n\f\u2022 All n \u00d7 n Circulant and Skew-Circulant matrices for r \u2265 1.\n\u2022 All n \u00d7 n Toeplitz matrices for r \u2265 2.\n\u2022 Inverses of Toeplitz matrices for r \u2265 2.\n\u2022 All products of the form A1 . . . At for r \u2265 2t.\n\u2022 All n \u00d7 n matrices for r = n.\n\n\u2022 All linear combinations of the form(cid:80)p\n\ni=1 \u03b2iA(i)\n\n1 . . . A(i)\n\nt where r \u2265 2tp.\n\nwhere each Ai above is a Toeplitz matrix or the inverse of a Toeplitz matrix.\n\nWhen we learn a parameter matrix structured as Eqn. 6 with displacement rank equal to 1 or 2,\nwe also search over convolutional transforms. In this sense, structured transforms with higher dis-\nplacement rank generalize (one-dimensional) convolutional layers. The displacement rank provides\na knob on modeling capacity: low displacement matrices are highly structured and compact, while\nhigh displacement matrices start to contain increasingly unstructured dense matrices.\nNext, we show that associated structured transforms of the form f (x) = M(G, H)x admit fast\nevaluation, and gradient computations with respect to G, H. First we recall the following well-\nknown result concerning the diagonalization of f-Circulant matrices.\nTheorem 3.2 (Diagonalization of f-circulant matrices, Theorem 2.6.4 [18]). For any f (cid:54)= 0, let\nf = [1, f 1\n\nn ]T \u2208 Cn, and Df = diag(f ). Then,\n\nn , . . . f\n\nn , f 2\n\nn\u22121\n\nZf (v) = D\u22121\n\nf \u2126\u22121 diag(\u2126(f \u25e6 v))\u2126Df\n\n(7)\nThis result implies that for the special cases of f = 1 and f = \u22121 corresponding to Circu-\nlant and Skew-circulant matrices respectively, the matrix-vector multiplication can be computed\nin O(n log n) time via the Fast Fourier transform:\n\n(cid:16)\n\ny = Z1(v)x = i\ufb00t (\ufb00t(v) \u25e6 \ufb00t(x))\n\ufb00t(v) \u25e6 \ufb00t(x)\ny = Z1(v)T x = i\ufb00t\ny = Z\u22121(v)x = \u00af\u03b7 \u25e6 i\ufb00t (\ufb00t(\u03b7 \u25e6 v) \u25e6 \ufb00t(\u03b7 \u25e6 x))\ny = Z\u22121(v)T x = \u00af\u03b7 \u25e6 i\ufb00t (\ufb00t(\u03b7 \u25e6 v) \u25e6 \ufb00t(\u03b7 \u25e6 x))\n\n(cid:17)\n\n(8)\n(9)\n(10)\n(11)\n\n(cid:33)\n\nr(cid:88)\n\n(cid:32) r(cid:88)\n\nwhere \u03b7 = [1, \u03b7, \u03b72 . . . \u03b7n\u22121]T where \u03b7 = (\u22121) 1\nIn particular, a single matrix-vector product for Circulant and Skew-circulant matrices has the com-\nputational cost of 3 FFTs. Therefore, for matrices of the form in Eqn. 6 comprising of r products\nof Circulant and Skew-Circulant matrices, naively computing a matrix-vector product for a batch\nof b input vectors would take 6rb FFTs. However, this cost can be signi\ufb01cantly lowered to that of\n2(rb + r + b) FFTs by making the following observation:\n\nn ), the root of negative unity.\n\nn = exp(i \u03c0\n\nY =\n\nZ1(gi)Z\u22121(hi)X = \u2126\u22121\n\ndiag(\u2126gi) \u2126 diag( \u00af\u03b7) \u2126\u22121 diag(\u2126(\u03b7 \u25e6 hi)) \u02dcX\n\ni=1\n\ni=1\n\nalgorithm.\n\nwhere \u02dcX = \u2126 diag(\u03b7) X. Here, (1) The FFT of the parameters, \u2126gi and \u2126(\u03b7 \u25e6 hi) is computed\nonce and shared across multiple input vectors in the minibatch, (2) The (scaled) FFT of the input,\n(\u2126 diag(\u03b7) X) is computed once and shared across the sum in Eqn. 6, and (3) The \ufb01nal inverse FFT\nis also shared. Thus, the following result is immediate.\nTheorem 3.3 (Fast Multiplication). Given an n \u00d7 b matrix X, the matrix-matrix product, Y =\ni=1 Z1(gi)Z\u22121(hi)) X, can be computed at the cost of 2(rb + b + r) FFTs, using the following\n\n((cid:80)r\n(cid:17) Set \u03b7 = [1, \u03b7, \u03b72 . . . \u03b7n\u22121]T where \u03b7 = (\u22121) 1\n(cid:17) Initialize Y = 0n\u00d7b\n(cid:17) Set \u02dcX = \ufb00t(diag(\u03b7)X)\n(cid:17) Set \u02dcG = \ufb00t(G) = [\u02dcg1 . . . \u02dcgr] and \u02dcH = \ufb00t(diag(\u03b7)H) = [\u02dch1 . . . \u02dchr]\n(cid:17) for i = 1 to r\n\u25e6 U = Z\u22121(hi)X = diag( \u00af\u03b7)i\ufb00t\n\u25e6 V = diag(\u02dcgi) \ufb00t(U)\n\nn = exp(i \u03c0\nn )\n\ndiag(\u02dchi) \u02dcX\n\n(cid:17)\n\n(cid:16)\n\n5\n\n\f\u25e6 Y = Y + V\n(cid:17) Set Y = i\ufb00t (Y)\n(cid:17) Return Y\n\nWe now show that when our structured transforms are embedded in a deep learning pipeline, the gra-\ndient computation can also be accelerated. First, we note that the Jacobian structure of f-Circulant\nmatrices has the following pleasing form.\nProposition 3.4 (Jacobian of f-circulant transforms). The Jacobian of the map f (x, v) = Zf (v)x\nwith respect to the parameters v is Zf (x).\n\nThis leads to the following expressions for the Jacobians of the structured transforms of interest.\nProposition 3.5 (Jacobians with respect to displacement generators G, H). Consider parameterized\nvector-valued transforms of the form,\n\nr(cid:88)\n\ni=1\n\nf (x, G, H) =\n\nZ1(gi)Z\u22121(hi)x\n\n(12)\n\n(13)\n(14)\n\nThe Jacobians of f with respect to the jth column of G, H, i.e. gj, hj, at x, are as follows:\n\nJgj f|x = Z1 (Z\u22121(hj)x)\nJhj f|x = Z1(gj)Z\u22121(x)\n\ni=1[Jhj f|xi ]T \u03b4i where, {xi}b\n\nBased on Eqns. 13, 14 the gradient over a minibatch of size b requires computing,(cid:80)b\nand(cid:80)b\n\ni [Jgj f|xi ]T \u03b4i\ni=1 are batches of forward and backward inputs\nduring backpropagation. These can be naively computed with 6rb FFTs. However, as before, by\nsharing FFT of the forward and backward inputs, and the fft of the parameters, this can be lowered\nto (4br + 4r + 2b) FFTs. Below we give matricized implementation.\nProposition 3.6 (Fast Gradients). Let X, Z be n \u00d7 b matrices whose columns are forward and\nbackward inputs respectively of minibatch size b during backpropagation. The gradient with respect\nto gj, hj can be computed at the cost of (4br + 4r + 2b) FFTs as follows:\n\ni=1 and {\u03b4i}b\n\n(cid:16)\n\n(cid:20)(cid:18)\n\n(cid:17) Compute \u02dcZ = \ufb00t(Z), \u02dcX = \ufb00t(diag(\u03b7)X), \u02dcG = \ufb00t(G), \u02dcH = \ufb00t(diag(\u03b7)H)\n(cid:17) Gradient wrt gj (2b + 1 FFTs)\n\u25e6 return i\ufb00t\n(cid:104)(cid:16) \u02dcX \u25e6 \ufb00t\n(cid:17) Gradient wrt hj (2b + 1 FFTs)\n\u25e6 return diag ( \u00af\u03b7) i\ufb00t\n\n(cid:17)(cid:17) \u25e6 \u02dcZ\n(cid:16)\n\ndiag(\u02dchj) \u02dcX\n\n(cid:16)\n(cid:16)\n\ndiag(\u03b7)i\ufb00t\n\n\ufb00t\n\ndiag(\u00af\u03b7)i\ufb00t\n\n(cid:21)\n(cid:17)(cid:17)(cid:17)\n\ndiag(\u02dcgi)\u02dcZ\n\n1b\n\n(cid:19)\n\n1b\n\n(cid:105)\n\nRectangular Transforms: Variants of Theorems 2.2, 2.4 exist for rectangular transforms, see [19].\nAlternatively, for m < n we can subsample the outputs of square n \u00d7 n transforms at the cost of\nn output vectors\nextra computations, while for m > n, assuming m is a multiple of n, we can stack m\nof square n \u00d7 n transforms.\n\n4 Empirical Studies\n\nAcceleration with Structured Transforms: In Figure 3, we analyze the speedup obtained in prac-\ntice using n \u00d7 n Circulant and Toeplitz-like matrices relative to a dense unstructured n \u00d7 n matrix\n(fully connected layer) as a function of displacement rank and dimension n. Three scenarios are\nconsidered: inference speed per test instance, training speed as implicitly dictated by forward passes\non a minibatch, and gradient computations on a minibatch. Factors such as differences in cache\noptimization, SIMD vectorization and multithreading between Level-2 BLAS (matrix-vector multi-\nplication), Level-3 BLAS (matrix-matrix multiplication) and FFT implementations (we use FFTW:\nhttp://www.fftw.org) in\ufb02uence the speedup observed in practice. Speedup gains start to\nshow for dimensions as small as 512 for Circulant matrices. The gains become dramatic with accel-\neration of the order of 10 to 100 times for several thousand dimensions, even for higher displacement\nrank Toeplitz-like transforms.\n\n6\n\n\fFigure 3: Acceleration with n \u00d7 n Structured Transforms (6-core 32-GB Intel(R) Xeon(R) machine; random\ndatasets). In the plot, displacement rank = 0 corresponds to a Circulant Transform.\n\nEffectiveness for learning compact Neural Networks: Next, we compare the proposed structured\ntransforms with several existing techniques for learning compact feedforward neural networks. We\nexactly replicate the experimental setting from the recent paper on HASHEDNETS [3] which uses\nseveral image classi\ufb01cation datasets \ufb01rst prepared by [15]. MNIST is the original 10-class MNIST\ndigit classi\ufb01cation dataset with 60000 training examples and 10000 test examples. BG-IMG-ROT\nrefers to a challenging version of MNIST where digits are randomly rotated and placed against a\nrandom black and white background. RECT (1200 training images, 50000 test images) and CONVEX\n(8000 training images, 50000 test images) are 2-class binary image datasets where the task is to\ndistinguish between tall and wide rectangles, and whether the \u201con\u201d pixels form a convex region or\nnot, respectively. In all datasets, input images are of size 28 \u00d7 28. Several existing techniques are\nbenchmarked in [3] for compressing a reference single hidden layer model with 1000 hidden nodes.\n\u2022 Random Edge Removal (RER) [5] where a fraction of weights are randomly frozen to be zero-valued.\n\u2022 Low-rank Decomposition (LRD) [9]\n\u2022 Neural Network (NN) where the hidden layer size is reduced to satisfy a parameter budget.\n\u2022 Dark Knowledge (DK) [11]: A small neural network is trained with respect to both the original\n\u2022 HashedNets (HN) [3]: This approach uses a low-cost hash function to randomly group connection\n\u2022 HashedNets with Dark Knowledge (HNDK): Trains a HashedNet with respect to both the original\n\nlabeled data, as well as soft targets generated by a full uncompressed neural network.\n\nweights which share the same value.\n\nlabeled data, as well as soft targets generated by a full uncompressed neural network.\n\nWe consider learning models of comparable size with the weights in the hidden layer structured as\na Toeplitz-like matrix. We also compare with the FASTFOOD approach of [25, 16] where the weight\nmatrix is a product of diagonal parameter matrices and \ufb01xed permutation and Walsh-Hadamard\nmatrices, also admitting O(n log n) multiplication and gradient computation time. The CIRCULANT\nNeural Network approach proposed in [4] is a special case of our framework (Theorem 3.1).\nResults in Table 1 show that Toeplitz-like structured transforms outperform all competing ap-\nproaches on all datasets, sometimes by a very signi\ufb01cant margin, with similar or drastically lesser\nnumber of parameters. It should also be noted that while random weight tying in HASHEDNETS\nreduces the number of parameters, the lack of structure in the resulting weight matrix cannot be\nexploited for FFT-like O(n log n) multiplication time. We note in passing that for HASHEDNETS\nweight matrices whose entries assume only one of B distinct values, the Mailman algorithm [17]\ncan be used for faster matrix-vector multiplication, with complexity O(n2 log(B)/(log n)), which\nstill is much slower than matrix-vector multiplication time for Toeplitz-like matrices. Also note that\nthe distillation ideas of [11] are complementary to our approach and can further improve our results.\n\nMNIST\n\nBG-IMG-ROT\n\nCONVEX\n\nRECT\n\nRER\n15.03\n12406\n73.17\n12406\n37.22\n12281\n18.23\n12281\n\nLRD\n28.99\n12406\n80.63\n12406\n39.93\n12281\n23.67\n12281\n\nNN\n6.28\n12406\n79.03\n12406\n34.37\n12281\n5.68\n12281\n\nDK\n6.32\n12406\n77.40\n12406\n31.85\n12281\n5.78\n12281\n\nHN\n2.79\n12406\n59.20\n12406\n31.77\n12281\n3.67\n12281\n\nHNDK\n2.65\n12406\n58.25\n12406\n30.43\n12281\n3.37\n12281\n\nFastfood CIRCULANT\n\nTOEPLITZ (1)\n\nTOEPLITZ (2)\n\n6.61\n10202\n68.4\n10202\n33.92\n3922\n21.45\n3922\n\n3.12\n8634\n62.11\n8634\n24.76\n2352\n2.91\n2352\n\n2.79\n9418\n57.66\n9418\n17.43\n3138\n0.70\n3138\n\n2.54\n10986\n55.21\n10986\n16.18\n4706\n0.89\n4706\n\nTOEPLITZ (3)\n\n2.09\n12554\n53.94\n12554\n20.23\n6774\n0.66\n6774\n\nTable 1: Error rate and number of parameters (italicized). Best results in blue.\n\n7\n\n010203010\u22121100101102Displacement RankSpeedup (unstructured / structured)Inference 010203010\u22121100101102Displacement RankForward Pass (minibatch 100) n=512n=1024n=2048n=4096n=8192n=16384n=32768010203010\u22121100101102Displacement RankGradient (minibatch 100) \fMobile Speech Recognition: We now demonstrate the techniques developed in this paper on a\nspeech recognition application meant for mobile deployment. Speci\ufb01cally, we consider a keyword\nspotting (KWS) task, where a deep neural network is trained to detect a speci\ufb01c phrase, such as \u201cOk\nGoogle\u201d [2]. The data used for these experiments consists of 10\u221215K utterances of selected phrases\n(such as \u201cplay-music\u201d, \u201cdecline-call\u201d), and a larger set of 396K utterances to serve as negative\ntraining examples. The utterances were randomly split into training, development and evaluation\nsets in the ratio of 80 : 5 : 15. We created a noisy evaluation set by arti\ufb01cially adding babble-type\ncafeteria noise at 0dB SNR to the \u201cplay-music\u201d clean data set. We will refer to this noisy data\nset as CAFE0. We refer the reader to [23] for more details about the datasets. We consider the\ntask of shrinking a large model for this task whose architecture is as follows [23]: the input layer\nconsists of 40 dimensional log-mel \ufb01lterbanks, stacked with a temporal context of 32, to produce\nan input of 32 \u00d7 40 whose dimensions are in time and frequency respectively. This input is fed to\na convolutional layer with \ufb01lter size 32 \u00d7 8, frequency stride 4 and 186 \ufb01lters. The output of the\nconvolutional layer is of size 9 \u00d7 186 = 1674. The output of this layer is fed to a 1674 \u00d7 1674 fully\nconnected layer, followed by a softmax layer for predicting 4 classes constituting the phrase \u201cplay-\nmusic\u201d. The full training set contains about 90 million samples. We use asynchronous distributed\nstochastic gradient descent (SGD) in a parameter server framework [8], with 25 worker nodes for\noptimizing various models. The global learning rate is set to 0.002, while our structured transform\nlayers use a layer-speci\ufb01c learning rate of 0.0005; both are decayed by an exponential factor of 0.1.\n\nFigure 4: \u201cplay-music\u201d detection performance: (left) End-to-end keyword spotting performance in terms of\nfalse reject (FR) rate per false alarm (FA) rate (lower is better) (right): Classi\ufb01cation accuracy as a function of\ntraining time. Displacement rank is in parenthesis for Toeplitz-like models.\n\nResults with 11 different models are reported in Figure 4 (left) including the state of the art keyword\nspotting model developed in [23]. At an operating point of 1 False Alarm per hour, the follow-\ning observations can be made: With just 3348 parameters, a displacement rank=1 TOEPLITZ-LIKE\nstructured transform outperforms a standard low-rank bottleneck model with rank=16 containing 16\ntimes more parameters; it also lowers false reject rates from 10.2% with CIRCULANT and 14.2%\nwith FASTFOOD transforms to about 8.2%. With displacement rank 10, the false reject rate is 6.2%,\nin comparison to 6.8% with the 3 times larger rank=32 standard low-rank bottleneck model. Our best\nToeplitz-like model comes within 0.4% of the performance of the 80-times larger fully-connected\nand 3.6 times larger reference [23] models. In terms of raw classi\ufb01cation accuracy as a function of\ntraining time, Figure 4 (right) shows that our models (with displacement ranks 1, 2 and 10) come\nwithin 0.2% accuracy of the fully-connected and reference models, and easily provide much bet-\nter accuracy-time tradeoffs in comparison to standard low-rank bottleneck models, Circulant and\nFastfood baselines. The conclusions are similar for other noise conditions (see supplementary ma-\nterial [1]).\n\n5 Perspective\n\nWe have introduced and shown the effectiveness of new notions of parsimony rooted in the theory of\nstructured matrices. Our proposal can be extended to various other structured matrix classes, includ-\ning Block and multi-level Toeplitz-like [12] matrices related to multidimensional convolution [21].\nWe hope that such ideas might lead to new generalizations of Convolutional Neural Networks.\nAcknowledgements: We thank Yu-hsin Chen, Carolina Parada, Rohit Prabhavalkar, Alex Gruen-\nstein, Rajat Monga, Baris Sumengen, Kilian Weinberger and Wenlin Chen for their contributions.\n\n8\n\n0.511.522.530.050.060.070.080.090.10.110.120.130.14False Alarms per hourFalse Rejectsplay\u2212music:cafe0 fullyconnected (2.8M)reference (122K)lowrank4 (13.4K)lowrank8 (26.8K)lowrank16 (53.6K)lowrank32 (107K)circulant (1674)fastfood (5022)toeplitz\u2212disprank1 (3348)toeplitz\u2212disprank2 (6696)toeplitz\u2212disprank10 (33.5K)51015202530354096.496.696.89797.297.497.697.89898.298.4Time (hours)Accuracy (%)play\u2212music:accuracy fullyconnected (2.8M)reference (122K)lowrank4 (13.4K)lowrank8 (26.8K)lowrank16 (53.6K)lowrank32 (107K)circulant (1674)fastfood (5022)toeplitz\u2212disprank1 (3348)toeplitz\u2212disprank2 (6696)toeplitz\u2212disprank10 (33.5K)\fReferences\n[1] Supplementary material: Structured transforms for small footprint deep learning.\n\nhttp://vikas.sindhwani.org/st_supplementary.pdf.\n\n2015.\n\n[2] G. Chen, C. Parada, and G. Heigold. Small-footprint keyword spotting using deep neural\n\nnetworks. In ICASSP, 2014.\n\n[3] W. Chen, J. T. Wilson, S. Tyree, K. Q. Weinberger, and Y. Chen. Compressing neural networks\n\nwith the hashing trick. In ICML, 2015.\n\n[4] Y. Cheng, F. X. Xu, R. S. Feris, S. Kumar, A. Choudhary, and S.-F. Chang. Fast neural networks\n\nwith circulant projections. In arXiv:1502.03436, 2015.\n\n[5] D. C. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and Schmidhuber. High-performance\n\nneural networks for visual object classi\ufb01cation. In arXiv:1102.0183, 2011.\n\n[6] M. D. Collins and P. Kohli. Memory-bounded deep convolutional neural networks. In ICASSP,\n\n2013.\n\n[7] M. Courbariaux, J.-P. David, and Y. Bengio. Low-precision storage for deep learning. In ICLR,\n\n2015.\n\n[8] J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin, Q. V. Le, M. Z. Mao, M. Ranzato,\nA. Senior, P. Tucker, K. Yang, , and A. Y. Ng. Large-scale distributed deep networks. In NIPS,\n2012.\n\n[9] M. Denil, B. Shakibi, L. Dinh, and N. de Freitas. Predicting parameters in deep learning. In\n\nNIPS, 2013.\n\n[10] R. M. Gray. Toeplitz and circulant matrices: A review. Foundations and Trends in Communi-\n\ncations and Information Theory, 2005.\n\n[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. In NIPS\n\nworkshop, 2014.\n\n[12] T. Kailath and J. Chun. Generalized displacement structure for block toeplitz, toeplitz block\n\nand toeplitz-derived matrices. SIAM J. Matrix Anal. Appl., 15, 1994.\n\n[13] T. Kailath, S. Y. Kung, and M. Morf. Displacement ranks of matrices and linear equations.\n\nJournal of Mathematical Analysis and Applications, pages 395\u2013407, 1979.\n\n[14] T. Kailath and A. H. Sayed. Displacement structure: Theory and applications. SIAM Review,\n\n37, 1995.\n\n[15] H. Larochelle, D. Erhan, A. C. Courville, J. Bergstra, and Y. Bengio. An empirical evaluation\n\nof deep architectures on problems with many factors of variation. In ICML, 2007.\n\n[16] Q. Le, T. Sarlos, and A. Smola. Fastfood \u2013 approximating kernel expansions in loglinear time.\n\nIn ICML, 2013.\n\n[17] E. Liberty and S. W. Zucker. The mailman algorithm: a note on matrix vector multiplication.\n\nIn Information Processing Letters, 2009.\n\n[18] V. Pan. Structured Matrices and Polynomials: Uni\ufb01ed Superfast Algorithms. Springer, 2001.\n[19] V. Pan. Inversion of displacement operators. SIAM Journal of Matrix Analysis and Applica-\n\ntions, pages 660\u2013677, 2003.\n\n[20] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, 2007.\n[21] M. V. Rakhuba and I. V. Oseledets. Fast multidimensional convolution in low-rank tensor\n\nformats via cross approximation. SIAM J. Sci. Comput., 37, 2015.\n\n[22] T. Sainath, B. Kingsbury, V. Sindhwani, E. Arisoy, and B. Ramabhadran. Low-rank matrix fac-\ntorization for deep neural network training with high-dimensional output targets. In ICASSP,\n2013.\n\n[23] T. Sainath and C. Parada. Convolutional neural networks for small-footprint keyword spotting.\n\nIn Proc. Interspeech, 2015.\n\n[24] V. Vanhoucke, A. Senior, and M. Z. Mao. Improving the speed of neural networks on cpus. In\n\nNIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.\n\n[25] Z. Yang, M. Moczulski, M. Denil, N. de Freitas, A. Smola, L. Song, and Z. Wang. Deep fried\n\nconvnets. In arXiv:1412.7149, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1730, "authors": [{"given_name": "Vikas", "family_name": "Sindhwani", "institution": "Google"}, {"given_name": "Tara", "family_name": "Sainath", "institution": "Google"}, {"given_name": "Sanjiv", "family_name": "Kumar", "institution": "Google"}]}