{"title": "Learning Compressed Transforms with Low Displacement Rank", "book": "Advances in Neural Information Processing Systems", "page_first": 9052, "page_last": 9060, "abstract": "The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a rich class of LDR matrices with more general displacement operators, and explicitly learn over both the operators and the low-rank component. This class generalizes several previous constructions while preserving compression and efficient computation. We prove bounds on the VC dimension of multi-layer neural networks with structured weight matrices and show empirically that our compact parameterization can reduce the sample complexity of learning. When replacing weight layers in fully-connected, convolutional, and recurrent neural networks for image classification and language modeling tasks, our new classes exceed the accuracy of existing compression approaches, and on some tasks even outperform general unstructured layers while using more than 20x fewer parameters.", "full_text": "Learning Compressed Transforms\n\nwith Low Displacement Rank\n\nAnna T. Thomas\u2020\u21e4, Albert Gu\u2020\u21e4, Tri Dao\u2020, Atri Rudra\u2021, Christopher R\u00e9\u2020\n\n\u2020 Department of Computer Science, Stanford University\n\n\u2021 Department of Computer Science and Engineering, University at Buffalo, SUNY\n\n{thomasat,albertgu,trid}@stanford.edu, atri@buffalo.edu, chrismre@cs.stanford.edu\n\nAbstract\n\nThe low displacement rank (LDR) framework for structured matrices represents a\nmatrix through two displacement operators and a low-rank residual. Existing use of\nLDR matrices in deep learning has applied \ufb01xed displacement operators encoding\nforms of shift invariance akin to convolutions. We introduce a class of LDR\nmatrices with more general displacement operators, and explicitly learn over both\nthe operators and the low-rank component. This class generalizes several previous\nconstructions while preserving compression and ef\ufb01cient computation. We prove\nbounds on the VC dimension of multi-layer neural networks with structured weight\nmatrices and show empirically that our compact parameterization can reduce the\nsample complexity of learning. When replacing weight layers in fully-connected,\nconvolutional, and recurrent neural networks for image classi\ufb01cation and language\nmodeling tasks, our new classes exceed the accuracy of existing compression\napproaches, and on some tasks also outperform general unstructured layers while\nusing more than 20x fewer parameters.\n\n1\n\nIntroduction\n\nRecent years have seen a surge of interest in structured representations for deep learning, motivated\nby achieving compression and acceleration while maintaining generalization properties. A popular\napproach for learning compact models involves constraining the weight matrices to exhibit some form\nof dense but compressible structure and learning directly over the parameterization of this structure.\nExamples of structures explored for the weight matrices of deep learning pipelines include low-rank\nmatrices [15, 42], low-distortion projections [49], (block-)circulant matrices [8, 17], Toeplitz-like\nmatrices [34, 45], and constructions derived from Fourier-related transforms [37]. Though they confer\nsigni\ufb01cant storage and computation bene\ufb01ts, these constructions tend to underperform general fully-\nconnected layers in deep learning. This raises the question of whether broader classes of structured\nmatrices can achieve superior downstream performance while retaining compression guarantees.\nOur approach leverages the low displacement rank (LDR) framework (Section 2), which encodes\nstructure through two sparse displacement operators and a low-rank residual term [27]. Previous\nwork studying neural networks with LDR weight matrices assumes \ufb01xed displacement operators and\nlearns only over the residual [45, 50]. The only case attempted in practice that explicitly employs the\nLDR framework uses \ufb01xed operators encoding shift invariance, producing weight matrices which\nwere found to achieve superior downstream quality than several other compression approaches [45].\nUnlike previous work, we consider learning the displacement operators jointly with the low-rank\nresidual. Building upon recent progress on structured dense matrix-vector multiplication [14], we\nintroduce a more general class of LDR matrices and develop practical algorithms for using these\n\n\u21e4These authors contributed equally.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fmatrices in deep learning architectures. We show that the resulting class of matrices subsumes\nmany previously used structured layers, including constructions that did not explicitly use the LDR\nframework [17, 37]. When compressing weight matrices in fully-connected, convolutional, and\nrecurrent neural networks, we empirically demonstrate improved accuracy over existing approaches.\nFurthermore, on several tasks our constructions achieve higher accuracy than general unstructured\nlayers while using an order of magnitude fewer parameters.\nTo shed light on the empirical success of LDR matrices in machine learning, we draw connections to\nrecent work on learning equivariant representations, and hope to motivate further investigations of\nthis link. Notably, many successful previous methods for compression apply classes of structured\nmatrices related to convolutions [8, 17, 45]; while their explicit aim is to accelerate training and reduce\nmemory costs, this constraint implicitly encodes a shift-invariant structure that is well-suited for\nimage and audio data. We observe that the LDR construction enforces a natural notion of approximate\nequivariance to transformations governed by the displacement operators, suggesting that, in contrast,\nour approach of learning the operators allows for modeling and learning more general latent structures\nin data that may not be precisely known in advance.\nDespite their increased expressiveness, our new classes retain the storage and computational bene\ufb01ts\nof conventional structured representations. Our construction provides guaranteed compression (from\nquadratic to linear parameters) and matrix-vector multiplication algorithms that are quasi-linear in\nthe number of parameters. We additionally provide the \ufb01rst analysis of the sample complexity of\nlearning neural networks with LDR weight matrices, which extends to low-rank, Toeplitz-like and\nother previously explored \ufb01xed classes of LDR matrices. More generally, our analysis applies to\nstructured matrices whose parameters can interact multiplicatively with high degree. We prove that\nthe class of neural networks constructed from these matrices retains VC dimension almost linear in\nthe number of parameters, which implies that LDR matrices with learned displacement operators are\nstill ef\ufb01ciently recoverable from data. This is consistent with our empirical results, which suggest\nthat constraining weight layers to our broad class of LDR matrices can reduce the sample complexity\nof learning compared to unstructured weights.\nWe provide a detailed review of previous work and connections to our approach in Appendix B.\n\nSummary of contributions:\n\n\u2022 We introduce a rich class of LDR matrices where the displacement operators are explic-\nitly learned from data, and provide multiplication algorithms implemented in PyTorch\n(Section 3).2\n\u2022 We prove that the VC dimension of multi-layer neural networks with LDR weight matrices,\nwhich encompasses a broad class of previously explored approaches including the low-rank\nand Toeplitz-like classes, is quasi-linear in the number of parameters (Section 4).\n\n\u2022 We empirically demonstrate that our construction improves downstream quality when\ncompressing weight layers in fully-connected, convolutional, and recurrent neural networks\ncompared to previous compression approaches, and on some tasks can even outperform\ngeneral unstructured layers (Section 5).\n\n2 Background: displacement rank\nThe generic term structured matrix refers to an m \u21e5 n matrix that can be represented in much\nfewer than mn parameters, and admits fast operations such as matrix-vector multiplication. The\ndisplacement rank approach represents a structured matrix M 2 Rm\u21e5n through displacement\noperators (A 2 Rm\u21e5m, B 2 Rn\u21e5n) de\ufb01ning a linear map rA,B : M 7! AM  MB on matrices,\nand a residual R, so that if\n(1)\nthen M can be manipulated solely through the compressed representation (A, B, R). We assume\nthat A and B have disjoint eigenvalues, which guarantees that M can be recovered from A, B, R\n(c.f. Theorem 4.3.2, Pan [40]). The rank of R (also denoted rA,B[M]) is called the displacement\nrank of M w.r.t. (A, B).3\n\nAM  MB = R\n\n2Our code is available at https://github.com/HazyResearch/structured-nets.\n3Throughout this paper, we use square matrices for simplicity, but LDR is well-de\ufb01ned for rectangular.\n\n2\n\n\fZf = \uf8ff01\u21e5(n1)\n\nIn1\n\nThe displacement approach was originally introduced to describe the Toeplitz-like matrices, which\nare not perfectly Toeplitz but still have shift-invariant structure [27]. These matrices have LDR\nwith respect to shift/cycle operators. A standard formulation uses A = Z1, B = Z1, where\n\n0(n1)\u21e51 denotes the matrix with 1 on the subdiagonal and f in the top-\n\nright corner. The Toeplitz-like matrices have previously been applied in deep learning and kernel\napproximation, and in several cases have performed signi\ufb01cantly better than competing compressed\napproaches [10, 34, 45]. Figure 1 illustrates the displacement (1) for a Toeplitz matrix, showing how\nthe shift invariant structure of the matrix leads to a residual of rank at most 2.\n\nf\n\n1\n\n26664\n\n1\n\n. . .\n\n1\n\n37775\n\n266664\n\na0\na1\n...\n\na(n1)\n\na1\n\na0\n. . .\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n. . .\n. . .\na1\n\nan1\n...\na1\na0\n\n377775\n\n\n\n266664\n\na0\na1\n...\n\na(n1)\n\na1\n\na0\n. . .\n\u00b7\u00b7\u00b7\n\n\u00b7\u00b7\u00b7\n. . .\n. . .\na1\n\nan1\n...\na1\na0\n\n377775\n\n26664\n\n1\n\n1\n\n. . .\n\n1\n\n37775\n\n=26664\n\nx \u00b7\u00b7\u00b7\n\ny\n\n2a0\nz\n...\nw\n\n37775\n\nFigure 1: Displacement equation for a Toeplitz matrix with respect to shift operators Z1, Z1.\n\nA few distinct classes of useful matrices are known to satisfy a displacement property: the classic\ntypes are the Toeplitz-, Hankel-, Vandermonde-, and Cauchy-like matrices (Appendix C, Table 5),\nwhich are ubiquitous in other disciplines [40]. These classes have \ufb01xed operators consisting of\ndiagonal or shift matrices, and LDR properties have traditionally been analyzed in detail only for\nthese special cases. Nonetheless, a few elegant properties hold for generic operators, stating that\ncertain combinations of (and operations on) LDR matrices preserve low displacement rank. We\ncall these closure properties, and introduce an additional block closure property that is related to\nconvolutional \ufb01lter channels (Section 5.2).\nWe use the notation Dr\nProposition 1. LDR matrices are closed under the following operations:\n\nA,B to refer to the matrices of displacement rank \uf8ff r with respect to (A, B).\n\n(a) Transpose/Inverse If M 2D r\n(b) Sum If M 2D r\nA,B and N 2D s\n(c) Product If M 2D r\nA,B and N 2D s\n(d) Block Let Mij satisfy Mij 2D r\n\nA,B, then MT 2D r\nA,B, then M + N 2D r+s\nA,B.\nB,C, then MN 2D r+s\nA,C.\nAi,Bj for i = 1 . . . k, j = 1 . . .` . Then the k \u21e5 ` block\n\nBT ,AT and M1 2D r\n\nB,A.\n\nmatrix (Mij)ij has displacement rank rk`.\n\nProposition 1 is proved in Appendix C.\n\n3 Learning displacement operators\n\nWe consider two classes of new displacement operators. These operators are \ufb01xed to be matrices with\nparticular sparsity patterns, where the entries are treated as learnable parameters.\nThe \ufb01rst operator class consists of subdiagonal (plus corner) matrices: Ai+1,i, along with the corner\nA0,n1, are the only possible non-zero entries. As Zf is a special case matching this sparsity pattern,\nthis class is the most direct generalization of Toeplitz-like matrices with learnable operators.\nThe second class of operators are tridiagonal (plus corner) matrices: with the exception of the\nouter corners A0,n1 and An1,0, Ai,j can only be non-zero if |i  j|\uf8ff 1. Figure 2 shows the\ndisplacement operators for the Toeplitz-like class and our more general operators. We henceforth let\nLDR-SD and LDR-TD denote the classes of matrices with low displacement rank with respect to\nsubdiagonal and tridiagonal operators, respectively. Note that LDR-TD contains LDR-SD.\n\nExpressiveness The matrices we introduce can model rich structure and subsume many types of\nlinear transformations used in machine learning. We list some of the structured matrices that have\nLDR with respect to tridiagonal displacement operators:\nProposition 2. The LDR-TD matrices contain:\n\n3\n\n\f0\n\n1\n...\n0\n0\n\n26666664\n\n\u00b7\u00b7\u00b7\n\n...\n...\n. . .\n\n0\n...\n\n...\n1\n\n0\n\n1\n...\n0\n\nf\n\n0\n...\n\n0\n\n37777775\n\n0\n\nx1\n...\n0\n0\n\n26666664\n\n0\n\nx2\n...\n0\n\n\u00b7\u00b7\u00b7\n\n0\n...\n\n...\n...\n...\n. . . xn1\n\nx0\n\n0\n...\n\n0\n\n37777775\n\n2666664\n\nb0 a0\nc0\nb1\n...\n0\nt\n\nc1\n\n0\n\n\u00b7\u00b7\u00b7\na1\n...\n...\n. . .\n\n0\n\ns\n0\n...\n...\nbn1 an2\ncn2\nbn1\n\n3777775\n\nFigure 2: The Zf operator (left), and our learnable subdiagonal (center) and tridiagonal (right) operators,\ncorresponding to our proposed LDR-SD and LDR-TD classes.\n\n(a) Toeplitz-like matrices, which themselves include many Toeplitz and circulant variants\n(including standard convolutional \ufb01lters - see Section 5.2 and Appendix C, Corollary 1) [8,\n17, 45].\n\n(b) low-rank matrices.\n(c) the other classic displacement structures: Hankel-like, Vandermonde-like, and Cauchy-like\n\nmatrices.\n\n(d) orthogonal polynomial transforms, including the Discrete Fourier and Cosine Transforms.\n(e) combinations and derivatives of these classes via the closure properties (Proposition 1),\nincluding structured classes previously used in machine learning such as ACDC [37] and\nblock circulant layers [17].\n\nThese reductions are stated more formally and proved in Appendix C.1. We also include a diagram of\nthe structured matrix classes included by the proposed LDR-TD class in Figure 5 in Appendix C.1.\n\nOur parameterization Given the parameters A, B, R, the operation that must ultimately be\nperformed is matrix-vector multiplication by M = r1\nA,B[R]. Several schemes for explicitly re-\nconstructing M from its displacement parameters are known for speci\ufb01c cases [41, 44], but do not\nalways apply to our general operators. Instead, we use A, B, R to implicitly construct a slightly\ndifferent matrix with at most double the displacement rank, which is simpler to work with.\nProposition 3. Let K(A, v) denote the n \u21e5 n Krylov matrix, de\ufb01ned to have i-th column Aiv. For\nany vectors g1, . . . , gr, h1, . . . , hr 2 Rn, then the matrix\n\nrXi=1\n\nK(A, gi)K(BT , hi)T\n\n(2)\n\nhas displacement rank at most 2r with respect to A1, B.\n\nThus our representation stores the parameters A, B, G, H, where A, B are either subdiagonal\nor tridiagonal operators (containing n or 3n parameters), and G, H 2 Rn\u21e5r. These parameters\nimplicitly de\ufb01ne the matrix (2), which is the LDR weight layer we use.\n\nAlgorithms for LDR-SD Generic and near-linear time algorithms for matrix-vector multiplication\nby LDR matrices with even more general operators, including both the LDR-TD and LDR-SD classes,\nwere recently shown to exist [14]. However, complete algorithms were not provided, as they relied on\ntheoretical results such as the transposition principle [6] that only imply the existence of algorithms.\nAdditionally, the recursive polynomial-based algorithms are dif\ufb01cult to implement ef\ufb01ciently. For\nLDR-SD, we provide explicit and complete near-linear time algorithms for multiplication by (2),\nas well as substantially simplify them to be useful in practical settings and implementable with\nstandard library operations. We empirically compare the ef\ufb01ciency of our implementation and\nunstructured matrix-vector multiplication in Figure 8 and Table 14 in Appendix E, showing that\nLDR-SD accelerates inference by 3.34-46.06x for n  4096. We also show results for the low-\nrank and Toeplitz-like classes, which have a lower computational cost. For LDR-TD, we explicitly\nconstruct the K(A, gi) and K(BT , hi) matrices for i = 1, ..., r from Proposition 3 and then apply\n\n4\n\n\fthe standard O(n2) matrix-vector multiplication algorithm. Ef\ufb01cient implementations of near-linear\ntime algorithms for LDR-TD are an interesting area of future work.\nTheorem 1. De\ufb01ne the simultaneous computation of k Fast Fourier Transforms (FFT), each with\nsize m, to be a batched FFT with total size km.\nConsider any subdiagonal matrix A 2 Rn\u21e5n and vectors g, h 2 Rn. Then K(A, g)T or K(A, g)\ncan be multiplied by any vector x by computing 8 log2(n) batched FFTs, each of total size 2n. The\ntotal number of computations is O(n log2 n).\n\nThese algorithms are also automatically differentiable, which we use to compute the gradients when\nlearning. More complete descriptions of these algorithms are presented in Appendix C.\n\n4 Theoretical properties of structured matrices\n\nComplexity of LDR neural networks The matrices we use (2) are unusual in that the parameters\ninteract multiplicatively (namely in Ai, Bi) to implicitly de\ufb01ne the actual layer. In contrast, fully-\nconnected layers are linear and other structured layers, such as Fastfood and ACDC [31, 37, 49], are\nconstant degree in their parameters. However, we can prove that this does not signi\ufb01cantly change\nthe learnability of our classes:\nTheorem 2. Let F denote the class of neural networks with L LDR layers, W total parameters,\nand piecewise linear activations. Let signF denote the corresponding classi\ufb01cation functions, i.e.\n{x 7! sign f (x) : f 2F} . The VC dimension of this class is\n\nVCdim(signF) = O(LW log W ).\n\nTheorem 2 matches the standard bound for unconstrained weight matrices [4, 24]. This immediately\nimplies a standard PAC-learnable guarantee [47]. Theorem 2 holds for even more general activations\nand matrices that for example include the broad classes of [14]. The proof is in Appendix D, and we\nempirically validate the generalization and sample complexity properties of our class in Section 5.3.\n\nDisplacement rank and equivariance We observe that displacement rank is related to a line of\nwork outside the resource-constrained learning community, speci\ufb01cally on building equivariant\n(also called covariant in some contexts [5, 35]) feature representations that transform in predictable\nways when the input is transformed. An equivariant feature map  satis\ufb01es\n\n(B(x)) = A((x))\n\n(3)\n\nfor transformations A, B (invariance is the special case when A is the identity) [16, 33, 43]. This\nmeans that perturbing the input by a transformation B before passing through the map  is equivalent\nto \ufb01rst \ufb01nding the features  then transforming by A.\nIntuitively, LDR matrices are a suitable choice for modeling approximately equivariant linear maps,\nsince the residual AB of (3) has low complexity. Furthermore, approximately equivariant maps\nshould retain the compositional properties of equivariance, which LDR satis\ufb01es via Proposition 1. For\nexample, Proposition 1(c) formalizes the notion that the composition of two approximately equivariant\nmaps is still approximately equivariant. Using this intuition, the displacement representation (1) of a\nmatrix decomposes into two parts: the operators A, B de\ufb01ne transformations to which the model is\napproximately equivariant, and the low complexity residual R controls standard model capacity.\nEquivariance has been used in several ways in the context of machine learning. One formulation,\nused for example to model ego-motions, supposes that (3) holds only approximately, and uses a \ufb01xed\ntransformation B along with data for (3) to learn an appropriate A [1, 33]. Another line of work uses\nthe representation theory formalization of equivariant maps [12, 28]. We describe this formulation in\nmore detail and show how LDR satis\ufb01es this de\ufb01nition as well in Appendix C.3, Proposition 7. In\ncontrast to previous settings, which \ufb01x one or both of A, B, our formulation stipulates that  can be\nuniquely determined from A, B, and learns the latter as part of an end-to-end model. In Section 5.4\nwe include a visual example of latent structure that our displacement operators learn, where they\nrecover centering information about objects from a 2D image dataset.\n\n5\n\n\f5 Empirical evaluation\n\nOverview In Section 5.1 we consider a standard setting of compressing a single hidden layer (SHL)\nneural network and the fully-connected (FC) layer of a CNN for image classi\ufb01cation tasks. Following\nprevious work [7, 45], we test on two challenging MNIST variants [30], and include two additional\ndatasets with more realistic objects (CIFAR-10 [29] and NORB [32]). Since SHL models take a\nsingle channel as input, we converted CIFAR-10 to grayscale for this task. Our classes and the\nstructured baselines are tested across different parameter budgets in order to show tradeoffs between\ncompression and accuracy. As shown in Table 1, in the SHL model, our methods consistently have\nhigher test accuracy than baselines for compressed training and inference, by 3.14, 2.70, 3.55, and 3.37\naccuracy points on MNIST-bg-rot, MNIST-noise, CIFAR-10, and NORB respectively. In the CNN\nmodel, as shown in Table 1 in Appendix E, we found improvements of 5.56, 0.95, and 1.98 accuracy\npoints over baselines on MNIST-bg-rot, MNIST-noise, and NORB respectively. Additionally, to\nexplore whether learning the displacement operators can facilitate adaptation to other domains, we\nreplace the input-hidden weights in an LSTM for a language modeling task, and show improvements\nof 0.81-30.47 perplexity points compared to baselines at several parameter budgets.\nIn addition to experiments on replacing fully-connected layers, in Section 5.2 we also replace the\nconvolutional layer of a simple CNN while preserving performance within 1.05 accuracy points on\nCIFAR-10. In Section 5.3, we consider the effect of a higher parameter budget. By increasing the rank\nto just 16, the LDR-SD class meets or exceeds the accuracy of the unstructured FC layer in all datasets\nwe tested on, for both SHL and CNN.4 Appendix F includes more experimental details and protocols.\nOur PyTorch code is publicly available at github.com/HazyResearch/structured-nets.\n\n5.1 Compressing fully-connected layers\n\nImage classi\ufb01cation Sindhwani et al. [45] showed that for a \ufb01xed parameter budget, the Toeplitz-\nlike class signi\ufb01cantly outperforms several other compression approaches, including Random Edge\nRemoval [11], Low Rank Decomposition [15], Dark Knowledge [25], HashedNets [7], and Hashed-\nNets with Dark Knowledge. Following previous experimental settings [7, 45], Table 1 compares our\nproposed classes to several baselines using dense structured matrices to compress the hidden layer\nof a single hidden layer neural network. In addition to Toeplitz-like, we implement and compare to\nother classic LDR types, Hankel-like and Vandermonde-like, which were previously indicated as an\nunexplored possibility [45, 50]. We also show results when compressing the FC layer of a 7-layer\nCNN based on LeNet in Appendix E, Table 7. In Appendix E, we show comparisons to additional\nbaselines at multiple budgets, including network pruning [23] and a baseline used in [7], in which the\nnumber of hidden units is adjusted to meet the parameter budget.\nAt rank one (the most compressed setting), our classes with learned operators achieve higher accuracy\nthan the \ufb01xed operator classes, and on the MNIST-bg-rot, MNIST-noise, and NORB datasets even\nimprove on FC layers of the same dimensions, by 1.73, 13.30, and 2.92 accuracy points respectively\non the SHL task, as shown in Table 1. On the CNN task, our classes improve upon unstructured\nfully-connected layers by 0.85 and 2.25 accuracy points on the MNIST-bg-rot and MNIST-noise\ndatasets (shown in Table 7 in Appendix E). As noted above, at higher ranks our classes meet or\nimprove upon the accuracy of FC layers on all datasets in both the SHL and CNN architectures.\nAdditionally, in Figure 3 we evaluate the performance of LDR-SD at higher ranks. Note that the\nratio of parameters between LDR-SD and the Toeplitz-like or low-rank is r+1\nr , which becomes\nnegligible at higher ranks. Figure 3 shows that at just rank 16, the LDR-SD class meets or exceeds\nthe performance of the FC layer on all four datasets, by 5.87, 15.05, 0.74, and 6.86 accuracy points\non MNIST-bg-rot, MNIST-noise, CIFAR-10, and NORB respectively, while still maintaining at least\n20x fewer parameters.\nOf particular note is the poor performance of low-rank matrices. As mentioned in Section 2, every\n\ufb01xed-operator class has the same parameterization (a low-rank matrix). We hypothesize that the\nmain contribution to their marked performance difference is the effect of the learned displacement\noperator modeling latent invariances in the data, and that the improvement in the displacement\n\n4In addition to the results reported in Table 1, Figure 3 and Table 7 in Appendix E, we also found that at rank\n16 the LDR-SD class on the CNN architecture achieved test accuracies of 68.48% and 75.45% on CIFAR-10\nand NORB respectively.\n\n6\n\n\fTable 1: Test accuracy when replacing the hidden layer with structured classes. Where applicable, rank (r) is in\nparentheses, and the number of parameters in the architecture is in italics below each method. Comparisons to\npreviously unexplored classic LDR types as well as additional structured baselines are included, with the ranks\nadjusted to match the parameter count of LDR-TD where possible. The Fastfood [49] and Circulant [8] methods\ndo not have rank parameters, and the parameter count for these methods cannot be exactly controlled. Additional\nresults when replacing the FC layer of a CNN are in Appendix E. Details for all experiments are in Appendix F.\n\nMethod\nUnstructured\n\nLDR-TD (r = 1)\n\nToeplitz-like [45] (r = 4)\n\nHankel-like (r = 4)\n\nVandermonde-like (r = 4)\n\nLow-rank [15] (r = 4)\n\nFastfood [49]\n\nCirculant [8]\n\nMNIST-bg-rot MNIST-noise CIFAR-10 NORB\n59.83\n44.08\n622506\n1054726\n62.75\n45.81\n14342\n14122\n59.38\n42.67\n14122\n14342\n60.09\n42.23\n14342\n14122\n48.98\n37.14\n14122\n14342\n43.66\n35.67\n14342\n14122\n59.02\n38.13\n10202\n9222\n46.45\n34.46\n8634\n7174\n\n46.03\n1058826\n45.33\n18442\n41.78\n18442\n41.40\n18442\n33.93\n18442\n32.28\n18442\n39.64\n13322\n34.28\n11274\n\n65.15\n622506\n78.45\n14122\n75.75\n14122\n73.65\n14122\n59.80\n14122\n52.25\n14122\n63.55\n10202\n65.35\n8634\n\nFigure 3: Test accuracy vs. rank for unstructured, LDR-SD, Toeplitz-like, low-rank classes. On each dataset,\nLDR-SD meets or exceeds the accuracy of the unstructured FC baseline at higher ranks. At rank 16, the\ncompression ratio of an LDR-SD layer compared to the unstructured layer ranges from 23 to 30. Shaded regions\nrepresent two standard deviations from the mean, computed over \ufb01ve trials with randomly initialized weights.\n\nrank classes\u2014from low-rank to Toeplitz-like to our learned operators\u2014comes from more accurate\nrepresentations of these invariances. As shown in Figure 3, broadening the operator class (from\nToeplitz-like at r = 1 to LDR-SD at r = 1) is consistently a more effective use of parameters than\nincreasing the displacement rank (from Toeplitz-like at r = 1 to r = 2). Note that LDR-SD (r = 1)\nand Toeplitz-like (r = 2) have the same parameter count.\nFor the rest of our experiments outside Section 5.1 we use the algorithms in Appendix C speci\ufb01cally\nfor LDR-SD matrices, and focus on further evaluation of this class on more expensive models.\n\nLanguage modeling Here, we replace the input-hidden weights in a single layer long short-term\nmemory network (LSTM) for a language modeling task. We evaluate on the WikiText-2 dataset,\nconsisting of 2M training tokens and a vocabulary size of 33K [36]. We compare to Toeplitz-like\nand low-rank baselines, both previously investigated for compressing recurrent nets [34]. As shown\nin Table 2, LDR-SD improves upon the baselines for each budget tested. Though our class does\n\n7\n\n\fnot outperform the unstructured model, we did \ufb01nd that it achieves a signi\ufb01cantly lower perplexity\nthan the \ufb01xed Toeplitz-like class (by 19.94-42.92 perplexity points), suggesting that learning the\ndisplacement operator can help adapt to different domains.\n\nTable 2: Test perplexity when replacing input-hidden matrices of an LSTM with structured classes on WikiText-2.\nAn unconstrained layer, with 65536 parameters, has perplexity 117.74. Parameter budgets correspond to ranks\n1,2,4,8,16,24 for LDR-SD. Lower is better.\n\nNum. Parameters LDR-SD Toeplitz-like Low-rank\n2048\n3072\n5120\n9216\n17408\n25600\n\n186.91\n177.60\n178.07\n186.52\n162.58\n155.73\n\n205.72\n179.46\n172.38\n144.41\n135.65\n133.37\n\n166.97\n154.51\n141.91\n143.60\n132.43\n129.46\n\n5.2 Replacing convolutional layers\nConvolutional layers of CNNs are a prominent example of equivariant feature maps.5 It has been noted\nthat convolutions are a subcase of Toeplitz-like matrices with a particular sparsity pattern6 [8, 45].\nAs channels are simply block matrices7, the block closure property implies that multi-channel\nconvolutional \ufb01lters are simply a Toeplitz-like matrix of higher rank (see Appendix C, Corollary 1).\nIn light of the interpretation of LDR of an approximately equivariant linear map (as discussed in\nSection 4), we investigate whether replacing convolutional layers with more general representations\ncan recover similar performance, without needing the hand-crafted sparsity pattern.\nBrie\ufb02y, we test the simplest multi-channel CNN model on the CIFAR-10 dataset, consisting of one\nlayer of convolutional channels (3 in/out channels), followed by a FC layer, followed by the softmax\nlayer. The \ufb01nal accuracies are listed in Table 3. The most striking result is for the simple architecture\nconsisting of two layers of a single structured matrix. This comes within 1.05 accuracy points of\nthe highly specialized architecture consisting of convolutional channels + pooling + FC layer, while\nusing fewer layers, hidden units, and parameters. The full details are in Appendix F.\n\nLast hidden layer\n\nTable 3: Replacing a \ufb01ve-layer CNN consisting of convolutional channels, max pooling, and FC layers with two\ngeneric LDR matrices results in only slight test accuracy decrease while containing fewer layers, hidden units,\nand parameters. Rank (r) is in parentheses.\nFirst hidden layer(s)\n3 Convolutional Channels (CC) FC\nFC\n3CC + Max Pool\n4CC + Max Pool\nFC\nToeplitz-like (r = 16) channels Toeplitz-like (r = 16) 3072, 512\nLDR-SD (r = 16)\nLDR-SD (r = 16) channels\n3072, 512\nToeplitz-like (r = 16) 3072, 512\nToeplitz-like (r = 48) matrix\nLDR-SD (r = 48) matrix\nLDR-SD (r = 16)\n3072, 512\n\nParameters Test Acc.\nHidden units\n1573089\n3072, 512\n3072, 768, 512\n393441\n4096, 1024, 512 524588\n393216\n417792\n393216\n405504\n\n54.59\n55.14\n60.05\n57.29\n59.36\n55.29\n59.00\n\n5.3 Generalization and sample complexity\n\nTheorem 2 states that the theoretical sample complexity of neural networks with structured weight\nmatrices scales almost linearly in the total number of parameters, matching the results for networks\nwith fully-connected layers [4, 24]. As LDR matrices have far fewer parameters, the VC dimension\n\n5Convolutions are designed to be shift equivariant, i.e. shifting the input is equivalent to shifting the output.\n6E.g. a 3 \u21e5 3 convolutional \ufb01lter on an n \u21e5 n matrix has a Toeplitz weight matrix supported on diagonals\n7A layer consisting of k in-channels and ` out-channels, each of which is connected by a weight matrix of\n\n1, 0, 1, n  1, n, n + 1, 2n  1, . . . .\nclass C, is the same as a k \u21e5 ` block matrix.\n\n8\n\n\f(a) Toeplitz-like\n\n(b) LDR-SD\n\n(c) Subdiagonal of B\n\n(d) Input examples\n\nFigure 4: The learned weight matrices (a,b) of models trained on MNIST-bg-rot. Unlike the Toeplitz-like matrix,\nthe LDR-SD matrix displays grid-like periodicity corresponding to the 2D input. Figure (c) shows the values\nof the subdiagonal of B, reshaped as an image. The size and location of the circle roughly corresponds to the\nlocation of objects of interest in the 2D inputs. A similar centering phenomenon was found on the NORB dataset,\nshown in Figure 6 in Appendix E.\n\nbound for LDR networks are correspondingly lower than that of general unstructured networks.\nThough the VC dimension bounds are suf\ufb01cient but not necessary for learnability, one might still\nexpect to be able to learn over compressed networks with fewer samples than over unstructured\nnetworks. We empirically investigate this result using the same experimental setting as Table 1\nand Figure 3. As shown in Table 12 (Appendix E), the structured classes consistently have lower\ngeneralization error (measured by the difference between training and test error) than the unstructured\nbaseline.\n\nReducing sample complexity We investigate whether LDR models with learned displacement\noperators require fewer samples to achieve the same test error, compared to unstructured weights, in\nboth the single hidden layer and CNN architectures. Tables 10 and 11 in Appendix E show our results.\nIn the single hidden layer architecture, when using only 25% of the training data the LDR-TD class\nexceeds the performance of an unstructured model trained on the full MNIST-noise dataset. On the\nCNN model, only 50% of the training data is suf\ufb01cient for the LDR-TD to exceed the performance of\nan unstructured layer trained on the full dataset.\n\n5.4 Visualizing learned weights\n\nFinally, we examine the actual structures that our models learn. Figure 4(a,b) shows the heat map of\nthe weight matrix W 2 R784\u21e5784 for the Toeplitz-like and LDR-SD classes, trained on MNIST-bg-rot\nwith a single hidden layer model. As is convention, the input is \ufb02attened to a vector in R784. The\nToeplitz-like class is unable to determine that the input is actually a 28 \u21e5 28 image instead of a vector.\nIn contrast, LDR-SD class is able to pick up regularity in the input, as the weight matrix displays\ngrid-like periodicity of size 28.\nFigure 4(c) reveals why the weight matrix displays this pattern. The equivariance interpretation\n(Section 4) predicts that B should encode a meaningful transformation of the inputs. The entries of\nthe learned subdiagonal are in fact recovering a latent invariant of the 2D domain: when visualized as\nan image, the pixel intensities correspond to how the inputs are centered in the dataset (Figure 4(d)).\nFigure 6 in Appendix E shows a similar \ufb01gure for the NORB dataset, which has smaller objects, and\nwe found that the subdiagonal learns a correspondingly smaller circle.\n\n6 Conclusion\n\nWe generalize the class of low displacement rank matrices explored in machine learning by consider-\ning classes of LDR matrices with displacement operators that can be learned from data. We show\nthese matrices can improve performance on downstream tasks compared to compression baselines\nand, on some tasks, general unstructured weight layers. We hope this work inspires additional ways\nof using structure to achieve both more compact and higher quality representations, especially for\ndeep learning models, which are commonly acknowledged to be overparameterized.\n\n9\n\n\f", "award": [], "sourceid": 5425, "authors": [{"given_name": "Anna", "family_name": "Thomas", "institution": "Stanford"}, {"given_name": "Albert", "family_name": "Gu", "institution": "Stanford"}, {"given_name": "Tri", "family_name": "Dao", "institution": "Stanford University"}, {"given_name": "Atri", "family_name": "Rudra", "institution": "University at Buffalo, SUNY"}, {"given_name": "Christopher", "family_name": "R\u00e9", "institution": "Stanford"}]}