{"title": "Tensor Switching Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2038, "page_last": 2046, "abstract": "We present a novel neural network algorithm, the Tensor Switching (TS) network, which generalizes the Rectified Linear Unit (ReLU) nonlinearity to tensor-valued hidden units. The TS network copies its entire input vector to different locations in an expanded representation, with the location determined by its hidden unit activity. In this way, even a simple linear readout from the TS representation can implement a highly expressive deep-network-like function. The TS network hence avoids the vanishing gradient problem by construction, at the cost of larger representation size. We develop several methods to train the TS network, including equivalent kernels for infinitely wide and deep TS networks, a one-pass linear learning algorithm, and two backpropagation-inspired representation learning algorithms. Our experimental results demonstrate that the TS network is indeed more expressive and consistently learns faster than standard ReLU networks.", "full_text": "Tensor Switching Networks\n\nChuan-Yung Tsai\u2217, Andrew Saxe\u2217, David Cox\n\nCenter for Brain Science, Harvard University, Cambridge, MA 02138\n\n{chuanyungtsai,asaxe,davidcox}@fas.harvard.edu\n\nAbstract\n\nWe present a novel neural network algorithm, the Tensor Switching (TS) network,\nwhich generalizes the Recti\ufb01ed Linear Unit (ReLU) nonlinearity to tensor-valued\nhidden units. The TS network copies its entire input vector to different locations in\nan expanded representation, with the location determined by its hidden unit activity.\nIn this way, even a simple linear readout from the TS representation can implement\na highly expressive deep-network-like function. The TS network hence avoids the\nvanishing gradient problem by construction, at the cost of larger representation size.\nWe develop several methods to train the TS network, including equivalent kernels\nfor in\ufb01nitely wide and deep TS networks, a one-pass linear learning algorithm, and\ntwo backpropagation-inspired representation learning algorithms. Our experimental\nresults demonstrate that the TS network is indeed more expressive and consistently\nlearns faster than standard ReLU networks.\n\n1\n\nIntroduction\n\nDeep networks [1, 2] continue to post impressive successes in a wide range of tasks, and the Recti\ufb01ed\nLinear Unit (ReLU) [3, 4] is arguably the most used simple nonlinearity. In this work we develop a\nnovel deep learning algorithm, the Tensor Switching (TS) network, which generalizes the ReLU such\nthat each hidden unit conveys a tensor, instead of scalar, yielding a more expressive model. Like the\nReLU network, the TS network is a linear function of its input, conditioned on the activation pattern\nof its hidden units. By separating the decision to activate from the analysis performed when active,\neven a linear classi\ufb01er can reach back across all layers to the input of the TS network, implementing\na deep-network-like function while avoiding the vanishing gradient problem [5], which can otherwise\nsigni\ufb01cantly slow down learning in deep networks. The trade-off is the representation size.\nWe exploit the properties of TS networks to develop several methods suitable for learning in different\nscaling regimes, including their equivalent kernels for SVMs on small to medium datasets, a one-pass\nlinear learning algorithm which visits each data point only once for use with very large but simpler\ndatasets, and two backpropagation-inspired representation learning algorithms for more generic use.\nOur experimental results show that TS networks are indeed more expressive and consistently learn\nfaster than standard ReLU networks.\nRelated work is brie\ufb02y summarized as follows. With respect to improving the nonlinearities, the idea\nof severing activation and analysis weights (or having multiple sets of weights) in each hidden layer\nhas been studied in [6, 7, 8]. Reordering activation and analysis is proposed by [9]. On tackling the\nvanishing gradient problem, tensor methods are used by [10] to train single-hidden-layer networks.\nConvex learning and inference in various deep architectures can be found in [11, 12, 13] too. Finally,\nconditional linearity of deep ReLU networks is also used by [14], mainly to analyze their performance.\nIn comparison, the TS network does not simply reorder or sever activation and analysis within each\nhidden layer. Instead, it is a cross-layer generalization of these concepts, which can be applied with\nmost of the recent deep learning architectures [15, 9], not only to increase their expressiveness, but\nalso to help avoiding the vanishing gradient problem (see Sec. 2.3).\n\n\u2217Equal contribution.\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fFigure 1: (Left) A single-hidden-layer standard (i.e. Scalar Switching) ReLU network. (Right) A\nsingle-hidden-layer Tensor Switching ReLU network, where each hidden unit conveys a vector of\nactivities\u2014inactive units (top-most unit) convey a vector of zeros while active units (bottom two\nunits) convey a copy of their input.\n\n2 Tensor Switching Networks\n\nIn the following we \ufb01rst construct the de\ufb01nition of shallow (single-hidden-layer) TS networks, then\ngeneralize the de\ufb01nition to deep TS networks, and \ufb01nally describe their qualitative properties. For\nsimplicity, we only show fully-connected architectures using the ReLU nonlinearity. However, other\npopular nonlinearities, e.g. max pooling and maxout [16], in addition to ReLU, are also supported in\nboth fully-connected and convolutional architectures.\n\n2.1 Shallow TS Networks\n\nX1 =\n\n(cid:12)W1\n\n(1)\n\n(cid:123)(cid:122)\n\nZ1\n\nThe TS-ReLU network is a generalization of standard ReLU networks that permits each hidden unit\nto convey an entire tensor of activity (see Fig. 1). To describe it, we build up from the standard ReLU\nnetwork. Consider a ReLU layer with weight matrix W1 \u2208 Rn1\u00d7n0 responding to an input vector\nX0 \u2208 Rn0. The resulting hidden activity X1 \u2208 Rn1 of this layer is X1 = max (0n1, W1X0) =\nH (W1X0) \u25e6 (W1X0) where H is the Heaviside step function, and \u25e6 denotes elementwise product.\nThe rightmost equation splits apart each hidden unit\u2019s decision to activate, represented by the term\nH (W1X0), from the information (i.e. result of analysis) it conveys when active, denoted by W1X0.\nWe then go one step further to rewrite X1 as\n\n\uf8f6\uf8f8 \u00d7 1n0,\nand tensor summative reduction C = A \u00d7 1n =\u21d2 c...,k,j =(cid:80)n\n\n\uf8eb\uf8edH (W1X0) \u2297 X0\n(cid:124)\n(cid:125)\n\nwhere we have made use of the following tensor operations: vector-tensor cross product C = A \u2297\nB =\u21d2 ci,j,k,... = aibj,k,..., tensor-matrix Hadamard product C = A (cid:12) B =\u21d2 c...,j,i = a...,j,ibj,i\ni=1 a...,k,j,i. In (1), the input vector\nX0 is \ufb01rst expanded into a new matrix representation Z1 \u2208 Rn1\u00d7n0 with one row per hidden unit. If\na hidden unit is active, the input vector X0 is copied to the corresponding row. Otherwise, the row is\n\ufb01lled with zeros. Finally, this expanded representation Z1 is collapsed back by projection onto W1.\nThe central idea behind the TS-ReLU network is to learn a linear classi\ufb01er directly from the rich,\nexpanded representation Z1, rather than collapsing it back to the lower dimensional X1. That is, in a\nstandard ReLU network, the hidden layer activity X1 is sent through a linear classi\ufb01er fX (WXX1)\ntrained to minimize some loss function LX (fX). In the TS-ReLU network, by contrast, the expanded\nrepresentation Z1 is sent to a linear classi\ufb01er fZ (WZ vec (Z1)) with loss function LZ (fZ). Each\nTS-ReLU neuron thus transmits a vector of activities (a row of Z1), compared to a standard ReLU\nneuron that transmits a single scalar (see Fig. 1). Because of this difference, in the following we call\nthe standard ReLU network a Scalar Switching ReLU (SS-ReLU) network.\n\n2.2 Deep TS Networks\n\nThe construction given above generalizes readily to deeper networks. De\ufb01ne a nonlinear expansion\noperation as X\u2295 W = H (WX)\u2297 X and linear contraction operation as Z(cid:9) W = (Z (cid:12) W)\u00d7 1n,\nsuch that (1) becomes Xl = ((Xl\u22121 \u2295 Wl) (cid:12) Wl)\u00d7 1nl\u22121 = Xl\u22121 \u2295 Wl (cid:9) Wl for a given layer l\n\n2\n\n23InputX0051ScalarSwitchingReLUX1W11111\u22121\u22121LinearReadoutyWX23InputX0[00][23][23]TensorSwitchingReLUZ1W11111\u22121\u22121LinearReadoutyWZ\fwith Xl \u2208 Rnl and Wl \u2208 Rnl\u00d7nl\u22121. A deep SS-ReLU network with L layers may then be expressed\nas a sequence of alternating expansion and contraction steps,\n\nXL = X0 \u2295 W1 (cid:9) W1 \u00b7\u00b7\u00b7 \u2295 WL (cid:9) WL.\n\n(2)\nTo obtain the deep TS-ReLU network, we further de\ufb01ne the ternary expansion operation Z \u2295X W =\nH (WX) \u2297 Z, such that the decision to activate is based on the SS-ReLU variables X, but the entire\ntensor Z is transmitted when the associated hidden unit is active. Let Z0 = X0. The l-th layer activity\ntensor of a TS network can then be written as Zl = H (WlXl\u22121) \u2297 Zl\u22121 = Zl\u22121 \u2295Xl\u22121 Wl \u2208\nRnl\u00d7nl\u22121\u00d7\u00b7\u00b7\u00b7\u00d7n0. Thus compared to a deep SS-ReLU network, a deep TS-ReLU network simply\nomits the contraction stages,\n\nZL = Z0 \u2295X0 W1 \u00b7\u00b7\u00b7 \u2295XL\u22121 WL.\n\n(3)\nBecause there are no contraction steps, the order of Zl \u2208 Rnl\u00d7nl\u22121\u00d7\u00b7\u00b7\u00b7\u00d7n0 grows with depth, adding\nan additional dimension for each layer. One interpretation of this scheme is that, if a hidden unit\nat layer l is active, the entire tensor Zl\u22121 is copied to the appropriate position in Zl.1 Otherwise a\ntensor of zeros is copied. Another equivalent interpretation is that the input vector X0 is copied to a\ngiven position Zl(i, j, . . . , k, :) only if hidden units i, j, . . . , k at layers l, l \u2212 1, . . . , 1 respectively\nare all active. Otherwise, Zl(i, j, . . . , k, :) = 0n0. Hence activity propagation in the deep TS-ReLU\nnetwork preserves the layered structure of a deep SS-ReLU network, in which a chain of hidden units\nacross layers must activate for activity to propagate from input to output.\n\n2.3 Properties\n\nThe TS network decouples a hidden unit\u2019s decision to activate (as encoded by the activation weights\n{Wl}) from the analysis performed on the input when the unit is active (as encoded by the analysis\nweights WZ). This distinguishing feature leads to the following 3 properties.\nCross-layer analysis. Since the TS representation preserves the layered structure of a deep network\nand offers direct access to the entire input (parcellated by the activated hidden units), a simple linear\nreadout can effectively reach back across layers to the input and thus implicitly learns analysis weights\nfor all layers at one time in WZ. Therefore it avoids the vanishing gradient problem by construction.2\nError-correcting analysis. As activation and analysis are severed, a careful selection of the analysis\nweights can \u201cclean up\u201d a certain amount of inexactitude in the choice to activate, e.g. from noisy or\neven random activation weights. While for the SS network, bad activation also implies bad analysis.\nFine-grained analysis. To see this, we consider single-hidden-layer TS and SS networks with just\none hidden unit. The TS unit, when active, conveys the entire input vector, and hence any full-rank\nlinear map from input to output may be implemented. The SS unit, when active, conveys just a single\nscalar, and hence can only implement a rank-1 linear map between input and output. By choosing the\nright analysis weights, a TS network can always implement an SS network,3 but not vice versa. As\nsuch, it clearly has greater modeling capacity for a \ufb01xed number of hidden units.\nAlthough the TS representation is highly expressive, it comes at the cost of an exponential increase in\nl nl. This renders TS networks of substantial width\nand depth very challenging (except as kernels). But as we will show, the expressiveness permits TS\nnetworks to perform fairly well without having to be extremely wide and deep, and often noticeably\nbetter than SS networks of the same sizes. Also, TS networks of useful sizes still can be implemented\nwith reasonable computing resources, especially when combined with techniques in Sec. 4.3.\n\nthe size of its representation with depth, i.e.(cid:81)\n\n3 Equivalent Kernels\n\nIn this section we derive equivalent kernels for TS-ReLU networks with arbitrary depth and an in\ufb01nite\nnumber of hidden units at each layer, with the aim of providing theoretical insight into how TS-ReLU\nis analytically different from SS-ReLU. These kernels represent the extreme of in\ufb01nite (but unlearned)\nfeatures, and might be used in SVM on datasets of small to medium sizes.\n\n1For convolutional networks using max pooling, the convolutional-window-sized input patch winning the\n\nmax pooling is copied. In other words, different nonlinearities only change the way the input is switched.\n\n2It is in spirit similar to models with skip connections to the output [17, 18], although not exactly reducible.\n3Therefore TS networks are also universal function approximators [19].\n\n3\n\n\fFigure 2: Equivalent kernels as a function of\nthe angle between unit-length vectors x and\ny. The deep SS-ReLU kernel converges to 1\neverywhere as L \u2192 \u221e, while the deep TS-\nReLU kernel converges to 1 at the origin and\n0 everywhere else.\n\nConsider a single-hidden-layer TS-ReLU network with n1 hidden units in which each element of\nthe activation weight matrix W1 \u2208 Rn1\u00d7n0 is i.i.d. zero mean Gaussian with arbitrary standard\ndeviation \u03c3. The in\ufb01nite-width random TS-ReLU kernel between two vectors x, y \u2208 Rn0 is the\n\ndot product between their expanded representations (scaled by(cid:112)2/n1 for convenience) in the limit\n(cid:17)\n(cid:16)(cid:112)2/n1 y \u2295 W1\ny, where w \u223c N(cid:0)0, \u03c32I(cid:1) is a n0-dimensional random Gaussian vector.\n\nof in\ufb01nite hidden units, kTS\n1 (x, y) = limn1\u2192\u221e vec\n(cid:124)\n2 E [H (w\ny)] x\nThe expectation is the probability that a randomly chosen vector w lies within 90 degrees of both x\nand y. Because w is drawn from an isotropic Gaussian, if x and y differ by an angle \u03b8, then only the\nfraction \u03c0\u2212\u03b8\n2\u03c0 of randomly drawn w will be within 90 degrees of both, yielding the equivalent kernel\nof a single-hidden-layer in\ufb01nite-width random TS-ReLU network given in (5).4\n\n(cid:16)(cid:112)2/n1 x \u2295 W1\n\n(cid:17)(cid:124)\n\nx) H (w\n\nvec\n\n=\n\n(cid:124)\n\n(cid:124)\n\n(cid:19)\n\n(cid:18)\n1 \u2212 tan \u03b8 \u2212 \u03b8\n(cid:18)\n1 \u2212 \u03b8\n\u03c0\n\n(cid:19)\n\n(cid:124)\nx\n\ny\n\n\u03c0\n\nkSS\n1 (x, y) = \u00afkSS (\u03b8) x\n\n(cid:124)\n\ny =\n\n(cid:124)\n1 (x, y) = \u00afkTS (\u03b8) x\nkTS\n\ny =\n\n(cid:124)\n\nx\n\ny\n\n(4)\n\n(5)\n\nFigure 2 compares (5) against the linear kernel and the single-hidden-layer in\ufb01nite-width random\nSS-ReLU kernel (4) from [20] (see Linear, TS L = 1 and SS L = 1). It has two important qualitative\nfeatures. First, it has discontinuous derivative at \u03b8 = 0, and hence a much sharper peak than the other\nkernels.5 Intuitively this means that a very close match counts for much more than a moderately close\nmatch. Second, unlike the SS-ReLU kernel which is non-negative everywhere, the TS-ReLU kernel\nstill has a negative lobe, though it is substantially reduced relative to the linear kernel. Intuitively this\nmeans that being dissimilar to a support vector can provide evidence against a particular classi\ufb01cation,\nbut this negative evidence is much weaker than in a standard linear kernel.\nTo derive kernels for deeper TS-ReLU networks, we need to consider the deeper SS-ReLU kernels as\nwell, since its activation and analysis are severed, and the activation instead depends on its SS-ReLU\ncounterpart. Based upon the recursive formulation from [20], \ufb01rst we de\ufb01ne the zeroth-layer kernel\nk\u2022\ny and the generalized angle \u03b8\u2022\n0 (x, y) = x\ndenotes SS or TS. Then we can easily get kSS\n\nl = cos\u22121(cid:0)k\u2022\nl+1 (x, y) = \u00afkSS(cid:0)\u03b8SS\n\nl (y, y)(cid:1), where \u2022\n\nl (x, y)/(cid:112)k\u2022\n(cid:1) kSS\n\nl (x, x) k\u2022\nl (x, y),6 and kTS\n\nl+1 (x, y) =\n\n(cid:124)\n\nl\n\n(x, y), where \u00afk\u2022 follows (4) or (5) accordingly.\n\n\u00afkTS(cid:0)\u03b8SS\n\n(cid:1) kTS\n\nl\n\nl\n\nFigure 2 also plots the deep TS-ReLU and SS-ReLU kernels as a function of depth. The shape of\nthese kernels reveals sharply divergent behavior between the TS and SS networks. As depth increases,\nthe equivalent kernel of the TS network falls off ever more rapidly as the angle between input vectors\nincreases. This means that vectors must be an ever closer match to retain a high kernel value. As\nargued earlier, this highlights the ability of the TS network to pick up on and amplify small differences\nbetween inputs, resulting in a quasi-nearest-neighbor behavior. In contrast, the equivalent kernel of\nthe SS network limits to one as depth increases. Thus, rather than amplifying small differences, it\ncollapses them with depth such that even very dissimilar vectors receive high kernel values.\n\n4This proof is succinct using a geometric view, while a longer proof can be found in the Supplementary\nMaterial. As the kernel is directly de\ufb01ned as a dot product between feature vectors, it is naturally a valid kernel.\n\n5Interestingly, a similar kernel is also observed by [21] for models with explicit skip connections.\n6We write (4) and kSS\n\nl differently from [20] for cleaner comparisons against TS-ReLU kernels. However they\n\nare numerically unstable expressions and are not used in our experiments to replace the original ones in [20].\n\n4\n\n\u03b800.5\u03c0\u03c0k-1-0.500.51LinearSS L=1SS L=2SS L=3TS L=1TS L=2TS L=3\fFigure 3: Inverted backpropagation learning \ufb02owchart, where \u2192 denotes signal \ufb02ow, (cid:57)(cid:57)(cid:75) denotes\npseudo gradient \ufb02ow, and = denotes equivalence. (Top row) The SS pathway. (Bottom row) The\nTS and auxiliary pathways, where Zl\u2019s are related by nonlinear expansions, and Al\u2019s are related by\nlinear contractions. The resulting AL is equivalent to the alternating expansion and contraction in the\nSS pathway that yields XL.\n\n4 Learning Algorithms\n\nIn the following we present 3 learning algorithms suitable for different scenarios. One-pass ridge\nregression in Sec. 4.1 learns only the linear readout (i.e. analysis weights WZ), leaving the hidden-\nlayer representations (i.e. activation weights {Wl}) random, hence it is convex and exactly solvable.\nInverted backpropagation in Sec. 4.2 learns both analysis and activation weights. Linear Rotation-\nCompression in Sec. 4.3 also learns both weights, but learns activation weights in an indirect way.\n\n4.1 Linear Readout Learning via One-pass Ridge Regression\n\nIn this scheme, we leverage the intuition that precision in the decision for a hidden unit to activate is\nless important than carefully tuned analysis weights, which can in part compensate for poorly tuned\nactivation weights. We randomly draw and \ufb01x the activation weights {Wl}, and then solve for the\nanalysis weights WZ using ridge regression, which can be done in a single pass through the dataset.\nFirst, each data point p = 1, . . . , P is expanded into its tensor representation Zp\nL and then accumulated\n(cid:124)\np yp vec (Zp\n. After\nL)\n\u22121\n\ninto the correlation matrices CZZ =(cid:80)\n\nand CyZ =(cid:80)\n\n(cid:124)\nL) vec (Zp\nL)\n\np vec (Zp\n\nall data points are processed once, the analysis weights are determined as WZ = CyZ (CZZ + \u03bbI)\nwhere \u03bb is an L2 regularization parameter.\nUnlike a standard SS network, which in this setting would only be able to select a linear readout from\nthe top hidden layer to the \ufb01nal classi\ufb01cation decision, the TS network offers direct access to entire\ninput vectors, parcellated by the hidden units they activate. In this way, even a linear readout can\neffectively reach back across layers to the input, implementing a complex function not representable\nwith an SS network with random \ufb01lters. However, this scheme requires high memory usage, which is\nfor storing CZZ, and even higher computation cost7 for solving WZ,\nwhich makes deep architectures (i.e. L > 1) impractical. Therefore, this scheme may best suit online\nlearning applications which allow only one-time access to data, but do not require a deep classi\ufb01er.\n\non the order of O(cid:16)(cid:81)L\n\nl=0 n2\nl\n\n(cid:17)\n\n4.2 Representation Learning via Inverted Backpropagation\n\nThe ridge regression learning uses random activation weights and only learns analysis weights. Here\nwe provide a \u201cgradient-based\u201d procedure to learn both weights. Learning the analysis weights (i.e. the\n\ufb01nal linear layer) WZ simply requires \u2202LZ\n, which is generally easy to compute. However, since the\nactivation weights Wl in the TS network only appear inside the Heaviside step function H with zero\n(or unde\ufb01ned) derivative, the gradient \u2202LZ\nis also zero. To bypass this, we introduce a sequence of\nauxiliary variables Al de\ufb01ned by A0 = ZL and the recursion Al = Al\u22121(cid:9)Wl \u2208 RnL\u00d7nL\u22121\u00d7\u00b7\u00b7\u00b7\u00d7nl.\nWe then derive the pseudo gradient using the proposed inverted backpropagation as\n\n\u2202WZ\n\n\u2202Wl\n\n(cid:100)\u2202LZ\n\n\u2202Wl\n\n\u2202LZ\n\u2202A0\n\n=\n\n(cid:18) \u2202A1\n\n(cid:19)\u2020\n\n\u2202A0\n\n\u00b7\u00b7\u00b7\n\n(cid:18) \u2202Al\n\n(cid:19)\u2020\n\n\u2202Al\u22121\n\n\u2202Al\n\u2202Wl\n\n,\n\n(6)\n\nwhere \u2020 denotes Moore\u2013Penrose pseudoinverse. Because the Al\u2019s are related via the linear contraction\noperator, these derivatives are non-zero and easy to compute. We \ufb01nd this works suf\ufb01ciently well as\na non-zero proxy for \u2202LZ\n\n.\n\n\u2202Wl\n\n7Nonetheless this is a one-time cost and still can be advantageous over other slowly converging algorithms.\n\n5\n\nX0XLLZZ0ZL=A0AL\u2295W1(cid:9)W1\u00b7\u00b7\u00b7\u2295WL(cid:9)WLScalarSwitching\u2295W1\u00b7\u00b7\u00b7\u2295WLTensorSwitching(cid:9)W1\u00b7\u00b7\u00b7(cid:9)WLAuxiliary\fOur motivation with this scheme is to \u201crecover\u201d the learning behavior in SS networks. To see this,\n\ufb01rst note that AL = A0 (cid:9) W1 \u00b7\u00b7\u00b7 (cid:9) WL = XL (see Fig. 3). This re\ufb02ects the fact that the TS and\nSS networks are linear once the active set of hidden units is known, such that the order of expansion\nand contraction steps has no effect on the \ufb01nal output. Hence the linear contraction steps, which\nalternate with expansion steps in (3), can instead be gathered at the end after all expansion steps. The\ngradient in the SS network is then\n\u2202LX\n\u2202Wl\n\n(cid:18) \u2202Al\n\n(cid:18) \u2202A1\n\n\u00b7\u00b7\u00b7 \u2202Al+1\n\u2202Al\n\n\u00b7\u00b7\u00b7 \u2202A1\n\u2202A0\n\n\u2202AL\n\u2202AL\u22121\n\n\u2202LX\n\u2202AL\n\n(cid:19)\u2020\n\n\u2202Al\n\u2202Wl\n\n=\n\n(cid:19)\u2020\n\n\u2202A0\n\n\u2202Al\u22121\n\n=\n\n\u2202Al\n\u2202Wl\n\n.\n\n(7)\n\n\u2202LX\n\u2202AL\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n\u2202AL\n\u2202AL\u22121\n\u2202LX\n\u2202A0\n\n(cid:125)\n\n\u00b7\u00b7\u00b7\n\n\u2202A0\n\nin (7) with \u2202LZ\n\ntime complexities at O ((cid:81)\n\nReplacing \u2202LX\n, such that the expanded representation may in\ufb02uence the inverted\ngradient, we recover (6). Compared to one-pass ridge regression, this scheme controls the memory and\nl nl), which makes training of a moderately-sized TS network on modern\ncomputing resources feasible. The ability to train activation weights also relaxes the assumption that\nanalysis weights can \u201cclean up\u201d inexact activations caused by using even random weights.\n\n\u2202A0\n\n4.3\n\nIndirect Representation Learning via Linear Rotation-Compression\n\nAlthough the inverted backpropagation learning controls memory and time complexities better than\nthe one-pass ridge regression, the exponential growth of a TS network\u2019s representation still severely\nconstrains its potential toward being applied in recent deep learning architectures, where network\nwidth and depth can easily go beyond, e.g., a thousand. In addition, the success of recent deep learning\narchitectures also heavily depends on the acceleration provided by highly-optimized GPU-enabled\nlibraries, where the operations of the previous learning schemes are mostly unsupported.\nTo address these 2 concerns, we provide a standard backpropagation-compatible learning algorithm,\nl vec (Xl\u22121 \u2295 Wl),\nwhere we no longer keep separate X and Z variables. Instead we de\ufb01ne Xl = W\u2217\nl \u00d7nlnl\u22121.\nwhich directly \ufb02attens the expanded representation and linearly projects it against W\u2217\nIn this scheme, even though Wl still lacks a non-zero gradient, the W\u2217\nl\u22121 of the previous layer can\nbe learned using backpropagation to properly \u201crotate\u201d Xl\u22121, such that it can be utilized by Wl and\nthe TS nonlinearity. Therefore, the representation learning here becomes indirect. To simultaneously\ncontrol the representation size, one can easily let n\u2217\nl becomes \u201ccompressive.\u201d\nInterestingly, we \ufb01nd n\u2217\nl = nl often works surprisingly well, which suggests linearly compressing\nthe expanded TS representation back to the size of an SS representation can still retain its advantage,\nand thus is used as the default. This scheme can also be combined with inverted backpropagation if\nlearning Wl is still desired.\nTo understand why linear compression does not remove the TS representation power, we note that it is\nnot equivalent to the linear contraction operation (cid:9), where each tensor-valued unit is down projected\nindependently. Linear compression introduces extra interaction between tensor-valued units. Another\nway to view the linear compression\u2019s role is through kernel analysis as shown in Sec. 3\u2014adding a\nlinear layer does not change the shape of a given TS kernel.\n\nl < nlnl\u22121 such that W\u2217\n\nl \u2208 Rn\u2217\n\n5 Experimental Results\n\nOur experiments focus on comparing TS and SS networks with the goal of determining how the TS\nnonlinearities differ from their SS counterparts. SVMs using SS-ReLU and TS-ReLU kernels are\nimplemented in Matlab based on libsvm-compact [22]. TS networks and all 3 learning algorithms in\nSec. 4 are implemented in Python based on Numpy\u2019s ndarray data structure. Both implementations\nutilize multicore CPU acceleration. In addition, TS networks with only the linear rotation-compression\nlearning are also implemented in Keras, which enjoys much faster GPU acceleration.\nWe adopt 3 datasets, viz. MNIST, CIFAR10 and SVHN2, where we reserve the last 5,000 training\nimages for validation. We also include SVHN2\u2019s extra training set (except for SVMs8) in the training\nprocess, and zero-pad MNIST images such that all datasets have the same spatial resolution\u201432\u00d7 32.\n\n8Due to the prohibitive kernel matrix size, as SVMs here can only be solved in the dual form.\n\n6\n\n\fTable 1: Error rate (%) and run time (\u00d7) comparison.\n\nError RateDepth\n\nSS SVM\nTS SVM\nSS MLP\nTS MLP RR\nTS MLP LRC\nTS MLP IBP-LRC\nSS CNN\nTS CNN LRC\n\nMNIST\n\nOne-pass \u2013 Asymptotic\n\nCIFAR10\n\nOne-pass \u2013 Asymptotic\n\nSVHN2\n\nOne-pass \u2013 Asymptotic\n\n\u2013 1.405\n\u2013 1.403\n16.342 \u2013 2.363\n2.991 \u2013\n3.332 \u2013 2.062\n3.331 \u2013 2.331\n\n\u2013 43.187\n\u2013 43.602\n66.411 \u2013 46.912\n47.711 \u2013\n55.691 \u2013 46.872\n55.691 \u2013 45.862\n\n43.743+1 \u2013 1.084+2\n3.855+3 \u2013 0.866+2\n\n74.843+3 \u2013 26.735+2\n54.403+3 \u2013 25.748+3\n\n\u2013 21.601\n\u2013 20.381\n30.243 \u2013 12.203\n27.111 \u2013\n20.422 \u2013 12.583\n20.202 \u2013 12.633\n13.697+1 \u2013 4.966+1\n9.137+3 \u2013 5.066+3\n\nTime\n\n1.0\n2.1\n1.0\n156.2\n11.7\n17.4\n1.0\n2.0\n\nRR = One-Pass Ridge Regression, LRC = Linear Rotation-Compression, IBP = Inverted Backpropagation.\n\nFigure 4: Comparison of SS CNN and TS CNN LRC models. (Left) Each dot\u2019s coordinate indicates\nthe differences of one-pass and asymptotic error rates between one pair of SS CNN and TS CNN\nLRC models sharing the same hyperparameters. The \ufb01rst quadrant shows where the TS CNN LRC is\nbetter in both errors. (Right) Validation error rates v.s. training time on CIFAR10 from the shallower,\nintermediate and deeper models.\n\nFor SVMs, we grid search for both kernels with depth from 1 to 10, C from 1 to 1, 000, and PCA\ndimension reduction of the images to 32, 64, 128, 256, or no reduction. For SS and TS networks with\nfully-connected (i.e. MLP) architectures, we grid search for depth from 1 to 3 and width (including\nPCA of the input) from 32 to 256 based on our Python implementation. For SS and TS networks with\nconvolutional (i.e. CNN) architectures, we adopt VGG-style [15] convolutional layers with 3 standard\nSS convolution-max pooling blocks,9 where each block can have up to three 3 \u00d7 3 convolutions,\nplus 1 to 3 fully-connected SS or TS layers of \ufb01xed width 256. CNN experiments are based on our\nKeras implementation. For all MLPs and CNNs, we universally use SGD with learning rate 10\u22123,\nmomentum 0.9, L2 weight decay 10\u22123 and batch size 128 to reduce the grid search complexity by\nfocusing on architectural hyperparameters. All networks are trained for 100 epochs on MNIST and\nCIFAR10, and 20 epochs on SVHN2, without data augmentation. The source code and scripts for\nreproducing our experiments are available at https://github.com/coxlab/tsnet.\nTable 1 summarizes our experimental results, including both one-pass (i.e. \ufb01rst-epoch) and asymptotic\n(i.e. all-epoch) error rates and the corresponding depths (for CNNs, convolutional and fully-connected\nlayers are listed separately). The TS nonlinearities perform better in almost all categories, con\ufb01rming\nour theoretical insights in Sec. 2.3\u2014the cross-layer analysis (as evidenced by their low error rates\nafter only one epoch of training), the error-correcting analysis (on MNIST and CIFAR10, for instance,\nthe one-pass error rates of TS MLP RR using \ufb01xed random activation are close to the asymptotic\nerror rates of TS MLP LRC and IBP-LRC with trained activation), and the \ufb01ne-grained analysis (the\nTS networks in general achieve better asymptotic error rates than their SS counterparts).\n\n9This decision mainly is to accelerate the experimental process, since TS convolution runs much slower, but\n\nwe also observe that TS nonlinearities in lower layers are not always helpful. See later for more discussion.\n\n7\n\nOne-pass Error Rate \u2206-4%0%4%16%36%64%100%Asymptotic Error Rate \u2206-1%0%1%4%9%16%MNISTCIFAR10SVHN2Seconds101001000Error Rate30%40%50%60%70%80%SS L=3+1SS L=6+2SS L=9+3TS L=3+1TS L=6+2TS L=9+3\fBackpropagation (SS MLP)\n\nInverted Backpropagation (TS MLP IBP)\n\nFigure 5: Visualization of \ufb01lters learned on (Top) MNIST, (Middle) CIFAR10 and (Bottom) SVHN2.\n\nTo further demonstrate how using TS nonlinearities affects the distribution of performance across\ndifferent architectures (here, mainly depth), we plot the performance gains (viz. one-pass and asymp-\ntotic error rates) introduced by using the TS nonlinearities on all CNN variants in Fig. 4. The fact\nthat most dots are in the \ufb01rst quadrant (and none in the third quadrant) suggests the TS nonlinearities\nare predominantly bene\ufb01cial. Also, to ease the concern that the TS networks\u2019 higher complexity may\nsimply consume their advantage on actual run time, we also provide examples of learning progress\n(i.e. validation error rate) over run time in Fig. 4. The results suggest that even our unoptimized TS\nnetwork implementation can still provide sizable gains in learning speed.\nFinally, to verify the effectiveness of inverted backpropagation in learning useful activation \ufb01lters\neven without the actual gradient, we train single-hidden-layer SS and TS MLPs with 16 hidden units\neach (without using PCA dimension reduction of the input) and visualize the learned \ufb01lters in Fig. 5.\nThe results suggest inverted backpropagation functions equally well.\n\n6 Discussion\n\nWhy do TS networks learn quickly? In general, the TS network sidesteps the vanishing gradient\nproblem as it skips the long chain of linear contractions against the analysis weights (i.e. the auxiliary\npathway in Fig. 3). Its linear readout has direct access to the full input vector, which is switched to\ndifferent parts of the highly expressive expanded representation. This directly accelerates learning.\nAlso, a well-\ufb02owing gradient confers bene\ufb01ts beyond the TS layers\u2014e.g. SS layers placed before TS\nlayers also learn faster since the TS layers \u201cself-organize\u201d rapidly, permitting useful error signals to\n\ufb02ow to the lower layers faster.10 Lastly, when using the inverted backpropagation or linear rotation-\ncompression learning, although {Wl} or {W\u2217\nl } do not learn as fast as WZ, and may still be quite\nrandom in the \ufb01rst few epochs, the error-correcting nature of WZ can still compensate for the learning\nprogress.\nChallenges toward deeper TS networks. As shown in Fig. 2, the equivalent kernels of deeper TS\nnetworks can be extremely sharp and discriminative, which unavoidably hurts invariant recognition of\ndissimilar examples. This may explain why we \ufb01nd having TS nonlinearities in only higher (instead\nof all) layers works better, since the lower SS layers can form invariant representations for the higher\nTS layers to classify. To remedy this, we may need to consider other types of regularization for WZ\n(instead of L2) or other smoothing techniques [25, 26].\nFuture work. Our main future direction is to improve the TS network\u2019s scalability, which may require\nmore parallelism (e.g. multi-GPU processing) and more customization (e.g. GPU kernels utilizing the\nsparsity of TS representations), with preferably more memory storage/bandwidth (e.g. GPUs using\n3D-stacked memory). With improved scalability, we also plan to further verify the TS nonlinearity\u2019s\nef\ufb01ciency in state-of-the-art architectures [27, 9, 18], which are still computationally prohibitive with\nour current implementation.\n\nAcknowledgments\n\nWe would like to thank James Fitzgerald, Mien \u201cBrabeeba\u201d Wang, Scott Linderman, and Yu Hu for\nfruitful discussions. We also thank the anonymous reviewers for their valuable comments. This work\nwas supported by NSF (IIS 1409097), IARPA (contract D16PC00002), and the Swartz Foundation.\n\n10This is a crucial aspect of gradient descent dynamics in layered structures, which behave like a chain\u2014the\n\nweakest link must change \ufb01rst [23, 24].\n\n8\n\n\fReferences\n\n[1] Y. LeCun, Y. Bengio, and G. Hinton, \u201cDeep learning,\u201d Nature, 2015.\n\n[2] J. Schmidhuber, \u201cDeep learning in neural networks: An overview,\u201d Neural Networks, 2015.\n\n[3] R. Hahnloser, R. Sarpeshkar, M. Mahowald, R. Douglas, and S. Seung, \u201cDigital selection and analogue\n\nampli\ufb01cation coexist in a cortex-inspired silicon circuit,\u201d Nature, 2000.\n\n[4] V. Nair and G. Hinton, \u201cRecti\ufb01ed Linear Units Improve Restricted Boltzmann Machines,\u201d in ICML, 2010.\n\n[5] S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber, \u201cGradient Flow in Recurrent Nets: the Dif\ufb01culty\n\nof Learning Long-Term Dependencies,\u201d in A Field Guide to Dynamical Recurrent Networks, 2001.\n\n[6] A. Courville, J. Bergstra, and Y. Bengio, \u201cA Spike and Slab Restricted Boltzmann Machine,\u201d in AISTATS,\n\n2011.\n\n[7] K. Konda, R. Memisevic, and D. Krueger, \u201cZero-bias autoencoders and the bene\ufb01ts of co-adapting\n\nfeatures,\u201d in ICLR, 2015.\n\n[8] R. Srivastava, K. Greff, and J. Schmidhuber, \u201cTraining Very Deep Networks,\u201d in NIPS, 2015.\n\n[9] K. He, X. Zhang, S. Ren, and J. Sun, \u201cIdentity Mappings in Deep Residual Networks,\u201d in ECCV, 2016.\n\n[10] M. Janzamin, H. Sedghi, and A. Anandkumar, \u201cBeating the Perils of Non-Convexity: Guaranteed Training\n\nof Neural Networks using Tensor Methods,\u201d arXiv, 2015.\n\n[11] L. Deng and D. Yu, \u201cDeep Convex Net: A Scalable Architecture for Speech Pattern Classi\ufb01cation,\u201d in\n\nInterspeech, 2011.\n\n[12] B. Amos and Z. Kolter, \u201cInput-Convex Deep Networks,\u201d in ICLR Workshop, 2015.\n\n[13] \u00d6. Aslan, X. Zhang, and D. Schuurmans, \u201cConvex Deep Learning via Normalized Kernels,\u201d in NIPS, 2014.\n\n[14] S. Wang, A. Mohamed, R. Caruana, J. Bilmes, M. Plilipose, M. Richardson, K. Geras, G. Urban, and\n\nO. Aslan, \u201cAnalysis of Deep Neural Networks with the Extended Data Jacobian Matrix,\u201d in ICML, 2016.\n\n[15] K. Simonyan and A. Zisserman, \u201cVery Deep Convolutional Networks for Large-Scale Image Recognition,\u201d\n\nin ICLR, 2015.\n\n[16] I. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, and Y. Bengio, \u201cMaxout Networks,\u201d in ICML,\n\n2013.\n\n[17] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and\n\nA. Rabinovich, \u201cGoing Deeper with Convolutions,\u201d in CVPR, 2015.\n\n[18] G. Huang, Z. Liu, and K. Weinberger, \u201cDensely Connected Convolutional Networks,\u201d arXiv, 2016.\n\n[19] S. Sonoda and N. Murata, \u201cNeural network with unbounded activation functions is universal approximator,\u201d\n\nApplied and Computational Harmonic Analysis, 2015.\n\n[20] Y. Cho and L. Saul, \u201cLarge-Margin Classi\ufb01cation in In\ufb01nite Neural Networks,\u201d Neural Computation, 2010.\n\n[21] D. Duvenaud, O. Rippel, R. Adams, and Z. Ghahramani, \u201cAvoiding pathologies in very deep networks,\u201d in\n\nAISTATS, 2014.\n\n[22] J. And\u00e9n and S. Mallat, \u201cDeep Scattering Spectrum,\u201d IEEE T-SP, 2014.\n\n[23] A. Saxe, J. McClelland, and S. Ganguli, \u201cExact solutions to the nonlinear dynamics of learning in deep\n\nlinear neural networks,\u201d in ICLR, 2014.\n\n[24] A. Saxe, \u201cA deep learning theory of perceptual learning dynamics,\u201d in COSYNE, 2015.\n\n[25] T. Miyato, S. Maeda, M. Koyama, K. Nakae, and S. Ishii, \u201cDistributional Smoothing with Virtual\n\nAdversarial Training,\u201d in ICLR, 2016.\n\n[26] Q. Bai, S. Rosenberg, Z. Wu, and S. Sclaroff, \u201cDifferential Geometric Regularization for Supervised\n\nLearning of Classi\ufb01ers,\u201d in ICML, 2016.\n\n[27] J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, \u201cStriving for Simplicity: The All\n\nConvolutional Net,\u201d in ICLR Workshop, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1088, "authors": [{"given_name": "Chuan-Yung", "family_name": "Tsai", "institution": "Harvard University"}, {"given_name": "Andrew", "family_name": "Saxe", "institution": "Stanford University"}, {"given_name": "Andrew", "family_name": "Saxe", "institution": "Harvard University"}, {"given_name": "David", "family_name": "Cox", "institution": "Harvard University"}]}