{"title": "A Normative Theory of Adaptive Dimensionality Reduction in Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 2269, "page_last": 2277, "abstract": "To make sense of the world our brains must analyze high-dimensional datasets streamed by our sensory organs. Because such analysis begins with dimensionality reduction, modelling early sensory processing requires biologically plausible online dimensionality reduction algorithms. Recently, we derived such an algorithm, termed similarity matching, from a Multidimensional Scaling (MDS) objective function. However, in the existing algorithm, the number of output dimensions is set a priori by the number of output neurons and cannot be changed. Because the number of informative dimensions in sensory inputs is variable there is a need for adaptive dimensionality reduction. Here, we derive biologically plausible dimensionality reduction algorithms which adapt the number of output dimensions to the eigenspectrum of the input covariance matrix. We formulate three objective functions which, in the offline setting, are optimized by the projections of the input dataset onto its principal subspace scaled by the eigenvalues of the output covariance matrix. In turn, the output eigenvalues are computed as i) soft-thresholded, ii) hard-thresholded, iii) equalized thresholded eigenvalues of the input covariance matrix. In the online setting, we derive the three corresponding adaptive algorithms and map them onto the dynamics of neuronal activity in networks with biologically plausible local learning rules. Remarkably, in the last two networks, neurons are divided into two classes which we identify with principal neurons and interneurons in biological circuits.", "full_text": "A Normative Theory of Adaptive Dimensionality\n\nReduction in Neural Networks\n\nCengiz Pehlevan\n\nSimons Center for Data Analysis\n\nSimons Foundation\nNew York, NY 10010\n\nDmitri B. Chklovskii\n\nSimons Center for Data Analysis\n\nSimons Foundation\nNew York, NY 10010\n\ncpehlevan@simonsfoundation.org\n\ndchklovskii@simonsfoundation.org\n\nAbstract\n\nTo make sense of the world our brains must analyze high-dimensional datasets\nstreamed by our sensory organs. Because such analysis begins with dimension-\nality reduction, modeling early sensory processing requires biologically plausible\nonline dimensionality reduction algorithms. Recently, we derived such an algo-\nrithm, termed similarity matching, from a Multidimensional Scaling (MDS) ob-\njective function. However, in the existing algorithm, the number of output dimen-\nsions is set a priori by the number of output neurons and cannot be changed. Be-\ncause the number of informative dimensions in sensory inputs is variable there is a\nneed for adaptive dimensionality reduction. Here, we derive biologically plausible\ndimensionality reduction algorithms which adapt the number of output dimensions\nto the eigenspectrum of the input covariance matrix. We formulate three objective\nfunctions which, in the of\ufb02ine setting, are optimized by the projections of the input\ndataset onto its principal subspace scaled by the eigenvalues of the output covari-\nance matrix. In turn, the output eigenvalues are computed as i) soft-thresholded,\nii) hard-thresholded, iii) equalized thresholded eigenvalues of the input covari-\nance matrix.\nIn the online setting, we derive the three corresponding adaptive\nalgorithms and map them onto the dynamics of neuronal activity in networks with\nbiologically plausible local learning rules. Remarkably, in the last two networks,\nneurons are divided into two classes which we identify with principal neurons and\ninterneurons in biological circuits.\n\n1\n\nIntroduction\n\nOur brains analyze high-dimensional datasets streamed by our sensory organs with ef\ufb01ciency and\nspeed rivaling modern computers. At the early stage of such analysis, the dimensionality of sensory\ninputs is drastically reduced as evidenced by anatomical measurements. Human retina, for example,\nconveys signals from \u2248125 million photoreceptors to the rest of the brain via \u22481 million ganglion\ncells [1] suggesting a hundred-fold dimensionality reduction. Therefore, biologically plausible di-\nmensionality reduction algorithms may offer a model of early sensory processing.\nIn a seminal work [2] Oja proposed that a single neuron may compute the \ufb01rst principal component\nof activity in upstream neurons. At each time point, Oja\u2019s neuron projects a vector composed of \ufb01r-\ning rates of upstream neurons onto the vector of synaptic weights by summing up currents generated\nby its synapses. In turn, synaptic weights are adjusted according to a Hebbian rule depending on the\nactivities of only the postsynaptic and corresponding presynaptic neurons [2].\nFollowing Oja\u2019s work, many multineuron circuits were proposed to extract multiple principal com-\nponents of the input, for a review see [3]. However, most multineuron algorithms did not meet the\nsame level of rigor and biological plausibility as the single-neuron algorithm [2, 4] which can be\nderived using a normative approach, from a principled objective function [5], and contains only lo-\n\n1\n\n\fcal Hebbian learning rules. Algorithms derived from principled objective functions either did not\nposess local learning rules [6, 4, 7, 8] or had other biologically implausible features [9]. In other\nalgorithms, local rules were chosen heuristically rather than derived from a principled objective\nfunction [10, 11, 12, 9, 3, 13, 14, 15, 16].\nThere is a notable exception to the above observation but it has other shortcomings. The two-\nlayer circuit with reciprocal synapses [17, 18, 19] can be derived from the minimization of the\nrepresentation error. However, the activity of principal neurons in the circuit is a dummy variable\nwithout its own dynamics. Therefore, such principal neurons do not integrate their input in time,\ncontradicting existing experimental observations.\nOther normative approaches use an information theoretical objective to compare theoretical lim-\nits with experimentally measured information in single neurons or populations [20, 21, 22] or to\ncalculate optimal synaptic weights in a postulated neural network [23, 22].\nRecently, a novel approach to the problem has been proposed [24]. Starting with the Multidimen-\nsional Scaling (MDS) strain cost function [25, 26] we derived an algorithm which maps onto a\nneuronal circuit with local learning rules. However, [24] had major limitations, which are shared by\nvairous other multineuron algorithms:\n1. The number of output dimensions was determined by the \ufb01xed number of output neurons pre-\ncluding adaptation to the varying number of informative components. A better solution would be\nto let the network decide, depending on the input statistics, how many dimensions to represent\n[14, 15]. The dimensionality of neural activity in such a network would be usually less than the\nmaximum set by the number of neurons.\n\n2. Because output neurons were coupled by anti-Hebbian synapses which are most naturally imple-\nmented by inhibitory synapses, if these neurons were to have excitatory outputs, as suggested by\ncortical anatomy, they would violate Dale\u2019s law (i.e. each neuron uses only one fast neurotrans-\nmitter). Here, following [10], by anti-Hebbian we mean synaptic weights that get more negative\nwith correlated activity of pre- and postsynaptic neurons.\n\n3. The output had a wide dynamic range which is dif\ufb01cult to implement using biological neurons\nwith a limited range. A better solution [27, 13] is to equalize the output variance across neurons.\nIn this paper, we advance the normative approach of [24] by proposing three new objective func-\ntions which allow us to overcome the above limitations. We optimize these objective functions by\nproceeding as follows. In Section 2, we formulate and solve three optimization problems of the\nform:\n\nO\ufb04ine setting : Y\u2217 = arg min\n\nL (X, Y) .\n\nY\n\n(1)\n\nHere, the input to the network, X = [x1, . . . , xT ] is an n \u00d7 T matrix with T centered input data\nsamples in Rn as its columns and the output of the network, Y = [y1, . . . , yT ] is a k\u00d7T matrix with\ncorresponding outputs in Rk as its columns. We assume T >> k and T >> n. Such optimization\nproblems are posed in the so-called of\ufb02ine setting where outputs are computed after seeing all data.\nWhereas the optimization problems in the of\ufb02ine setting admit closed-form solution, such setting\nis ill-suited for modeling neural computation on the mechanistic level and must be replaced by the\nonline setting.\nIndeed, neurons compute an output, yT , for each data sample presentation, xT ,\nbefore the next data sample is presented and past outputs cannot be altered. In such online setting,\noptimization is performed at every time step, T , on the objective which is a function of all inputs\nand outputs up to time T . Moreover, an online algorithm (also known as streaming) is not capable\nof storing all previous inputs and outputs and must rely on a smaller number of state variables.\nIn Section 3, we formulate three corresponding online optimization problems with respect to yT ,\nwhile keeping all the previous outputs \ufb01xed:\n\nOnline setting : yT \u2190 arg min\n\nyT\n\nL (X, Y) .\n\n(2)\n\nThen we derive algorithms solving these problems online and map their steps onto the dynamics of\nneuronal activity and local learning rules for synaptic weights in three neural networks.\nWe show that the solutions of the optimization problems and the corresponding online algorithms\nremove the limitations outlined above by performing the following computational tasks:\n\n2\n\n\fFigure 1:\nInput-output\nfunctions of the three\nsolutions\nof\ufb02ine\nand\nim-\nneural\nnetwork\nthe\nplementations of\ncorresponding\nonline\nalgorithms. A-C. Input-\noutput\nof\ncovariance eigenvalues.\nA.\nSoft-thresholding.\nB. Hard-thresholding.\nC. Equalization after\nD-F.\nthresholding.\nCorresponding network\narchitectures.\n\nfunctions\n\n1. Soft-thresholding the eigenvalues of the input covariance matrix, Figure 1A: eigenvalues below\nthe threshold are set to zero and the rest are shrunk by the threshold magnitude. Thus, the num-\nber of output dimensions is chosen adaptively. This algorithm maps onto a single-layer neural\nnetwork with the same architecture as in [24], Figure 1D, but with modi\ufb01ed learning rules.\n\n2. Hard-thresholding of input eigenvalues, Figure 1B: eigenvalues below the threshold vanish as\nbefore, but eigenvalues above the threshold remain unchanged. The steps of such algorithm map\nonto the dynamics of neuronal activity in a network which, in addition to principal neurons, has a\nlayer of interneurons reciprocally connected with principal neurons and each other, Figure 1E.\n\n3. Equalization of non-zero eigenvalues, Figure 1C. The corresponding network\u2019s architecture, Fig-\nure 1F, lacks reciprocal connections among interneurons. As before, the number of above-\nthreshold eigenvalues is chosen adaptively and cannot exceed the number of principal neurons. If\nthe two are equal, this network whitens the output.\n\nIn Section 4, we demonstrate that the online algorithms perform well on a synthetic dataset and, in\nDiscussion, we compare our neural circuits with biological observations.\n\n2 Dimensionality reduction in the of\ufb02ine setting\n\nIn this Section, we introduce and solve, in the of\ufb02ine setting, three novel optimization problems\nwhose solutions reduce the dimensionality of the input. We state our results in three Theorems\nwhich are proved in the Supplementary Material.\n\n2.1 Soft-thresholding of covariance eigenvalues\n\nWe consider the following optimization problem in the of\ufb02ine setting:\n\nwhere \u03b1 \u2265 0 and IT is the T \u00d7T identity matrix. To gain intuition behind this choice of the objective\nfunction let us expand the squared norm and keep only the Y-dependent terms:\n\n(cid:13)(cid:13)X(cid:62)X \u2212 Y(cid:62)Y \u2212 \u03b1T IT\n\narg min\n\nY\n\n(cid:13)(cid:13)X(cid:62)X \u2212 Y(cid:62)Y(cid:13)(cid:13)2\n\nF + 2\u03b1T Tr(cid:0)Y(cid:62)Y(cid:1) ,\n\nwhere the \ufb01rst term matches the similarity of input and output[24] and the second term is a nuclear\nnorm of Y(cid:62)Y known to be a convex relaxation of the matrix rank used for low-rank matrix modeling\n[28]. Thus, objective function (3) enforces low-rank similarity matching.\nWe show that the optimal output Y is a projection of the input data, X, onto its principal subspace.\nThe subspace dimensionality is set by m, the number of eigenvalues of the data covariance matrix,\nC = 1\n\nt , that are greater than or equal to the parameter \u03b1.\n\n(cid:80)T\nt=1 xtx(cid:62)\n\nT XX(cid:62) = 1\n\nT\n\n3\n\nmin\nY\n\n(cid:13)(cid:13)2\n\nF ,\n\n(cid:13)(cid:13)X(cid:62)X \u2212 Y(cid:62)Y \u2212 \u03b1T IT\n(cid:13)(cid:13)2\n\nF = arg min\n\nY\n\n(3)\n\n(4)\n\nx1xn. . .y1ykanti-Hebbian synapsesADx2HebbianBECFx1xn. . .y1ykx2\u03b1\u03b1\u03b1\u03b1\u03b2input eig.x1xn. . .x2output eig.input eig.input eig.output eig.output eig.PrincipalInter-neuronsz1zly1ykz1zl\fTheorem 1. Suppose an eigen-decomposition of X(cid:62)X = VX \u039bX VX(cid:62)\n\n1 , . . . , \u03bbX\nT\n\n(cid:1) with \u03bbX\n\ndiag(cid:0)\u03bbX\nare optima of (3), where STk(\u039bX , \u03b1T ) = diag(cid:0)ST(cid:0)\u03bbX\n\nciding with those of T C. Then,\n\n1 \u2265 . . . \u2265 \u03bbX\n\nY\u2217 = Uk STk(\u039bX , \u03b1T )1/2 VX\n\n, where \u039bX =\nT . Note that \u039bX has at most n nonzero eigenvalues coin-\n\n,\n\nk\n\n(cid:62)\n\n1 , \u03b1T(cid:1) , . . . , ST(cid:0)\u03bbX\nk , \u03b1T(cid:1)(cid:1), ST is the soft-\n(cid:3) and Uk is any k \u00d7 k orthogonal matrix, i.e.\n\nk consists of the columns of VX corresponding\n\n(5)\n\nk > \u03b1T and\n\nthresholding function, ST(a, b) = max(a\u2212 b, 0), VX\nto the top k eigenvalues, i.e. VX\n1 , . . . , vX\nk\nUk \u2208 O(k). The form (5) uniquely de\ufb01nes all optima of (3), except when k < m, \u03bbX\n\u03bbX\nk = \u03bbX\n\nk+1.\n\nk = (cid:2)vX\n\n2.2 Hard-thresholding of covariance eigenvalues\n\nConsider the following minimax problem in the of\ufb02ine setting:\n\n(cid:13)(cid:13)X(cid:62)X \u2212 Y(cid:62)Y(cid:13)(cid:13)2\n\n(cid:13)(cid:13)Y(cid:62)Y \u2212 Z(cid:62)Z \u2212 \u03b1T IT\n\n(cid:13)(cid:13)2\n\nZ\n\nF ,\n\nmax\n\nmin\nY\n\nF \u2212\n\n(cid:1) with \u03bbX\n\n(6)\nwhere \u03b1 \u2265 0 and we introduced an internal variable Z, which is an l \u00d7 T matrix Z = [z1, . . . , zT ]\nwith zt \u2208 Rl. The intuition behind this objective function is again based on similarity matching but\ndiag(cid:0)\u03bbX\nrank regularization is applied indirectly via the internal variable, Z.\nTheorem 2. Suppose an eigen-decomposition of X(cid:62)X = VX \u039bX VX(cid:62)\n, where \u039bX =\nare optima of (6), where HTk(\u039bX , \u03b1T ) = diag(cid:0)HT(cid:0)\u03bbX\nb, STl,min(k,m)(\u039bX , \u03b1T ) = diag(cid:0)ST(cid:0)\u03bbX\n1 , \u03b1T(cid:1) , . . . , ST\n(cid:2)vX\n\nT \u2265 0. Assume l \u2265 min(k, m). Then,\n1 , . . . , \u03bbX\n1 \u2265 . . . \u2265 \u03bbX\n1 , \u03b1T(cid:1) , . . . , HT(cid:0)\u03bbX\nk , \u03b1T(cid:1)(cid:1), HT(a, b) =\nT\n(cid:62)\nY\u2217 = Uk HTk(\u039bX , \u03b1T )1/2 VX\n(cid:17)\n(cid:16)\n(cid:1),VX\n(cid:124)\n(cid:3) and Up \u2208 O(p). The form (7) uniquely de\ufb01nes all optima (6) except when either\n\na\u0398(a \u2212 b) with \u0398() being the step function: \u0398(a \u2212 b) = 1 if a \u2265 b and \u0398(a \u2212 b) = 0 if a <\np =\n\nZ\u2217 = Ul STl,min(k,m)(\u039bX , \u03b1T )1/2 VX\n\n, 0, . . . , 0\nl\u2212min(k,m)\n\n\u03bbX\nmin(k,m), \u03b1T\n\n1 , . . . , vX\np\n\n(cid:123)(cid:122)\n\n(cid:125)\n\n(7)\n\n(cid:62)\n\n1) \u03b1 is an eigenvalue of C or 2) k < m and \u03bbX\n\nk\n\n,\n\n,\n\nl\n\nk = \u03bbX\n\nk+1.\n\n2.3 Equalizing thresholded covariance eigenvalues\n\nZ\n\nmax\n\nmin\nY\n\n1 , . . . , \u03bbX\nT\n\n1 \u2265 . . . \u2265 \u03bbX\n\nConsider the following minimax problem in the of\ufb02ine setting:\n\n\u2212X(cid:62)XY(cid:62)Y + Y(cid:62)YZ(cid:62)Z + \u03b1T Y(cid:62)Y \u2212 \u03b2T Z(cid:62)Z(cid:1) ,\n\nTr(cid:0)\n(cid:1) with \u03bbX\n(cid:112)\u03b2T \u0398k(\u039bX , \u03b1T )1/2 VX\n\n(8)\nwhere \u03b1 \u2265 0 and \u03b2 > 0. This objective function follows from (6) after dropping the quartic Z term.\nTheorem 3. Suppose an eigen-decomposition of X(cid:62)X is X(cid:62)X = VX \u039bX VX(cid:62)\n, where \u039bX =\nZ\u2217 = Ul \u03a3l\u00d7T O\u039bY \u2217 VX(cid:62)\n\nT \u2265 0. Assume l \u2265 min(k, m). Then,\n\ndiag(cid:0)\u03bbX\nare optima of (8), where \u0398k(\u039bX , \u03b1T ) = diag(cid:0)\u0398(cid:0)\u03bbX\n(cid:2)vX\n\nk \u2212 \u03b1T(cid:1)(cid:1), \u03a3l\u00d7T is an\n(cid:3), and Up \u2208 O(p). The form (9) uniquely de\ufb01nes all optima of (8) except when either\n\nl \u00d7 T rectangular diagonal matrix with top min(k, m) diagonals are set to arbitrary nonnegative\nconstants and the rest are zero, O\u039bY \u2217 is a block-diagonal orthogonal matrix that has two blocks:\nthe top block is min(k, m) dimensional and the bottom block is T \u2212 min(k, m) dimensional, Vp =\n1) \u03b1 is an eigenvalue of C or 2) k < m and \u03bbX\nRemark 1. If k = m, then Y is full-rank and 1\nequalizing variance across all channels.\n\nk = \u03bbX\nT YY(cid:62) = \u03b2Ik, implying that the output is whitened,\n\n1 \u2212 \u03b1T(cid:1) , . . . , \u0398(cid:0)\u03bbX\n\nY\u2217 = Uk\n\n1 , . . . , vX\np\n\nk+1.\n\n(9)\n\n(cid:62)\n\nk\n\n,\n\n,\n\n3 Online dimensionality reduction using Hebbian/anti-Hebbian neural nets\n\nIn this Section, we formulate online versions of the dimensionality reduction optimization problems\npresented in the previous Section, derive corresponding online algorithms and map them onto the dy-\nnamics of neural networks with biologically plausible local learning rules. The order of subsections\ncorresponds to that in the previous Section.\n\n4\n\n\fwhere \u03b7 is the weight parameter, and WY X\noutput covariances,\n\nT\n\nT yT\n\n(13)\nare normalized input-output and output-\n\n3.1 Online soft-thresholding of eigenvalues\n\nConsider the following optimization problem in the online setting:\n\n(cid:13)(cid:13)2\n\nF .\n\n(10)\n\n(cid:13)(cid:13)X(cid:62)X \u2212 Y(cid:62)Y \u2212 \u03b1T IT\n(cid:32)T\u22121(cid:88)\n(cid:33)\n\nyT \u2190 arg min\n\nyT\n\n(cid:32)T\u22121(cid:88)\n\n(cid:33)\n\nt\n\nT\n\nxty(cid:62)\n\nyT + 2y(cid:62)\n\nBy keeping only the terms that depend on yT we get the following objective for (2):\nL = \u22124x(cid:62)\n\nyT \u2212 2(cid:107)xT(cid:107)2(cid:107)yT(cid:107)2 + (cid:107)yT(cid:107)4. (11)\nIn the large-T limit, the last two terms can be dropped since the \ufb01rst two terms grow linearly with T\nand dominate. The remaining cost is a positive de\ufb01nite quadratic form in yT and the optimization\nproblem is convex. At its minimum, the following equality holds:\nytx(cid:62)\n\nt + \u03b1T Im\n\nyty(cid:62)\n\n(cid:33)\n\n(12)\n\nxT .\n\nt=1\n\nt=1\n\nT\n\nt\n\nWhile a closed-form analytical solution via matrix inversion exists for yT , we are interested in\nbiologically plausible algorithms. Instead, we use a weighted Jacobi iteration where yT is updated\naccording to:\n\nt=1\n\nyT =\n\n(cid:33)\n\nyty(cid:62)\n\nt + \u03b1T Im\n\n(cid:32)T\u22121(cid:88)\n(cid:32)T\u22121(cid:88)\nyT \u2190 (1 \u2212 \u03b7) yT + \u03b7(cid:0)WY X\nT\u22121(cid:80)\n\nand WY Y\n\nt=1\n\nT\n\nT xT \u2212 WY Y\nT\u22121(cid:80)\n\nyt,iyt,j\n\n(cid:1) ,\n\nW Y X\n\nT,ik =\n\nt=1\n\n\u03b1T +\n\nyt,ixt,k\n\nT\u22121(cid:80)\n\nt=1\n\n,\n\ny2\nt,i\n\nW Y Y\n\nT,i,j(cid:54)=i =\n\nt=1\n\n\u03b1T +\n\nW Y Y\n\nT,ii = 0.\n\n(14)\n\n,\n\ny2\nt,i\n\nT\u22121(cid:80)\n\nt=1\n\nT\n\nT\n\nand WY Y\n\nIteration (13) can be implemented by the dynamics of neuronal activity in a single-layer network,\nrepresent the weights of feedforward (xt \u2192 yt) and lateral\nFigure 1D. Then, WY X\n(yt \u2192 yt) synaptic connections, respectively. Remarkably, synaptic weights appear in the online\nsolution despite their absence in the optimization problem formulation (3). Previously, nonnormal-\nized covariances have been used as state variables in an online dictionary learning algorithm [29].\nTo formulate a fully online algorithm, we rewrite (14) in a recursive form. This requires introducing\na scalar variable DY\nT,i =\nt,i. Then, at each data sample presentation, T , after the output yT converges to a steady\ny2\n\nT,i representing cumulative activity of a neuron i up to time T \u2212 1, DY\n\nT\u22121(cid:80)\n\n\u03b1T +\nstate, the following updates are performed:\n\nt=1\n\nDY\nT +1,i \u2190 DY\nT +1,ij \u2190 W Y X\nW Y X\nT +1,i,j(cid:54)=i \u2190 W Y Y\nW Y Y\n\nT,i,\n\nT,i + \u03b1 + y2\n\nT,ij +(cid:0)yT,ixT,j \u2212\nT,ij +(cid:0)yT,iyT,j \u2212\n\n(cid:0)\u03b1 + y2\n(cid:0)\u03b1 + y2\n\nT,i\n\nT,i\n\n(cid:1) W Y X\n(cid:1) W Y Y\n\nT,ij\n\n(cid:1) /DY\n(cid:1) /DY\n\nT,ij\n\nT +1,i,\nT +1,i.\n\n(15)\n\nHence, we arrive at a neural network algorithm that solves the optimization problem (10) for stream-\ning data by alternating between two phases. After a data sample is presented at time T , in the \ufb01rst\nphase of the algorithm (13), neuron activities are updated until convergence to a \ufb01xed point. In the\nsecond phase of the algorithm, synaptic weights are updated for feedforward connections according\nto a local Hebbian rule (15) and for lateral connections according to a local anti-Hebbian rule (due\nto the (\u2212) sign in equation (13)). Interestingly, in the \u03b1 = 0 limit, these updates have the same\nform as the single-neuron Oja rule [24, 2], except that the learning rate is not a free parameter but is\ndetermined by the cumulative neuronal activity 1/DY\n\nT +1,i [4, 5].\n\n3.2 Online hard-thresholding of eigenvalues\n\nConsider the following minimax problem in the online setting, where we assume \u03b1 > 0:\nF .\n\narg max\n\n{yT , zT} \u2190 arg min\n\nyT\n\nF \u2212\n\n(cid:13)(cid:13)Y(cid:62)Y \u2212 Z(cid:62)Z \u2212 \u03b1T IT\n\n(cid:13)(cid:13)X(cid:62)X \u2212 Y(cid:62)Y(cid:13)(cid:13)2\n\n(cid:13)(cid:13)2\n\nzT\n\n(16)\n\nBy keeping only those terms that depend on yT or zT and considering the large-T limit, we get the\n\n5\n\n\ffollowing objective:\n\nL = 2\u03b1T (cid:107)yT(cid:107)2 \u2212 4x(cid:62)\n\nT\n\n(cid:32)T\u22121(cid:88)\n\n(cid:33)\n\nxty(cid:62)\n\nt\n\nt=1\n\n(cid:32)T\u22121(cid:88)\n\n(cid:33)\n\n(cid:32)T\u22121(cid:88)\n\n(cid:33)\n\nzT .\n\nytz(cid:62)\n\nt\n\nzT + 4y(cid:62)\n\nT\n\nztz(cid:62)\n\nt + \u03b1T Ik\n\nt=1\n\nyT \u2212 2z(cid:62)\n\nT\n\nt=1\n\n(17)\nNote that this objective is strongly convex in yT and strongly concave in zT . The solution of this\nminimax problem is the saddle-point of the objective function, which is found by setting the gradient\nof the objective with respect to {yT , zT} to zero [30]:\n\u03b1T yT =\n\n(cid:32)T\u22121(cid:88)\n\n(cid:32)T\u22121(cid:88)\n\nytx(cid:62)\n\nzty(cid:62)\n\nytz(cid:62)\n\n(cid:33)\n\n(cid:33)\n\n(cid:33)\n\nyT .\n\nt\n\nt\n\nxT \u2212\n\nt=1\n\nyT \u2190 (1 \u2212 \u03b7) yT + \u03b7(cid:0)WY X\n\nTo obtain a neurally plausible algorithm, we solve these equations by a weighted Jacobi iteration:\nT zT\n\nT zT\n\nT yT \u2212 WZZ\n\nHere, similarly to (14), WT are normalized covariances that can be updated recursively:\n\nt\n\nt=1\n\nt=1\n\nt=1\n\nzT ,\n\nzT =\n\n(cid:33)\n\nztz(cid:62)\n\n(cid:1) ,\n\nt + \u03b1T Ik\n\nT xT \u2212 WY Z\n\n(cid:32)T\u22121(cid:88)\n(cid:32)T\u22121(cid:88)\nzT \u2190 (1 \u2212 \u03b7) zT + \u03b7(cid:0)WZY\n(cid:1) /DY\nT,ij +(cid:0)yT,ixT,j \u2212 \u03b1W Y X\n(cid:1) /DY\nT,ij +(cid:0)yT,izT,j \u2212 \u03b1W Y Z\n(cid:1) W ZY\n(cid:0)\u03b1 + z2\nT,ij +(cid:0)zT,iyT,j \u2212\nT,ij +(cid:0)zT,izT,j \u2212\n(cid:0)\u03b1 + z2\n(cid:1) W ZZ\n\n(cid:1) /DZ\n(cid:1) /DZ\n\nT +1,i\nT +1,i, W ZZ\n\nDZ\nT +1,i \u2190 DZ\n\nT,i + \u03b1 + z2\nT,i\n\nT,i + \u03b1,\n\nT +1,i\n\nT +1,i\n\nT,ij\n\nT,ij\n\nT,ij\n\nT,i\n\nDY\nT +1,i \u2190 DY\nW Y X\nT +1,ij \u2190 W Y X\nW Y Z\nT +1,ij \u2190 W Y Z\nW ZY\nT +1,i,j \u2190 W ZY\nT +1,i,j(cid:54)=i \u2190 W ZZ\nW ZZ\n\n(18)\n\n(cid:1) .\n\n(19)\n\n(20)\nEquations (19) and (20) de\ufb01ne an online algorithm that can be naturally implemented by a neural\nnetwork with two populations of neurons: principal and interneurons, Figure 1E. Again, after each\ndata sample presentation, T , the algorithm proceeds in two phases. First, (19) is iterated until\nconvergence by the dynamics of neuronal activities. Second, synaptic weights are updated according\nto local, anti-Hebbian (for synapses from interneurons) and Hebbian (for all other synapses) rules.\n\nT,ii = 0.\n\nT,ij\n\nT,i\n\n3.3 Online thresholding and equalization of eigenvalues\n\nConsider the following minimax problem in the online setting, where we assume \u03b1 > 0 and \u03b2 > 0:\n(21)\n\narg max\n\n{yT , zT} \u2190 arg min\n\nyT\n\nzT\n\nBy keeping only those terms that depend on yT or zT and considering the large-T limit, we get the\nfollowing objective:\n\nTr(cid:2)\n\u2212X(cid:62)XY(cid:62)Y + Y(cid:62)YZ(cid:62)Z + \u03b1TY(cid:62)Y \u2212 \u03b2TZ(cid:62)Z(cid:3) .\n(cid:32)T\u22121(cid:88)\n\nytz(cid:62)\n\nt\n\nzT .\n\n(22)\n\nThis objective is strongly convex in yT and strongly concave in zT and its saddle point is given by:\n\nxty(cid:62)\n\n(cid:33)\n(cid:32)T\u22121(cid:88)\n\nt\n\nt=1\n\nT\n\nyT \u2212 \u03b2T (cid:107)zT(cid:107)2 + 2y(cid:62)\n(cid:33)\n(cid:1) ,\n\n\u03b2T zT =\n\nzT ,\n\nt\n\nytz(cid:62)\n\n(cid:33)\n(cid:33)\n\n(cid:32)T\u22121(cid:88)\n(cid:32)T\u22121(cid:88)\n\nt=1\n\nt=1\n\nT\n\n(cid:32)T\u22121(cid:88)\n\nL = \u03b1T (cid:107)yT(cid:107)2 \u2212 2x(cid:62)\n(cid:33)\nyT \u2190 (1 \u2212 \u03b7) yT + \u03b7(cid:0)WY X\n\n\u03b1T yT =\n\nytx(cid:62)\n\nt=1\n\nt\n\nt=1\n\nxT \u2212\n\nTo obtain a neurally plausible algorithm, we solve these equations by a weighted Jacobi iteration:\n\nAs before, WT are normalized covariances which can be updated recursively:\n\nT xT \u2212 WY Z\n\nT zT\n\nzT \u2190 (1 \u2212 \u03b7) zT + \u03b7WZY\n\nT yT ,\n\n(24)\n\nzty(cid:62)\n\nt\n\nyT .\n\n(23)\n\nDY\nT +1,i \u2190 DY\nW Y X\nT +1,ij \u2190 W Y X\nT +1,ij \u2190 W Y Z\nW Y Z\nW ZY\nT +1,i,j \u2190 W ZY\n\nT,i + \u03b1,\n\nDZ\nT +1,i \u2190 DZ\n\nT,ij +(cid:0)yT,ixT,j \u2212 \u03b1W Y X\n(cid:1) /DY\n(cid:1) /DY\nT,ij +(cid:0)yT,izT,j \u2212 \u03b1W Y Z\n(cid:1) /DZ\nT,ij +(cid:0)zT,iyT,j \u2212 \u03b2W ZY\n\nT,i + \u03b2\n\nT,ij\n\nT,ij\n\nT,ij\n\nT +1,i\n\nT +1,i\nT +1,i.\n\n(25)\nEquations (24) and (25) de\ufb01ne an online algorithm that can be naturally implemented by a neural\nnetwork with principal neurons and interneurons. As beofre, after each data sample presentation at\n\n6\n\n\fFigure 2: Performance of the three neural networks: soft-thresholding (A), hard-thresholding (B),\nequalization after thresholding (C). Top: eigenvalue error, bottom: subspace error as a function\nof data presentations. Solid lines - means and shades - stds over 10 runs. Red - principal, blue -\ninter-neurons. Dashed lines - best-\ufb01t power laws. For metric de\ufb01nitions see text.\n\ntime T , the algorithm, \ufb01rst, iterates (24) by the dynamics of neuronal activities until convergence\nand, second, updates synaptic weights according to local anti-Hebbian (for synapses from interneu-\nrons) and Hebbian (25) (for all other synapses) rules.\nWhile an algorithm similar to (24), (25), but with predetermined learning rates, was previously given\nin [15, 14], it has not been derived from an optimization problem. Plumbley\u2019s convergence analysis\nof his algorithm [14] suggests that at the \ufb01xed point of synaptic updates, the interneuron activity is\nalso a projection onto the principal subspace. This result is a special case of our of\ufb02ine solution, (9),\nsupported by the online numerical simulations (next Section).\n\n4 Numerical simulations\n\n0,i and 1/DZ\n\nHere, we evaluate the performance of the three online algorithms on a synthetic dataset, which is\ngenerated by an n = 64 dimensional colored Gaussian process with a speci\ufb01ed covariance matrix.\nIn this covariance matrix, the eigenvalues, \u03bb1..4 = {5, 4, 3, 2} and the remaining \u03bb5..60 are chosen\nuniformly from the interval [0, 0.5]. Correlations are introduced in the covariance matrix by gen-\nerating random orthonormal eigenvectors. For all three algorithms, we choose \u03b1 = 1 and, for the\nequalizing algorithm, we choose \u03b2 = 1. In all simulated networks, the number of principal neurons,\nk = 20, and, for the hard-thresholding and the equalizing algorithms, the number of interneurons,\nl = 5. Synaptic weight matrices were initialized randomly, and synaptic update learning rates,\n0,i were initialized to 0.1. Network dynamics is run with a weight \u03b7 = 0.1 until the\n1/DY\nrelative change in yT and zT in one cycle is < 10\u22125.\nTo quantify the performance of these algorithms, we use two different metrics. The \ufb01rst metric,\neigenvalue error, measures the deviation of output covariance eigenvalues from their optimal of\ufb02ine\nvalues given in Theorems 1, 2 and 3. The eigenvalue error at time T is calculated by summing\nT ZZ(cid:62), and their optimal of\ufb02ine values\nsquared differences between the eigenvalues of 1\nat time T . The second metric, subspace error, quanti\ufb01es the deviation of the learned subspace from\nthe true principal subspace. To form such metric, at each T , we calculate the linear transforma-\ntion that maps inputs, xT , to outputs, yT = FY X\nT xT , at the \ufb01xed points of\nthe neural dynamics stages ((13), (19), (24)) of the three algorithms. Exact expressions for these\nmatrices for all algorithms are given in the Supplementary Material. Then, at each T , the deviation\n, where Fm,T is an n \u00d7 m matrix whose columns are the top m\nm,T is the projection matrix to the subspace spanned by these\nm,T is an n\u00d7m matrix whose columns are the principal eigenvectors of the input\n\nm,T \u2212 UX\nright singular vectors of FT , Fm,T F(cid:62)\nsingular vectors, UX\ncovariance matrix C at time T , UX\n\nis(cid:13)(cid:13)Fm,T F(cid:62)\n\nm,T is the projection matrix to the principal subspace.\n\nT xT and zT = FZX\n\n(cid:13)(cid:13)2\n\nF\n\nm,T UX (cid:62)\n\nm,T\n\nT YY(cid:62) or 1\n\nm,T UX (cid:62)\n\n7\n\nABC1TSubspace ErrorEigenvalue ErrorT -1.50T -1.56Subspace ErrorEigenvalue ErrorSubspace ErrorEigenvalue Error10310210110-110-210-31010210310410110-110-210-310-41T10102103104\u221d\u221dT -1.53T -1.4310110-110-210-31T10102103104\u221d\u221dT -1.33T -1.80\u221d\u221d1T10310210110-110-210-3101021031041T1010210310410310210110-110-2T -1.48\u221d1T1010210310410110-110-210-3T -1.41T -1.38\u221d\u221d\fFurther numerical simulations comparing the performance of the soft-thresholding algorithm with\n\u03b1 = 0 with other neural principal subspace algorithms can be found in [24].\n\n5 Discussion and conclusions\n\nWe developed a normative approach for dimensionality reduction by formulating three novel opti-\nmization problems, the solutions of which project the input onto its principal subspace, and rescale\nthe data by i) soft-thresholding, ii) hard-thresholding, iii) equalization after thresholding of the input\neigenvalues. Remarkably we found that these optimization problems can be solved online using\nbiologically plausible neural circuits. The dimensionality of neural activity is the number of either\ninput covariance eigenvalues above the threshold, m, (if m < k) or output neurons, k (if k \u2264 m).\nThe former case is ubiquitous in the analysis of experimental recordings, for a review see [31].\nInterestingly, the division of neurons into two populations, principal and interneurons, in the last\ntwo models has natural parallels in biological neural networks. In biology, principal neurons and\ninterneurons usually are excitatory and inhibitory respectively. However, we cannot make such an\nassignment in our theory, because the signs of neural activities, xT and yT , and, hence, the signs of\nsynaptic weights, W, are unconstrained. Previously, interneurons were included into neural circuits\n[32], [33] outside of the normative approach.\nSimilarity matching in the of\ufb02ine setting has been used to analyze experimentally recorded neu-\nron activity lending support to our proposal. Semantically similar stimuli result in similar neural\nactivity patterns in human (fMRI) and monkey (electrophysiology) IT cortices [34, 35]. In addi-\ntion, [36] computed similarities among visual stimuli by matching them with the similarity among\ncorresponding retinal activity patterns (using an information theoretic metric).\nWe see several possible extensions to the algorithms presented here: 1) Our online objective func-\ntions may be optimized by alternative algorithms, such as gradient descent, which map onto different\ncircuit architectures and learning rules. Interestingly, gradient descent-ascent on convex-concave ob-\njectives has been previously related to the dynamics of principal and interneurons [37]. 2) Inputs\ncoming from a non-stationary distribution (with time-varying covariance matrix) can be processed\nby algorithms derived from the objective functions where contributions from older data points are\n\u201cforgotten\u201d, or \u201cdiscounted\u201d. Such discounting results in higher learning rates in the corresponding\nonline algorithms, even at large T , giving them the ability to respond to variations in data statistics\n[24, 4]. Hence, the output dimensionality can track the number of input dimensions whose eigen-\nvalues exceed the threshold. 3) In general, the output of our algorithms is not decorrelated. Such\ndecorrelation can be achieved by including a correlation-penalizing term in our objective functions\n[38]. 4) Choosing the threshold parameter \u03b1 requires an a priori knowledge of input statistics. A\nbetter solution, to be presented elsewhere, would be to let the network adjust such threshold adap-\ntively, e.g. by \ufb01ltering out all the eigenmodes with power below the mean eigenmode power. 5)\nHere, we focused on dimensionality reduction using only spatial, as opposed to the spatio-temporal,\ncorrelation structure.\nWe thank L. Greengard, A. Sengupta, A. Grinshpan, S. Wright, A. Barnett and E. Pnevmatikakis.\n\nReferences\n[1] David H Hubel. Eye, brain, and vision. Scienti\ufb01c American Library/Scienti\ufb01c American Books, 1995.\n[2] E Oja. Simpli\ufb01ed neuron model as a principal component analyzer. J Math Biol, 15(3):267\u2013273, 1982.\n[3] KI Diamantaras and SY Kung. Principal component neural networks: theory and applications. John\n\nWiley & Sons, Inc., 1996.\n\n[4] B Yang. Projection approximation subspace tracking. IEEE Trans. Signal Process., 43(1):95\u2013107, 1995.\n[5] T Hu, ZJ Tow\ufb01c, C Pehlevan, A Genkin, and DB Chklovskii. A neuron as a signal processing device. In\n\nAsilomar Conference on Signals, Systems and Computers, pages 362\u2013366. IEEE, 2013.\n\n[6] E Oja. Principal components, minor components, and linear neural networks. Neural Networks, 5(6):927\u2013\n\n935, 1992.\n\n[7] R Arora, A Cotter, K Livescu, and N Srebro. Stochastic optimization for pca and pls. In Allerton Conf.\n\non Communication, Control, and Computing, pages 861\u2013868. IEEE, 2012.\n\n[8] J Goes, T Zhang, R Arora, and G Lerman. Robust stochastic principal component analysis. In Proc. 17th\n\nInt. Conf. on Arti\ufb01cial Intelligence and Statistics, pages 266\u2013274, 2014.\n\n8\n\n\f[9] Todd K Leen. Dynamics of learning in recurrent feature-discovery networks. NIPS, 3, 1990.\n[10] P F\u00a8oldiak. Adaptive network for optimal linear feature extraction. In Int. Joint Conf. on Neural Networks,\n\npages 401\u2013405. IEEE, 1989.\n\n[11] TD Sanger. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural\n\nnetworks, 2(6):459\u2013473, 1989.\n\n[12] J Rubner and P Tavan. A self-organizing network for principal-component analysis. EPL, 10:693, 1989.\n[13] MD Plumbley. A hebbian/anti-hebbian network which optimizes information capacity by orthonormaliz-\n\ning the principal subspace. In Proc. 3rd Int. Conf. on Arti\ufb01cial Neural Networks, pages 86\u201390, 1993.\n\n[14] MD Plumbley. A subspace network that determines its own output dimension. Tech. Rep., 1994.\n[15] MD Plumbley. Information processing in negative feedback neural networks. Network-Comp Neural,\n\n7(2):301\u2013305, 1996.\n\n[16] P Vertechi, W Brendel, and CK Machens. Unsupervised learning of an ef\ufb01cient short-term memory\n\nnetwork. In NIPS, pages 3653\u20133661, 2014.\n\n[17] BA Olshausen and DJ Field. Sparse coding with an overcomplete basis set: A strategy employed by v1?\n\nVision Res, 37(23):3311\u20133325, 1997.\n\n[18] AA Koulakov and D Rinberg. Sparse incomplete representations: a potential role of olfactory granule\n\ncells. Neuron, 72(1):124\u2013136, 2011.\n\n[19] S Druckmann, T Hu, and DB Chklovskii. A mechanistic model of early sensory processing based on\n\nsubtracting sparse representations. In NIPS, pages 1979\u20131987, 2012.\n\n[20] AL Fairhall, GD Lewen, W Bialek, and RRR van Steveninck. Ef\ufb01ciency and ambiguity in an adaptive\n\nneural code. Nature, 412(6849):787\u2013792, 2001.\n\n[21] SE Palmer, O Marre, MJ Berry, and W Bialek. Predictive information in a sensory population. PNAS,\n\n112(22):6908\u20136913, 2015.\n\n[22] E Doi, JL Gauthier, GD Field, J Shlens, et al. Ef\ufb01cient coding of spatial information in the primate retina.\n\nJ Neurosci, 32(46):16256\u201316264, 2012.\n\n[23] R Linsker. Self-organization in a perceptual network. Computer, 21(3):105\u2013117, 1988.\n[24] C Pehlevan, T Hu, and DB Chklovskii. A hebbian/anti-hebbian neural network for linear subspace learn-\ning: A derivation from multidimensional scaling of streaming data. Neural Comput, 27:1461\u20131495, 2015.\n[25] G Young and AS Householder. Discussion of a set of points in terms of their mutual distances. Psychome-\n\ntrika, 3(1):19\u201322, 1938.\n\n[26] WS Torgerson. Multidimensional scaling: I. theory and method. Psychometrika, 17(4):401\u2013419, 1952.\n[27] HG Barrow and JML Budd. Automatic gain control by a basic neural circuit. Arti\ufb01cial Neural Networks,\n\n2:433\u2013436, 1992.\n\n[28] EJ Cand`es and B Recht. Exact matrix completion via convex optimization. Found Comput Math,\n\n9(6):717\u2013772, 2009.\n\n[29] J Mairal, F Bach, J Ponce, and G Sapiro. Online learning for matrix factorization and sparse coding.\n\nJMLR, 11:19\u201360, 2010.\n\n[30] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge university press, 2004.\n[31] P Gao and S Ganguli. On simplicity and complexity in the brave new world of large-scale neuroscience.\n\nCurr Opin Neurobiol, 32:148\u2013155, 2015.\n\n[32] M Zhu and CJ Rozell. Modeling inhibitory interneurons in ef\ufb01cient sensory coding models. PLoS Comput\n\nBiol, 11(7):e1004353, 2015.\n\n[33] PD King, J Zylberberg, and MR DeWeese. Inhibitory interneurons decorrelate excitatory cells to drive\n\nsparse code formation in a spiking model of v1. J Neurosci, 33(13):5475\u20135485, 2013.\n\n[34] N Kriegeskorte, M Mur, DA Ruff, R Kiani, et al. Matching categorical object representations in inferior\n\ntemporal cortex of man and monkey. Neuron, 60(6):1126\u20131141, 2008.\n\n[35] R Kiani, H Esteky, K Mirpour, and K Tanaka. Object category structure in response patterns of neuronal\n\npopulation in monkey inferior temporal cortex. J Neurophysiol, 97(6):4296\u20134309, 2007.\n\n[36] G Tka\u02c7cik, E Granot-Atedgi, R Segev, and E Schneidman. Retinal metric: a stimulus distance measure\n\nderived from population neural responses. PRL, 110(5):058104, 2013.\n\n[37] HS Seung, TJ Richardson, JC Lagarias, and JJ Hop\ufb01eld. Minimax and hamiltonian dynamics of\n\nexcitatory-inhibitory networks. NIPS, 10:329\u2013335, 1998.\n\n[38] C Pehlevan and DB Chklovskii. Optimization theory of hebbian/anti-hebbian networks for pca and\n\nwhitening. In Allerton Conf. on Communication, Control, and Computing, 2015.\n\n9\n\n\f", "award": [], "sourceid": 1350, "authors": [{"given_name": "Cengiz", "family_name": "Pehlevan", "institution": "Simons Foundation"}, {"given_name": "Dmitri", "family_name": "Chklovskii", "institution": "Simons Foundation"}]}