{"title": "A Similarity-preserving Network Trained on Transformed Images Recapitulates Salient Features of the Fly Motion Detection Circuit", "book": "Advances in Neural Information Processing Systems", "page_first": 14201, "page_last": 14212, "abstract": "Learning to detect content-independent transformations from data is one of the central problems in biological and artificial intelligence. An example of such problem is unsupervised learning of a visual motion detector from pairs of consecutive video frames. Rao and Ruderman formulated this problem in terms of learning infinitesimal transformation operators (Lie group generators) via minimizing image reconstruction error. Unfortunately, it is difficult to map their model onto a biologically plausible neural network (NN) with local learning rules. Here we propose a biologically plausible model of motion detection. We also adopt the transformation-operator approach but, instead of reconstruction-error minimization, start with a similarity-preserving objective function. An online algorithm that optimizes such an objective function naturally maps onto an NN with biologically plausible learning rules. The trained NN recapitulates major features of the well-studied motion detector in the fly. In particular, it is consistent with the experimental observation that local motion detectors combine information from at least three adjacent pixels, something that contradicts the celebrated Hassenstein-Reichardt model.", "full_text": "A Similarity-preserving Neural Network Trained on\nTransformed Images Recapitulates Salient Features\n\nof the Fly Motion Detection Circuit\n\nYanis Bahroun \u2020\n\u2020Flatiron Institute\n\nAnirvan M. Sengupta \u2020\u2021\nDmitri B. Chklovskii\u2020\u2217\n\u2021Rutgers University \u2217NYU Langone Medical Center\nanirvans@physics.rutgers.edu,\n\n{ybahroun,dchklovskii}@\ufb02atironinstitute.org,\n\nAbstract\n\nLearning to detect content-independent transformations from data is one of the\ncentral problems in biological and arti\ufb01cial intelligence. An example of such prob-\nlem is unsupervised learning of a visual motion detector from pairs of consecutive\nvideo frames. Rao and Ruderman formulated this problem in terms of learning\nin\ufb01nitesimal transformation operators (Lie group generators) via minimizing image\nreconstruction error. Unfortunately, it is dif\ufb01cult to map their model onto a biologi-\ncally plausible neural network (NN) with local learning rules. Here we propose a\nbiologically plausible model of motion detection. We also adopt the transformation-\noperator approach but, instead of reconstruction-error minimization, start with a\nsimilarity-preserving objective function. An online algorithm that optimizes such\nan objective function naturally maps onto an NN with biologically plausible learn-\ning rules. The trained NN recapitulates major features of the well-studied motion\ndetector in the \ufb02y. In particular, it is consistent with the experimental observation\nthat local motion detectors combine information from at least three adjacent pixels,\nsomething that contradicts the celebrated Hassenstein-Reichardt model.\n\n1\n\nIntroduction\n\nHumans can recognize objects, such as human faces, even when presented at various distances, from\nvarious angles and under various illumination conditions. Whereas the brain performs such a task\nalmost effortlessly, this is a challenging unsupervised learning problem. Because the number of\ntraining views for any given face is limited, such transformations must be learned from data com-\nprising different faces, or in a content-independent manner. Therefore, learning content-independent\ntransformations plays a central role in reverse engineering the brain and building arti\ufb01cial intelligence.\nPerhaps the simplest example of this task is learning a visual motion detector, which computes the\noptic \ufb02ow from pairs of consecutive video frames regardless of their content. Motion detector learning\nwas addressed by Rao and Ruderman [31] who formulated this problem as learning in\ufb01nitesimal\ntranslation operators (or generators of the translation Lie group). They learned a motion detector by\nminimizing, for each pair of consecutive video frames, the squared mismatch between the observed\nvariation in pixel intensity values and that predicted by the scaled in\ufb01nitesimal translation operator.\nWhereas such an approach learns the operators and evaluates transformation magnitudes correctly\n[31, 22, 42], its biological implementation has been lacking (see below).\nThe non-biological nature of the neural networks (NNs) derived from the reconstruction approach has\nbeen previously encountered in the context of discovery of latent degrees of freedom, e.g. dimension-\nality reduction and sparse coding [8, 26]. When such NNs are derived from the reconstruction-error-\nminimization objective they require non-local learning rules, which are not biologically plausible. To\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fovercome this, [28, 29, 30] proposed deriving NNs from objectives that strive to preserve similarity\nbetween pairs of inputs in corresponding outputs.\nInspired by [29, 30], we propose a similarity-preserving objective for learning in\ufb01nitesimal translation\noperators. Instead of preserving similarity of input pairs as was done for dimensionality reduction\nNNs, our objective function preserves the similarity of input features formed by the outer product\nof variation in pixel intensity and pixel intensity which are suggested by the translation-operator\nformalism. Such objective is optimized by an online algorithm that maps onto a biologically plausible\nNN. After training the similarity-preserving NN on one-dimensional (1D) and two-dimensional (2D)\ntranslations, we obtain an NN that recapitulates salient features of the \ufb02y motion detection circuit.\nThus, our main contribution is the derivation of a biologically plausible NN for learning content-\nindependent transformations by similarity preservation of outer product input features.\n\n1.1 Contrasting reconstruction and similarity-preservation NNs\n\nWe start by reviewing the NNs for discovery of latent degrees of freedom from principled objective\nfunctions. Although these NNs do not detect transformations, they provide a useful analogy that\nwill be important for understanding our approach. First, we explain why the NNs derived from\nminimizing the reconstruction error lack biological plausibility. Then, we show how the NNs derived\nfrom similarity preservation objectives solve this problem.\nTo introduce our notation, the input to the NN is a set of vectors, xt \u2208 Rn, t = 1, . . . , T , with\ncomponents represented by the activity of n upstream neurons at time, t. In response, the NN outputs\nan activity vector, yt \u2208 Rm, t = 1, . . . , T , where m is the number of output neurons.\nThe reconstruction approach starts with minimizing the squared reconstruction error:\n\nmin\n\nW,yt=1...T \u2208Rm\n\nmin\n\nt Wyt + y(cid:62)\n\n||xt \u2212 Wyt||2 =\n\nT(cid:88)t=1(cid:104)(cid:107)xt(cid:107)2 \u2212 2x(cid:62)\n\nt W(cid:62)Wyt(cid:105), (1)\nW,yt=1...T \u2208Rm(cid:88)t\npossibly subject to additional constraints on the latent variables yt or on the weights W \u2208 Rn\u00d7m.\nWithout additional constraints, this objective is optimized of\ufb02ine by a projection onto the principal\nsubspace of the input data, of which PCA is a special case [24].\nIn an online setting, the objective can be optimized by alternating minimization [26]. After the arrival\nof data sample, xt: \ufb01rstly, the objective (1) is minimized with respect to the output, yt, while the\nweights, W, are kept \ufb01xed, secondly, the weights are updated according to the following learning\nrule derived by a gradient descent with respect to W for \ufb01xed yt:\n\n\u02d9yt = W(cid:62)\n\nt\u22121xt \u2212 W(cid:62)\n\nt\u22121Wt\u22121yt, Wt \u2190\u2212 Wt\u22121 + \u03b7 (xt \u2212 Wt\u22121yt) y(cid:62)\n\nt ,\n\n(2)\n\nIn the NN implementations of the algorithm (2), the elements of matrix W are represented by synaptic\nweights and principal components by the activities of output neurons yj, Fig. 1a [23].\nHowever, implementing update (2)right in the single-layer NN architecture, Fig. 1a, requires non-\nlocal learning rules making it biologically implausible. Indeed, the last term in (2)right implies that\nupdating the weight of a synapse requires the knowledge of output activities of all other neurons which\nare not available to the synapse. Moreover, the matrix of lateral connection weights, \u2212W(cid:62)\nt\u22121Wt\u22121,\nin the last term of (2)left is computed as a Gramian of feedforward weights; a non-local operation.\nThis problem is not limited to PCA and arises in nonlinear NNs as well [26, 18].\nWhereas NNs with local learning rules have been proposed [26] their two-layer feedback architecture\nis not consistent with most biological sensory systems with the exception of olfaction [17]. Most\nimportantly, such feedback architecture seems inappropriate for motion detection which requires\nspeedy processing of streamed stimuli.\nTo address these dif\ufb01culties, [29] derived NNs from similarity-preserving objectives. Such objectives\nrequire that similar input pairs, xt and xt(cid:48), evoke similar output pairs, yt and yt(cid:48). If the similarity of\na pair of vectors is quanti\ufb01ed by their scalar product, one such objective is similarity matching (SM):\n\nmin\n\n\u2200t\u2208{1,...,T}: yt\u2208Rm\n\n1\n2\n\nT(cid:88)t,t(cid:48)=1\n\n(xt \u00b7 xt(cid:48) \u2212 yt \u00b7 yt(cid:48))2.\n\n(3)\n\n2\n\n\fThis of\ufb02ine optimization problem is also solved by projecting the input data onto the principal\nsubspace [44, 5, 19]. Remarkably, the optimization problem (3) can be converted algebraically to a\ntractable form by introducing variables W and M [30]:\n\nT(cid:88)t=1\n\nmin\n\n{yt\u2208Rm}T\n\nt=1\n\nmin\n\nW\u2208Rn\u00d7m\n\nmax\n\nM\u2208Rm\u00d7m\n\n[\n\n(\u22122x(cid:62)\n\nt Wyt +y(cid:62)\n\nt Myt)+T Tr(W(cid:62)W)\u2212\n\nTr(M(cid:62)M)]. (4)\n\nT\n2\n\nIn the online setting, \ufb01rst, we minimize (4) with respect to the output variables, yt, by gradient\ndescent while keeping W, M \ufb01xed [29]:\n\n\u02d9yt = W(cid:62)xt \u2212 Myt.\n\nWij \u2190 Wij + \u03b7 (xiyj \u2212 Wij) ,\n\n(5)\nTo \ufb01nd yt after presenting the corresponding input, xt, (5) is iterated until convergence. After the\nconvergence of yt, we update W and M by gradient descent and gradient ascent respectively [29]:\n(6)\nAlgorithm (5), (6) can be implemented by a biologically plausible NN, Fig. 1b. As before, activity\n(\ufb01ring rate) of the upstream neurons encodes input variables, xt. Output variables, yt, are computed\nby the dynamics of activity (5) in a single layer of neurons. The elements of matrices W and M\nare represented by the weights of synapses in feedforward and lateral connections respectively. The\nlearning rules (6) are local, i.e. the weight update, \u2206Wij, for the synapse between ith input neuron\nand jth output neuron depends only on the activities, xi, of ith input neuron and, yj, of jth output\nneuron, and the synaptic weight. Learning rules (6) for synaptic weights W and \u2212M (here minus\nindicates inhibitory synapses, see Eq.(5)) are Hebbian and anti-Hebbian respectively.\n\nMij \u2190 Mij + \u03b7 (yiyj \u2212 Mij) .\n\nFigure 1: Single-layer NNs performing online (a) reconstruction error minimization (1) [23, 26], (b)\nsimilarity matching (SM) (3) [29], and (c) nonnegative similarity matching (NSM) (7) [28].\n\nt Wyt + y(cid:62)\n\nt Wyt + y(cid:62)\n\nt W(cid:62)Wyt (Eq 1). The SM approach leads to \u22122x(cid:62)\n\nWe now compare the objective functions of the two approaches. After dropping invariant terms, the\nreconstructive objective function has the following interactions among input and output variables:\n\u22122x(cid:62)\nt Myt, ( Eq 4). The\nterm linear in yt, a cross-term between inputs and outputs, \u22122x(cid:62)\nt Wyt, is common in both approaches\nand is responsible for projecting the data onto the principal subspace via the feedforward connections\nin Fig.1ab. The terms quadratic in yt\u2019s decorrelate different output channels via a competition\nimplemented by the lateral connections in Fig.1ab and are different in the two approaches. In\nparticular, the inhibitory interaction between neuronal activities yj in the reconstruction approach\ndepends upon W(cid:62)W, which is tied to trained W in a non-local way. In contrast, in the SM approach\nthe inhibitory interaction matrix M is learned for yj\u2019s via a local anti-Hebbian rule.\nThe SM approach can be applied to other computational tasks such as clustering and learning\nmanifolds by tiling them with localized receptive \ufb01elds [34]. To this end we modify the of\ufb02ine\noptimization problem (3) by constraining the output, yt \u2208 Rm\n+ , which represents assignment indices\n(as e.g. in the K-means algorithm):\n\nmin\n\n\u2200t\u2208{1,...,T}: yt\u2208Rm\n\n+\n\n1\n2\n\n(xt \u00b7 xt(cid:48) \u2212 yt \u00b7 yt(cid:48))2.\n\n(7)\n\nSuch nonnegative SM (NSM), just like the optimization problem (3), (7) can be converted alge-\nbraically to a tractable form by introducing similar variables W and M [28]. The synaptic weight\nupdate rules presented in (6) remain unchanged and the only difference between the online solutions\nof (3) and (7) is the dynamics of neurons which, instead of being linear, are now rectifying, Fig. 1c.\n\nT(cid:88)t,t(cid:48)=1\n\n3\n\n-MW. . .x1xn. . .x2y1yk. . .ABanti-Hebbian synapsesHebbianRectification-MW. . .x1xn. . .x2y1yk-W. . .x1xn. . .x2y1ykABWTWxxxCanti-Hebbian synapsesHebbian Non-localPrincipal-MW. . .x1xn. . .x2y1yk. . .ABanti-Hebbian synapsesHebbianRectification(a) (b) (c) \fIn the next section, we will address transformation learning. Similarly, we will review the recon-\nstruction approach, identify the key term analogous to the cross-term \u22122x(cid:62)\nt Wyt, and then alter the\nobjective function, so that the cross-term is preserved but the inhibition between output neurons can\nbe learned in a biologically plausible manner.\n\n2 Learning a motion detector using similarity preservation\n\nNow, we focus on learning to detect transformations from pairs of consecutive video frames, xt, and\nxt+1. We start with the observation that much of the change in pixel intensities in consecutive frames\narises from a translation of the image. For in\ufb01nitesimal translations, pixel intensity change is given by\na linear operator (or matrix), denoted by Aa, multiplying the vector of pixel intensity scaled by the\nmagnitude of translation, denoted by \u03b8a. Because for a 2D image multiple directions of translation\nare possible, there is a set of translation matrices with corresponding magnitudes. Our goal is to learn\nboth the translation matrices from pairs of consecutive video frames and compute the magnitudes of\ntranslations for each pair. Such a learning problem will reduce to the one discussed in the previous\nsection, but performed on an unusual feature \u2013 the outer product of pixel intensity and variation of\npixel intensity vectors.\n\n2.1 Reconstruction-based transformation learning\n\nWe represent a video frame at time, t, by the pixel intensity vector, xt, formed by reshaping an image\nmatrix into a vector. For in\ufb01nitesimal transformations, the difference, \u2206xt, between two consecutive\nframes, xt and xt+1 is:\n\nK(cid:88)a=1\n\n\u2206xt = xt+1 \u2212 xt =\n\n\u03b8a\nt Aaxt\n\n,\n\n\u2200t \u2208 {1, . . . , T \u2212 1}.\n\n(8)\n\nwhere, for each transformation, a \u2208 {1, . . . K}, between the frames, t and t + 1, we de\ufb01ne a\ntransformation matrix Aa and a magnitude of transformations, \u03b8a\nt . Whereas for image translation\nAa is known to implement a spatial derivative operator, we are interested in learning Aa from data in\nunsupervised fashion.\nPreviously, unsupervised algorithms for learning both Aa and \u03b8a\nrespect to Aa and \u03b8a\nmismatch between the actual image and the one computed based on the learned model:\n\nt the prediction-error squared [31] where optimal Aa and \u03b8a\n\nt were derived by minimizing with\nt minimize the\n\n(cid:88)t (cid:107)\u2206xt \u2212\n\nK(cid:88)a=1\n\n\u03b8a\n\nt Aaxt(cid:107)2 =(cid:88)t (cid:104)(cid:107)\u2206xt(cid:107)2 \u2212 2\u2206x(cid:62)\n\nt\n\nK(cid:88)a=1\n\n\u03b8a\nt Aaxt + (cid:107)\n\nK(cid:88)a=1\n\n\u03b8a\n\nt Aaxt(cid:107)2(cid:105).\n\n(9)\n\nWhereas solving (9) in the of\ufb02ine setting leads to reasonable estimates of Aa and \u03b8a\nt [31], it is rather\nnon-biological. In a biologically plausible online setting the data are streamed sequentially and \u03b8a\nt\n(Aa) must be computed (updated) with minimum latency. The algorithm can store only the latest\npair of images and a small number of variables, i.e. suf\ufb01cient statistic, but not any signi\ufb01cant part of\nthe dataset. Although a sketch of neural architecture was proposed in [31], it is clear from Section\n1.1 that due to the quadratic term in the output, \u03b8a\nt , a detailed architecture will suffer from the same\nnon-locality as the reconstruction approach to latent variable NNs (1).\nAs the cross-term in (9) plays a key role in projecting the data (Section 1.1), we re-write it as follows:\n\n\u03b8a\n\nK(cid:88)a=1\n\n\u2206x(cid:62)\n\nt\n\n\u03b8a\nt Aa\n\n\u2206xt,i\u03b8a\n\nt Aa\n\ni,jxt,j = (cid:88)i,j,t,a\n\nt Aaxt = (cid:88)i,j,t,a\n\n(cid:88)t\ni,j\u2206xt,ixt,j =(cid:88)t\nwhere we introduced A \u2208 RK\u00d7n2, the matrix whose components represents the vectorized version\nof the generators, Aa,: = Vec(Aa),\u2200a \u2208 {1, . . . , K} and \u0398t = (\u03b8a={1...K}\n)(cid:62), the vector whose\ncomponents represent the magnitude of the transformation, a, at time, t.\nEq. (10) shows that the cross-term favors aligning Aa,: in the direction of the outer product of pixel\nintensity variation and pixel intensity vectors, Vec(\u2206xx(cid:62)). Although central to the learning of\ntransformations in (9), the outer product of pixel intensity variation and pixel intensity vectors was\nnot explicitly highlighted in the transformation-operator learning approach [31, 10, 22].\n\n\u0398tAVec(\u2206xtx(cid:62)\n\nt\n\nt ) , (10)\n\n4\n\n\f2.2 Why the outer product of pixel intensity variation and pixel intensity vectors?\n\nHere, we provide intuitions for using outer products in content-independent detection of translations.\nFor simplicity, we consider 1D motion in a 1D world. Motion detection relies on a correspondence\nbetween consecutive video frames, xt and xt+1.\nOne may think that such correspondences can be detected by a neuron adding up responses of the\ndisplaced \ufb01lters applied to xt and xt+1. While possible in principle, such neuron\u2019s response would\nbe highly dependent on the image content [20, 21]. This is because summing the outputs of the two\n\ufb01lters amounts to applying an OR operation to them which does not selectively respond to translation.\nTo avoid such dependence on the content, [20] proposed to invoke an AND operation, which is\nimplemented by multiplication. Speci\ufb01cally, consider forming an outer product of xt and xt+1 and\nsumming its values along each diagonal. If the image is static then the main diagonal produces\nthe highest correlation.\nIf the image is shifted by one pixel between the frames then the \ufb01rst\nsub(super)-diagonal yields the highest correlation. If the image is shifted by two pixels - the second\nsub(super)-diagonal yields the highest correlation and so on. Then, if the sum over each diagonal\nis represented by a different neuron, the velocity of the object is given by the most active neuron.\nOther models relying on multiplications are \"mapping units\" [15], \"dynamic mappings\" [41] and\nother bilinear models [25].\nOur algorithm for motion detection adopts multiplication to detect correspondences but computes an\nouter product between the vectors of pixel intensity, xt, and pixel intensity variation, \u2206xt. Compared\nto the approach in [20], one advantage of our approach is that we do not require separate neurons to\nrepresent different velocities but rather have a single output neuron (for each direction of motion),\nwhose activity increases with velocity. Previously, a similar outer product feature was proposed in\n[3] (for a formal connection - see Supplement A). Another advantage of our approach is a derivation\nfrom the principled SM objective motivated by the transformation-operator formalism.\n\n2.3 A novel similarity matching objective for learning transformations\n\nHaving identi\ufb01ed the cross-term in (9) analogous to that in (1), we propose a novel objective function\nwhere the inhibition between output neurons is learned in a biologically plausible manner. By analogy\nwith (Eq.3), we substitute the reconstruction-error-minimization objective by an SM objective for\ntransformation learning. We denote the vectorized outer product between \u2206xt and xt as \u03c7t \u2208 Rn2:\n(11)\nWe concatenate these vectors into a matrix, \u03c7 \u2261 [\u03c71, . . . , \u03c7T ], as well as the transformation magni-\ntude vectors, \u0398 \u2261 [\u03981, . . . , \u0398T ]. Using these notations, we introduce the following SM objective:\n\nt )i,j, with \u03b1 = (i \u2212 1)n + j,\n\n\u03c7t,\u03b1 = (\u2206xtx(cid:62)\n\n\u0398\u2208RK\u00d7T (cid:107)\u03c7(cid:62)\u03c7 \u2212 \u0398(cid:62)\u0398(cid:107)2\n\nmin\n\nF = min\n\n\u03981,...,\u0398T\n\n1\nT 2\n\n(\u03c7(cid:62)\nt \u03c7t(cid:48) \u2212 \u0398(cid:62)\n\nt \u0398t(cid:48))2.\n\n(12)\n\nT(cid:88)t\n\nT(cid:88)t(cid:48)\n\nt \u03c7t(cid:48) =\n\n1\nT 2\n\n\u0398(cid:62)\n\n1\nT 2\n\nT(cid:88)t(cid:48)=1\n\n\u0398(cid:62)\nt \u0398t(cid:48)\u03c7(cid:62)\n\nTo reconcile (9) and (12), we \ufb01rst show that the cross-terms are the same by introducing the following\noptimization over a matrix, W \u2208 RK\u00d7n2 as:\nt (cid:34) T(cid:88)t(cid:48)=1\nT(cid:88)t=1\n\nt W\u03c7t \u2212 TrW(cid:62)W (13)\n\u0398(cid:62)\nt AVec(\u2206xtx(cid:62)\nTherefore, the SM approach yields the cross-term, \u0398(cid:62)\nt )\nin [31]. We can thus identify the rows Wa,: with the vectorized transformation matrices, Vec(Aa),\nFig. 2a. Solutions of (12) are known to be projections onto the principal subspace of \u03c7, the vectorized\nouter product of \u2206xt and xt which are equivalent, up to an orthogonal rotation, to PCA.\nIf we constrain the output to be nonnegative (NSM):\n\nt W\u03c7t which is the same as \u0398(cid:62)\n\nt(cid:48)(cid:35) \u03c7t = max\n\nT(cid:88)t=1\n\nT(cid:88)t=1\n\n\u0398t(cid:48)\u03c7(cid:62)\n\n2\nT\n\nW\n\nmin\n\n\u0398\u2208RK\u00d7T\n\n+\n\n(cid:107)\u03c7(cid:62)\u03c7 \u2212 \u0398(cid:62)\u0398(cid:107)2\n\nF .\n\n(14)\n\nthen by analogy with Sec. 1.1 [28], this objective function clusters data or tiles data manifolds [34].\n\n5\n\n\f2.4 Online algorithm and NN\n\nTo derive online learning algorithms for (12) and (14) we follow the similarity matching approach [29].\nThe optimality condition of each online problem is given by [28, 29] for SM and NSM respectively:\n(15)\n\nSM: \u0398\u2217\n\nt , 0) ,\n\nwith W and M found using recursive formulations, \u2200a \u2208 {1, . . . , K},\u2200\u03b1 \u2208 {1, . . . , n2}:\n\nt\n\n; NSM: \u0398\u2217\n\nt = W\u03c7t \u2212 M\u0398\u2217\nt = max(W\u03c7t \u2212 M\u0398\u2217\nWa\u03b1 \u2190 Wa\u03b1 +(cid:18)\u0398t\u22121,a(\u03c7t\u22121,\u03b1 \u2212 Wa\u03b1\u0398t\u22121,a)(cid:30) \u02c6\u0398t,a(cid:19)\nMaa(cid:48)(cid:54)=a \u2190 Maa(cid:48) +(cid:18)\u0398t\u22121,a(\u0398t\u22121,a(cid:48) \u2212 Maa(cid:48)\u0398t\u22121,a)(cid:30) \u02c6\u0398t,a(cid:19)\n\n\u02c6\u0398t,a = \u02c6\u0398t\u22121,a + (\u0398t\u22121,a)2\n\n.\n\n(16)\n\n(17)\n\n(18)\n\nThis algorithm is similar to the model proposed in [29], but it is more dif\ufb01cult to implement in\na biologically plausible way. This is because \u03c7t is an outer product of input data and cannot be\nidenti\ufb01ed with the inputs to a single neuron. To implement this algorithm, we break up W into rank-1\ncomponents, each of which is computed in a separate neuron such that:\n\n\u0398\u2217\n\nt,a =(cid:88)i\n\n\u2206xt,i(cid:88)j\n\nWijaxt,j \u2212(cid:88)a(cid:48)\n\nMaa(cid:48)\u0398\u2217\nt,a(cid:48)\n\n.\n\n(19)\n\nEach element of the tensor, Wija will be encoded in the weight of a feedforward synapse from\nthe j-th pixel onto i-th neuron encoding a-th transformation (see Fig. 2a). Biologically plausible\nimplementations of this algorithm are given in Section 3.\n\n2.5 Numerical experiments\n\nHere, we implement the biologically plausible algorithms presented in the previous subsection and\nreport the learned transformation matrices. To validate the results of SM and NSM applied to the\nouter-product feature, \u03c7, we compare them with those of PCA and K-means, respectively, also applied\nto \u03c7 as formally de\ufb01ned in in Supplement B. These standard but biologically implausible algorithms\nwere chosen because they perform similar computations in the context of latent variable discovery.\nThe 1D visual world is represented by a continuous pro\ufb01le of light intensity as a function of one\ncoordinate. A 1D eye measures light intensity in a 1D window consisting of n discrete pixels. To\nimitate self-motion, such window can move left and right by a fraction of a pixel at each time step.\nFor the purpose of evaluating the proposed algorithms and derived NNs, we generated arti\ufb01cial\ntraining data by subjecting a randomly generated 1D image (Gaussian, exponentially correlated noise)\nto known horizontal subpixel translations. Then, we spatially whitened the discrete images by using\nthe ZCA whitening technique [2].\nWe start by learning K = 2 transformation matrices using each algorithm. After the rows of the\nsynaptic weights, W, are reshaped into n \u00d7 n matrices, they can be identi\ufb01ed with the transformation\noperators, A. Then the magnitude of the transformation given by \u2206x(cid:62)\n\nt Axt, Fig. 2a.\n\nSM and PCA. The \ufb01lters learned from SM are shown in Fig.2c and those learned from PCA - in\nFig.2e. The left panels of Fig.2ce represent the singular vectors capturing the maximum variance.\nThey replicate the known operator of translation, a spatial derivative, found in [31]. The right panels\nof Fig.2ce show the singular vector capturing the second largest variance, which do not account for a\nknown transformation matrix. In the absence of a nonnegativity constraint a reversal of translation is\nrepresented by a change of sign of the transformation magnitude.\n\nNSM and K-means. The \ufb01lters learned by NSM are shown in Fig.2d and those learned by K-means\n- in Fig. 2f. They are similar to the \ufb01rst singular vector learned by SM, PCA and [31]. However, in\nNSM and K-means the output must be nonnegative, so representing the opposite directions of motion\nrequires two \ufb01lters, which are sign inversions of each other.\nFor the various models, the rows of the learned operators, Aa, are identical except for a shift, i.e. the\nsame operator is applied at each image location. As expected, the learned \ufb01lters compute a spatial\nderivative of the pixel intensity, red rectangle in Fig.2a. The learned weights can be approximated by\n\n6\n\n\fFigure 2: The rows of the synaptic weight matrix Wa,: are reshaped into n \u00d7 n transformation\nmatrices Aa. Then, the magnitude of the transformation is \u2206x(cid:62)\nt Aaxt. Such a computation can be\napproximated by the cartoon model (b). Synaptic weights learned from 1D translation on a vector of\nsize 5 pixels by (c) SM, (d) NSM, (e) PCA (decreasing eigenvalues), and (f) K-means.\n\nthe \ufb01lter keeping only the three central pixels, Fig.2 which we name the cartoon model of the motion\ndetector. It computes a correlation between the spatial derivative denoted by \u2206ixt,i and the temporal\nderivative, \u2206txt. Such algorithm may be viewed as a Bayesian optimal estimate of velocity in the\nlow SNR regime (Supplement C) appropriate for the \ufb02y visual system[36].\nThe results presented in Fig.2 were obtained with n = 5 pixels, but the same structure was observed\nwith larger values of n. Similar results were also obtained with models trained on moving periodic\nsine-wave gratings often used in \ufb02y experiments.\nWe also trained our NN on motion in the four cardinal directions, and planar rotations of two-\ndimensional images as was done in [31] and showed that our model can learn such transformations.\nBy using NSM we can again distinguish between motion in the four cardinal directions, and clockwise\nand counterclockwise rotations, which was not possible with prior approaches (see Supplement D).\n\n3 Learning transformations in a biologically plausible way\n\nIn this section, we propose two biologically plausible implementations of a motion detector by taking\nadvantage of the decomposition of the outer product feature matrix into single-row components (19).\nThe \ufb01rst implementation models computation in a mammalian neuron such as a cortical pyramidal\ncell. The second models computation in a Drosophila motion-detecting neuron T4 (same arguments\napply to T5). In the following, for simplicity we focus on the cartoon model Fig.2b.\n\n3.1 Multi-compartment neuron model\n\nMammalian neurons can implement motion computation by representing each row of the trans-\nformation matrix, W, in a different dendritic branch originating from the soma (cell body). Each\nsuch branch forms a compartment with its own membrane potential [14, 37] allowing it to perform\nits own non-linear computation the results of which are then summed in the soma. Each dendrite\ncompartment receives pixel intensity variation from only one pixel via a proximal shunting inhibitory\nsynapse [40, 16] and the pixel intensity vector via more distal synapses, Fig. 3a. We assume that\nthe conductance of the shunting inhibitory synapse decreases with the variation in pixel intensity.\nThe weights of the more distal synapses represent the corresponding row of the outer product feature\nmatrix. When the variation in pixel intensity is low, the shunting inhibition vetoes other post-synaptic\ncurrents. When the variation in pixel intensity is high, the shunting is absent and the remaining\npost-synaptic currents \ufb02ow into the soma. A formal analysis shows that this operation can be viewed\nas a multiplication [40, 16]. Different compartments compute such products for variation in intensity\nof different pixels, after which these products are summed in the soma (19), Fig. 3a.\nThe weight of a distal synapse is updated using a Hebbian learning rule applied to the corresponding\npixel intensity available pre-synaptically and the transformation magnitude modulated by the shunting\ninhibition representing pixel intensity variation, Fig. 3b. The transformation magnitude is computed in\nthe soma and reaches distal synapses via backpropagating dendritic spikes [38]. Such backpropagating\nsignal is modulated by the shunting inhibition, thus implementing multiplication of the transformation\n\n7\n\n(e)(f)(c)\u0394x\",:x\",:\u2212x\",%x\",&-x\",'x\",%-x\",(x\",'\u2212x\",)x\",(\u2248[\u0394x\",&\u0394x\",%\u0394x\",'\u0394x\",(\u0394x\",)]x1th Principal Component2th Principal Component3th Principal Component4th Principal Component[\u0394x\",&\u0394x\",%\u0394x\",'\u0394x\",(\u0394x\",)]x\",&x\",%x\",'x\",(x\",)W+,:(a)(d)x!,#x!,$x!,%&&'+-\u0394x!,#\u00d7(x!,#\u2212x!,%)(b)\fmagnitude and pixel intensity variation (16), Fig. 3b . Competition between the neurons detecting\nmotion in different directions is mediated by inhibitory interneurons [27].\n\nFigure 3: A multi-compartment model of a mammalian neuron. (a) Each dendrite multiplies pixel\nintensity variation signaled by the shunting inhibitory synapse and the weighted vector of pixel\nintensities carried by more distal synapses. Products computed in each dendrite are summed in the\nsoma to yield transformation magnitude encoded in the spike rate. (b) Synaptic weights are updated\nby the product of the corresponding pre-synaptic pixel intensities and the backpropagating spikes\nmodulated by the shunting inhibition.\n3.2 A learned similarity preserving NN replicates the structure of the \ufb02y motion detector\n\nThe Drosophila visual system comprises retinotopically organized layers of neurons, meaning that\nnearby columns process photoreceptor signals (identi\ufb01ed with xi below) from nearby locations\nin the visual \ufb01eld. Unlike the implementation in the previous subsection, motion computation is\nperformed across multiple neurons. The local motion signal is \ufb01rst computed in each of the hundreds\nof T4 neurons that jointly tile the visual \ufb01eld. Their outputs are integrated by the downstream giant\ntangential neurons. Each T4 neuron receives light intensity variation from only one pixel via synapses\nfrom neurons Mi1 and Tm3 and light intensities from nearby pixels via synapses from neurons Mi4\nand Mi9 (with opposite signs) [39], Fig. 3c. Therefore, in each T4 neuron \u2206x is a scalar and W is a\nvector and local motion velocity can be computed by a single-compartment neuron. If the weights of\nsynapses from Mi4 and Mi9 of different columns represent W, then the multiplication of \u2206x and\nWx can be accomplished as before using shunting inhibition. Competition among T4s detecting\ndifferent directions of motion is implemented by inhibitory lateral connections.\n\nFigure 4: An NN trained on 1D translations recapitulates the motion detection circuit in Drosophila.\n(a) Each motion-detecting neuron receives pixel intensity variation signal from pixel i and pixel\nintensity signals at least from pixels i \u2212 1 and i + 1 (with opposite signs). (b) In Drosophila, each\nretinotopically organized column contains neurons Mi1/Tm3, Mi9, and Mi4 [39] which respond to\nlight intensity in the corresponding pixel according to the impulse responses shown in (c) (from [1]).\nEach T4 neuron selectively samples different inputs from different columns [39]: it receives light\nintensity variation via Mi1/Tm3 and light intensity via Mi4 and Mi9 (with opposite signs).\nOur model correlates inputs from at least three pixels in agreement with recent experimental results\n[39, 1, 11, 33], instead of two in the celebrated Hassenstein-Reichardt detector (HRD)[32]. In the\n\ufb02y, outputs of T4s are summed over the visual \ufb01eld in downstream neurons. The summed output of\nour detectors is equivalent to the summed output of HRDs and thus consistent with multiple prior\nbehavioral experiments and physiological recordings from downstream neurons (see Supplement E).\nThere is experimental evidence for both nonlinear interactions of T4 inputs [33, 13] supporting a\nmultiplicative model but also for the linear summation of inputs [11, 43]. Even if summation is linear,\nthe neuronal output nonlinearity can generate multiplicative terms for outer product computation.\n\n8\n\n(a)Soma!\"axi#$xixi%$\u2206xixi%'xi#'xi#$xixi%$\u2206xi%$xi%'xi#'(b)Soma!\"aShunting InhibitionExcitatory connectionInhibitory connectionxi#$xixi%$\u2206xixi%'xi#'xi#$xixi%$\u2206xi%$xi%'xi#'\u0398\u2217t,a=!i\u2206xt,i!jWijaxt,j\u2212!a\u2032Maa\u2032\u0398\u2217t,a\u2032.Dendritic BackpropagationNeural activity computed according to (19)Synaptic update by dendritic backpropagation according to (16-18)Mi1/Tm3 Mi9 Mi4Column 1iMi1/Tm3 Mi9 Mi4eColumn 2Mi1/Tm3 Mi9 Mi4Column 3iT4eiiT4 rightMi1/Tm3 Mi9 Mi4Column i+1iiT4 leftMi1/Tm3 Mi9 Mi4Column ieeMi1/Tm3 Mi9 Mi4Columni\u22121ii\u2206xi%&xi%&\u2206xixi\u2206xi'&xi'&x-+Pixel i\u22121Pixel iPixel i+1EMD right(a)(b)(c)\fThe main difference between our learned model (Fig.2a) and most published models is that the\nmotion detector is learned from data using biologically plausible learning rules in an unsupervised\nsetting. Thus, our model can generate somewhat different receptive \ufb01elds for different natural image\nstatistics such as that in ON and OFF pathways potentially accounting for minor differences reported\nbetween T4 and T5 circuits [39].\nA recent model from [33] also uses inputs from three differently preprocessed inputs. Unlike our\nmodel that relies on a derivative computation in the middle pixel, the model in [33] is composed of a\nshared non-delay line \ufb02anked by two delay lines.\nAs shown in Supplement E, after integration over the visual \ufb01eld, the global signal from our cartoon\nmodel Fig.2b is equivalent to that from HRD. Same observation has been made for the model in [33].\nYet, the predicted output of a single motion detector in our model is different from both HRD and\n[33].\n\n3.3 Experimentally established properties of the global motion detector\n\nUntil recently, most experiments con\ufb01rmed the predictions of the HRD model. However, almost all\nof these experiments measured either the activity of downstream giant neurons integrating T4 output\nover the whole visual \ufb01eld or the behavioral response generated by these giant neurons. Because after\nintegration over the visual \ufb01eld, the global signal from our cartoon model Fig.2b is equivalent to that\nfrom HRD, various experimental con\ufb01rmations of the HRD predictions are inherited by our model.\nBelow, we list some of the con\ufb01rmed predictions.\n\nDependence of the output on the image contrast. Because HRD multiplies signals from the two\nphotoreceptors its output should be quadratic in the stimulus contrast. Similarly, in our model, the\noutput should be proportional to contrast squared because it is given by the covariance between time\nand space derivatives of the light intensity Supplement C each proportional to contrast. Note that\nthis prediction differs from [31] whose output is contrast-independent. Several experiments have\ncon\ufb01rmed these predictions in the low SNR regime [12, 7, 9, 35, 4]. Of course, the output cannot\ngrow unabated and, in the high SNR regime, the output becomes contrast independent. A likely cause\nis the signal normalization between photoreceptors and T4 [12].\n\nOscillations in the motion signal locked to the visual stimulus.\nIn accordance with the oscillating\noutput of HRD in response to moving periodic stimulus, physiological recordings have reported such\nphase-locked oscillations [6]. Our model reproduces such oscillations.\n\nDependence of the peak velocity on the wavelength.\nIn our model, just like in the HRD, output\n\ufb01rst increases with the velocity of the visual stimulus and then decreases. The optimal velocity is\nproportional to the spatial wavelength of the visual stimulus because then the temporal frequency of\nthe optimal stimulus is a constant given by the inverse of the time delay in one of the arms.\n\nIn conclusion, we learn transformation matrices using a similarity-preserving approach leading to a\nbiologically plausible model of a motion detector. Generalizing our work to the learning of other\ncontent-preserving transformation will open a path towards principled biologically plausible object\nrecognition.\n\nAcknowledgments\n\nWe are grateful to P. Gunn, and A. Genkin for discussion and comments on this manuscript. We thank\nD. Clark, J. Fitzgerald, E. Hunsicker, and B. Olshausen for helpful discussions.\n\nReferences\n[1] Alexander Arenz, Michael S Drews, Florian G Richter, Georg Ammer, and Alexander Borst.\nThe temporal tuning of the drosophila motion detectors is determined by the dynamics of their\ninput elements. Current Biology, 27(7):929\u2013944, 2017.\n\n[2] Anthony J Bell and Terrence J Sejnowski. The \u201cindependent components\u201d of natural scenes are\n\nedge \ufb01lters. Vision research, 37(23):3327\u20133338, 1997.\n\n9\n\n\f[3] Matthias Bethge, Sebastian Gerwinn, and Jakob H Macke. Unsupervised learning of a steerable\nbasis for invariant image representations. In Human Vision and Electronic Imaging XII, volume\n6492, page 64920C. International Society for Optics and Photonics, 2007.\n\n[4] Erich Buchner. Elementary movement detectors in an insect visual system. Biological cybernet-\n\nics, 24(2):85\u2013101, 1976.\n\n[5] Trevor F Cox and Michael AA Cox. Multidimensional scaling. Chapman and hall/CRC, 2000.\n\n[6] Martin Egelhaaf and Alexander Borst. A look into the cockpit of the \ufb02y: visual orientation,\n\nalgorithms, and identi\ufb01ed neurons. The Journal of Neuroscience, 13(11), 1993.\n\n[7] G Fermi and W Reichardt. Optomotor reactions of the \ufb02y, musca domestica. dependence of\nthe reaction on wave length, velocity, contrast and median brightness of periodically moved\nstimulus patterns. Kybernetik, 2:15\u201328, 1963.\n\n[8] Peter F\u00f6ldi\u00e1k. Learning invariance from transformation sequences. Neural Computation,\n\n3(2):194\u2013200, 1991.\n\n[9] KG G\u00f6tz. Optomoter studies of the visual system of several eye mutants of the fruit \ufb02y\n\ndrosophila. Kybernetik, 2(2):77, 1964.\n\n[10] David B Grimes and Rajesh PN Rao. Bilinear sparse coding for invariant vision. Neural\n\ncomputation, 17(1):47\u201373, 2005.\n\n[11] Eyal Gruntman, Sandro Romani, and Michael B Reiser. Simple integration of fast excitation and\noffset, delayed inhibition computes directional selectivity in drosophila. Nature neuroscience,\n21(2):250, 2018.\n\n[12] Juergen Haag, Winfried Denk, and Alexander Borst. Fly motion vision is based on reichardt\ndetectors regardless of the signal-to-noise ratio. Proceedings of the National Academy of\nSciences, 101(46):16333\u201316338, 2004.\n\n[13] Juergen Haag, Abhishek Mishra, and Alexander Borst. A common directional tuning mechanism\nof drosophila motion-sensing neurons in the on and in the off pathway. Elife, 6:e29044, 2017.\n\n[14] Michael H\u00e4usser and Bartlett Mel. Dendrites: bug or feature? Current opinion in neurobiology,\n\n13(3):372\u2013383, 2003.\n\n[15] Geoffrey F Hinton. A parallel computation that assigns canonical object-based frames of\nreference. In Proceedings of the 7th international joint conference on Arti\ufb01cial intelligence-\nVolume 2, pages 683\u2013685. Morgan Kaufmann Publishers Inc., 1981.\n\n[16] Christof Koch, Tomaso Poggio, and Vincent Torre. Nonlinear interactions in a dendritic tree:\nlocalization, timing, and role in information processing. Proceedings of the National Academy\nof Sciences, 80(9):2799\u20132802, 1983.\n\n[17] Alexei A Koulakov and Dmitry Rinberg. Sparse incomplete representations: a potential role of\n\nolfactory granule cells. Neuron, 72(1):124\u2013136, 2011.\n\n[18] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix\n\nfactorization. Nature, 401(6755):788\u2013791, 1999.\n\n[19] KV Mardia, JT Kent, and JM Bibby. Multivariate analysis. Academic press, 1980.\n\n[20] Roland Memisevic. Learning to relate images: Mapping units, complex cells and simultaneous\n\neigenspaces. arXiv preprint arXiv:1110.0107, 2011.\n\n[21] Roland Memisevic. Learning to relate images. IEEE Transactions on pattern analysis and\n\nmachine intelligence, 35(8):1829\u20131846, 2013.\n\n[22] Xu Miao and Rajesh PN Rao. Learning the lie groups of visual invariance. Neural computation,\n\n19(10):2665\u20132693, 2007.\n\n10\n\n\f[23] E Oja. Principal components, minor components, and linear neural networks. Neural Networks,\n\n5(6):927\u2013935, 1992.\n\n[24] Erkki Oja. Simpli\ufb01ed neuron model as a principal component analyzer. J. Math. Biol., 15(3):267\u2013\n\n273, 1982.\n\n[25] Bruno A Olshausen, Charles Cadieu, Jack Culpepper, and David K Warland. Bilinear models\nof natural images. In Human Vision and Electronic Imaging XII, volume 6492, page 649206.\nInternational Society for Optics and Photonics, 2007.\n\n[26] Bruno A Olshausen and David J Field. Emergence of simple-cell receptive \ufb01eld properties by\n\nlearning a sparse code for natural images. Nature, 381:607\u2013609, 1996.\n\n[27] Cengiz Pehlevan and Dmitri Chklovskii. A normative theory of adaptive dimensionality\nreduction in neural networks. In Advances in neural information processing systems, pages\n2269\u20132277, 2015.\n\n[28] Cengiz Pehlevan and Dmitri B Chklovskii. A Hebbian/anti-Hebbian network derived from\nonline non-negative matrix factorization can cluster and discover sparse features. In 2014 48th\nAsilomar Conference on Signals, Systems and Computers, pages 769\u2013775. IEEE, 2014.\n\n[29] Cengiz Pehlevan, Tao Hu, and Dmitri B Chklovskii. A Hebbian/anti-Hebbian neural network\nfor linear subspace learning: A derivation from multidimensional scaling of streaming data.\nNeural computation, 27(7):1461\u20131495, 2015.\n\n[30] Cengiz Pehlevan, Anirvan M Sengupta, and Dmitri B Chklovskii. Why do similarity matching\nobjectives lead to Hebbian/anti-Hebbian networks? Neural Computation, 30(1):84\u2013124, 2018.\n\n[31] Rajesh PN Rao and Daniel L Ruderman. Learning lie groups for invariant visual perception. In\n\nAdvances in neural information processing systems, pages 810\u2013816, 1999.\n\n[32] Werner Reichardt. Autocorrelation, a principle for the evaluation of sensory information by the\n\ncentral nervous system. Sensory communication, pages 303\u2013317, 1961.\n\n[33] Emilio Salazar-Gatzimas, Margarida Agrochao, James E Fitzgerald, and Damon A Clark. The\nneuronal basis of an illusory motion percept is explained by decorrelation of parallel motion\npathways. Current Biology, 28(23):3748\u20133762, 2018.\n\n[34] Anirvan Sengupta, Cengiz Pehlevan, Mariano Tepper, Alexander Genkin, and Dmitri Chklovskii.\nManifold-tiling localized receptive \ufb01elds are optimal in similarity-preserving neural networks.\nIn Advances in Neural Information Processing Systems, pages 7080\u20137090, 2018.\n\n[35] Sandra Single and Alexander Borst. Dendritic integration and its role in computing image\n\nvelocity. Science, 281(5384):1848\u20131850, 1998.\n\n[36] Shiva R Sinha, William Bialek, and Rob R van Steveninck. Optimal local estimates of visual\n\nmotion in a natural environment. arXiv preprint arXiv:1812.11878, 2018.\n\n[37] Nelson Spruston. Pyramidal neurons: dendritic structure and synaptic integration. Nature\n\nReviews Neuroscience, 9(3):206, 2008.\n\n[38] Greg Stuart, Nelson Spruston, Bert Sakmann, and Michael H\u00e4usser. Action potential initiation\nand backpropagation in neurons of the mammalian cns. Trends in neurosciences, 20(3):125\u2013131,\n1997.\n\n[39] Shin-ya Takemura, Aljoscha Nern, Dmitri B Chklovskii, Louis K Scheffer, Gerald M Rubin,\nand Ian A Meinertzhagen. The comprehensive connectome of a neural substrate for \u2018on\u2019motion\ndetection in drosophila. Elife, 6, 2017.\n\n[40] V Torre and T Poggio. A synaptic mechanism possibly underlying directional selectivity to mo-\ntion. Proceedings of the Royal Society of London. Series B. Biological Sciences, 202(1148):409\u2013\n416, 1978.\n\n[41] Christoph Von Der Malsburg. The correlation theory of brain function. In Models of neural\n\nnetworks, pages 95\u2013119. Springer, 1994.\n\n11\n\n\f[42] Jimmy Wang, Jascha Sohl-Dickstein, and Bruno Olshausen. Unsupervised learning of lie group\noperators from image sequences. In Frontiers in Systems Neuroscience. Conference Abstract:\nComputational and systems neuroscience, volume 1130, 2009.\n\n[43] Carl FR Wienecke, Jonathan CS Leong, and Thomas R Clandinin. Linear summation underlies\n\ndirection selectivity in drosophila. Neuron, 99(4):680\u2013688, 2018.\n\n[44] Christopher KI Williams. On a connection between kernel pca and metric multidimensional\n\nscaling. In Advances in neural information processing systems, pages 675\u2013681, 2001.\n\n12\n\n\f", "award": [], "sourceid": 7967, "authors": [{"given_name": "Yanis", "family_name": "Bahroun", "institution": "Flatiron institute"}, {"given_name": "Dmitri", "family_name": "Chklovskii", "institution": "Flatiron Institute, Simons Foundation"}, {"given_name": "Anirvan", "family_name": "Sengupta", "institution": "Rutgers University"}]}