{"title": "Developing Population Codes by Minimizing Description Length", "book": "Advances in Neural Information Processing Systems", "page_first": 11, "page_last": 18, "abstract": null, "full_text": "Developing Population Codes By \nMinimizing Description Length \n\nRichard S. Zemel \n\nCNL, The Salk Institute \n\n10010 North Torrey Pines Rd. \n\nLa J oUa, CA 92037 \n\nGeoffrey E. Hinton \n\nDepartment of Computer Science \n\nUniversity of Toronto \n\nToronto M5S 1A4 Canada \n\nAbstract \n\nThe Minimum Description Length principle (MDL) can be used to \ntrain the hidden units of a neural network to extract a representa(cid:173)\ntion that is cheap to describe but nonetheless allows the input to \nbe reconstructed accurately. We show how MDL can be used to \ndevelop highly redundant population codes. Each hidden unit has \na location in a low-dimensional implicit space. If the hidden unit \nactivities form a bump of a standard shape in this space, they can \nbe cheaply encoded by the center ofthis bump. So the weights from \nthe input units to the hidden units in an autoencoder are trained \nto make the activities form a standard bump. The coordinates of \nthe hidden units in the implicit space are also learned, thus allow(cid:173)\ning flexibility, as the network develops a discontinuous topography \nwhen presented with different input classes. Population-coding in \na space other than the input enables a network to extract nonlinear \nhigher-order properties of the inputs. \n\nMost existing unsupervised learning algorithms can be understood using the Min(cid:173)\nimum Description Length (MDL) principle (Rissanen, 1989). Given an ensemble \nof input vectors, the aim of the learning algorithm is to find a method of coding \neach input vector that minimizes the total cost, in bits, of communicating the input \nvectors to a receiver. There are three terms in the total description length: \n\n\u2022 The code-cost is the number of bits required to communicate the code \n\nthat the algorithm assigns to each input vector. \n\n11 \n\n\f12 \n\nZemel and Hinton \n\n\u2022 The model-cost is the number of bits required to specify how to recon(cid:173)\n\nstruct input vectors from codes (e.g., the hidden-to-output weights) . \n\n\u2022 The reconstruction-error is the number of bits required to fix up any \n\nerrors that occur when the input vector is reconstructed from its code. \n\nFormulating the problem in terms of a communication model allows us to derive an \nobjective function for a network (note that we are not actually sending the bits). \nFor example, in competitive learning (vector quantization), the code is the identity \nof the winning hidden unit, so by limiting the system to 1i units we limit the \naverage code-cost to at most log21i bits. The reconstruction-error is proportional \nto the squared difference between the input vector and the weight-vector of the \nwinner, and this is what competitive learning algorithms minimize. The model-cost \nis usually ignored. \n\nThe representations produced by vector quantization contain very little information \nabout the in put (at most log21i bits). To get richer representations we must allow \nmany hidden units to be active at once and to have varying activity levels. Principal \ncomponents analysis (PCA) achieves this for linear mappings from inputs to codes. \nIt can be viewed as a version of MDL in which we limit the code-cost by only \nhaving a few hidden units, and ignoring the model-cost and the accuracy with which \nthe hidden activities must be coded. An autoencoder (see Figure 2) that tries to \nreconstruct the input vector on its output units will perform a version of PCA if the \noutput units are linear. We can obtain novel and interesting unsupervised learning \nalgorithms using this MDL approach by considering various alternative methods of \ncommunicating the hidden activities. The algorithms can all be implemented by \nbackpropagating the derivative of the code-cost for the hidden units in addition to \nthe derivative of the reconstruction-error backpropagated from the output units. \n\nAny method that communicates each hidden activity separately and independently \nwill tend to lead to factorial codes because any mutual information between hidden \nunits will cause redundancy in the communicated message, so the pressure to keep \nthe message short will squeeze out the redundancy. In (Zemel, 1993) and (Hinton \nand Zemel, 1994), we present algorithms derived from this MDL approach aimed \nat developing factorial codes. Although factorial codes are interesting, they are not \nrobust against hardware failure nor do they resemble the population codes found in \nsome parts of the brain. Our aim in this paper is to show how the MDL approach \ncan be used to develop population codes in which the activities of hidden units are \nhighly correlated. For a more complete discussion of the details of this algorithm, \nsee (Zemel, 1993). \n\nUnsupervised algorithms contain an implicit assumption about the nature of the \nstructure or constraints underlying the input set. For example, competitive learning \nalgorithms are suited to datasets in which each input can be attributed to one of \na set of possible causes. In the algorithm we present here, we assume that each \ninput can be described as a point in a low-dimensional continuous constraint space. \nFor instance, a complex shape may require a detailed representation, but a set of \nimages of that shape from multiple viewpoints can be concisely represented by first \ndescribing the shape, and then encoding each instance as a point in the constraint \nspace spanned by the viewing parameters. Our goal is to find and represent the \nconstraint space underlying high-dimensional data samples. \n\n\fDeveloping Population Codes by Minimizing Description Length \n\n13 \n\nsize \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\u2022 \n\n\u2022 \n\u00b7x. \u2022 \n\u2022 \n\u2022 \u2022 \u2022 \n\u2022 \u2022 \n\u2022 \u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\norientation \n\nFigure 1: The population code for an instance in a two-dimensional implicit space. \nThe position of each blob corresponds to the position of a unit within the population, \nand the blob size corresponds to the unit's activity. Here one dimension describes \nthe size and the other the orientation of a shape. We can determine the instantiation \nparameters of this particular shape by computing the center of gravity of the blob \nactivities, marked here by an \"X\". \n\n1 POPULATION CODES \nIn order to represent inputs as points drawn from a constraint space, we choose \na population code style of representation. In a population code, each code unit is \nassociated with a position in what we call the implicit space, and the code units' \npattern of activity conveys a single point in this space. This implicit space should \ncorrespond to the constraint space. For example, suppose that each code unit \nis assigned a position in a two-dimensional implicit space, where one dimension \ncorresponds to the size of the shape and the second to its orientation in the image \n(see Figure 1). A population of code units broadly-tuned to different positions can \nrepresent any particular instance of the shape by their relative activity levels. \n\nThis example illustrates that population codes involve three quite different spaces: \nthe input-vector space (the pixel intensities in the example); the hidden-vector space \n(where each hidden, or code unit entails an additional dimension); and this third, \nlow-dimensional space which we term the implicit space. In a learning algorithm \nfor population codes, this implicit space is intended to come to smoothly represent \nthe underlying dimensions of variability in the inputs, i.e., the constraint space. \nFor instance, the Kohonen (1982) algorithm defines the implicit space topology \nthrough fixed neighborhood relations, and the algorithm then manipulates hidden(cid:173)\nvector space so that neighbors in implicit space respond to similar inputs. \n\nThis form of coding has several computational advantages, in addition to its signif(cid:173)\nicance due to its prevalence in biological systems. Population codes contain some \nredundancy and hence have some degree of fault-tolerance, and they reflect under(cid:173)\nlying structure of the input, in that similar inputs are mapped to nearby implicit \npositions. They also possess a hyperacuity property, as the number of implicit \npositions that can be represented far exceeds the number of code units. \n\n\f14 \n\nZemel and Hinton \n\n2 LEARNING POPULATION CODES WITH MDL \n\nAutoencoders are a general way of addressing issues of coding, in which the hidden \nunit activities for an input are the codes for that input which are produced by the \ninput-hidden weights, and in which reconstruction from the code is done by the \nhidden-output mapping. In order to allow an autoencoder to develop population \ncodes for an input set, we need some additional structure in the hidden layer that \nwill allow a code vector to be interpreted as a point in implicit space. While most \ntopographic-map formation algorithms (e.g., the Kohonen and elastic net (Durbin \nand Willshaw, 1987) algorithms) define the topology of this implicit space by fixed \nneighborhood relations, in our algorithm we use a more explicit representation. \nEach hidden unit has weights coming from the input units that determine its activity \nlevel. But in addition to these weights, it has another set of adjustable parameters \nthat represent its coordinates in the implicit space. To determine what implicit \nposition is represented by a vector of hidden activities, we can average together the \nimplicit coordinates of the hidden units, weighting each coordinate vector by the \nactivity level of the unit. \n\nSuppose, for example, that each hidden unit is connected to an 8x8 retina and has \n2 implicit coordinates that represent the size and orientation of a particular kind of \nshape on the retina, as in our earlier example. If we plot the hidden activity levels \nin the implicit space (not the input space), we would like to see a bump of activity \nof a standard shape (e.g., a Gaussian) whose center represents the instantiation \nparameters of the shape (Figure 2 depicts this for a 1D implicit space). If the \nactivities form a perfect Gaussian bump of fixed variance we can communicate \nthem by simply communicating the coordinates of the mean of the Gaussian; this \nis very economical if there are many less implicit coordinates than hidden units. \n\nIt is important to realize that the activity of a hidden unit is actually caused by the \ninput-to-hidden weights, but by setting these weights appropriately we can make \nthe activity match the height under the Gaussian in implicit space. If the activity \nbump is not quite perfect, we must also encode the bump-error-the misfit between \nthe actual activity levels and the levels predicted by the Gaussian bump. The \ncost of encoding this misfit is what forces the activity bump in implicit space to \napproximate a Gaussian. \n\nThe reconstruction-error is then the deviation of the output from the input. This \nreconstruction ignores implicit space; the output activities only depend on the vector \nof hidden activities and weights. \n\n2.1 The objective function \n\nCurrently, we ignore the model-cost, so the description length to be minimized is: \n\nEt \n\nBt + Rt \n1\u00a3 \nI)bj - bj)2 /2VB + L(a~ - c~)2 /2VR \n\nN \n\nj=l \n\nk=l \n\n(1) \n\nwhere a, b, c are the activities of units in the input, hidden, and output layers, \nrespectively, VB and VR are the fixed variances of the Gaussians used for coding the \n\n\fDeveloping Population Codes by Minimizing Description Length \n\n15 \n\nNETWOHI< \n\nIMPLICIT SPACE (1/ = 1) \n\nOutput \n(1...N) \n\nIIidden \n(l...H) \n\nInpllt. 0 () ... () 0 \n(l...N) \n\n... \n\n~ - ------~ \n\nActivity (b) \n\nJJ \n\nx\\ \n\nX(j \n\n. l \n\nXl \n\nI; I: \nI: I: \nI' \ni7 Xi X8 \nPosit ion (x) \n\n- I: \n\nX2 \n\nII} \n,;/ \nbeRt-fit \nGaussian \n\nJ \n\nIi \n\nX7 \n\n.' X~ \n\nFigure 2: Each of the 1t hidden units in the autoencoder has an associated position \nin implicit space. Here we show a ID implicit space. The activity h; of each hidden \nunit j on case t is shown by a solid line. The network fits the best Gaussian to this \npattern of activity in implicit space. The predicted activity, h;, of unit j under this \nGaussian is based on the distance from Xj to the mean j..lt; it serves as a target for \nhj. \n\nbump-errors and the reconstruction-errors, and the other symbols are explained in \nthe caption of Figure 2. \nWe compute the actual activity of a hidden unit, h;, as a normalized exponential \nof its total input. 1 Note that a unit's actual activity is independent of its position \nin implicit space. Its expected activity is its normalized value under the predicted \nGaussian bump: \n\nhj = exp( -(Xj - j..lt)2 /2(7'2)/ L exp( -(xi - j..lt)2/2(7'2) \n\n1{. \n\n(2) \n\ni=l \n\nwhere (7' is the width of the bump, which we assume for now is fixed throughout \ntraining. \n\nWe have explored several methods for computing the mean of this bump. Simply \ncomputing the center of gravity of the representation units' positions, weighted \nby their activity, produces a bias towards points in the center of implicit space. \nInstead, on each case, a separate minimization determines j..lt; it is the position in \nimplicit space that minimizes Bt given {Xj' hj} . The network has full inter-layer \nconnectivity, and linear output units. Both the network weights and the implicit \ncoordinates of the hidden units are adapted to minimize E. \n\n1b~ = exp(net~)/ 2::::1 exp(net~), where net~ is the net input into unit j on case t. \n\n\f16 \n\nZemel and Hinton \n\n0. 08 \n\n0. 06 \n\nActivityO.04 \n\nUnit 18 - Epoch 0 \n\nUnit 18 - Epoch 23 \n\n0.2 \n\n0.15 \n\nActivity 0. 1 \n\n0.05 \n\nx posi tion \n\nY position \n\nX position \n\ny position \n\n10 \n\n10 \n\nFigure 3: This figure shows the receptive field in implicit space for a hidden unit. \nThe left panel shows that before learning, the unit responds randomly to 100 differ(cid:173)\nent test patterns, generated by positioning a shape in the image at each point in a \n10xlO grid. Here the 2 dimensions in implicit space correspond to x and y positions. \nThe right panel shows that after learning, the hidden unit responds to objects in \na particular position, and its activity level falls off smoothly as the object position \nmoves away from the center of the learned receptive field. \n\n3 EXPERIMENTAL RESULTS \n\nIn the first experiment, each 8x8 real-valued input image contained an instance of a \nsimple shape in a random (x, y)-position. The network began with random weights, \nand each of 100 hidden units in a random 2D implicit position; we trained it using \nconjugate gradient on 400 examples. The network converged after 25 epochs. Each \nhidden unit developed a receptive field so that it responded to inputs in a limited \nneighborhood that corresponded to its learned position in implicit space (see Figure \n3). The set of hidden units covered the range of possible positions. \n\nIn a second experiment, we also varied the orientation of the shape and we gave \neach hidden unit three implicit coordinates. The network converged after 60 epochs \nof training on 1000 images. The hidden unit activities formed a population code \nthat allowed the input to be accurately reconstructed. \n\nA third experiment employed a training set where each image contained either a \nhorizontal or vertical bar, in some random position. The hidden units formed an \ninteresting 2D implicit space in this case: one set of hidden units moved to one \ncorner of the space, and represented instances of one shape, while the other group \nmoved to an opposite corner and represented the other (Figure 4). The network \nwas thus able to squeeze a third dimension (i.e., which shape) into the 2D implicit \nspace. This type of representation would be difficult to learn in a Kohonen network; \nthe fact that the hidden units learn their implicit coordinates allows more flexibility \nthan a system in which these coordinates are fixed in advance. \n\n\fY \n6.s0 \n\n6.00 \n\n5.50 \n\n!i.00 \n\n--- -----,-- , --- -\n\n- , - - -T - ----,--\n\n- - 1 - -\n\nx \n\nx \n\nx \n\n)( \n\n)( \n\nx \n\n.,P \n\nx \u2022 \nx \n\nx \n\n0 \n\nt-\n\n\u2022 \n~ \nx \n\no \n\no x \n\n,(' \n\n0>< x \n\n* 0 0 \n\u00b7 ~}o xcIX \nDa\u00b0 \nr \nx \n\n)( \n\ncP \n\nx \n\n4.00 \n\n4.50 \n\nx \n~. \n\nnil X \n'''xc \n\n[] . \n\nx \n\u2022 \n'\" \nx 0 c. x \n3.50 -IV \n)1(0 \u00b0 \nn-\n\u2022\u2022\u2022 \n\u2022 \nIf \n\u2022 \n\u00b7oX \n\u2022\u2022\u2022\u2022 lff~~ \nt \n\u2022 \n\n2.00 \n\n2.50 \n\n3.00 \n\n\u2022 \n\nf \n\n~ \n\nx \n\nn \n\no )(o~ \n\n\u2022 \n., \n\nIII \n\nx \n\nx \n\n\" \n\nx \n\ny \n\nX I \n~ \n\n'ic \nx \n\n'6 \n\nx \n\nxX \n\nx \n\nx \n\nx \n\nf,~ \n\no \u2022 \nx x ~ \n\u2022 \n\" \n.J,' \u2022 \u2022 X \u00b0 \nu \n'b \n\u2022 \n\u2022 \n\n, \nx \n\n~ \n\n0 \n\nCI \n\nII. J< \nr8 \n\n\u2022 \n\nUiO \n\n1.00 \n\n0050 \n\n0.00 \n\nx \n\n~.~n jroX\"x \u2022 ,8 Ii' \n\no \u2022 \n\n\u2022 c \n\n0 \nn \n)f,.o I) x \n\n_L \n0.00 \n\n- 1-\n1.00 \n\n0 \n\nx \n\nn \n\nx \nL \n2.00 \n\nc \nx \n\u2022 x \n_ L . \n3.00 \n\nX \nX x' '\\, ~ \nn \n\nxn \n\nx \nXx \n\u2022 \n~ \nx \n\n\u2022 \n\n\"1, \n\nIJ \n\n_L-\n6.00 \n\nfiX \n\nL \n4.00 \n\nL \n5.00 \n\n6.00 \n\n5.50 \n\n5.00 \n\n4.50 \n\n4.00 . \n\n3.50 \n\nx \n\n3.00 -\n\n2.50 \n\n2.00 \n\nI~O \n\n1.00 \n\n0.50 \n\n~ \n\nx \n\n\u00b0mean.V \n\u00b0mean.H \n\nx \n\nx \n\n\"x \n\nD~ )( \no \n0 \n\nx \nXX l,~oWt \u00b0 \n0'01 \nrmx \nqfJb \n'bl:r$J 0,(' \n\u00b0X0'b \u00b0 \"x \no 0 \n~c xBij /1x \n\nx \n\no~ x \n\n)( \n\n0 \n\n)( \n\nx * \n\nx x \n\n,,>!-\n\nx \n\nx )( \n\n)( \n\n)( \n\n\" \nx \" \n\nx \n\nx \n\nx \n~~ )( \nx \n)( .~x I \n\"x \nx ~~~ \nr;/~o<J1' \n\u2022 ~.'oxx x \n-t .~,,,P \nx\" \n\nx \n\nX \nx \n\n)(Jf. x \n\nx \n\nx \n\nx x \nx \n\n)( \n\nx \n\nx \n\nx \n\nDeveloping Population Codes by Minimizing Description Length \n\n17 \n\nImplicit Spare (Epoch 0) \n\nImplicit Spare (Epoch 120) \n\n6~ - --,--r--r-----.-~-\u00b7--, Xposn \n\nL_ \n0.00 \n\n.. L \n1.00 \n\n_L \n2.00 \n\n_ L _ _ __ _ L. \n3.00 \n4.00 \n\n_L \n5.00 \n\n_J \n6.00 \n\nX \n\nFigure 4: This figure shows the positions of the hidden units and the means in the 2D \nimplicit space before and after training on the horizontal/vertical task. The means \nin the top right of the second plot all correspond to images containing vertical bars, \nwhile the other set correspond to horizontal bar images. Note that some hidden \nunits are far from all the means; these units do not playa role in the coding of the \ninput, and are free to be recruited for other types of input cases. \n\n4 RELATED WORK \n\nThis new algorithm bears some similarities to several earlier algorithms. In the \nexperiments presented above, each hidden unit learns to act as a Radial Basis \nFunction (RBF) unit. Unlike standard RBFs, however, here the RBF activity serves \nas a target for the activity levels, and is determined by distance in a space other \nthan the input space. \n\nOur algorithm is more similar to topographic map formation algorithms, such as the \nKohonen and elastic-net algorithms. In these methods, however, the population(cid:173)\ncode is in effect formed in input space. Population coding in a space other than \nthe input enables our networks to extract nonlinear higher-order properties of the \ninputs. \n\nIn (Saund, 1989), hidden unit patterns of activity in an autoencoder are trained to \nform Gaussian bumps, where the center of the bump is intended to correspond to \nthe position in an underlying dimension of the inputs. In addition to the objective \nfunctions being quite different in the two algorithms, another crucial difference \nexists: in his algorithm, as well as the other earlier algorithms, the implicit space \ntopology is statically determined by the ordering of the hidden units, while units in \nour model learn their implicit coordinates. \n\n\f18 \n\nZemel and Hinton \n\n5 CONCLUSIONS AND CURRENT DIRECTIONS \n\nWe have shown how MDL can be used to develop non-factorial, redundant repre(cid:173)\nsentations. The objective function is derived from a communication model where \nrather than communicating each hidden unit activity independently, we instead \ncommunicate the location of a Gaussian bump in a low-dimensional implicit space. \nIf hidden units are appropriately tuned in this space their activities can then be \ninferred from the bump location. \n\nOur method can easily be applied to networks with multiple hidden layers , where \nthe implicit space is constructed at the last hidden layer before the output and \nderivatives are then backpropagated; this allows the implicit space to correspond \nto arbitrarily high-order input properties. Alternatively, instead of using multiple \nhidden layers to extract a single code for the input, one could use a hierarchical \nsystem in which the code-cost is computed at every layer. \n\nA limitation of this approach (as well as the aforementioned approaches) is the \nneed to predefine the dimensionality of implicit space. We are currently working \non an extension that will allow the learning algorithm to determine for itself the \nappropriate number of dimensions in implicit space. We start with many dimensions \nbut include the cost of specifying j-tt in the description length. This obviously \ndepends on how many implicit coordinates are used. If all of the hidden units have \nthe same value for one of the implicit coordinates, it costs nothing to communicate \nthat value for each bump. In general, the cost of an implicit coordinate depends \non the ratio between its variance (over all the different bumps) and the accuracy \nwith which it must be communicated. So the network can save bits by reducing \nthe variance for unneeded coordinates. This creates a smooth search space for \ndetermining how many implicit coordinates are needed. \n\nAcknowledgements \nThis research was supported by grants from NSERC, the Ontario Information Technology \nResearch Center, and the Institute for Robotics and Intelligent Systems. Geoffrey Hinton \nis the Noranda Fellow of the Canadian Institute for Advanced Research. We thank Peter \nDayan for helpful discussions. \n\nReferences \n\nDurbin, R. and Willshaw, D. (1987). An analogue approach to the travelling salesman \n\nproblem. Nature, 326:689-691. \n\nHinton, G. and Zemel, R. (1994). Autoencoders, minimum description length, and \nHelmholtz free energy. To appear in Cowan, J.D., Tesauro, G., and Alspector, \nJ. (eds.), Advances in Neural Information Processing Systems 6. San Francisco, \nCA: Morgan Kaufmann. \n\nKohonen, T. (1982). Self-organized formation of topologically correct feature maps. \n\nBiological Cybernetics, 43:59-69. \n\nRissanen, J. (1989). Stochastic Complexity in StatisticalInquiry. World Scientific Pub(cid:173)\n\nlishing Co., Singapore. \n\nSaund, E. (1989). Dimensionality-reduction using connectionist networks. \n\nIEEE \n\nTransactions on Pattern Analysis and Machine Intelligence, 11(3):304-314. \n\nZemel, R. (1993). A Minimum Description Length Framework for Unsupervised Learn(cid:173)\n\ning. Ph.D. Thesis, Department of Computer Science, University of Toronto. \n\n\f", "award": [], "sourceid": 845, "authors": [{"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}