{"title": "Bipartite expander Hopfield networks as self-decoding high-capacity error correcting codes", "book": "Advances in Neural Information Processing Systems", "page_first": 7688, "page_last": 7699, "abstract": "Neural network models of memory and error correction famously include the Hopfield network, which can directly store---and error-correct through its dynamics---arbitrary N-bit patterns, but only for ~N such patterns. On the other end of the spectrum, Shannon's coding theory established that it is possible to represent exponentially many states (~e^N) using N symbols in such a way that an optimal decoder could correct all noise upto a threshold. We prove that it is possible to construct an associative content-addressable network that combines the properties of strong error correcting codes and Hopfield networks: it simultaneously possesses exponentially many stable states, these states are robust enough, with large enough basins of attraction that they can be correctly recovered despite errors in a finite fraction of all nodes, and the errors are intrinsically corrected by the network\u2019s own dynamics. The network is a two-layer Boltzmann machine with simple neural dynamics, low dynamic-range (binary) pairwise synaptic connections, and sparse expander graph connectivity. Thus, quasi-random sparse structures---characteristic of important error-correcting codes---may provide for high-performance computation in artificial neural networks and the brain.", "full_text": "Bipartite expander Hop\ufb01eld networks as\n\nself-decoding high-capacity error correcting codes\n\nRishidev Chaudhuri\n\nCenter for Neuroscience,\n\nDepartments of Mathematics and\n\nNeurobiology, Physiology and Behavior,\n\nUniversity of California, Davis,\n\nDavis, CA 95616\n\nrchaudhuri@ucdavis.edu\n\nIla Fiete\n\nBrain and Cognitive Sciences,\n\nMassachusetts Institute of Technology,\n\nCambridge, MA 02139\n\nfiete@mit.edu\n\nAbstract\n\nNeural network models of memory and error correction famously include the Hop-\n\ufb01eld network, which can directly store\u2014and error-correct through its dynamics\u2014\narbitrary N-bit patterns, but only for \u223c N such patterns. On the other end of\nthe spectrum, Shannon\u2019s coding theory established that it is possible to represent\nexponentially many states (\u223c eN ) using N symbols in such a way that an optimal\ndecoder could correct all noise upto a threshold. We prove that it is possible to\nconstruct an associative content-addressable network that combines the properties\nof strong error correcting codes and Hop\ufb01eld networks: it simultaneously possesses\nexponentially many stable states, these states are robust enough, with large enough\nbasins of attraction that they can be correctly recovered despite errors in a \ufb01nite\nfraction of all nodes, and the errors are intrinsically corrected by the network\u2019s\nown dynamics. The network is a two-layer Boltzmann machine with simple neural\ndynamics, low dynamic-range (binary) pairwise synaptic connections, and sparse\nexpander graph connectivity. Thus, quasi-random sparse structures\u2014characteristic\nof important error-correcting codes\u2014may provide for high-performance computa-\ntion in arti\ufb01cial neural networks and the brain.\n\n1\n\nIntroduction\n\nNeural systems must be able to recover stored states from partial or noisy cues (pattern completion\nor cleanup), the de\ufb01nition of an associative memory. If the memory state can be addressed by its\ncontent, it is furthermore called content-addressable. The classic framework for neural associative\ncontent-addressable (ACA) memory is the conceptually powerful Hop\ufb01eld network [1\u20133].\nHere, we are motivated by both the Hop\ufb01eld network and by strong error-correcting codes (ECCs).\nECCs are simultaneously compact, allowing the speci\ufb01cation of exponentially many coding states\nusing codewords that grow only linearly in size; and well-separated or robust, permitting the correction\nof relatively large errors by an appropriate decoder. Speci\ufb01cally, ECCs can use strings of N bits to\nencode exponentially-many messages (\u223c 2\u03b1N , 0 < \u03b1 < 1), and to retrieve them in the presence of\nnoise that corrupts a \ufb01nite fraction of bits [4, 5].\nHowever, in ECCs, the decoder is external to the system and its costs (in time, space, and computa-\ntional complexity) are not taken into account when determining the capacity of the code. By contrast,\nthe simple neural dynamics of Hop\ufb01eld networks permits them to cleanup or decode their own states,\nthough only for a small number of such states. Here we pose (and answer) the question of whether\nit is possible to build simple, Hop\ufb01eld-like neural networks that can represent exponentially many\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwell-separated states using linearly many neurons, and perform error-correction or decoding on these\nstates in response to errors on a \ufb01nite fraction of all neurons.\nVery generally, in any representational system there is a tradeoff between capacity and noise\nrobustness[4]: a robust system must have redundancy to recover from noise, but the redundancy\ncomes at the price of fewer representational states. Conversely, a system with a large number of\ncoding states leaves less room for redundancy or surrounding basins for robustness. The entire state\nspace of a neural network with N binary nodes is 2N , thus the goal of achieving exponentially many\nstable coding states or codewords\u2014the same scaling as the total number of states\u2014simultaneously\nwith large basin sizes around each codeword for robust correction of errors that occur at a \ufb01nite rate\nat every neuron (for a total number of errors linear in network size) is highly non-trivial.\nWe show that a bipartite neural network and its stochastic equivalent, a Boltzmann machine[6]\u2014\nwhen equipped with random expander-graph connectivity between layers and clustered inhibition in\nthe hidden layer\u2014achieves the desiderata of strong error-correcting codes with built-in decoding,\nmeaning that the network exhibits exponential capacity, robustness to large errors, and self decoding\nor clean-up of these errors. By forging connections between the theory of low-density parity-check\n(LDPC) codes on expander graphs [7], and robust representation in neural networks, this construction\nleads to ACA networks with an unprecedented combination of robustness and number of states.\n\n2 Results\n\nFor this work, we de\ufb01ne a Hop\ufb01eld network to have N binary neurons, symmetric (undirected)\nweights, and asynchronous updates of a single randomly selected neuron at each discrete time-step:\n\n\uf8f1\uf8f4\uf8f2\uf8f4\uf8f31\n\nxt+1\n\ni =\n\n0\nBern(0.5)\n\nif (cid:80)\nif(cid:80)\nif(cid:80)\n\nj Wijxt\nj Wijxt\nj Wijxt\n\nj + bi > 0\nj + bi < 0\nj + bi = 0.\n\n(1)\n\nHere W is the weight matrix, b is a vector of biases, and Bern(0.5) is 0 or 1 with equal probability.\nThe neuron sums its inputs and turns on (off) if the sum exceeds (falls short of) a threshold. The\nnetwork dynamics lead it to some stable \ufb01xed point, determined by the connection weights and initial\nstate. Stable \ufb01xed points are minima of a generalized energy function:\n\nE(x|b, W ) = \u2212\n\n1\n2\n\nWijxixj \u2212\n\nbixi\n\n(2)\n\n(cid:88)\n\ni(cid:54)=j\n\n(cid:88)\n\ni\n\nWe de\ufb01ne a high pattern number (HPN) network as one with exponentially many stable states (i.e.,\nC \u223c 2\u03b1N , for some constant 0 < \u03b1 \u2264 1 with N neurons). A HPN network retains a non-vanishing\ninformation rate (ratio of log number of coding states to total possible states; log(C)/ log(2N ) = \u03b1)\neven as the network grows in size. Note that, as with the codewords of a good ECC, the stable states\nof a HPN network cannot be arbitrarily chosen (and theoretical results show that there can be at most\nO(N) arbitrarily chosen stable states in a Hop\ufb01eld network[8\u201310]).\nWe de\ufb01ne a robust system as one that can recover the original (or nearest Hamming distance)\ncodeword from a perturbed version with a constant error rate (p) in each node. Robust networks must\nthus tolerate a number of errors (pN) proportional to network size, which requires the memory states\nto be surrounded by attracting basins that grow suf\ufb01ciently fast with network size.\nTo our knowledge, no existing neural network model has been shown to combine the two capabilities\nof exponential capacity and high robustness, in addition to performing self-decoding (see Discussion\nfor more details on previous work).\nIn the main paper, our focus is on the mapping between ECCs and Hop\ufb01eld networks, and on the\nintuition behind why the network combines exponentially-many \ufb01xed points with large basins of\nattraction. Proofs and more technical results are in the SI. In particular, S3\u20134 prove that general\nlinear codes can be partially mapped onto Hop\ufb01eld networks but the complete mapping fails, S7\u20139\nprove that expander codes can be mapped to Hop\ufb01eld networks and provide further details on the\nconstruction, S10\u201311 consider extensions to weaker constraints and noisy updates, and S12 describes\na self-organization rule that generates the network.\n\n2\n\n\f2.1 Using ECCs to construct Hop\ufb01eld networks with exponentially many well-separated\n\nminima and failure of decoding\n\nFirst, we consider how to directly embed the codewords of a linear ECC into \ufb01xed points of a neural\nnetwork to generate well-separated stable states [11, 12].\nLinear ECCs consist of binary codewords of length K + K(cid:48) = N that satisfy K(cid:48) constraint parity\nequations, for a total of 2K codewords separated from each other by up to K(cid:48) bits. These codes can\npermit correction of errors on up to (K(cid:48) \u2212 1)/2 bits (see SI S2 for a brief pedagogical overview of\nECCs), by an appropriate decoder.\nWe demonstrate the embedding using the classic (7,4) Hamming code[13], which consists of 24\ncodewords of length 7, including 3 parity check bits, Fig. 1a (see SI S3 for the embedding of more\ngeneral linear ECC codewords into neural network \ufb01xed points). The 3-bit separation of codewords\nmeans that all single bit \ufb02ips should be corrected by an optimal decoder. The correct codeword is the\none that is closest in Hamming distance to the single-bit corrupted state.\nThe Hamming code can be mapped into the \ufb01xed points of an ACA network using a Hop\ufb01eld network\nof 7 neurons and 4th-order weights, Fig. 1b [11]: The binary state of one neuron represents one bit\n(letter) in the codeword while each weight represents a 4-way constraint on the nodes, Figure 1b. By\nconstruction, the minimum energy states are the Hamming codewords (SI S3 for a proof).\nWe achieve the same result more realistically, using only pairwise rather than K-th order connections,\nby introducing a bipartite architecture: Each hidden node can connect pairwise with multiple (K)\ninput neurons to enforce the appropriate K-th order relationships among the input neurons, Fig. 1c.\nThough the well-separated codewords are well-separated minima of the network dynamics, running\nthe dynamics does not appropriately correct a \ufb02ipped bit, Fig. 1d. Starting one bit-\ufb02ip from a\ncodeword (at x1), there are many paths for the network to move downhill in energy (energy function\nin Fig. 1b). One is to appropriately correct the error bit. But, in general, there is only one downhill\nand correct direction, and many more (\u223c N/2) ways to move downhill in energy that correspond to\nsteps away in Hamming distance (e.g. by \ufb02ipping bits x3 or x7), producing wrong decoding. For\nnetwork embeddings of longer Hamming codes, error probability approaches unity (SI S4). Indeed,\nwe expect similar decoding failures for a large class of error-correcting codes (SI S4).\nThe network\u2019s failure should not be surprising, since decoding strong ECCs is computationally\nhard [14], involving more complex message-passing dynamics than the local sums and pointwise\nnonlinearities of neural networks. The failure is one of credit assignment: the network cannot identify\nthe actual \ufb02ipped bit, and \ufb02ips a different bit to move downhill in energy. A successful solution must\nsolve the credit-assignment problem.\n\n2.2 Exponential-capacity robust error-correcting ACAs\n\nThe central contribution of the present work is to show, we believe for the \ufb01rst time, that it is possible\nto implement strong error correcting codes with self-decoding in Hop\ufb01eld networks. Our solution is\nbased on recent developments in graph theory and coding theory that led to ECCs that can be decoded\nby simple greedy algorithms[7].\nThese ECCs rely on a constraint structure determined by an expander graph. In an expander graph,\nall small subsets of vertices are connected to relatively large numbers of vertices in their complements.\nFor instance, a subset of 4 vertices, each with out-degree 3, shows good expansion if it projects to 12\nother vertices, Figure 2a (top).\nWe construct a neural network embedding by constructing a bipartite expander graph, consisting of\nN, NC input and hidden (constraint) nodes (NC (cid:47) N), respectively. Each input node is a neuron\nwhile each hidden node is a small network of neurons interacting competitively through inhibition,\nwhose size K does not grow with N. Different neurons in a constraint node are connected to the\nsame input subset, but differ in their weights; the competitive interactions serve to separate stable\nstates within the constraint node.\nAt the level of nodes, connections run between input and constraint layers but not between nodes in a\nlayer, as in a restricted Boltzmann machine [15]. The input and constraint nodes have degrees z, zC\nrespectively (thus N z = NCzC; for expositional simplicity we assume \ufb01xed degree; see SI S6-9\nfor generalization to variable degree). We are interested in sparse networks, where the number of\n\n3\n\n\fFigure 1: Successful embedding but incorrect decoding of ECC codewords in Hop\ufb01eld net-\nworks. (a) Codewords of the (7,4) Hamming code are binary strings of length 7 (right) that satisfy\n3 constraint equations (bottom left; sums are modulo 2). (b) The codewords are embedded as the\nstable states of a Hop\ufb01eld network with 7 nodes (circles) and 4th-order edges (lines of a single\ncolor represent a single 4th-order edge; the edge weight between nodes i, j, k, l is Jijkl; binary\nHamming states are mapped to {\u22121, 1} in the Hop\ufb01eld network for notational convenience). Each\nedge implements one constraint equation. Bottom: The energy function E; it is minimized at each\ncodeword. (c) The 4th-order edges can be replaced by conventional pairwise edges if the recurrent\nnetwork is transformed into a network with a hidden layer. A constraint node (square) is not a single\nneuron but a small network that implements a parity operation on its inputs (e.g. as in Fig. 3). (d)\nSchematic showing decoding error. The initial state \u22121111111 (lavender cross), one \ufb02ip from the\nclosest codeword (1111111; blue) can proceed along multiple trajectories that decrease energy, either\nto the correct codeword (blue) or to a codeword two \ufb02ips away (black). Solid lines show regions in\nstate space closest to a particular codeword.\n\nconnections the input and hidden nodes each make does not increase with network size. This bipartite\ngraph is a (\u03b3, (1 \u2212 \u0001)) expander if every suf\ufb01ciently small subset S (of size |S| < \u03b3N, for some\n\ufb01xed \u03b3 < 1) of input vertices has at least (1 \u2212 \u0001)z|S| neighbors among the hidden vertices[7, 16\u201318]\n(0 \u2264 \u0001 < 1 is some constant, with \u0001 \u2192 0 corresponding to increasing expansion).\nSurprisingly, and helpfully in the neural context, sparse random bipartite graphs even with small\ndegree yield good expansion (with \u0001 < 1/4 generically) [7, 16\u201318]. Thus we choose connections\nbetween layers randomly, and verify that they are indeed good expanders (SI S5 and Figure S1).\nThe Lyapunov (generalized energy) function of the network is:\n\n(cid:18)\n\n(cid:19)\n\nE(x, h) = \u2212\n\nxT U h + bT h +\n\n1\n2\n\nhT W h\n\n.\n\n(3)\n\nHere x and h are vectors of input and constraint neuron activity respectively, U is a sparse matrix\nwith non-zero entries determined by an expander graph, W is a block-diagonal matrix of inhibitory\nconnections and b is a vector of \ufb01xed background inputs to the constraint neurons (for further details\nsee SI S8).\nThe network capacity is exponential, Figure 2b: the number of stable states grows exponentially in\nthe total size of the network (total number of neurons in input and constraint nodes). Moreover, the\nnetwork dynamics appropriately moves perturbed states back to their closest (in Hamming distance)\nstable states, correcting a number of errors that increases in proportion to the number of input nodes\n(and thus also total network size), Figure 2c-e (proofs in SI, S6-9).\nSpeci\ufb01cally, if each constraint node contains at most Kmax neurons (independent of N; thus the total\nnetwork size is Ntotal \u2264 N + KmaxNC), we prove that the expected total number of energy minima\nand the total number of correctable errors are given by\n(4)\n\nNstates \u2265 2\u03b1Ntotal\n\nand Nerrors \u2265 \u03b2N\n\n4\n\nbcda62345171762345xx1111111\u201311\u2013111\u201311\u201311\u201311111\u20131111111bipartite graph: 4th order weights pairwise weights0000000111111110001010100110001011100010111100011101001010011100110001001110001011010111010101100111010001110100(x1,x2,x3,x4|x5,x6,x7)x1+x2+x3+x5=0x2+x3+x4+x6=0x1+x3+x4+x7=0E(x)=\u2212x1x2x3x5\u2212x2x3x4x6\u2212x1x3x4x7J1235=J2346=J1347=1\f(1+Kmax \u02c6z) and \u03b2 = \u03b3(1 \u2212 2\u0001) are both \ufb01nite constants. \u02c6z = (cid:104)z(cid:105)\n(cid:104)zC(cid:105)\n\nwhere \u03b1 = (1\u2212(cid:104)r(cid:105)\u02c6z)\nis the ratio of the\naverage input and hidden layer degrees across nodes (= z/zC for a regular network) and \u2212(cid:104)r(cid:105) is the\naverage across nodes of the log ratio of permitted states to all states for a constraint node (SI S8 for\nmore details). The number of minimum energy states is exponential in total network size because\n\u03b1 is independent of N, Ntotal. We need only be concerned with errors in the inputs because the\ninitial conditions in the constraint nodes are set by relaxation after brie\ufb02y clamping the input nodes\n(equivalently, all errors in the constraint neurons are irrelevant/correctable).\nGiven any per-neuron error probability smaller than \u03b2, the network corrects all errors (with probability\n\u2192 1 as N \u2192 \u221e). Typical of strong ECCs, the probability of correct inference is step-like: all errors\nsmaller than a threshold size are corrected, while those exceeding the threshold result in failure.\nThe results of Figure 2 involve an expansion coef\ufb01cient \u0001 > 1/4; constraint nodes that only permit\ninput con\ufb01gurations that differ in the states of at least two neurons (described below); and noise-free\n(Hop\ufb01eld-like) update dynamics (beyond the randomness present from the asynchronous nature of\nthe updates). In SI S10 and 11 we extend these results to less stringent conditions on the constraint\nnodes and show that analogous results, up to small \ufb02uctuations around the noiseless stable state, hold\nfor stochastic dynamics in a Boltzmann Machine (Fig. S2).\n\n2.3 Compositional structure and dynamics of the robust high-capacity ACA\n\nWe next examine the structure and dynamics of this network, to explain how it works. In the\nsection after, we examine why, by virtue of its sparse expander structure, it does not fall prey to the\ncredit-assignment errors exhibited by neural network implementations of Hamming codes.\nWithin a constraint node, a small network of K neurons with recurrent inhibition attempts to drive\nthe input states to a smaller set of permitted states.1 Thus, although the network has a restricted\nBoltzmann machine architecture at the level of input neurons and constraint nodes, the constraint\nneurons in a given node interact through lateral inhibition. All neurons in one constraint node connect\nto the same subset of zC input neurons, Figure 3a, but each will prefer a different con\ufb01guration of\nstates on these input neurons, as determined by its weights. For our construction, it is important that\nthese preferred con\ufb01gurations differ from one another in at least two bits or two input neuron states.\nWe will show below a simple self-organized way for these well-separated con\ufb01gurations to emerge,\nbut for the moment will simply assume that these hidden node preferred states are at least two bits\napart. Then, there are at most 2zC\u22121 preferred states (out of 2zC possible states for the input subset)\nrequiring K \u2264 2zC\u22121 neurons in a constraint node (Figure 3b). Operationally, zC can be small\n(between 2 and 6 in Figure 2 ). For simplicity in Figure 2, 3, we choose the preferred states to be\neven parity states of the input, by setting weights appropriately (see SI S6, S8-10 for generalizations).\nBy virtue of their shared inputs, neurons in a constraint node could be viewed as a glomerulus (this\ndescribes connectivity and need not imply physical clustering as for olfactory or cerebellar glomeruli;\nit also does not perform ampli\ufb01cation as in sensory glomeruli). While input-to-constraint node\nconnectivity is random, connectivity of individual neurons within a constraint node is correlated since\nthese neurons receive the same set of inputs \u2014 the network is therefore not fully random.\nTaken together, the stable states of the network are compositions or combinations of the preferred\ninput con\ufb01gurations of the different constraint nodes. In the absence of constraints, the input layer\nhas 2N possible states. Each constraint node cuts in half the number of possible states (if preferred\npatterns in the constraint node are based on parity; see SI S6, S8-10 for generalizations); thus, with\nNC constraints there are 2N\u2212NC stable states. Since NC is a fraction of N, the number of stable\nstates is exponential in N (and in total network size; see above). By virtue of the \u2265 2 separation\nin patterns per constraint node and the expansion property of the connectivity, the stable states are\nseparated by at least \u03b3N, which is linear in N (recall that \u03b3 is the expansion coef\ufb01cient).\nTo understand how the network works, \ufb01rst consider the activity of a constraint node. Constraint\nnodes are conditionally independent of each other given the inputs; thus each constraint node can\nbe studied individually, with its inputs. When a neuron within a constraint node receives an input\nthat exactly matches its preferred con\ufb01guration, it becomes active and silences the others through\nstrong inhibition, Figure 3a. This is a low-energy state for that node (SI S8), which we will refer to as\n\n1Within-node inhibition can be replaced by common global inhibition across all neurons and constraint\n\nnodes, slowing convergence dynamics but not affecting the overall quality of the computation.\n\n5\n\n\fFigure 2: ACA network with exponential capacity and robust error correction. (a) Good (top) vs.\npoor (bottom) expansion in bipartite graphs. Input nodes each send z = 3 edges to the constraint layer.\nSubsets of input neurons (one subset highlighted by dashed line) have many (few) neighbors in good\n(poor) expanders. (b) Network capacity is exponential. Gray line: Derived theoretical lower bound\non number of robust stable states versus the number of input neurons (N). Open circles: Number of\nrobust stable states in simulated networks (100 simulations for each point). Since constraint nodes\napply parity constraints (see SI S6, S8-10 for generalizations), we numerically calculate the number\nof stable states as the dimension of the null space of the constraint matrix in the binary \ufb01eld F2. The\nslight scatter of points re\ufb02ects occasional duplicate constraints in small random networks (vanishingly\nrare for large N). (c) Fraction of times the network converges to the correct state when a \ufb01nite fraction\nof nodes (thus a linearly growing number of nodes with network size) is corrupted. The increasingly\nsharp transition between recovery and failure (as N increases) is characteristic of ECCs. Error bars:\nstandard error. Minimum of 25 runs for each data point. (d) Energy (gray) and number of node \ufb02ips\n(black) over time in an N = 500 neuron network with 4% initial corruption, as the network relaxes\nto the closest stable state. Energy always decreases monotonically, but the number of node \ufb02ips need\nnot. For panels (b-d), NC = 0.95N, 5 \u2264 z \u2264 10 and 2 \u2264 zC \u2264 6 (see SI S8 for further details). (e)\nNetwork state-space trajectories (projected onto 2D space) in a simulated small network. Black dots:\ntwo stable states. Different initial states (with 1-5 nodes corrupted, purple, gray, and orange dots) in\nthe vicinity of the stable state to the left and their \ufb02ows to the stable states. All initial states within a\nthreshold Hamming distance of the original stable state \ufb02ow to it (purple dots). Those not within that\ndistance (gray and orange dots) \ufb02ow to other stable states including spurious states. The trajectories\nthat end in the adjacent stable state shown at right are in orange. N = 18, NC = 15, z = 5, zC = 6.\n\n6\n\nbnodes200400600800010161001041081012stable states1000cinput corrupted (%)6810124201fraction correcttime05000100 (energy)energy, number flips5 (num flips)dz=3z=3aeN = 250N = 500N = 1000N = 1500\fFigure 3: Architecture and dynamics of a robust exponential capacity ACA. (a) A constraint\nnode (square box) is a small subnetwork of 2zC\u22121 neurons, with global inhibition within the node\n(hashed node is inhibitory). Recall that zC is very small (typically 5-10) and does not change with\nnetwork size. (b) Single constraint node: All neurons in one constraint node connect to the same\ninputs, but with different binary weights (+\u2019s and \u2212\u2019s shown above a few neurons) that determine\nwhich input pattern a constraint neuron prefers. E.g. (\ufb01rst example, top row) a constraint neuron with\ninput weights \u2212 \u2212 ++ prefers 0011 and wins the competition to be active for that input (on state:\nwhite \ufb01ll), silencing the rest (off state: gray). Green check-mark (red cross) shows the constraint\nnode in a (un)satis\ufb01ed state. Here, even-parity patterns (an even number of 1\u2019s) are the only preferred\ninput states. (c) Top left: network with both satis\ufb01ed and unsatis\ufb01ed constraints. Flipping an input\nattached to more unsatis\ufb01ed than satis\ufb01ed constraint nodes (red arrow, bottom) lowers the energy of\nthe network when it \ufb02ips, by \ufb02ipping the status of all its constraint nodes (top right). Bottom: The\nproblem of poor expansion: if two constraints have overlapping input states, they cannot identify the\nsource of multiple shared errors. A single error in the shared inputs violates both constraints (center),\nand if a second shared input is corrupted (right) both constraints are satis\ufb01ed and the iteration process\nfrom above fails.\n\n\u201csatis\ufb01ed\u201d, Figure 3b (left, green). If the input exactly matches none of the preferred con\ufb01gurations,\nmore than one constraint neuron will receive equal drive, Figure 3b (right, red); inhibition is strong\nenough that no more than two among the equally-driven constraint neurons can be active. Such a\nstate corresponds to a higher energy, \u201cunsatis\ufb01ed\u201d con\ufb01guration. The overall energy of this network\nis proportional to the total number of unsatis\ufb01ed constraint nodes (SI S3 and S9), allowing local\nenergy-minimizing dynamics to successfully correct errors as we describe below.\n\n2.4 Credit assignment with expander graph architecture\n\nHeuristically, the network solves credit assignment as follows. For networks with sparse and expansive\nconnectivity, any small set of input neurons shares few common constraint nodes (Fig. 2a and 3c).\nThus, if a small fraction of input nodes is wrong, each constraint node will typically receive one\nor zero wrong inputs. The total number of unsatis\ufb01ed constraints, which is proportional to the\nenergy of the network, re\ufb02ects the number of corrupted inputs, aligning the metrics of energy and\ncorrupted input bits. In addition, the speci\ufb01c constellation of unsatis\ufb01ed constraint nodes for codes\non sparse expander networks (but not in standard ECC embeddings into a neural network, including\nthe Hamming code implementations above) is a detailed \ufb01ngerprint of the pattern of input errors; this\n\ufb01ngerprint determines which speci\ufb01c input neurons are in error and should be \ufb02ipped.\nSpeci\ufb01cally, for a suf\ufb01ciently small input error rate (< \u03b3(1 \u2212 2\u0001)), a majority of the constraint nodes\nthat receive any corrupted inputs are connected to only one corrupted input. These constraint nodes\nare appropriately unsatis\ufb01ed. Next, the network dynamics highly preferentially updates inputs that are\nconnected to more unsatis\ufb01ed than satis\ufb01ed constraint nodes (for proofs, see SI, S6-9 and [7]); this\nprocess both reduces the energy of the network and the number of corrupted inputs: all unsatis\ufb01ed\nconstraints connected to the \ufb02ipped input bit will now be satis\ufb01ed, and vice versa. The network thus\niteratively reduces unsatis\ufb01ed constraints and, doing so, corrects errors.\n\n2.5 Self-organization to robust exponential capacity\n\nRobust exponential capacity requires that preferred patterns at each constraint node differ by at least\ntwo bits. The neurons within each constraint node, connected to the same subset of input neurons, can\ncome to prefer suf\ufb01ciently non-overlapping patterns through a simple self-organization rule. Brie\ufb02y,\n\n7\n\nbca2zC\u22121zC=4\u2212\u2212++\u2212+\u2212+\u2212++\u2212+\u2212+\u2212\fthe self-organization rule proceeds by pairing very-sparse random activation of the constraint neurons\nin a node with random activation of the inputs, together with inhibition within the constraint node.\nAn active constraint neuron wires with strength +1 to co-active inputs and -1 to inactive inputs, in a\nHebbian-like one-shot modi\ufb01cation. Lateral inhibition prevents constraint neurons from becoming\nactivated by too-similar patterns. We prove that this self-organization rule leads each constraint node\nto come to prefer a set of input states that differ in at least two entries, and does so in a time that\nscales only logarithmically in network size (SI S12 and Fig. S3 for further details).\n\n3 Discussion\n\nWe leveraged recent constructions of low-density parity check codes based on expander graphs\n(expander codes), which admit decoding by simple greedy algorithms [7, 16], to show that ACA\nnetworks with quasi-random connectivity can have and decode exponentially-many robust stable\nstates. In sum, we have constructed simple ACA networks with capacity and robustness comparable to\nstate-of-the-art codes in communications theory, moreover with the decoder built into the dynamics.\nSpin glasses (random-weight symmetric Hop\ufb01eld networks) possess exponentially many (quasi)stable\n\ufb01xed points [19] but these energy minima have not been shown to have large basins of attraction.\nInstead, most minima are not observable in numerical explorations [20], suggesting small basins\n\u221aN stable\nof attraction. Hop\ufb01eld networks designed for constraint satisfaction problems have \u223c 2\nstates [21\u201323]; a recent construction with hidden nodes has capacity that is exponential in the ratio\nof the number of hidden to input neurons[24]; and a network based on a sparse bipartite graph with\ninteger-valued neural activity shows exponentially many stable states [25]. However, in all cases the\nfraction of correctable errors either vanishes with network size or is negligible to begin with. Thus\nnone of these networks are robust to noise.\nThe \ufb01rst attempts to link ECCs with spin glasses and Hop\ufb01eld networks were by Sourlas [11, 12],\nwhere the goal was to use the relationship to decode noisy messages. However, these studies construct\na separate neural network (with different weights and energy function) for decoding inputs in the\nneighborhood of each codeword. Thus, each network carries out decoding speci\ufb01c to a particular\ninput and does not possess exponentially-many stable states with large basins of attraction that\nappropriately decode states near any of the robust stable states.\nNear-exponential capacity was recently realized in Hop\ufb01eld networks with clique structures. Hillar\n& Tran [26] construct a network whose stored patterns correspond to the cliques of an abstract\n\ngraph, with capacity C =\u223c e\u03b1\u221aN . Fiete et. al [27] divide a network of N neurons into N/ log(N )\nnon-overlapping binary switches, for capacity C =\u223c eN/ log(N ). These networks have multiple nice\nfeatures: they have large basins of attraction, converge rapidly and are easy to construct. However,\ntheir information rate (\u03c1 \u2261 ln(C)/N) still vanishes asymptotically in N and thus does not match\ngood error-correcting codes. They are also susceptible to sub-linear sized adversarial patterns of error.\nOur network, which does not break into distinct sub-linear sized subnetworks due to the intermeshed\nexpander graph structure is not similarly susceptible.\nThe network has a two-layer restricted Boltzmann machine-like architecture and can be represented\nas a factor graph (constraint modules are factors), undirected graphical model (clique potentials\nare indicator functions on visible neurons in a constraint) and Bayes net (in several different ways).\nOur network can be made structurally similar to the Bayes nets used to decode ECCs [28, 29] by\nadding an input layer with \ufb01xed noisy inputs and slightly rewriting constraint modules. The primary\ndifference is dynamical rather than structural. Expander codes permit simple local decoding [7] and\nthus we are able to use Hop\ufb01eld dynamics as the decoding rule in the network. These dynamics are\nsigni\ufb01cantly simpler than belief propagation (BP). In general, codes that can be decoded by BP do\nnot admit simple network decoding by the energy-based Hop\ufb01eld rule and we do not expect general\nECC decoding to be performed by Hop\ufb01eld dynamics. In the SI we make preliminary attempts to\ncharacterize when ECCs can be mapped to neural networks. It will be interesting, in future work, to\ndetermine when the mapping is possible and to analyze a broader set of ECCs as neural networks.\nHop\ufb01eld networks of N neurons cannot store for full recall more than O(N ) arbitrary patterns [8, 9].\nThe patterns stored in our exponential capacity network are not arbitrary and satisfy a conjunction of\nmany sparse constraints. Given these restrictions, how might these (or any other supralinear capacity\nnetworks[26, 27]) be used?\n\n8\n\n\fOne possibility is that the networks are used in the traditional Hop\ufb01eld network sense, as content-\naddressable memories to recall entire input patterns, but with very high capacity for inputs with\nappropriate structure. Indeed, natural inputs that are stored well by brains are not random. For\nexample, natural images are generated from latent causes or sources in the world, each imposing\nconstraints on a sparse subset of the retinal data we receive, and might be reasonable candidates to\nstore in such a Hop\ufb01eld network [30]. Alternatively, processing the data to decorrelate lower moments\n[31\u201333] while preserving or adding information in higher moments might produce appropriate\nstructure. It is an open question what stimuli might either be already described within this framework\nor be naturally transformed to acquire the appropriate structure.\nA second possibility is to use these networks as high-capacity pattern labelers or locality-sensitive\nhash functions. Here input patterns in a very high-dimensional space (possibly neocortex) are mapped\nto the exponentially-many stable states of a HPN attractor network (possibly hippocampus), which\nserve as memory labels for the patterns. The connectivity matrix can be constructed in a simple,\nonline, Hebbian way. When presented with a noisy version of an input pattern, the memory network\nrobustly retrieves the correct label (and maintains it in the absence of input). Such a pattern labeler can\nbe used for recognition or familiarity detection, template matching, classi\ufb01cation, locality-sensitive\nhashing and nearest neighbor computations, and could also play a role in memory consolidation and\nlearning conjunctive representations.\nOur network is structurally simple, apart from the modular organization of the constraints: con-\nnectivity is pairwise and random, and weights have low dynamic range (in fact they are binary).\nThis simplicity suggests that such error-correction strategies may be used in the brain, and it will\nbe interesting to look for signatures of them in neural data. One signature is that the network has\nmany more constraint cells than input cells. Constraint cells are very sparsely active because only\nthose few which receive their preferred input are driven, along with some others that are transiently\nactive to correct errors. The smaller number of constraint cells that receive their preferred inputs,\ntogether with the input cells that carry the actual representation will respond stably. Consequently,\nrepresentations are predicted to contain a dense but small stable core, with many other neurons that\nare transiently active and not reliably responsive across repeats of a pattern (because errors might\nbe different each time). This is reminiscent of observations in place cell populations [34] and the\nsparse, heavy-tailed distribution of population activities across hippocampus and neocortex [35, 36].\nA second signature of these strategies is that representations in the stably active neurons should have\ndecorrelated second-order (i.e. pairwise) statistics but contain structure in higher-order moments that\nthe network exploits for error correction.\nOur results generalize along multiple directions, and provide several insights into the structure of\nsparse neural networks. We show many other results in the supplement, including that: 1) The same\nprinciples carry over to stochastic Boltzmann machines (SI S11). 2) Sparse random higher-order\nHop\ufb01eld networks, even without hidden nodes, are isomorphic to expander codes (SI S7). Thus,\nhigher-order Hop\ufb01eld networks generically have exponential capacity with high robustness, if they are\nsparse and random. Higher-order Hop\ufb01eld networks are used to model disordered systems in physics,\noften under the name of p-spin in\ufb01nite-range models[37]. The connection to expander codes may lead\nto further insights into these models. 3) Energy-based decoding is likely to generically fail on dense\nnetworks, suggesting that sparsity may be necessary, not merely suf\ufb01cient, for high-capacity neural\nnetworks (SI S4). 4) We extend the \ufb01nding that the total capacity (the product of information per\npattern times number of patterns) of any Hop\ufb01eld network with pairwise connections is theoretically\nbounded at O(N ) arbitrary dense patterns [8\u201310] and O(N 2) arbitrary sparse patterns[38\u201340], to\nshow that these bounds also hold for architectures with hidden nodes (SI S13).\nExpander graph neural networks combine many weak constraints and exploit properties that are rare\nin low dimensions but generic in high dimensions, both common tropes in modern computer science\nand machine learning[41\u201344]. Expander graphs have found widespread recent use in designing\nalgorithms[17, 7, 45]. Because neural networks are large and sparse, and because large sparse random\nnetworks are good expanders[17, 18], expander architectures may provide broader insight into the\ncomputational capabilities of the brain. Our results may bridge problems in neural networks and a\ngrowing body of powerful expander-graph-based techniques and algorithms in computer science.\n\nCode availability All simulations were run using scripts written in Python 2.7, along with standard\npackages. Code is available at https://chaudhurilab.ucdavis.edu/code.\n\n9\n\n\fAcknowledgments\n\nWe are indebted to Peter Latham for extensive and thought-provoking comments and discussion, as\nwell as suggestions for improving the exposition of our results. We are grateful to Yoram Burak,\nDavid Schwab, and Ngoc Tran for many helpful discussions on early parts of this work, and to Yoram\nBurak and Christopher Hillar for comments on the manuscript. IF is an HHMI Faculty Scholar, a\nCIFAR Senior Fellow, and acknowledges funding from the Simons Foundation and the ONR under a\nYIP award. Part of this work was performed by RC and IF in residence at the Simons Institute for the\nTheory of Computing at Berkeley, where RC was supported by a Google Fellowship.\n\nReferences\n[1] Little, W. The existence of persistent states in the brain. Math. Biosci. 19, 101\u2013120 (1974).\n\n[2] Hop\ufb01eld, J. J. Neural networks and physical systems with emergent collective computational\n\nabilities. Proc. Natl. Acad. Sci. U. S. A. 79, 2554\u20138 (1982).\n\n[3] Grossberg, S. Nonlinear neural networks: Principles, mechanisms, and architectures. Neural\n\nNetw. 1, 17\u201361 (1988).\n\n[4] MacKay, D. Information Theory, Inference, and Learning Algorithms (Cambridge University\n\nPress, 2004).\n\n[5] Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379\u2013423, 623\u2013656\n\n(1948).\n\n[6] Hinton, G. E. & Sejnowski, T. J. Optimal perceptual inference. In Proc. CVPR IEEE, 448\u2013453\n\n(Citeseer, 1983).\n\n[7] Sipser, M. & Spielman, D. A. Expander codes. IEEE Trans. Inf. Theory 42, 1710\u20131722 (1996).\n\n[8] Abu-Mostafa, Y. S. & St Jacques, J. Information capacity of the Hop\ufb01eld model. IEEE Trans.\n\nInf. Theory 31, 461\u2013464 (1985).\n\n[9] Gardner, E. & Derrida, B. Optimal storage properties of neural network models. J. Phys. A 21,\n\n271 (1988).\n\n[10] Treves, A. & Rolls, E. T. What determines the capacity of autoassociative memories in the\n\nbrain? Network: Comp. Neural 2, 371\u2013397 (1991).\n\n[11] Sourlas, N. Spin-glass models as error-correcting codes. Nature 339, 693\u2013695 (1989).\n\n[12] Sourlas, N. Statistical mechanics and capacity-approaching error-correcting codes. Physica A\n\n302, 14\u201321 (2001).\n\n[13] Hamming, R. W. Error detecting and error correcting codes. Bell Syst. Tech. J. 29, 147\u2013160\n\n(1950).\n\n[14] Berlekamp, E. R., McEliece, R. J. & Van Tilborg, H. C. On the inherent intractability of certain\n\ncoding problems. IEEE Trans. Inf. Theory 24, 384\u2013386 (1978).\n\n[15] Smolensky, P. Information processing in dynamical systems: Foundations of harmony theory.\nIn Parallel distributed processing: Explorations in the microstructure of cognition, 194\u2014-281\n(MIT press, 1986).\n\n[16] Luby, M. G., Mitzenmacher, M., Shokrollahi, M. A. & Spielman, D. A. Improved low-density\n\nparity-check codes using irregular graphs. IEEE Trans. Inf. Theory 47, 585\u2013598 (2001).\n\n[17] Hoory, S., Linial, N. & Wigderson, A. Expander graphs and their applications. Bull. Am. Math.\n\nSoc. 43, 439\u2013561 (2006).\n\n[18] Lubotzky, A. Expander graphs in pure and applied mathematics. Bull. Am. Math. Soc. 49,\n\n113\u2013162 (2012).\n\n10\n\n\f[19] Tanaka, F. & Edwards, S. Analytic theory of the ground state properties of a spin glass. I. Ising\n\nspin glass. J. Phys. F 10, 2769 (1980).\n\n[20] Kirkpatrick, S. & Sherrington, D. In\ufb01nite-ranged models of spin-glasses. Phys. Rev. B 17, 4384\n\n(1978).\n\n[21] Hop\ufb01eld, J. J. & Tank, D. W. \u201cNeural\u201d computation of decisions in optimization problems. Biol.\n\nCybern. 52, 141\u2013152 (1985).\n\n[22] Hop\ufb01eld, J. J. & Tank, D. W. Computing with neural circuits: A model. Science 233, 625\u2013633\n\n(1986).\n\n[23] Tank, D. W. & Hop\ufb01eld, J. J. Simple \u2018neural\u2019 optimization networks: An A/D converter, signal\ndecision circuit, and a linear programming circuit. IEEE Trans. Circuits Syst. 33, 533\u2013541\n(1986).\n\n[24] Alemi, A. & Abbara, A. Exponential capacity in an autoencoder neural network with a hidden\n\nlayer. arXiv preprint arXiv:1705.07441 (2017).\n\n[25] Salavati, A. H., Kumar, K. R. & Shokrollahi, A. Nonbinary associative memory with exponential\npattern retrieval capacity and iterative learning. IEEE Trans. Neural Netw. Learn. Syst. 25,\n557\u2013570 (2014).\n\n[26] Hillar, C. J. & Tran, N. M. Robust exponential memory in Hop\ufb01eld networks. J. Math. Neurosci.\n\n8, 1 (2018).\n\n[27] Fiete, I., Schwab, D. J. & Tran, N. M. A binary Hop\ufb01eld network with 1/ log(n) information\n\nrate and applications to grid cell decoding. arXiv preprint arXiv:1407.6029 (2014).\n\n[28] McEliece, R. J., MacKay, D. J. & Cheng, J.-F. Turbo decoding as an instance of Pearl\u2019s \u201cbelief\n\npropagation\u201d algorithm. IEEE J. Sel. Areas Commun. 16, 140\u2013152 (1998).\n\n[29] Kschischang, F. R. & Frey, B. J. Iterative decoding of compound codes by probability propaga-\n\ntion in graphical models. IEEE J. Sel. Areas Commun. 16, 219\u2013230 (1998).\n\n[30] Hillar, C., Mehta, R. & Koepsell, K. A Hop\ufb01eld recurrent neural network trained on natural\nimages performs state-of-the-art image compression. In IEEE Image Proc., 4092\u20134096 (IEEE,\n2014).\n\n[31] Barlow, H. B. Possible principles underlying the transformations of sensory messages. In\n\nSensory Communication (MIT press, 1961).\n\n[32] Olshausen, B. A. & Field, D. J. Emergence of simple-cell receptive \ufb01eld properties by learning\n\na sparse code for natural images. Nature 381, 607\u2013609 (1996).\n\n[33] Vinje, W. E. & Gallant, J. L. Sparse coding and decorrelation in primary visual cortex during\n\nnatural vision. Science 287, 1273\u20131276 (2000).\n\n[34] Ziv, Y. et al. Long-term dynamics of CA1 hippocampal place codes. Nat. Neurosci. 16, 264\u2013266\n\n(2013).\n\n[35] Buzs\u00e1ki, G. & Mizuseki, K. The log-dynamic brain: how skewed distributions affect network\n\noperations. Nat. Rev. Neuro. 15, 264\u2013278 (2014).\n\n[36] Rich, P. D., Liaw, H.-P. & Lee, A. K. Large environments reveal the statistical structure\n\ngoverning hippocampal representations. Science 345, 814\u2013817 (2014).\n\n[37] Binder, K. & Young, A. P. Spin glasses: Experimental facts, theoretical concepts, and open\n\nquestions. Rev. Mod. Phys. 58, 801 (1986).\n\n[38] Tsodyks, M. V. & Feigelman, M. V. The enhanced storage capacity in neural networks with low\n\nactivity level. Europhys. Lett. 6, 101\u2013105 (1988).\n\n[39] Gardner, E. The space of interactions in neural network models. Journal of Physics A:\n\nMathematical and General 21, 257 (1988).\n\n11\n\n\f[40] Palm, G. & Sommer, F. T. Information capacity in recurrent McCulloch\u2013Pitts networks with\n\nsparsely coded memory states. Network: Comput. Neural Syst. 3, 177\u2013186 (1992).\n\n[41] Candes, E. J., Romberg, J. K. & Tao, T. Stable signal recovery from incomplete and inaccurate\n\nmeasurements. Comm. Pure Appl. Math. 59, 1207\u20131223 (2006).\n\n[42] Donoho, D. L. Compressed sensing. IEEE Trans. Inf. Theory 52, 1289\u20131306 (2006).\n\n[43] Kuncheva, L. I. Combining pattern classi\ufb01ers: methods and algorithms (John Wiley & Sons,\n\n2004).\n\n[44] Schapire, R. E. The strength of weak learnability. Mach. Learn. 5, 197\u2013227 (1990).\n\n[45] Larsen, K. G., Nelson, J., Nguy\u00ean, H. L. & Thorup, M. Heavy hitters via cluster-preserving\n\nclustering. In FOCS, 61\u201370 (IEEE, 2016).\n\n12\n\n\f", "award": [], "sourceid": 4175, "authors": [{"given_name": "Rishidev", "family_name": "Chaudhuri", "institution": "University of California, Davis"}, {"given_name": "Ila", "family_name": "Fiete", "institution": "Massachusetts Institute of Technology"}]}