{"title": "Hyper-Graph-Network Decoders for Block Codes", "book": "Advances in Neural Information Processing Systems", "page_first": 2329, "page_last": 2339, "abstract": "Neural decoders were shown to outperform classical message passing techniques for short BCH codes. In this work, we extend these results to much larger families of algebraic  block codes, by performing message passing with graph neural networks. The parameters of the sub-network at each variable-node in the Tanner graph are obtained from a hypernetwork that receives the absolute values of the current message as input. To add stability, we employ a simplified version of the arctanh activation that is based on a high order Taylor approximation of this activation function. Our results show that for a large number of algebraic block codes, from diverse families of codes (BCH, LDPC, Polar), the decoding obtained with our method outperforms the vanilla belief propagation method as well as other learning techniques from the literature.", "full_text": "Hyper-Graph-Network Decoders for Block Codes\n\nEliya Nachmani and Lior Wolf\n\nFacebook AI Research and Tel Aviv University\n\nAbstract\n\nNeural decoders were shown to outperform classical message passing techniques\nfor short BCH codes. In this work, we extend these results to much larger families of\nalgebraic block codes, by performing message passing with graph neural networks.\nThe parameters of the sub-network at each variable-node in the Tanner graph are\nobtained from a hypernetwork that receives the absolute values of the current\nmessage as input. To add stability, we employ a simpli\ufb01ed version of the arctanh\nactivation that is based on a high order Taylor approximation of this activation\nfunction. Our results show that for a large number of algebraic block codes, from\ndiverse families of codes (BCH, LDPC, Polar), the decoding obtained with our\nmethod outperforms the vanilla belief propagation method as well as other learning\ntechniques from the literature.\n\n1\n\nIntroduction\n\nDecoding algebraic block codes is an open problem and learning techniques have recently been\nintroduced to this \ufb01eld. While the \ufb01rst networks were fully connected (FC) networks, these were\nreplaced with recurrent neural networks (RNNs), which follow the steps of the belief propagation\n(BP) algorithm. These RNN solutions weight the messages that are being passed as part of the BP\nmethod with \ufb01xed learnable weights.\nIn this work, we add compute to the message passing iterations, by turning the message graph into\na graph neural network, in which one type of nodes, called variable nodes, processes the incoming\nmessages with a FC network g. Since the space of possible messages is large and its underlying\nstructure random, training such a network is challenging. Instead, we propose to make this network\nadaptive, by training a second network f to predict the weights \u03b8g of network g.\nThis \u201chypernetwork\u201d scheme, in which one network predicts the weights of another, allows us to\ncontrol the capacity, e.g., we can have a different network per node or per group of nodes. Since\nthe nodes in the decoding graph are naturally strati\ufb01ed and since a per-node capacity is too high for\nthis problem, the second option is selected. Unfortunately, training such a hypernetwork still fails to\nproduce the desired results, without applying two additional modi\ufb01cations. The \ufb01rst modi\ufb01cation\nis to apply an absolute value to the input of network f, thus allowing it to focus on the con\ufb01dence\nin each message rather than on the content of the messages. The second is to replace the arctanh\nactivation function that is employed by the check nodes with a high order Taylor approximation of\nthis function, which avoids its asymptotes.\nWhen applying learning solutions to algebraic block codes, the exponential size of the input space\ncan be mitigated by ensuring that certain symmetry conditions are met. In this case, it is suf\ufb01cient\nto train the network on a noisy version of the zero codeword. As we show, the architecture of the\nhypernetwork we employ is selected such that these conditions are met.\nApplied to a wide variety of codes, our method outperforms the current learning based solutions,\nas well as the classical BP method, both for a \ufb01nite number of iterations and at convergence of the\nmessage passing iterations.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Related Work\n\nOver the past few years, deep learning techniques were applied to error correcting codes. This\nincludes encoding, decoding, and even, as shown recently in [11], designing new feedback codes.\nThe new feedback codes, which were designed by an RNN, outperform the well-known state of the\nart codes (Turbo, LDPC, Polar) for a Gaussian noise channel with feedback.\nFully connected neural networks were used for decoding polar codes [7]. For short polar codes,\ne.g., n = 16 bits, the obtained results are close to the optimal performance obtained with maximum\na posteriori (MAP) decoding. Since the number of codewords is exponential in the number of\ninformation bits k, scaling the fully connected network to larger block codes is infeasible.\nSeveral methods were introduced for decoding larger block codes (n (cid:62) 100). For example in [17]\nthe belief propagation (BP) decoding method is unfolded into a neural network in which weights\nare assigned to each variable edge. The same neural decoding technique was then extended to the\nmin-sum algorithm, which is more hardware friendly [16]. In both cases, an improvement is shown\nin comparison to the baseline BP method.\nAnother approach was presented for decoding Polar codes [5]. The polar encoding graph is partitioned\ninto sub-blocks, and the decoding is performed to each sub-block separately. In [12] an RNN decoding\nscheme is introduced for convolutional and Turbo codes, and shown to achieve close to the optimal\nperformance, similar to the classical convolutional codes decoders Viterbi and BCJR.\nOur work decodes block codes, such as LDPC, BCH, and Polar. The most relevant comparison is\nwith [18], which improve upon [17]. A similar method was applied to Polar code in [21], and another\nrelated work on Polar codes [5] introduced a non-iterative and parallel decoder. Another contribution\nlearns the nodes activations based on components from existing decoders (BF, GallagerB, MSA,\nSPA) [22]. In contrast, our method learns the node activations from scratch.\nThe term hypernetworks is used to refer to a framework in which a network f is trained to predict\nthe weights \u03b8g of another network g. Earlier work in the \ufb01eld [14, 20] learned weights of speci\ufb01c\nlayers in the context of tasks that required a dynamic behavior. Fuller networks were trained to\npredict video frames and stereo views [10]. The term itself was coined in [8], which employed such\nmeta-functions in the context of sequence modeling. A Bayesian formulation was introduced in a\nsubsequent work [15]. The application of hyper networks as meta-learners in the context of few-shot\nlearning was introduced in [2].\nAn application of hypernetworks for searching over the architecture space, where evaluation is done\nwith predicted weights conditioned on the architecture, rather than performing gradient descent with\nthat architecture was proposed in [4]. Recently, graph hypernetworks were introduced for searching\nover possible architectures [23]. Given an architecture, a graph hypernetwork that is conditioned\non the graph of the architecture and shares its structure, generates the weights of the network with\nthe given architecture. In our work, a non-graph network generates the weights of a graph network.\nTo separate between the two approaches, we call our method hyper-graph-network and not graph\nhypernetwork.\n\n3 Background\n\nWe consider codes with a block size of n bits. It is de\ufb01ned by a binary generator matrix G of size\nk \u00d7 n and a binary parity check matrix H of size (n \u2212 k) \u00d7 n.\nThe parity check matrix entails a Tanner graph, which has n variable nodes and (n \u2212 k) check nodes,\nsee Fig. 1(a). The edges of the graph correspond to the on-bits in each column of the matrix H. For\nnotational convenience, we assume that the degree of each variable node in the Tanner graph, i.e., the\nsum of each column of H, has a \ufb01xed value dv.\nThe Tanner graph is unrolled into a Trellis graph. This graph starts with n variable nodes and is\nthen composed of two types of columns, variable columns and check columns. Variable columns\nconsist of variable processing units and check columns consist of check processing units. dv variable\nprocessing units are associate with each received bit, and the number of processing units in the\nvariable column is, therefore, E = dvn. The check processing units are also directly linked to the\nedges of the Tanner graph, where each parity check corresponds to a row of H. Therefore, the check\n\n2\n\n\f(a)\n\n(b)\n\nFigure 1: (a) The Tanner graph for a linear block code with n = 5, k = 2 and dv = 2. (b) The\ncorresponding Trellis graph, with two iteration.\n\ncolumns also have E processing units each. The Trellis graph ends with an output layer of n variable\nnodes. See Fig. 1(b).\nMessage passing algorithms operate on the Trellis graph. The messages propagate from variable\ncolumns to check columns and from check columns to variable columns, in an iterative manner. The\nleftmost layer corresponds to a vector of log likelihood ratios (LLR) l \u2208 Rn of the input bits:\n\nlv = log\n\nPr (cv = 1|yv)\nPr (cv = 0|yv)\n\n,\n\nwhere v \u2208 [n] is an index and yv is the channel output for the corresponding bit cv, which we wish to\nrecover.\nLet xj be the vector of messages that a column in the Trellis graph propagates to the next column. At\nthe \ufb01rst round of message passing j = 1, and similarly to other cases where j is odd, a variable node\ntype of computation is performed, in which the messages are added:\n\ne = xj\nxj\n\n(c,v) = lv +\n\ne(cid:48)\u2208N (v)\\{(c,v)}\n\nxj\u22121\ne(cid:48)\n\n,\n\n(1)\n\nwhere each variable node is indexed the edge e = (c, v) on the Tanner graph and N (v) =\n{(c, v)|H(c, v) = 1}, i.e, the set of all edges in which v participates. By de\ufb01nition x0 = 0\nand when j = 1 the messages are directly determined by the vector l.\nFor even j, the check layer performs the following computations:\n\n(cid:88)\n\n\uf8eb\uf8ed (cid:89)\n\n(cid:32)\n\nxj\u22121\ne(cid:48)\n2\n\n(cid:33)\uf8f6\uf8f8\n\ne = xj\nxj\n\n(c,v) = 2arctanh\n\ntanh\n\ne(cid:48)\u2208N (c)\\{(c,v)}\n\n(2)\n\nwhere N (c) = {(c, v)|H(c, v) = 1} is the set of edges in the Tanner graph in which row c of the\nparity check matrix H participates.\nA slightly different formulation is provided by [18]. In this formulation, the tanh activation is moved\nto the variable node processing units. In addition, a set of learned weights we are added. Note that\nthe learned weights are shared across all iterations j of the Trellis graph.\n\ne = xj\nxj\n\n(c,v) = tanh\n\ne = xj\nxj\n\n(c,v) = 2arctanh\n\n\uf8f6\uf8f8\uf8f6\uf8f8 ,\n\n\uf8eb\uf8ed 1\n\n2\n\n\uf8eb\uf8edlv +\n(cid:88)\n\uf8eb\uf8ed (cid:89)\n\ne(cid:48)\u2208N (v)\\{(c,v)}\n\ne(cid:48)\u2208N (c)\\{(c,v)}\n\nwe(cid:48)xj\u22121\ne(cid:48)\n\n\uf8f6\uf8f8\n\nxj\u22121\ne(cid:48)\n\n3\n\nif j is odd\n\n(3)\n\nif j is even\n\n(4)\n\n\fAs mentioned, the computation graph alternates between variable columns and check columns, with\nL layers of each type. The \ufb01nal layer marginalizes the messages from the last check layer with the\nlogistic (sigmoid) activation function \u03c3, and output n bits. The vth bit output at layer 2L + 1, in the\nweighted version, is given by:\n\n\uf8eb\uf8edlv +\n\n(cid:88)\n\n\uf8f6\uf8f8 ,\n\nov = \u03c3\n\n\u00afwe(cid:48)x2L\ne(cid:48)\n\ne(cid:48)\u2208N (v)\n\n(5)\n\nwhere \u00afwe(cid:48) is a second set of learnable weights.\n\n4 Method\n\nWe suggest further adding learned components into the message passing algorithm. Speci\ufb01cally, we\nreplace Eq. 3 (odd j) with the following equation:\n\n(c,v) = g(lv, xj\u22121\n\nN (v, c), \u03b8j\n\ng),\n\ne = xj\nxj\n\n(6)\nN (v, c) is a vector of length dv \u2212 1 that contains the elements of xj that correspond to the\n\nwhere xj\nindices N (v) \\ {(c, v)}, and \u03b8j\nIn order to make g adaptive to the current input messages at every variable node, we employ a\nhypernetwork scheme and use a network f to determine its weights.\n\ng has the weights of network g at iteration j.\n\ng = f (|xj\u22121|, \u03b8f )\n\u03b8j\n\n(7)\n\nwhere \u03b8f are the learned weights of network f. Note that g is \ufb01xed to all variable nodes at the same\ncolumn. We have also experimented with different weights per variable (further conditioning g on\nthe speci\ufb01c messages xj\u22121\nN (v, c) for the variable with index e = (v, c)). However, the added capacity\nseems detrimental.\nThe adaptive nature of the hypernetwork allows the variable computation, for example to neglect part\nof the inputs of g, in case the input message l contains errors.\nNote that the messages xj\u22121 are passed to f in absolute value (Eq. 7). The absolute value of the\nmessages is sometimes seen as measure for the correctness, and the sign of the message as the value\n(zero or one) of the corresponding bit [19]. Since we want the network f to focus on the correctness\nof the message and not the information bits, we remove the signs.\nThe architecture of both f and g does not contain bias terms and employs the tanh activations. The\nnetwork g has p layers, i.e., \u03b8g = (W1, ..., Wp), for some weight matrices Wi. The network f ends\nwith p linear projections, each corresponding to one of the layers of network g. As noted above, if a\nset of symmetry conditions are met, then it is suf\ufb01cient to learn to correct the zero codeword. The link\nbetween the architectural choices of the networks and the symmetry conditions is studied in Sec. 5.\nAnother modi\ufb01cation is being done to the columns of the check variables in the Trellis graph. For\neven values of j, we employ the following computation, instead of Eq. 4.\n\nq(cid:88)\n\n1\n\n\uf8eb\uf8ed (cid:89)\n\n\uf8f6\uf8f82m+1\n\nxj\u22121\ne(cid:48)\n\ne = xj\nxj\n\n(c,v) = 2\n\n2m + 1\n\nm=0\n\ne(cid:48)\u2208N (c)\\{(c,v)}\n\n(8)\n\nin which arctanh is replaced with its Taylor approximation of degree q. The approximation is\nemployed as a way to stabilize the training process. The arctanh activation, has asymptotes in\nx = 1,\u22121, and training with it often explodes. Its Taylor approximation is a well-behaved polynomial,\nsee Figure 2.\n\n4.1 Training\n\nIn addition to observing the \ufb01nal output of the network, as given in Eq. 5, we consider the following\nmarginalization for each iteration where j is odd: oj\n. Similarly to [18],\n\ne(cid:48)\u2208N (v) \u00afwe(cid:48)xj\ne(cid:48)\n\nv = \u03c3\n\n(cid:16)\n\nlv +(cid:80)\n\n(cid:17)\n\n4\n\n\fFigure 2: Taylor Approximation of the arctanh activation function.\n\nwe employ the cross entropy loss function, which considers the error after every check node iteration\nout of the L iterations:\n\ncv log(o2h+1\n\nv\n\n) + (1 \u2212 cv) log(1 \u2212 o2h+1\n\nv\n\n)\n\n(9)\n\nL(cid:88)\n\nn(cid:88)\n\nh=0\n\nv=1\n\nL = \u2212 1\nn\n\n(cid:80)L\n\n(cid:80)n\nv=1 log(1 \u2212 o2h+1\n\nv\n\nn\n\nwhere cv is the ground truth bit. This loss simpli\ufb01es, when learning the zero codeword, to\n\u2212 1\nThe learning rate was 1e \u2212 4 for all type of codes, and the Adam optimizer [13] is used for training.\nThe decoding network has ten layers which simulates L = 5 iterations of a modi\ufb01ed BP algorithm.\n\nh=0\n\n).\n\n5 Symmetry conditions\n\nFor block codes that maintain certain symmetry conditions, the decoding error is independent of the\ntransmitted codeword [19, Lemma 4.92]. A direct implication is that we can train our network to\ndecode only the zero codeword. Otherwise, training would need to be performed for all 2k words.\nNote that training with the zero codeword should give the same results as training with all 2k words.\nThere are two symmetry conditions.\n1. For a check node with index (c, v) at iteration j and for any vector b \u2208 {0, 1}dv\u22121\n\n(cid:16)\n\n(cid:33)\n\n(cid:16)\n\n(cid:17)\n\n(cid:32) K(cid:89)\n\n1\n\n\u03a6\n\nb(cid:62)xj\u22121\n\nN ( v,c)\n\n=\n\nbk\n\n\u03a6\n\nxj\u22121\n\nN ( v,c)\n\nN ( v,c) is a vector of length dv \u2212 1 that contains the elements of xj that correspond to\nwhere xj\nthe indices N (c) \\ {(c, v)} and \u03a6 is the activation function used, e.g., arctanh or the truncated\nversion of it.\n\n2. For a variable node with index (c, v) at iteration j, which performs computation \u03a8\n\n(cid:16)\u2212lv,\u2212xj\u22121\n\n\u03a8\n\nN (v, c)\n\n(cid:17)\n\n(cid:16)\n\n= \u2212\u03a8\n\nlv, xj\u22121\n\nN (v, c)\n\nIn the proposed architecture, \u03a8 is a FC neural network (g) with tanh activations and no bias\nterms.\n\nOur method, by design, maintains the symmetry condition on both the variable and the check nodes.\nThis is veri\ufb01ed in the following lemmas.\nLemma 1. Assuming that the check node calculation is given by Eq. (8) then the proposed architecture\nsatis\ufb01es the \ufb01rst symmetry condition.\n\n5\n\n(10)\n\n(11)\n\n(cid:17)\n\n(cid:17)\n\n\fProof. In our case the activation function \u03a6 is Taylor approximation of arctanh. Let the input\nmessage at j be xj\n\nfor K = dv \u2212 1. We can verify that:\n\nxj\n1, . . . , xj\n\nN ( v,c) =\n\nK\n\nxj(b1xj\u22121\n\n1\n\n, ..., bKxj\u22121\n\nK ) = 2\n\nK(cid:89)\n\n(\nk=1\n\n1\n\n2m + 1\n\nbkxj\u22121\n\nk\n\n)2m+1 = 2(\n\nbk)\n\nk=1\n\nm=0\n\nK(cid:89)\n\nq(cid:88)\n\nK(cid:89)\n\n(\nk=1\n\n1\n\n2m + 1\n\nxj\u22121\n\nk\n\n)2m+1\n\n(cid:17)\n\n(cid:16)\n\nq(cid:88)\nK(cid:89)\n\nm=0\n\nk=1\n\n= (\n\nbk)xj(xj\u22121\n\n1\n\n, ..., xj\u22121\nK )\n\nwhere the second equality holds since 2m + 1 is odd.\nLemma 2. Assuming that the variable node calculation is given by Eq. (6) and Eq. (7), g does not\ncontain bias terms and employs the tanh activation, then the proposed architecture satis\ufb01es the\nvariable symmetry condition.\n\nProof. Let K = dv \u2212 1 and xj\nj (cid:62) 0, \u03a8 is given as\n\n(cid:16)\n\n(cid:17)\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)\n(cid:16)\n\nN (v, c) =\n\nxj\n1, . . . , xj\n\nK\n\n. In the proposed architecture for any odd\n\n(cid:16)\n\n(cid:16)\n\n(cid:17)(cid:17)(cid:17)(cid:17)\n\nlv, xj\u22121\n\n1\n\n, . . . , xj\u22121\n\nK , \u03b8j\n\ng\n\ng\n\n= tanh\n\nW (cid:62)\n\np\n\n... tanh\n\nW (cid:62)\n\n2 tanh\n\nW (cid:62)\n\n1\n\nlv, xj\u22121\n\n1\n\n, . . . , xj\u22121\n\nK\n\nwhere p is the number of layers and the weights W1, ..., Wp constitute \u03b8j\n(cid:17)\nFor real valued weights \u03b8lhs\nlv, xj\u22121\ng = \u03b8rhs\n\u03b8lhs\ng = f (|xj\u22121|, \u03b8f ) = f (| \u2212 xj\u22121|, \u03b8f ) = \u03b8rhs\n\u03b8lhs\n\n(cid:16)\u2212lv,\u2212xj\u22121\n\nand \u03b8rhs\n, . . . , xj\u22121\n\n= \u2212g\n.\n\nK , \u03b8lhs\n\nthen g\n\n(cid:16)\n\n1\n\n1\n\ng\n\ng\n\ng\n\ng\n\ng\n\n, since tanh(x) is an odd function for any real value input, if\n. In our case,\n\n, . . . ,\u2212xj\u22121\n\nK , \u03b8rhs\n\ng\n\ng = f (|xj\u22121|, \u03b8f ).\n(cid:17)\n\n(12)\n\n6 Experiments\n\nIn order to evaluate our method, we train the proposed architecture with three classes of\nlinear block codes: Low Density Parity Check (LDPC) codes [6], Polar codes [1] and\nBose\u2013Chaudhuri\u2013Hocquenghem (BCH) codes [3]. All generator matrices and parity check ma-\ntrices are taken from [9].\nTraining examples are generated as a zero codeword transmitted over an additive white Gaussian\nnoise. For validation, we use the generator matrix G, in order to simulate valid codewords. Each\ntraining batch contains examples with different Signal-To-Noise (SNR) values.\nThe hyperparameters for each family of codes are determined by practical considerations. For Polar\ncodes, which are denser than LDPC codes, we use a batch size of 90 examples. We train with SNR\nvalues of 1dB, 2dB, .., 6dB, where from each SNR we present 15 examples per single batch. For\nBCH and LDPC codes, we train for SNR ranges of 1 \u2212 8dB (120 samples per batch). In our results\nwe report, the test error up to an SNR of 6dB, since evaluating the statistics for higher SNRs in a\nreliable way requires the evaluation of a large number of test samples (recall that in train, we only\nneed to train on a noisy version of a single codeword). However, for BCH codes, which are the focus\nof the current literature, we extend the tests to 8dB in some cases.\nIn our experiments, the order of the Taylor series of arctanh is set to q = 1005. The network f has\nfour layers with 32 neurons at each layer. The network g has two layer with 16 neurons at each layer.\nFor BCH codes, we also tested a deeper con\ufb01guration in which the network f has four layers with\n128 neurons at each layer.\nThe results are reported as bit error rates (BER) for different SNR values (dB). Fig. 3 shows the\nresults for sample codes, and Tab. 1 lists results for more codes. As can be seen in the \ufb01gure for\nPolar(128,96) code with \ufb01ve iteration of BP we get an improvement of 0.48dB over [18]. For LDPC\nMacKay(96,48) code, we get an improvement of 0.15dB. For the BCH(63,51) with large f we get\n\n6\n\n\fTable 1: A comparison of the negative natural logarithm of Bit Error Rate (BER) for three SNR\nvalues of our method with literature baselines. Higher is better.\n\nMethod\n\nBP\n5\n\n4\n\n[18]\n5\n\n4\n\n6\n\u2014 after \ufb01ve iterations \u2014\n\n6\n\nOurs\n5\n\n6\n\n4\n\nOurs deeper f\n6\n\n4\n\n5\n\nPolar (63,32)\nPolar (64,48)\nPolar (128,64)\nPolar (128,86)\nPolar (128,96)\nLDPC (49,24)\nLDPC (121,60)\nLDPC (121,70)\nLDPC (121,80)\nMacKay (96,48)\nCCSDS (128,64)\nBCH (31,16)\nBCH (63,36)\nBCH (63,45)\nBCH (63,51)\n\nPolar (63,32)\nPolar (64,48)\nPolar (128,64)\nPolar (128,86)\nPolar (128,96)\nLDPC (49,24)\nMacKay (96,48)\nBCH (63,36)\nBCH (63,45)\nBCH (63,51)\n\n3.52 4.04 4.48\n4.15 4.68 5.31\n3.38 3.80 4.15\n3.80 4.19 4.62\n3.99 4.41 4.78\n5.30 7.28 9.88\n4.82 7.21 10.87\n5.88 8.76 13.04\n6.66 9.82 13.98\n6.84 9.40 12.57\n6.55 9.65 13.78\n4.63 5.88 7.60\n3.72 4.65 5.66\n4.08 4.96 6.07\n4.34 5.29 6.35\n\n4.26 5.38 6.50\n4.74 5.94 7.42\n4.10 5.11 6.15\n4.49 5.65 6.97\n4.61 5.79 7.08\n6.23 8.19 11.72\n8.15 11.29 14.29\n4.03 5.42 7.26\n4.36 5.55 7.26\n4.58 5.82 7.42\n\n4.25 5.49 7.02 \u2014 \u2014 \u2014\n4.91 6.48 8.41 \u2014 \u2014 \u2014\n3.89 5.18 6.94 \u2014 \u2014 \u2014\n4.57 6.18 8.27 \u2014 \u2014 \u2014\n4.73 6.39 8.57 \u2014 \u2014 \u2014\n5.76 7.90 11.17 \u2014 \u2014 \u2014\n5.22 8.29 13.00 \u2014 \u2014 \u2014\n6.39 9.81 14.04 \u2014 \u2014 \u2014\n6.95 10.68 15.80 \u2014 \u2014 \u2014\n7.19 10.02 13.16 \u2014 \u2014 \u2014\n6.99 10.57 15.27 \u2014 \u2014 \u2014\n4.96 6.63 8.80\n5.05 6.64 8.80\n3.96 5.35 7.20\n4.00 5.42 7.34\n4.41 5.91 7.91\n4.48 6.07 8.45\n4.64 6.08 8.16\n4.67 6.19 8.22\n\n4.14 5.32 6.67\n4.77 6.12 7.84\n3.73 4.78 5.87\n4.37 5.71 7.19\n4.56 5.98 7.53\n5.49 7.44 10.47\n5.12 7.97 12.22\n6.27 9.44 13.47\n6.97 10.47 14.86\n7.04 9.67 12.75\n6.82 10.15 13.96\n4.74 6.25 8.00\n3.94 5.27 6.97\n4.37 5.78 7.67\n4.54 5.98 7.73\n\u2014 at convergence \u2014\n4.59 6.10 7.69 \u2014 \u2014 \u2014\n4.22 5.59 7.30\n4.92 6.44 8.39 \u2014 \u2014 \u2014\n4.70 5.93 7.55\n4.52 6.12 8.25 \u2014 \u2014 \u2014\n4.19 5.79 7.88\n4.95 6.84 9.28 \u2014 \u2014 \u2014\n4.58 6.31 8.65\n4.94 6.76 9.09 \u2014 \u2014 \u2014\n4.63 6.31 8.54\n6.23 8.54 11.95 \u2014 \u2014 \u2014\n6.05 8.34 11.80\n8.66 11.52 14.32\n8.90 11.97 14.94 \u2014 \u2014 \u2014\n4.15 5.73 7.88 \u2014 \u2014 \u2014 4.29 5.91 8.01\n4.49 6.01 8.20 \u2014 \u2014 \u2014 4.64 6.27 8.51\n4.64 6.21 8.21 \u2014 \u2014 \u2014 4.80 6.44 8.58\n\nan improvement of 0.45dB and with small f we get a similar improvement of 0.43dB. Furthermore,\nfor every number of iterations, our method obtains better results then [18]. We can also observe\nthat our method with 5 iteration achieve the same results as [18] with 50 iteration, for BCH(63,51)\nand Polar(128,96) codes. Similar improvements were also observe for other BCH and Polar codes.\nFig. 3(e) provides experiments for large and non-regular LDPC codes - WARN(384, 256) and TU-\nKL(96, 48). As can be seen, our method improves the results, even in non-regular codes where the\ndegree varies. Note that we learned just one hypernetwork g, which corresponds to the maximal\ndegree and we discard irrelevant outputs for nodes with lower degrees. In Tab. 1 we present the\nnegative natural logarithm of the BER. For the 15 block codes tested, our method get better results\nthen the BP and [18] algorithms. This results stay true for the convergence point of the algorithms,\ni.e. when we run the algorithms with 50 iteration.\nTo evaluate the contribution of the various components of our method, we ran an ablation analysis.\nWe compare (i) our complete method, (ii) a method in which the parameters of g are \ufb01xed and g\nreceives and additional input of |xj\u22121|, (iii) a similar method where the number of hidden units in g\nwas increased to have the same amount of parameters of f and g combined, (iv) a method in which f\nreceives the xj\u22121 instead of the absolute value of it, (v) a variant of our method in which arctanh\nreplaces its Taylor approximation, and (vi) a similar method to the previous one, in which gradient\nclipping is used to prevent explosion. The results, reported in Tab. 2 demonstrate the advantage of\nour complete method. We can observe that without hypernetwork and without the absolute value in\nEq. 7, the results degrade below those of [18]. We can also observe that for (ii), (iii) and (iv) the\nmethod reaches the same low quality performance. For (v) and (vi), the training process explodes\nand the performance is equal to a random guess. In (vi), we train our method while clipping the\narctanh at multiple threshold values (TH = 0.5, 1, 2, 4, 5, applied to both the positive and negative\n\n7\n\n\fTable 2: Ablation analysis. The negative natural logarithm of BER results of our complete method\nare compared with alternative methods. Higher is better.\n\nCode\nVariant/SNR\n(i) Complete method\n(ii) No hypernetwork\n(iii) No hypernetwork, higher capacity\n(iv) No abs in Eq. 7\n(v) Not truncating arctanh\n(vi) Gradient clipping\n[18]\n[18] with truncated arctanh\n\nBCH (31,16) BCH (63,45) BCH (63,51)\n\n4\n\n4.96\n2.94\n2.94\n2.86\n0.69\n0.69\n4.74\n4.78\n\n6\n\n8.80\n3.85\n3.85\n3.99\n0.69\n0.69\n8.00\n8.24\n\n4\n\n4.41\n3.54\n3.54\n3.55\n0.69\n0.69\n3.97\n4.34\n\n6\n\n7.91\n4.76\n4.76\n4.77\n0.69\n0.69\n7.10\n7.34\n\n4\n\n4.67\n3.83\n3.83\n3.84\n0.69\n0.69\n4.54\n4.53\n\n6\n\n8.22\n5.18\n5.18\n5.20\n0.69\n0.69\n7.73\n7.84\n\nsides, multiple block codes BCH(31,16), BCH(63,45), BCH(63,51), LDPC (49,24), LDPC (121,80),\nPOLAR(64,32), POLAR(128,96), L = 5 iterations). In all cases, the training exploded, similar to the\nno-threshold vanilla arctanh (v). In order to understand this, we observe the values when arctanh is\napplied at initialization for our method and for [17, 18]. In [17, 18], which are initialized to mimic the\nvanilla BP, the activations are such that the maximal arctanh value at initialization is 3.45. However\nin our case, in many of the units, the value explodes at in\ufb01nity. Clipping does not help, since for any\nthreshold value, the number of units that are above the threshold (and receive no gradient) is large.\nSince we employ hypernetworks, the weights \u03b8j\ng of the network g are dynamically determined by the\nnetwork f and vary between samples, making it challenging to control the activations g produces.\nThis highlights the critical importance of the Taylor approximation for the usage of hypernetworks in\nour setting. The table also shows that for most cases, the method of [18] slightly bene\ufb01ts from the\nusage of approximated arctanh.\n\n7 Conclusions\n\nWe presents graph networks in which the weights are a function of the node\u2019s input, and demonstrate\nthat this architecture provides the adaptive computation that is required in the case of decoding block\ncodes. Training networks in this domain can be challenging and we present a method to avoid gradient\nexplosion that seems more effective, in this case, than gradient clipping. By carefully designing our\nnetworks, important symmetry conditions are met and we can train ef\ufb01ciently. Our results go far\nbeyond the current literature on learning block codes and we present results for a large number of\ncodes from multiple code families.\n\nAcknowledgments\n\nWe thank Sebastian Cammerer and Chieh-Fang Teng for the helpful discussion and providing code for\ndeep polar decoder. The contribution of Eliya Nachmani is part of a Ph.D. thesis research conducted\nat Tel Aviv University.\n\n8\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 3: BER for various values of SNR for various codes.\n(a) Polar (128,96), (b) LDPC\nMacKay(96,48), (c) BCH (63,51), (d) BCH(63,51) with a deeper network f, (e) Large and non-regular\nLDPC codes: WRAN(384,256) and TU-KL(96,48).\n\n9\n\n\fReferences\n[1] Erdal Arikan. Channel polarization: A method for constructing capacity-achieving codes. In\n2008 IEEE International Symposium on Information Theory, pages 1173\u20131177. IEEE, 2008.\n\n[2] Luca Bertinetto, Jo\u00e3o F Henriques, Jack Valmadre, Philip Torr, and Andrea Vedaldi. Learning\nfeed-forward one-shot learners. In Advances in Neural Information Processing Systems, pages\n523\u2013531, 2016.\n\n[3] Raj Chandra Bose and Dwijendra K Ray-Chaudhuri. On a class of error correcting binary group\n\ncodes. Information and control, 3(1):68\u201379, 1960.\n\n[4] Andrew Brock, Theo Lim, J.M. Ritchie, and Nick Weston. SMASH: One-shot model architec-\nture search through hypernetworks. In International Conference on Learning Representations,\n2018.\n\n[5] Sebastian Cammerer, Tobias Gruber, Jakob Hoydis, and Stephan ten Brink. Scaling deep\nlearning-based decoding of polar codes via partitioning. In GLOBECOM 2017-2017 IEEE\nGlobal Communications Conference, pages 1\u20136. IEEE, 2017.\n\n[6] Robert Gallager. Low-density parity-check codes. IRE Transactions on information theory,\n\n8(1):21\u201328, 1962.\n\n[7] Tobias Gruber, Sebastian Cammerer, Jakob Hoydis, and Stephan ten Brink. On deep learning-\nbased channel decoding. In 2017 51st Annual Conference on Information Sciences and Systems\n(CISS), pages 1\u20136. IEEE, 2017.\n\n[8] David Ha, Andrew Dai, and Quoc V Le. Hypernetworks. arXiv preprint arXiv:1609.09106,\n\n2016.\n\n[9] Michael Helmling, Stefan Scholl, Florian Gensheimer, Tobias Dietz, Kira Kraft, Stefan Ruzika,\nand Norbert Wehn. Database of Channel Codes and ML Simulation Results. www.uni-kl.de/\nchannel-codes, 2019.\n\n[10] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc V Gool. Dynamic \ufb01lter networks. In\n\nAdvances in Neural Information Processing Systems, pages 667\u2013675, 2016.\n\n[11] Hyeji Kim, Yihan Jiang, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath. Deepcode:\nFeedback codes via deep learning. In Advances in Neural Information Processing Systems\n(NIPS), pages 9436\u20139446, 2018.\n\n[12] Hyeji Kim, Yihan Jiang, Ranvir Rana, Sreeram Kannan, Sewoong Oh, and Pramod Viswanath.\nCommunication algorithms via deep learning. In Sixth International Conference on Learning\nRepresentations (ICLR), 2018.\n\n[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[14] Benjamin Klein, Lior Wolf, and Yehuda Afek. A dynamic convolutional layer for short range\nweather prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 4840\u20134848, 2015.\n\n[15] David Krueger, Chin-Wei Huang, Riashat Islam, Ryan Turner, Alexandre Lacoste, and Aaron\n\nCourville. Bayesian hypernetworks. arXiv preprint arXiv:1710.04759, 2017.\n\n[16] Loren Lugosch and Warren J Gross. Neural offset min-sum decoding. In 2017 IEEE Interna-\n\ntional Symposium on Information Theory (ISIT), pages 1361\u20131365. IEEE, 2017.\n\n[17] Eliya Nachmani, Yair Be\u2019ery, and David Burshtein. Learning to decode linear codes using\ndeep learning. In 2016 54th Annual Allerton Conference on Communication, Control, and\nComputing (Allerton), pages 341\u2013346. IEEE, 2016.\n\n[18] Eliya Nachmani, Elad Marciano, Loren Lugosch, Warren J Gross, David Burshtein, and Yair\nBe\u2019ery. Deep learning methods for improved decoding of linear codes. IEEE Journal of Selected\nTopics in Signal Processing, 12(1):119\u2013131, 2018.\n\n10\n\n\f[19] Tom Richardson and Ruediger Urbanke. Modern coding theory. Cambridge university press,\n\n2008.\n\n[20] G. Riegler, S. Schulter, M. R\u00fcther, and H. Bischof. Conditioned regression models for non-blind\nsingle image super-resolution. In 2015 IEEE International Conference on Computer Vision\n(ICCV), pages 522\u2013530, Dec 2015.\n\n[21] Chieh-Fang Teng, Chen-Hsi Derek Wu, Andrew Kuan-Shiuan Ho, and An-Yeu Andy Wu.\nLow-complexity recurrent neural network-based polar decoder with weight quantization mecha-\nnism. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal\nProcessing (ICASSP), pages 1413\u20131417. IEEE, 2019.\n\n[22] Bane Vasi\u00b4c, Xin Xiao, and Shu Lin. Learning to decode ldpc codes with \ufb01nite-alphabet message\npassing. In 2018 Information Theory and Applications Workshop (ITA), pages 1\u20139. IEEE, 2018.\n\n[23] Chris Zhang, Mengye Ren, and Raquel Urtasun. Graph hypernetworks for neural architecture\n\nsearch. In International Conference on Learning Representations, 2019.\n\n11\n\n\f", "award": [], "sourceid": 1362, "authors": [{"given_name": "Eliya", "family_name": "Nachmani", "institution": "Tel Aviv University and Facebook AI Research"}, {"given_name": "Lior", "family_name": "Wolf", "institution": "Facebook AI Research"}]}