{"title": "Revisit Fuzzy Neural Network: Demystifying Batch Normalization and ReLU with Generalized Hamming Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1923, "page_last": 1932, "abstract": "We revisit fuzzy neural network with a cornerstone notion of generalized hamming distance, which provides a novel and theoretically justified framework to re-interpret many useful neural network techniques in terms of fuzzy logic. In particular, we conjecture and empirically illustrate that, the celebrated batch normalization (BN) technique actually adapts the \u201cnormalized\u201d bias such that it approximates the rightful bias induced by the generalized hamming distance. Once the due bias is enforced analytically, neither the optimization of bias terms nor the sophisticated batch normalization is needed. Also in the light of generalized hamming distance, the popular rectified linear units (ReLU) can be treated as setting a minimal hamming distance threshold between network inputs and weights. This thresholding scheme, on the one hand, can be improved by introducing double-thresholding on both positive and negative extremes of neuron outputs. On the other hand, ReLUs turn out to be non-essential and can be removed from networks trained for simple tasks like MNIST classification. The proposed generalized hamming network (GHN) as such not only lends itself to rigorous analysis and interpretation within the fuzzy logic theory but also demonstrates fast learning speed, well-controlled behaviour and state-of-the-art performances on a variety of learning tasks.", "full_text": "Revisit Fuzzy Neural Network:\n\nDemystifying Batch Normalization and ReLU with\n\nGeneralized Hamming Network\n\nLixin Fan\n\nlixin.fan@nokia.com\n\nNokia Technologies\nTampere, Finland\n\nAbstract\n\nWe revisit fuzzy neural network with a cornerstone notion of generalized ham-\nming distance, which provides a novel and theoretically justi\ufb01ed framework to\nre-interpret many useful neural network techniques in terms of fuzzy logic. In par-\nticular, we conjecture and empirically illustrate that, the celebrated batch normaliza-\ntion (BN) technique actually adapts the \u201cnormalized\u201d bias such that it approximates\nthe rightful bias induced by the generalized hamming distance. Once the due bias\nis enforced analytically, neither the optimization of bias terms nor the sophisticated\nbatch normalization is needed. Also in the light of generalized hamming distance,\nthe popular recti\ufb01ed linear units (ReLU) can be treated as setting a minimal ham-\nming distance threshold between network inputs and weights. This thresholding\nscheme, on the one hand, can be improved by introducing double-thresholding on\nboth positive and negative extremes of neuron outputs. On the other hand, ReLUs\nturn out to be non-essential and can be removed from networks trained for simple\ntasks like MNIST classi\ufb01cation. The proposed generalized hamming network\n(GHN) as such not only lends itself to rigorous analysis and interpretation within\nthe fuzzy logic theory but also demonstrates fast learning speed, well-controlled\nbehaviour and state-of-the-art performances on a variety of learning tasks.\n\n1\n\nIntroduction\n\nSince early 1990s the integration of fuzzy logic and computational neural networks has given birth to\nthe fuzzy neural networks (FNN) [1]. While the formal fuzzy set theory provides a strict mathematical\nframework in which vague conceptual phenomena can be precisely and rigorously studied [2, 3, 4, 5],\napplication-oriented fuzzy technologies lag far behind theoretical studies. In particular, fuzzy neural\nnetworks have only demonstrated limited successes on some toy examples such as [6, 7]. In order to\ncatch up with the rapid advances in recent neural network developments, especially those with deep\nlayered structures, it is the goal of this paper to demonstrate the relevance of FNN, and moreover, to\nprovide a novel view on its non-fuzzy counterparts.\nOur revisiting of FNN is not merely for the fond remembrances of the golden age of \u201csoft computing\u201d\n[8]. Instead it provides a novel and theoretically justi\ufb01ed perspective of neural computing, in which\nwe are able to re-examine and demystify some useful techniques that were proposed to improve\neither effectiveness or ef\ufb01ciency of neural networks training processes. Among many others, batch\nnormalization (BN) [9] is probably the most in\ufb02uential yet mysterious trick, that signi\ufb01cantly\nimproved the training ef\ufb01ciency by adapting to the change in the distribution of layers\u2019 inputs (coined\nas internal covariate shift). Such kind of adaptations, when viewed within the fuzzy neural network\nframework, can be interpreted as recti\ufb01cations to the de\ufb01ciencies of neuron outputs with respect to the\nrightful generalized hamming distance (see de\ufb01nition 1) between inputs and neuron weights. Once\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fthe appropriate recti\ufb01cation is applied , the ill effects of internal covariate shift are automatically\neradicated, and consequently, one is able to enjoy the fast training process without resorting to a\nsophisticated learning method used by BN.\nAnother crucial component in neural computing, Recti\ufb01ed linear unit (ReLU), has been widely used\ndue to its strong biological motivations and mathematical justi\ufb01cations [10, 11, 12]. We show that\nwithin the generalized hamming group endowed with generalized hamming distance, ReLU can be\nregarded as setting a minimal hamming distance threshold between network input and neuron weights.\nThis novel view immediately leads us to an effective double-thresholding scheme to suppress fuzzy\nelements in the generalized hamming group.\nThe proposed generalized hamming network (GHN) forms its foundation on the cornerstone notion\nof generalized hamming distance (GHD), which is essentially de\ufb01ned as h(x, w) := x + w \u2212 2xw\nfor any x, w \u2208 R (see de\ufb01nition 1). Its connection with the inferencing rule in neural computing is\nobvious: the last term (\u22122xw) corresponds to element-wise multiplications of neuron inputs and\nweights, and since we aim to measure the GHD between inputs x and weights w, the bias term then\nshould take the value x + w. In this article we de\ufb01ne any network that has its neuron outputs ful\ufb01lling\nthis requirement (3) as a generalized hamming network. Since the underlying GHD induces a fuzzy\nXOR logic, GHN lends itself to rigorous analysis within the fuzzy logics theory (see de\ufb01nition 4).\nApart from its theoretical appeals, GHN also demonstrates appealing features in terms of fast learning\nspeed, well-controlled behaviour and simple parameter settings (see Section 4).\n\n1.1 Related Work\n\nFuzzy logic and fuzzy neural network: the notion of fuzzy logic is based on the rejection of the\nfundamental principle of bivalence of classical logic i.e. any declarative sentence has only two\npossible truth values, true and false. Although the earliest connotation of fuzzy logic was attributed\nto Aristotle, the founder of classical logic [13], it was Zadeh\u2019s publication in 1965 that ignited the\nenthusiasm about the theory of fuzzy sets [2]. Since then mathematical developments have advanced\nto a very high standard and are still forthcoming to day [3, 4, 5]. Fuzzy neural networks were proposed\nto take advantages of the \ufb02exible knowledge acquiring capability of neural networks [1, 14]. In theory\nit was proved that fuzzy systems and certain classes of neural networks are equivalent and convertible\nwith each other [15, 16]. In practice, however, successful applications of FNNs are limited to some\ntoy examples only [6, 7].\nDemystifying neural networks: efforts of interpreting neural networks by means of propositional\nlogic dated back to McCulloch & Pitts\u2019 seminial paper [17]. Recent research along this line include\n[18] and the references therein, in which First Order Logic (FOL) rules are encoded using soft logic\non continuous truth values from the interval [0, 1]. These interpretations, albeit interesting, seldom\nexplain effective neural network techniques such as batch normalization or ReLU. Recently [19]\nprovided an improvement (and explanation) to batch normalization by removing dependencies in\nweight normalization between the examples in a minibatch.\nBinary-valued neural network: Restricted Boltzmann Machine (RBM) was used to model an \u201cen-\nsemble of binary vectors\u201d and rose to prominence in the mid-2000s after fast learning algorithms\nwere demonstrated by Hinton et. al. [20, 21]. Recent binarized neural network [22, 23] approximated\nstandard CNNs by binarizing \ufb01lter weights and/or inputs, with the aim to reduce computational\ncomplexity and memory consumption. The XNOR operation employed in [23] is limited to binary\nhamming distance and not readily applicable to non-binary neuron weights and inputs.\nEnsemble of binary patterns: the distributive property of GHD described in (1) provides an intriguing\nview on neural computing \u2013 even though real-valued pattens are involved in the computation, the\ncomputed GHD is strictly equivalent to the mean of binary hamming distances across two ensembles\nof binary patterns! This novel view illuminates the connection between generalized hamming\nnetworks and ef\ufb01cient binary features, that have long been used in various computer vision tasks,\nfor instance, the celebrated Adaboost face detection[24], numerous binary features for key-point\nmatching [25, 26] and binary codes for large database hashing [27, 28, 29, 30].\n\n2\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 1: (a) h(a, b) has one fuzzy region near the identity element 0.5 (in white), two positively\ncon\ufb01dent (in red) and two negatively con\ufb01dent (in blue) regions from above and below, respectively.\n(b) Fuzziness F (h(a, b)) = h(a, b) \u2295 h(a, b) has its maxima along a = 0.5 or b = 0.5.\n(c)\n\u00b5(h(a, b)) : U \u2192 I where \u00b5(h) = 1/(1+exp(0.5\u2212h)) is the logistic function to assign membership\nto fuzzy set elements (see de\ufb01nition 4). (d) partial derivative of \u00b5(h(a, b)). Note that magnitudes of\ngradient in the fuzzy region is non-negligible.\n\n2 Generalized Hamming Distance\nDe\ufb01nition 1. Let a, b, c \u2208 U \u2286 R, and a generalized hamming distance (GHD), denoted by \u2295, be a\nbinary operator h : U \u00d7 U \u2192 U ; h(a, b) := a \u2295 b = a + b \u2212 2 \u00b7 a \u00b7 b . Then\n(i) for U = {0, 1} GHD de-generalizes to binary hamming distance with\n\n0 \u2295 0 = 0; 0 \u2295 1 = 1; 1 \u2295 0 = 1; 1 \u2295 1 = 0;\n\n(ii) for U = [0.0, 1.0] the unitary interval I, a \u2295 b \u2208 I (closure);\n\nRemark: this case is referred to as the \u201crestricted\u201d hamming distance, in the sense that inverse\nof any elements in I are not necessarily contained in I (see below for de\ufb01nition of inverse).\n(iii) for U = R, H := (R,\u2295) is a group satisfying \ufb01ve abelian group axioms, thus is referred to as\n\nthe generalized hamming group or hamming group:\n\u2022 a \u2295 b = (a + b \u2212 2 \u00b7 a \u00b7 b) \u2208 R (closure);\n\u2022 a \u2295 b = (a + b \u2212 2 \u00b7 a \u00b7 b) = b \u2295 a (commutativity);\n\u2022 (a \u2295 b) \u2295 c = (a + b \u2212 2 \u00b7 a \u00b7 b) + c \u2212 2(a + b \u2212 2 \u00b7 a \u00b7 b)c\n= a + (b + c \u2212 2 \u00b7 b \u00b7 c) \u2212 2 \u00b7 a \u00b7 (b + c \u2212 2 \u00b7 b \u00b7 c) = a \u2295 (b \u2295 c) (associativity);\n\u2022 \u2203e = 0 \u2208 R such that e \u2295 a = a \u2295 e = (0 + a \u2212 2 \u00b7 0 \u00b7 a) = a (identity element);\n2\u00b7a\u22121 \u2212 2a \u00b7\n\u2022 for each a \u2208 R \\ {0.5}, \u2203a\u22121 := a/(2 \u00b7 a \u2212 1) s.t. a \u2295 a\u22121 = (a + a\n\n= 0 = e; and we de\ufb01ne \u221e := (0.5)\u22121 (inverse element).\n\na\n\n2\u00b7a\u22121 )\n\nRemark: note that 1 \u2295 a = 1 \u2212 a which complements a. \u201c0.5\u201d is a \ufb01xed point since \u2200a \u2208\nR, 0.5 \u2295 a = 0.5, and 0.5 \u2295 \u221e = 0 according to de\ufb01nition1.\n\n(iv) GHD naturally leads to a measurement of fuzziness: F (a) := a \u2295 a, R \u2192 (\u2212\u221e, 0.5] :\nF (a) \u2265 0,\u2200a \u2208 [0, 1]; F (a) < 0 otherwise. Therefore [0, 1] is referred to as the fuzzy\nregion in which F (0.5) = 0.5 has the maximal fuzziness and F (0) = F (1) = 0 are two\nboundary points. Outer regions (\u2212\u221e, 0] and [1,\u221e) are negative and positive con\ufb01dent regions\nrespectively. See Figure 1 (a) for the surface of h(a, b) which has one central fuzzy region, two\npositive con\ufb01dent and two negative con\ufb01dent regions.\n\n(v) The direct sum of hamming group is still a hamming group HL := \u2295l\u2208LHl:\n\nlet x =\n{x1, . . . , xL}, y = {y1, . . . , yL} \u2208 HL be two group members, then the generalized ham-\nming distance is de\ufb01ned as the arithmetic mean of element-wise GHD: GL(x \u2295L y) :=\nL (x1 \u2295 y1 + . . . + xL \u2295 yL).\n1\nAnd let \u02dcx = (x1 + . . . xL)/L, \u02dcy = (y1 + . . . yL)/L be arithmetic means of respective elements,\nthen GL(x \u2295L y) = \u02dcx + \u02dcy \u2212 2\nL\n\n(x \u00b7 y) , where x \u00b7 y =(cid:80)L\n\nl=1 xl \u00b7 yl is the dot product.\n\n1By this extension, it is R = R \u222a {\u2212\u221e, +\u221e} instead of R on which we have all group members.\n\n3\n\n\u22123\u22122\u2212101234\u22123\u22122\u2212101234\u221220\u22121001020h(a,b)\u22123\u22122\u2212101234\u22123\u22122\u2212101234\u22121200\u22121000\u2212800\u2212600\u2212400\u22122000F(h(a,b))\u22123\u22122\u2212101234\u22123\u22122\u22121012340.20.40.60.8\u03bc(h(a,b))\u22123\u22122\u2212101234\u22123\u22122\u2212101234\u22120.015\u22120.010\u22120.0050.0000.0050.0100.015\u2202\u03bc\u2202a(h(a,b))\f(vi) Distributive property:\n\nlet \u00afXM = (x1 + . . . xM )/M \u2208 HL be element-wise arithmetic mean\nof a set of members xm \u2208 HL, and \u00afYN be de\ufb01ned in the same vein. Then GHD is distributive:\n\nGL( \u00afXM \u2295L \u00afYN ) =\n\n1\nL\n\n\u00afxl \u2295 \u00afyl =\n\n1\nM\n\n1\nN\n\nl \u2295 yn\nxm\n\nl\n\nL(cid:88)\n\nl=1\n\nL(cid:88)\n\nM(cid:88)\nN(cid:88)\nN(cid:88)\nM(cid:88)\n\n1\nL\n\nm=1\n\nn=1\n\nl=1\n\n(1)\n\nGL(xm \u2295L yn).\n\n=\n\n1\n\nM N\n\nm=1\n\nn=1\n\nl \u2208 {0, 1} i.e. for two sets of binary patterns, the mean of binary\nRemark: in case that xm\nhamming distance between two sets can be ef\ufb01ciently computed as the GHD between two real-\nvalued patterns \u00afXM , \u00afYN . Conversely, a real-valued pattern can be viewed as the element-wise\naverage of an ensemble of binary patterns.\n\nl , yn\n\n3 Generalized Hamming Network\n\nDespite the recent progresses in deep learning, arti\ufb01cial neural networks has long been criticized\nfor its \u201cblack box\u201d nature: \u201cthey capture hidden relations between inputs and outputs with a highly\naccurate approximation, but no de\ufb01nitive answer is offered for the question of how they work\u201d [16].\nIn this section we provide an interpretation on neural computing by showing that, if the condition\nspeci\ufb01ed in (3) is ful\ufb01lled, outputs of each neuron can be strictly de\ufb01ned as the generalized hamming\ndistance between inputs and weights. Moreover, the computations of GHD induces fuzzy implication\nof XOR connective, and therefore, the inferencing of entire network can be regarded as a logical\ncalculus in the same vein as described in McCulloch & Pitts\u2019 seminial paper [17].\n\n3.1 New perspective on neural computing\n\nThe bearing of generalized hamming distance on neural computing is elucidated by looking at the\nnegative of generalized hamming distance, (GHD, see de\ufb01nition 1), between inputs x \u2208 HL and\nweights w \u2208 HL in which L denotes the length of neuron weights e.g. in convolution kernels:\n\n\u2212GL(w \u2295L x) =\n\n2\nL\n\nw \u00b7 x \u2212 1\nL\n\nwl \u2212 1\nL\n\nL(cid:88)\n\nl=1\n\nL(cid:88)\n\nl=1\n\nxl\n\n(2)\n\n(3)\n\nDivide (2) by the constant 2\n\nL and let\n\n(cid:0) L(cid:88)\n\nl=1\n\nwl +\n\n(cid:1)\n\nxl\n\nL(cid:88)\n\nl=1\n\nb = \u2212 1\n2\n\nnegatives of GHD between inputs and weights. Note that, for each layer, the bias term(cid:80)L\naveraged over neighbouring neurons in individual input image. The bias term(cid:80)L\nthe optimization,(cid:80)L\n\nthen it becomes the familiar form (w \u00b7 x + b) of neuron outputs save the non-linear activation\nfunction. By enforcing the bias term to take the given value in (3), standard neuron outputs measure\nl=1 xl is\nl=1 wl is computed\nseparately for each \ufb01lter in fully connected or convolution layers. When weights are updated during\nl=1 wl changes accordingly to keep up with weights and maintain stable neuron\n\noutputs. We discuss below (re-)interpretations of neural computing in terms of GHD.\nFuzzy inference: As illustrated in de\ufb01nition 4 GHD induces a fuzzy XOR connective. Therefore the\nnegative of GHD quanti\ufb01es the degree of equivalence between inputs x and weights w (see de\ufb01nition\n4 of fuzzy XOR), i.e. the fuzzy truth value of the statement \u201cx \u2194 w\u201d where \u2194 denotes a fuzzy\nequivalence relation. For GHD with multiple layers stacked together, neighbouring neuron outputs\ni \u2194\nfrom the previous layer are integrated to form composite statements e.g. \u201c(x1\ni ) \u2194 w2\nj \u201d where superscripts correspond to two layers. Thus stacked layers will form more\nw1\ncomplex, and hopefully more powerful, statements as the layer depth increases.\n\n1 \u2194 w1\n\n1, . . . , x1\n\n4\n\n\fFigure 2: Left to right: mean, max and min of neuron outputs, with/without batch normalized (BN,\nWO_BN) and generalized hamming distance (XOR). Outputs are averaged over all 64 \ufb01lters in the\n\ufb01rst convolution layer and plotted for 30 epochs training of a MNIST network used in our experiment\n(see Section 4).\n\nBatch normalization demysti\ufb01ed: When a mini-batch of training samples X = {x1, . . . , xM} is\ninvolved in the computation, due to the distributive property of GHD, the data-dependent bias term\n\nxl equals the arithmetic mean of corresponding bias terms computed for each sample in the\n\nL(cid:80)\n\nl=1\n\nM(cid:80)\n\nL(cid:80)\n\nl=1\n\n1\nM\n\nm=1\n\nmini-batch i.e.\nl . It is almost impossible to maintain a constant scalar b that ful\ufb01ls\nxm\nthis requirement when mini-batch changes, especially at deep layers of the network whose inputs\nare in\ufb02uenced by weights of incoming layers. The celebrated batch normalization (BN) technique\ntherefore proposed a learning method to compensate for the input vector change, with additional\nparameters \u03b3, \u03b2 to be learnt during the training [9]. It is our conjecture that batch normalization is\napproximating these rightful bias through optimization, and this connection is empirically revealed\nin Figure 2 with very similar neuron outputs obtained by BN and GHD. Indeed they are highly\ncorrelated during the course of training (with Pearson correlation coef\ufb01cient=0.97), con\ufb01rming our\nview that BN is attempting to in\ufb02uence the bias term according to (3).\nOnce b is enforced to follow (3), neither the optimization of bias terms nor the sophisticated learning\nmethod of BN is needed. In the following section we will illustrate a recti\ufb01ed neural network designed\nas such.\nRecti\ufb01ed linear units (ReLU) redesigned: Due to its strong biological motivations [10] and mathe-\nmatical justi\ufb01cations [11], recti\ufb01ed linear unit (ReLu) is the most popular activation function used for\ndeep neural network [31]. If neuron outputs are recti\ufb01ed as the generalized hamming distances, the\nactivation function max(0, 0.5 \u2212 h(x, w)) then simply sets a minimal hamming distance threshold\nof 0.5 (see Figure 1). Astute readers may immediately spot two limitations of this activation function:\na) it only takes into account the negative con\ufb01dence region while disregards positive con\ufb01dence\nregions; b) it allows elements in the fuzzy regime near 0.5 to misguide the optimization with their\nnon-negligible gradients.\nA straightforward remedy to ReLU is to suppress elements within the fuzzy region by setting outputs\nbetween [0.5 \u2212 r, 0.5 + r] to 0.5, where r is a parameter to control acceptable fuzziness in neuron\noutputs. In particular, we may set thresholds adaptively e.g. [0.5 \u2212 r \u00b7 O, 0.5 + r \u00b7 O] where O\nis the maximal magnitude of neuron outputs and the threshold ratio r is adjusted by the optimizer.\nThis double-thresholding strategy effectively prevents noisy gradients of fuzzy elements, since\n0.5 is a \ufb01xed point and x \u2295 0.5 = 0.5 for any x. Empirically we found this scheme, in tandem\nwith the recti\ufb01cation (3), dramatically boosts the training ef\ufb01ciency for challenging tasks such as\nCIFAR10/100 image classi\ufb01cation. It must be noted that, however, the use of non-linear activation as\nsuch is not essential for GHD-based neural computing. When the double-thresholding is switched-off\n(by \ufb01xing r = 0), the learning is prolonged for challenging CIFAR10/100 image classi\ufb01cation but its\nin\ufb02uence on the simple MNIST classi\ufb01cation is almost negligible (see Section 4 for experimental\nresults).\n\n3.2 Ganeralized hamming network with induced fuzzy XOR\n\nDe\ufb01nition 2. A generalized hamming network (GHN) is any networks consisting of neurons, whose\noutputs h \u2208 HL are related to neuron inputs x \u2208 HL and weights w \u2208 HL by h = x \u2295L w .\n\n5\n\n051015202530\u22120.16\u22120.14\u22120.12\u22120.10\u22120.08\u22120.06\u22120.04\u22120.020.00Y: Mean outputs X:epochs( 100)BNXORWO_BN051015202530012345678Y: Max outputs X:epochs(x100)BNXORWO_BN051015202530\u22127\u22126\u22125\u22124\u22123\u22122\u221210Y: Min outputs X:epochs(x100)BNXORWO_BN\fRemark: In case that the bias term is computed directly from (3) such that h = x \u2295L w is ful\ufb01lled\nstrictly, the network is called a recti\ufb01ed GHN or simply a GHN. In other cases where bias terms are\napproximating the rightful offsets (e.g. by batch normalization [9]), the trained network is called an\napproximated GHN.\n\nCompared with traditional neural networks, the optimization of bias terms is no longer needed in\nGHN. Empirically, it is shown that the proposed GHN bene\ufb01ts from a fast and robust learning process\nthat is on par with that of the batch-normalization approach, yet without resorting to sophisticated\nlearning process of additional parameters (see Section 4 for experimental results). On the other hand,\nGHN also bene\ufb01ts from the rapid developments of neural computing techniques, in particular, those\nemploying parallel computing on GPUs. Due to this ef\ufb01cient implementation of GHNs, it is the \ufb01rst\ntime that fuzzy neural networks have demonstrated state-of-the-art performances on learning tasks\nwith large scale datasets.\nOften neuron outputs are clamped by a logistic activation function to within the range [0, 1], so\nthat outputs can be compared with the target labels in supervised learning. As shown below, GHD\nfollowed by such a non-linear activation actually induces a fuzzy XOR connective. We brie\ufb02y review\nbasic notion of fuzzy set used in our work and refer readers to [2, 32, 13] for thorough treatments and\nreview of the topic.\nDe\ufb01nition 3. Fuzzy Set: Let X be an universal set of elements x \u2208 X, then a fuzzy set A is a set\n\nof pairs: A := {(cid:0)x, \u00b5A(x)(cid:1)|x \u2208 X, \u00b5A(x) \u2208 I}, in which \u00b5A : X \u2192 I is called the membership\n\nfunction (or grade membership).\nRemark: In this work we let X be a Cartesian product of two sets X = P \u00d7 U where P are (2D or\n3D) collection of neural nodes and U are real numbers in \u2286 I or \u2286 R. We de\ufb01ne the membership\nfunction \u00b5X (x) := \u00b5U (xp),\u2200x = (p, xp) \u2208 X such that it is dependent on xp only. For the sake of\nbrevity we abuse the notation and use \u00b5(x), \u00b5X (x) and \u00b5U (xp) interchangeably.\nDe\ufb01nition 4. Induced fuzzy XOR: let two fuzzy set elements a, b \u2208 U be assigned with respective\ngrade or membership by a membership function \u00b5 : U \u2192 I : \u00b5(a) = i, \u00b5(b) = j, then the\ngeneralized hamming distance h(a, b) : U \u00d7 U \u2192 U induces a fuzzy XOR connective E : I \u00d7 I \u2192 I\nwhose membership function is given by\n\n\u00b5R(i, j) = \u00b5(h(\u00b5\u22121(i), \u00b5\u22121(j))).\n\n(4)\n\nRemark: For the restricted case U = I the membership function can be trivially de\ufb01ned as the identity\nfunction \u00b5 = idI as proved in [4].\nRemark: For the generalized case where U = R, the fuzzy membership \u00b5 can be de\ufb01ned by a sigmoid\nfunction such as logistic, tanh or any function : U \u2192 I. In this work we adopt the logistic function\n1+exp(0.5\u2212a) and the resulting fuzzy XOR connective is given by following membership\n\u00b5(a) =\nfunction:\n\n1\n\n1 + exp(cid:0)0.5 \u2212 \u00b5\u22121(i) \u2295 \u00b5\u22121(j)(cid:1) ,\n\n1\n\n(5)\n\n\u00b5R(i, j) =\na \u2212 1) + 1\n\nwhere \u00b5\u22121(a) = \u2212 ln( 1\n2 is the inverse of \u00b5(a). Following this analysis, it is possible to\nrigorously formulate neuron computing of the entire network according to inference rules of fuzzy\nlogic theory (in the same vein as illustrated in [17]). Nevertheless, research along this line is out of\nthe scope of the present article and will be reported elsewhere.\n\n4 Performance evaluation\n\n4.1 A case study with MNIST image classi\ufb01cation\n\nOverall performance: we tested a simple four-layered GHN (cv[1,5,5,16]-pool-cv[16,5,5,64]-pool-\nfc[1024]-fc[1024,10]) on the MNIST dataset with 99.0% test accuracy obtained. For this relatively\nsimple dataset, GHN is able to reach test accuracies above 0.95 with 1000 mini-batches and a\nlearning rate 0.1. This learning speed is on par with that of the batch normalization (BN), but without\nresorting to the learning of additional parameters in BN. It was also observed a wide range of large\nlearning rates (from 0.01 to 0.1) all resulted in similar \ufb01nal accuracies (see below). We ascribe this\nwell-controlled robust learning behaviour to recti\ufb01ed bias terms enforced in GHNs.\n\n6\n\n\fFigure 3: Test accuracies of MNIST classi\ufb01cation with Generalized Hamming Network (GHN). Left:\ntest accuracies without using non-linear activation (by setting r = 0). Middle: with r optimized for\neach layer. Right: with r optimized for each \ufb01lter. Four learning rates i.e. {0.1, 0.05, 0.025, 0.01} are\nused for each case with the \ufb01nal accuracy reported in brackets. Note that the number of mini-batch\nare in logarithmic scale along x-axis.\n\nIn\ufb02uence of learning rate: This experiment compares performances with different learning rates\nand Figure 3 (middle,right) show that a very large learning rate (0.1) leads to much faster learning\nwithout the risk of divergences. A small learning rate (0.01) suf\ufb01ce to guarantee the comparable \ufb01nal\ntest accuracy. Therefore we set the learning rate to a constant 0.1 for all experiments unless stated\notherwise.\nIn\ufb02uence of non-linear double-thresholding: The non-linear double-thresholding can be turned off by\nsetting the threshold ratio r = 0 (see texts in Section 3.1). Optionally the parameter r is automatically\noptimized together with the optimization of neuron weights. Figure 3 (left) shows that the GHN\nwithout non-linear activation (by setting r = 0) performs equally well as compared with the case\nwhere r is optimized (in Figure 3 left, right). There are no signi\ufb01cant differences between two settings\nfor this relative simple task.\n\n4.2 CIFAR10/100 image classi\ufb01cation\n\nIn this experiment, we tested a six-layered GHN (cv[3,3,3,64]-cv[64,5,5,256]-pool-cv[256,5,5,256]-\npool-fc[1024]-fc[1024,512]-fc[1024,nclass]) on both CIFAR10 (nclass=10) and CIFAR100\n(nclass=100) datasets. Figure 4 shows that the double-thresholding scheme improves the learn-\ning ef\ufb01ciency dramatically for these challenging image classi\ufb01cation tasks: when the parameter r\nis optimized for each feature \ufb01lter the numbers of iterations required to reach the same level of test\naccuracy are reduced by 1 to 2 orders of magnitudes. It must be noted that performances of such a\nsimple generalized hamming network (89.3% for CIFAR10 and 60.1% for CIFAR100) are on par\nwith many sophisticated networks reported in [33]. In our view, the recti\ufb01ed bias enforced by (3) can\nbe readily applied to these sophisticated networks, although resulting improvements may vary and\nremain to be tested.\n\n4.3 Generative modelling with Variational Autoencoder\n\nIn this experiment, we tested the effect of recti\ufb01cation in GHN applied to a generative modelling\nsetting. One crucial difference is that the objective is now to minimize reconstruction error instead of\nclassi\ufb01cation error. It turns out the double-thresholding scheme is no longer relevant for this setting\nand thus not used in the experiment.\nThe baseline network (784-400-400-20) used in this experiment is an improved implementation [34]\nof the in\ufb02uential paper [35], trained on the MNIST dataset of images of handwritten digits. We have\nrecti\ufb01ed the outputs following (3) and, instead of optimizing the lower bound of the log marginal\nlikelihood as in [35], we directly minimize the reconstruction error. Also we did not include weights\nregularization terms for the optimization as it is unnecessary for GHN. Figure 5 (left) illustrates\nthe reconstruction error with respect to number of training steps (mini-batches). It is shown that\nthe recti\ufb01ed generalized hamming network converges to a lower minimal reconstruction error as\ncompared to the baseline network, with about 28% reduction. The recti\ufb01cation also leads to a faster\nconvergence, which is in accordance with our observations in other experiments.\n\n7\n\n3.03.54.04.55.0log(#mini_batch)0.800.850.900.951.00Accuracyrate0.1 (98.97%)rate0.05 (98.86%)rate0.025 (98.96%)rate0.01 (98.69%)3.03.54.04.55.0log(#mini_batch)0.750.800.850.900.951.00Accuracyrate0.1 (98.91%)rate0.05 (99.01%)rate0.025 (98.86%)rate0.01 (98.65%)3.03.54.04.55.0log(#mini_batch)0.750.800.850.900.951.00Accuracyrate0.1 (98.98%)rate0.05 (98.83%)rate0.025 (98.84%)rate0.01 (98.63%)\fFigure 4: Left: GHN test accuracies of CIFAR10 classi\ufb01cation (OPT THRES: parameter r op-\ntimized; WO THRES: without nonlinear activation). Right: GHN test accuracies of CIFAR100\nclassi\ufb01cation(OPT THRES: parameter r optimized; WO THRES: without non-linear activation).\n\nFigure 5: Left: Reconstruction errors of convolution VAE with and w/o recti\ufb01cation. Right: Evalua-\ntion accuracies of Sentence classi\ufb01cation with GHN recti\ufb01cation and w/o recti\ufb01cation).\n\n4.4 Sentence classi\ufb01cation\n\nA simple CNN has been used for sentence-level classi\ufb01cation tasks and excellent results were\ndemonstrated on multiple benchmarks [36]. The baseline network used in this experiment is a\nre-implementation of [36] made available from [37]. Figure 5 (right) plots accuracy curves from both\nnetworks. It was observed that the recti\ufb01ed GHN did improve the learning speed, but did not improve\nthe \ufb01nal accuracy as compared with the baseline network: both networks yielded the \ufb01nal evaluation\naccuracy around 74% despite that the training accuracy were almost 100%. The over-\ufb01tting in this\nexperiment is probably due to the relatively small Movie Review dataset size with 10,662 example\nreview sentences, half positive and half negative.\n\n5 Conclusion\n\nIn summary, we proposed a recti\ufb01ed generalized hamming network (GHN) architecture which materi-\nalizes a re-emerging principle of fuzzy logic inferencing. This principle has been extensively studied\nfrom a theoretic fuzzy logic point of view, but has been largely overlooked in the practical research\nof ANN. The recti\ufb01ed neural network derives fuzzy logic implications with underlying generalized\nhamming distances computed in neuron outputs. Bearing this recti\ufb01ed view in mind, we proposed to\ncompute bias terms analytically without resorting to sophisticated learning methods such as batch\nnormalization. Moreover, we have shown that, the recti\ufb01ed linear units (ReLU) was theoretically\nnon-essential and could be skipped for some easy tasks. While for challenging classi\ufb01cation problems,\nthe double-thresholding scheme did improve the learning ef\ufb01ciency signi\ufb01cantly.\nThe simple architecture of GHN, on the one hand, lends itself to being analysed rigorously and this\nfollow up research will be reported elsewhere. On the other hand, GHN is the \ufb01rst fuzzy neural\nnetwork of its kind that has demonstrated fast learning speed, well-controlled behaviour and state-\nof-the-art performances on a variety of learning tasks. By cross-checking existing networks against\nGHN, one is able to grasp the most essential ingredient of deep learning. It is our hope that this kind\nof comparative study will shed light on future deep learning research and eventually open the \u201cblack\nbox\u201d of arti\ufb01cial neural networks [16].\n\n8\n\n3.03.54.04.55.05.56.0log(#mini_batch)0.40.50.60.70.80.9AccuracyOPT_THRES (89.26%)WO_THRES (84.63%)3.003.253.503.754.004.254.504.75log(#mini_batch)0.10.20.30.40.50.6AccuracyOPT_THRES (60.05%)WO_THRES (51.71%)020000400006000080000100000#mini_batch15002000250030003500400045005000Reconstruction errorGHNVAE010002000300040005000#mini_batch0.500.550.600.650.700.75accuracyGHNCNN\fAcknowledgement\n\nI am grateful to anonymous reviewers for their constructive comments to improve the quality of this\npaper. I greatly appreciate valuable discussions and supports from colleagues at Nokia Technologies.\n\nReferences\n[1] M M Gupta and D H Rao. Invited Review on the principles of fuzzy neural networks. Fuzzy Sets and\n\nSystems, 61:1\u201318, 1994.\n\n[2] L.A. Zadeh. Fuzzy sets. Information Control, 8:338\u2013353, 1965.\n\n[3] J\u00f3zsef Tick, J\u00e1nos Fodor, and John Von Neumann. Fuzzy Implications and Inference Process. Computing\n\nand Informatics, 24:591\u2013602, 2005.\n\n[4] Benjam\u00edn C Bedregal, Renata H S Reiser, and Gra\u00e7aliz P Dimuro. Xor-Implications and E-Implications:\nClasses of Fuzzy Implications Based on Fuzzy Xor. Electronic Notes in Theoretical Computer Science,\n247:5\u201318, 2009.\n\n[5] Krassimir Atanassov. On Zadeh\u2019s intuitionistic fuzzy disjunction and conjunction. NIFS, 17(1):1\u20134, 2011.\n\n[6] Abhay B Ulsari. Training Arti\ufb01cial Neural Networks for Fuzzy Logic. Complex Systems, 6:443\u2013457, 1992.\n\n[7] Witold Pedrycz and Giancarlo Succi. fXOR fuzzy logic networks. Soft Computing, 7, 2002.\n\n[8] H.-J Zimmermann. Fuzzy set theory review. Advanced Review, 2010.\n\n[9] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Francis R. Bach and David M. Blei, editors, ICML, volume 37, pages 448\u2013456,\n2015.\n\n[10] R. Hahnloser, R. Sarpeshkar, M. Mahowald, R.J. Douglas, H.S.Seung. Digital selection and analogue\n\nampli\ufb01cation coexist in a cortex-inspired silicon circuit. 405, 2000.\n\n[11] H.S. Seung R Hahnloser. Permitted and forbidden sets in symmetric threshold-linear networks. In NIPS,\n\n2001.\n\n[12] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. In Geoffrey\nGordon, David Dunson, and Miroslav Dud\u00edk, editors, Proceedings of the Fourteenth International Confer-\nence on Arti\ufb01cial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research,\npages 315\u2013323, 11\u201313 Apr 2011.\n\n[13] R. Belohlavek, J.W. Dauben, and G.J. Klir. Fuzzy Logic and Mathematics: A Historical Perspective.\n\nOxford University Press, 2017.\n\n[14] P. Liu and H.X. Li. Fuzzy Neural Network Theory and Application. Series in machine perception and\n\narti\ufb01cial intelligence. World Scienti\ufb01c, 2004.\n\n[15] Jyh-Shing Roger Jang and Chuen-Tsai Sun. Functional equivalence between radial basis function networks\n\nand fuzzy inference systems. IEEE Trans. Neural Networks, 4(1):156\u2013159, 1993.\n\n[16] Jos\u00e9 Manuel Ben\u00edtez, Juan Luis Castro, and Ignacio Requena. Are arti\ufb01cial neural networks black boxes?\n\nIEEE Trans. Neural Networks, 8(5):1156\u20131164, 1997.\n\n[17] Warren Mcculloch and Walter Pitts. A logical calculus of ideas immanent in nervous activity. Bulletin of\n\nMathematical Biophysics, 5:127\u2013147, 1943.\n\n[18] Zhiting Hu, Xuezhe Ma, Zhengzhong Liu, Eduard H. Hovy, and Eric P. Xing. Harnessing deep neural\nnetworks with logic rules. In Proceedings of the 54th Annual Meeting of the Association for Computational\nLinguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016.\n\n[19] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate\n\ntraining of deep neural networks. page 901, 2016.\n\n[20] Geoffrey Hinton and Ruslan Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504 \u2013 507, 2006.\n\n9\n\n\f[21] Vinod Nair and Geoffrey E. Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\nJohannes F\u00fcrnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on\nMachine Learning (ICML-10), pages 807\u2013814. Omnipress, 2010.\n\n[22] Matthieu Courbariaux and Yoshua Bengio. Binarized neural network: Training deep neural networks with\n\nweights and activations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.\n\n[23] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. Xnor-net: Imagenet classi\ufb01ca-\n\ntion using binary convolutional neural networks. CoRR, abs/1603.05279, 2016.\n\n[24] Paul Viola and Michael J. Jones. Robust real-time face detection. Int. J. Comput. Vision, 57(2):137\u2013154,\n\nMay 2004.\n\n[25] Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent\nelementary features. In Proceedings of the 11th European Conference on Computer Vision: Part IV,\nECCV\u201910, pages 778\u2013792, 2010.\n\n[26] Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary Bradski. Orb: An ef\ufb01cient alternative to sift\nor surf. In Proceedings of the 2011 International Conference on Computer Vision, ICCV \u201911, pages\n2564\u20132571, Washington, DC, USA, 2011.\n\n[27] Brian Kulis and Trevor Darrell. Learning to hash with binary reconstructive embeddings. In Proceedings of\nthe 22Nd International Conference on Neural Information Processing Systems, NIPS\u201909, pages 1042\u20131050,\n2009.\n\n[28] Mohammad Norouzi and David M. Blei. Minimal loss hashing for compact binary codes. In Proceedings\nof the 28th International Conference on Machine Learning (ICML-11), pages 353\u2013360, New York, NY,\nUSA, 2011.\n\n[29] Kevin Lin, Huei-Fang Yang, Jen-Hao Hsiao, and Chu-Song Chen. Deep learning of binary hash codes\nfor fast image retrieval. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\nWorkshops, June 2015.\n\n[30] Mohammad Norouzi, David J Fleet, and Ruslan R Salakhutdinov. Hamming distance metric learning. In\nF. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information\nProcessing Systems 25, pages 1061\u20131069. 2012.\n\n[31] Yann Lecun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444, 5 2015.\n\n[32] H.-J. Zimmermann. Fuzzy Set Theory \u2014 and Its Applications. Kluwer Academic Publishers, Norwell,\n\nMA, USA, 2001.\n\n[33] What\n\nis the class of this image?\n\nin objects classi\ufb01-\ncation. http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_\nresults.html. Accessed: 2017-07-19.\n\nDiscover the current state of the art\n\n[34] A baseline variational auto-encoder based on \"auto-encoding variational bayes\". https://github.com/\n\ny0ast/VAE-TensorFlow. Accessed: 2017-05-19.\n\n[35] Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. CoRR, abs/1312.6114, 2013.\n\n[36] Yoon Kim. Convolutional neural networks for sentence classi\ufb01cation. CoRR, abs/1408.5882, 2014.\n\n[37] A baseline cnn network for sentence classi\ufb01cation implemented with tensor\ufb02ow. https://github.com/\n\ndennybritz/cnn-text-classification-tf. Accessed: 2017-05-19.\n\n10\n\n\f", "award": [], "sourceid": 1193, "authors": [{"given_name": "Lixin", "family_name": "Fan", "institution": "Nokia Technologies"}]}