{"title": "Learning Complex Boolean Functions: Algorithms and Applications", "book": "Advances in Neural Information Processing Systems", "page_first": 911, "page_last": 918, "abstract": null, "full_text": "Learning Complex Boolean Functions: \n\nAlgorithms and Applications \n\nArlindo L. Oliveira and Alberto Sangiovanni-Vincentelli \n\nDept. of EECS \n\nUC Berkeley \n\nBerkeley CA 94720 \n\nAbstract \n\nThe most commonly used neural network models are not well suited \nto direct digital implementations because each node needs to per(cid:173)\nform a large number of operations between floating point values. \nFortunately, the ability to learn from examples and to generalize is \nnot restricted to networks ofthis type. Indeed, networks where each \nnode implements a simple Boolean function (Boolean networks) can \nbe designed in such a way as to exhibit similar properties. Two \nalgorithms that generate Boolean networks from examples are pre(cid:173)\nsented. The results show that these algorithms generalize very \nwell in a class of problems that accept compact Boolean network \ndescriptions. The techniques described are general and can be ap(cid:173)\nplied to tasks that are not known to have that characteristic. Two \nexamples of applications are presented: image reconstruction and \nhand-written character recognition. \n\n1 \n\nIntroduction \n\nThe main objective of this research is the design of algorithms for empirical learning \nthat generate networks suitable for digital implementations. Although threshold \ngate networks can be implemented using standard digital technologies, for many \napplications this approach is expensive and inefficient. Pulse stream modulation \n[Murray and Smith, 1988] is one possible approach, but is limited to a relatively \nsmall number of neurons and becomes slow if high precision is required. Dedicated \n\n911 \n\n\f912 \n\nOliveira and Sangiovanni-Vincentelli \n\nboards based on DSP processors can achieve very high performance and are very \nflexible but may be too expensive for some applications. \n\nThe algorithms described in this paper accept as input a training set and generate \nnetworks where each node implements a relatively simple Boolean function. Such \nnetworks will be called Boolean networks. Many applications can benefit from \nsuch an approach because the speed and compactness of digital implementations \nis still unmatched by its analog counterparts. Additionally, many alternatives are \navailable to designers that want to implement Boolean networks, from full-custom \ndesign to field programmable gate arrays. This makes the digital alternative more \ncost effective than solutions based on analog designs. \nOccam's razor [Blumer et ai., 1987; Rissanen, 1986] provides the theoretical founda(cid:173)\ntion for the development of algorithms that can be used to obtain Boolean networks \nthat generalize well. According to this paradigm, simpler explanations for the avail(cid:173)\nable data have higher predictive power. The induction problem can therefore be \nposed as an optimization problem: given a labeled training set, derive the \nless complex Boolean network that is consistent I with the training set. \n\nOccam's razor, however, doesn't help in the choice of the particular way of mea(cid:173)\nsuring complexity that should be used. In general, different types of problems may \nrequire different complexity measures. The algorithms described in section 3.1 and \n3.2 are greedy algorithms that aim at minimizing one specific complexity measure: \nthe size of the overall network. Although this particular way of measuring com(cid:173)\nplexity may prove inappropriate in some cases, we believe the approach proposed \ncan be generalized and used with minor modifications in many other tasks. The \nproblem of finding the smallest Boolean network consistent with the training set is \nNP-hard [Garey and Johnson, 1979] and cannot be solved exactly in most cases. \nHeuristic approaches like the ones described are therefore required. \n\n2 Definitions \n\nWe consider the problem of supervised learning in an attribute based description \nlanguage. The attributes (input variables) are assumed to be Boolean and every \nexemplar in the training set is labeled with a value that describes its class. Both \nalgorithms try to maximize the mutual information between the network output \nand these labels. \nLet variable X take the values {Xl, X2, ... x n } with probabilities p(Xd,P(X2) ... P(x n ). \nThe entropy of X is given by H(X) = - Lj p(Xj) logp(xj) and is a measure \nof the uncertainty about the value of X. The uncertainty about the value \nof X when the value of another variable Y is known is given by H(XIY) = \n- Li p(Yi) Lj p(Xj Iyd logp(xj Iyd\u00b7 \nThe amount by which the uncertainty of X is reduced when the value of variable Y \nis known, I(Y, X) = H(X) - H(XIY) is called the mutual information between Y \nand X. In this context, Y will be a variable defined by the output of one or more \nnottes in the network and X will be the target value specified in the training set. \n\n1 Up to some specified level. \n\n\fLearning Complex Boolean Functions: Algorithms and Applications \n\n913 \n\n3 Algorithms \n\n3.1 Muesli - An algorithm for the design of multi-level logic networks \n\nThis algorithm derives the Boolean network by performing gradient descent in the \nmutual information between a set of nodes and the target values specified by the \nlabels in the training set. \nIn the pseudo code description of the algorithm given in figure 1, the function 'L (S) \ncomputes the mutual information between the nodes in S (viewed as a multi-valued \nvariable) and the target output. \n\nmuesli( nlist) { \n\nnlist ;- sorLnlisLby1(nlist,1); \nsup;- 2; \nwhile (noLdone(nlist) /\\ sup < max_sup) { \n\nact ;- 0; \ndo { \n\nact + +; \nsuccess;- improvLmi(act, nlist, sup); \n\n} while (success = FALSE /\\ act < max_act); \nif (success = TRUE) { \n\nsup;- 2; \nwhile (success = TRUE) \n\nsuccess;- improve_mi(act, nlist, sup); \n\n} \nelse sup + +; \n\n} \n\n} \n\nimproVLmi(act, nlist, sup) { \n\nnlist;- sorLnlisLby1(nlist, act); \n1;- besLlunction(nlist, act, sup); \nif (I(nlist[l:act-l] U f) > I(nlist[l:act])) { \n\nnlist ;- nlist U I; \nreturn(TRUE); \n\n} \nelse return(F ALSE) j \n\n} \n\nFigure 1: Pseudo-code for the Muesli algorithm. \n\nThe algorithm works by keeping a list of candidate nodes, nlist, that initially con(cid:173)\ntains only the primary inputs. The act variable selects which node in nl ist is active. \nInitially, act is set to 1 and the node that provides more information about the out(cid:173)\nput is selected as the active node. Function imp1'ove_miO tries to combine the \nactive node with other nodes as to increase the mutual information. \n\nExcept for very simple functions, a point will be reached where no further improve-\n\n\f914 \n\nOliveira and Sangiovanni-Vincentelli \n\nments can be made for the single most informative node. The value of act is then \nincreased (up to a pre-specified maximum) and improve_mi is again called to select \nauxiliary features using other nodes in ntist as the active node. If this fails, the \nvalue of sup (size of the support of each selected function) is increased until no \nfurther improvements are possible or the target is reached. \n\nThe function sorLnlisLbyJ(nlist, act) sorts the first act nodes in the list by de(cid:173)\ncreasing value of the information they provide about the labels. More explicitly, the \nfirst node in the sorted list is the one that provides maximal information about the \nlabels. The second node is the one that will provide more additional information \nafter the first has been selected and so on. \n\nFunction improve_miO calls besLfunction(nlist, act, sup) to select the Boolean \nfunction f that takes as inputs node nlist[act] plus s'up-1 other nodes and maximizes \nI(nlist[l : act -1] U f). When sup is larger than 2 it is unfeasible to search all 22 s UP \npossible functions to select the desired one. However, given sup input variables, \nfinding such a function is equivalent to selecting a partition 2 of the 28UP points in \nthe input space that maximizes a specific cost function. This partition is found using \nthe Kernighan-Lin algorithm [Kernighan and Lin, 1970] for graph-partitioning. \n\nFigure 2 exemplifies how the algorithm works when learning the simple Boolean \nfunction f = ab + cde from a complete training set. In this example, the value of \nsup is always at 2. Therefore, only 2 input Boolean functions are generated. \n\nSelect x = ab \n\nFails to fmd f(x,?) with mi([f]) > 0.52 \nSet act = 2; \n\nmi([]) = 0.0 \n\na -\n\nnlist = [a,b,c,d,e] \nact = 1 \nmi([a]) = 0.16 \n\nnlist = [x,c,d,e,a,b] \nact = 1 \nmi([xD = 0.52 \n\nnlist = [x,c,d,e,a,b] \nact = 2 \nmi([x,c]) = 0.63 \n\nSelecty = cd \n\nSelect w = ye \n\nFails to find f(w,?) with mi([x,f]) > 0.93 \nSet act = 0; Select Z = x+w \n\nnlist = [x,y,e,a,b,c,d] \nact = 2 \nmi([x,y]) = 0.74 \n\nnlist = [x,y,e,a,b,c,d] \nact = 2 \nmi([x,w]) = 0.93 \n\nnlist = [z,x,y,a,b,c,d,e] \nact = 1 \nmi([z]) = 0.93 \n\nFigure 2: The muesli algorithm, illustrated \n\n2 A single output Boolean function is equivalent to a partition of the input space in two \n\nsets. \n\n\fLearning Complex Boolean Functions: Algorithms and Applications \n\n915 \n\n3.2 \n\nFulfringe - a network generation algorithm based on decision trees \n\nThis algorithm uses binary decision trees [Quinlan, 1986] as the basic underlying \nrepresentation. A binary decision tree is a rooted, directed, acyclic graph, where \neach terminal node (a node with no outgoing edges) is labeled with one of the \npossible output labels and each non-terminal node has exactly two outgoing edges \nlabeled 0 and 1. Each non-terminal node is also labeled with the name of the \nattribute that is tested at that node. A decision tree can be used to classify a \nparticular example by starting at the root node and taking, until a terminal is \nreached, the edge labeled with the value of the attribute tested at the current node. \n\nDecision trees are usually built in a greedy way. At each step, the algorithm greedily \nselects the attribute to be tested as the one that provides maximal information about \nthe label of the examples that reached that node in the decision tree. It then recurs \nafter splitting these examples according to the value of the tested attribute. \n\nFulfringe works by identifying patterns near the fringes of the decision tree and \nusing them to build new features. The idea was first proposed in [Pagallo and \nHaussler, 1990]. \n\nN A \n\n1\\0 \n+ 0 \n\n-p&-g \n\np&-g \n\n!A o + \"A \n\n+ 0 \n\np&g \n\n-p&g \n\n+ \n\n+ \n\nA \n\nA \n+~ \n\nA \nA \n~+ 1\\ + \nMMMM \n\n+1 \\ \n\n-p+-g \n\n-p+g \n\np+-g \n\np+g \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ \n\n+ + \n\n+ \n\n+ \n\np(t)g \n\nFigure 3: Fringe patterns identified by fuifringe \n\nFigure 3 shows the patterns that fulfringe identifies . Dcfringe, proposed in [Yang \net al., 1991], identifies the patterns shown in the first two rows. These patterns \ncorrespond to 8 Boolean functions of 2 variables. Since there are only 10 distinct \nBoolean functions that depend on two variables3 , it is natural to add the patterns \nin the third row and identify all possible functions of 2 variables. As in dcftinge \nand fringe, these new composite features are added (if they have not yet been \ngenerated) to the list of available features and a new decision tree is built. The \n\n3The remaining 6 functions of 2 variables depend on only one or none of the variables. \n\n\f916 \n\nOliveira and Sangiovanni-Vincentelli \n\nprocess is iterated until a decision tree with only one decision node is built. The \nattribute tested at this node is a complex feature and can be viewed as the output \nof a Boolean network that matches the training set data. \n\n3.3 Encoding multivalued outputs \n\nBoth muesli and Julfringe generate Boolean networks with a single binary valued \noutput. When the target label can have more than 2 values, some encoding must be \nused. The prefered solution is to encode the outputs using an error correcting code \n[Dietterich and Bakiri, 1991]. This approach preserves most of the compactness of \na digital encoding while beeing much less sensitive to errors in one of the output \nvariables. Additionally, the Hamming distance between an observed output and the \nclosest valid codeword gives a measure of the certainty of the classification. This \ncan be used to our advantage in problems where a failure to classify is less serious \nthan the output of a wrong classification. \n\n4 \n\nPerformance evaluation \n\nTo evaluate the algorithms, we selected a set of 11 functions of variable complexity. \nA complete description of these functions can be found in [Oliveira, 1994]. The first \n6 functions were proposed as test cases in [Pagallo and Haussler, 1990] and accept \ncompact disjoint normal form descriptions. The remaining ones accept compact \nmulti-level representations but have large two level descriptions. The algorithms \ndescribed in sections 3.1 and 3.2 were compared with the cascade-correlation algo(cid:173)\nrithm [Fahlman and Lebiere, 1990] and a standard decision t.ree algorithm analog \nto ID3 [Quinlan, 1986]. As in [Pagallo and Haussler, 1990], the number of examples \nin the training set was selected to be equal to ~ times the description length of the \nfunction under a fixed encoding scheme, where f was set equal to 0.1. For each \nfunction, 5 training sets were randomly selected. The average accuracy for the 5 \nruns in an independent set of 4000 examples is listed in table 1. \n\nTable 1: Accuracy of the four algorithms. \n\nFunction \n\n# inputs # examples \n\ndnfl \ndnf2 \ndnf3 \ndnf4 \nxor4_16 \nxor5_32 \nsm12 \nsm18 \nstr18 \nstr27 \ncarry8 \nAverage \n\n80 \n40 \n32 \n64 \n16 \n32 \n12 \n18 \n18 \n27 \n16 \n\n3292 \n2185 \n1650 \n2640 \n1200 \n4000 \n1540 \n2720 \n2720 \n4160 \n2017 \n\nmuesli \n99.91 \n99.28 \n99.94 \n100.00 \n98.35 \n60.16 \n99.90 \n100.00 \n100.00 \n98.64 \n99.50 \n95.97 \n\nAccuracy \n\nfulfringe \n99.98 \n98.89 \n100.00 \n100.00 \n100.00 \n100.00 \nlUO.OO \n99.92 \n100.00 \n99.35 \n98.71 \n99.71 \n\nID3 CasCor \n75.38 \n73.11 \n79 .19 \n58.41 \n99.91 \n99.97 \n98.98 \n91.30 \n92.57 \n93.90 \n99.22 \n87.45 \n\n82.09 \n88.84 \n89.98 \n72.61 \n75.20 \n51.41 \n99.81 \n91.48 \n94.55 \n94.24 \n96.70 \n85.35 \n\nThe results show that the performance of muesli and fulfringe is consistently su-\n\n\fLearning Complex Boolean Functions: Algorithms and Applications \n\n917 \n\nperior to the other two algorithms. Muesli performs poorly in examples that have \nmany xor functions, due the greedy nature of the algorithm . In particular, muesli \nfailed to find a solution in the alloted time for 4 of the 5 runs of xor5_32 and found \nthe exact solution in only one of the runs. \n\nID3 was the fastest of the algorithms and Cascade-Correlation the slowest. Fulfringe \nand muesli exhibited similar running times for these tasks. 'rVe observed, however, \nthat for larger problems the runtime for fulfringe becomes prohibitively high and \nmuesli is comparatively much faster. \n\n5 Applications \n\nTo evaluate the techniques described in real problems, experiments were performed \nin two domains: noisy image reconstruction and handwritten character recognition. \nThe main objective was to investigate whether the approach is applicable to prob(cid:173)\nlems that are not known to accept a compact Boolean network representation. The \noutputs were encoded using a 15 bit Hadamard error correcting code. \n\n5.1 \n\nImage reconstruction \n\nThe speed required by applications in image processing makes it a very interesting \nfield for this type of approach. In this experiment, 16 level gray scale images were \ncorrupted by random noise by switching each bit with 5% probability. Samples of \nthis image were used to train a network in the reconstruction of the original image. \nThe training set consisted of .5x5 pixel regions of corrupted images (100 binary \nvariables per sample) labeled with the value of the center pixel. Figure 4 shows a \ndetail of the reconstruction performed in an independent test image by the network \nobtained using fulfringe. \n\nOriginal image \n\ncorrupted image \n\nReconstructed image \n\nFigure 4: Image reconstruction experiment \n\n5.2 Handwritten character recognition \n\nThe NIST database of handwritten characters was used for this task. Individually \nsegmented digits were normalized to a 16 by 16 binary grid. A set of 53629 digits \nwas used for training and the resulting network was tested in a different set of 52467 \n\n\f918 \n\nOliveira and Sangiovanni-Vincentelli \n\ndigits. Training was performed using muesli. The algorithm was stopped after a pre(cid:173)\nspecified time (48 hours on a DECstation 5000/260) ellapsed. The resulting network \nwas placed and routed using the TimberWolf [Sechen and Sangiovanni-Vincentelli, \n1986] package and occupies an area of 78.8 sq. mm. using 0.8fl technology. \n\nThe accuracy on the test set was 93.9%. This value compares well with the per(cid:173)\nformance obtained by alternative approaches that use a similarly sized training set \nand little domain knowledge, but falls short of the best results published so far. \nOngoing research on this problem is concentrated on the use of domain knowledge \nto restrict the search for compact networks and speed up the training. \n\nAcknowledgements \n\nThis work was supported by Joint Services Electronics Program grant F49620-93-C-0014. \n\nReferences \n\n[Blumer et al., 1987] A. Blumer, A. Ehrenfeucht, D. Haussler, and M. Warmuth. Occam's \n\nrazor. Information Processing Letters, 24:377-380, 1987. \n\n[Dietterich and Bakiri, 1991] T. G. Dietterich and G. Bakiri. Error-correcting output \n\ncodes: A general method for improving multiclass inductive learning programs. In Pro(cid:173)\nceedings of the Ninth National Conference on Artificial Intelligence (AAAI-91), pages \n572-577. AAAI Press, 1991. \n\n[Fahlman and Lebiere, 1990] S.E. Fahlman and C. Lebiere. The cascade-correlation learn(cid:173)\n\ning architecture. In D.S. Touretzky, editor, Advances in Neural Information Processing \nSystems, volume 2, pages 524-532, San Mateo, 1990. Morgan Kaufmann. \n\n[Garey and Johnson, 1979] M.R. Garey and D.S. Johnson. Computers and Intractability: \n\nA Guide to the Theory of NP-Completeness. Freeman, New York. 1979. \n\n[Kernighan and Lin, 1970] B. W. Kernighan and S. Lin. An efficient heuristic procedure \nfor partitioning graphs. The Bell System Technical Journal, pages 291-307, February \n1970. \n\n[Murray and Smith, 1988] Alan F. Murray and Anthony V. W. Smith. Asynchronous vlsi \nneural networks using pulse-stream arithmetic. IEEE Journal of Solid-State Circuits, \n23:3:688-697, 1988. \n\n[Oliveira, 1994] Arlindo L. Oliveira. Inductive Learning by Selection of Minimal Repre(cid:173)\n\nsentations. PhD thesis, UC Berkeley, 1994. In preparation. \n\n[Pagallo and Haussler, 1990] G. Pagallo and D. Haussler. Boolean feature discovery in \n\nempirical learning. Machine Learning, 1, 1990. \n\n[Quinlan, 1986] J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, \n\n1986. \n\n[Rissanen, 1986) J. Rissanen. Stochastic complexity and modeling. Annals of Statistics, \n\n14:1080-1100, 1986. \n\n[Sechen and Sangiovanni-Vincentelli, 1986J \n\nCarl Sechen and Alberto Sangiovanni-Vincentelli. TimberWolf3.2: A new standard cell \nplacement and global routing package. In Proceedings of the 23rd Design Automation \nConference, pages 432-439, 1986. \n\n[Yang et al., 1991] D. S. Yang, L. Rendell, and G. Blix. Fringe-like feature construction: \nA comparative study and a unifying scheme. In Proceedings of the Eight International \nConference in Machine Learning, pages 223-227, San Mateo, 1991. Morgan Kaufmann. \n\n\f", "award": [], "sourceid": 857, "authors": [{"given_name": "Arlindo", "family_name": "Oliveira", "institution": null}, {"given_name": "Alberto", "family_name": "Sangiovanni-Vincentelli", "institution": null}]}