{"title": "Comparison of three classification techniques: CART, C4.5 and Multi-Layer Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 963, "page_last": 969, "abstract": null, "full_text": "Comparison of three classification techniques, \n\nCART, C4.5 and Multi-Layer Perceptrons \n\nA C Tsoi \nDepartment of Electrical EngineeringDepartment of Computer Science \nUniversity of Queensland \nSt Lucia, Queensland 4072 \nAustralia \n\nAust Defence Force Academy \n\nR A Pearson \n\nCampbell, ACT 2600 \n\nAustralia \n\nAbstract \n\nIn this paper, after some introductory remarks into the classification prob(cid:173)\nlem as considered in various research communities, and some discussions \nconcerning some of the reasons for ascertaining the performances of the \nthree chosen algorithms, viz., CART (Classification and Regression Tree), \nC4.5 (one of the more recent versions of a popular induction tree tech(cid:173)\nnique known as ID3), and a multi-layer perceptron (MLP), it is proposed \nto compare the performances of these algorithms under two criteria: classi(cid:173)\nfication and generalisation. It is found that, in general, the MLP has better \nclassification and generalisation accuracies compared with the other two \nalgorithms. \n\n1 \n\nIntroduction \n\nClassification of data into categories has been pursued by a number of research \ncommunities, viz., applied statistics, knowledge acquisition, neural networks. \n\nIn applied statistics, there are a number of techniques, e.g., clustering algorithms \n(see e.g., Hartigan), CART (Classification and Regression Trees, see e.g., Breiman \net al). Clustering algorithms are used when the underlying data naturally fall into a \nnumber of groups, the distance among groups are measured by various metrics [Har(cid:173)\ntigan]. CART [Breiman, et all has been very popular among applied statisticians. \nIt assumes that the underlying data can be separated into categories, the decision \nboundaries can either be parallel to the axis or they can be a linear combination \nof these axes!. Under certain assumptions on the input data and their associated \n\nlIn CART, and C4.5, the axes are the same as the input features \n\n963 \n\n\f964 \n\nTsoi and Pearson \n\noutput categories, its properties can be proved rigorously [Breiman et al]. The way \nin which CART organises its data set is quite sophisticated. For example, it grows \na number of decision trees by a cross validation method. \nKnowledge acquisition is an important topic in expert systems studies, see e.g., \nCharniak, McDermott. In this case, one is presented with a subset of input output \nexamples drawn from the set of all possible input output examples exhibited by the \nunderlying system. The problem is how to \"distill\" a set of rules describing the set \nof input output examples. The rules are often expressed in the form of \"if statement \n1, then statement 2, else statement 3\". Once this set of rules is obtained, it can \nbe used in a knowledge base for inference or for consulting purposes. It is trivial \nto observe that the rules can be represented in the form of a binary tree structure. \nIn the process of building this binary tree, the knowledge acquisition system must \nlearn about the set of input output examples. Often this problem is pursued in the \nmachine learning community, see e.g., Michalski et al. \n\nOne of the most popular induction tree algorithms is known as ID3, or its later \nvariants, known as C4 (see e.g., Quinlan, Utgoff). There has not been any explicit \nmention of the underlying assumptions on the data. However, it can be postulated \nthat for an induction tree technqiue to work efficiently, there must be some under(cid:173)\nlying assumptions on the data set considered. By analogy with CART, it can be \nobserved that an important underlying assumption must be that the data can be \ndivided into categories, the decision boundaries must be parallel to the axes (i.e., it \ndoes not find a linear combination of the underlying axes to form a possible decision \nboundary). In contrast to CART, and similar technqiues, it does not yet have a \nrigorous theoretical basis. Its learning algorithm, and the way in which it organises \nthe data set are somewhat different from CART. \n\nRecently, there is considerable activities in the study of yet another classification \nmethod, known generally as an artificial neural network (ANN) approach (see e.g., \nHecht-Nielson). In this approach, the idea is to use a system consisting of artifi(cid:173)\ncial neurons with very simple internal dynamics, interconnected to each other for \nmodelling a given set of input output examples. In this approach, one selects an \narchitecture of interconnection of artificial neurons, and a learning algorithm for \nfinding the unknown parameters in the architecture. A particular popular ANN ar(cid:173)\nchitecture is known as a multi-layer perceptron (MLP). In this architecture, signal \ntravels in only one direction, i.e., there is no feedback from the output to the input. \nA simple version of this architecture, consisting of only input and output layers \nof neurons was popularised by Rosenblatt in the 1950's and 1960's. An improved \nversion incorporating possibly more than one layer of hidden layer neurons has been \nused in the more recent past. A learning algorithm for finding the set of unknown \nparameters in this architecture while minimising a least square criterion is known \nas a back propagation algorithm. (see e.g., Rumelhart, McClelland). \n\nThere have been much analysis recently in understanding why a MLP can be used \nin classifying given input output examples, and what underlying assumptions are \nrequired (see e.g., Cybenko, Hornik et al). It can be proved that the MLP can \nbe used to approximate any given nonlinear input output mapping given certain \nnot too restrictive assumptions on the mapping, and the underlying input output \nvariables. \n\n\fComparison of Three Classification Techniques \n\n965 \n\nGiven that the three methods mentioned above, viz., CART, C4.5 (the latest version \nof the C4 Induction Tree methodology), and MLP, all enjoy popularity in their \nrespective research communities, and that they all perform classification based on \na given set of input output examples, a natural question to ask is: how do they \nperform as compared with one another. \n\nThere might be some objections to why a comparison among these algorithms is \nnecessary, since each is designed to operate under some predetermined conditions. \nSecondly, even if it is shown that a particular algorithm performs better for a set of \nparticular examples, there is no guarantee that the algorithm will perform better \nunder a different set of circumstances. Thus, this may throw some doubt on the \ndesirability of making a comparison among these algorithms. \n\nAs indicated above, each algorithm has some underlying assumptions on the con(cid:173)\nstruction of a data model, whether these assumptions are made explicit or not. In \na practical problem, e.g., power system forecasting [Atlas et al] it is not possible \nto determine the underlying assumptions in the data. But on an artificially gen(cid:173)\nerated example, it is possible to constrain the data so that they would have the \ndesirable characteristics. From this, it is possible to at least make some qualitative \nstatements concerning the algorithms. These qualitative statements may guide a \npractitioner to watch out for possible pitfalls in applying a particular algorithm to \npractical problems. Hence, it is worthwhile to carry out comparison studies. \n\nThe comparison question is not new. In fact there are already a number of stud(cid:173)\nies carried out to compare the performances of some of or all three algorithms \nmentioned 2 . For example, Atlas et al compared the performances of CART and \nMLP. In addition they have considered the performances of these two algorithms to \na practical problem, viz., the power system forecasting. Dietterich et al compared \nthe performances of ID3 and MLP, and have applied them to the Text to Speech \nmapping problem. In general, their conclusions are that the MLP is more accurate \nin performing generalisation on unseen examples, while the ID3 or CART is much \nfaster in performing the classficiation task. \n\nIn this paper, we will consider the performances of all three algorithms, viz., CART, \nC4.5 and MLP on two criteria: \n\n\u2022 Classification capabilities \n\u2022 Generalisation capabilities \n\nIn order to ascertain how these algorithms will perform, we have chosen to study \ntheir performances using a closed set of input output examples. In this aspect, we \nhave chosen a version of the Penzias example, first considered by Denker et al. This \nclass of problems has been shown to require at least one hidden layer in a MLP \narchitecture, indicating that the relationship between the input and output is non(cid:173)\nlinear. Secondly, the problem complexity depends on the number of input neurons \n(in Cart and C4.5, input features). Hence it is possible to test the algorithms using \na progressively complex set of examples. \n\nWe have chosen to compare the algorithms under the two critieria because of the \n\n2Both Atlas et. al, and DieUrich et al were brought to our attention during the confer(cid:173)\n\nenCe. Hence some of their conclusions were only communicat.ed to us at that time \n\n\f966 \n\nTsoi and Pearson \n\nfact that some of them, at least, in the case of CART, were designed for classi(cid:173)\nfication purposes. It was not originally intended for generalisation purposes. By \ngeneralisation, we mean that the trained system is used to predict the categories of \nunseen examples when only the input variables are given. The predicted categories \nare then compared with the true categories to ascertain how well the trained system \nhas performed. \n\nThe separate comparison is necessary because of the fact that classification and \ngeneralisation are rather different. In classification studies, the main purpose is to \ntrain a system to classify the given set of input output examples. The characteristics \nare: good model of the data; good accuracy in classifying the given set of examples. \nIn generalisation, the main goal is to provide a good accuracy of prediction of output \ncategories on the set of unseen examples. It does not matter much if the results of \napplying the trained data model to the training data set are less accurate. \n\nAn important point to note is that all the algorithms have a number of parameters \nor procedures which allow them to perform better. For example, it is possible to \nvary the a priori assumption on the occurrence of different output categories in \nCART, while to perform a similar task in C4.5 or MLP is rather more difficult. It \nis possible to train the MLP by ever increasing iterations until the error is small, \ngiven sufficient number of hidden layer neurons. On the other hand, in C4.5, or \nCART, the number of iterations is not an externally adjustable parameter. \n\nIn order to avoid pitfalls like these, as well as to avoid the criticism of favoring \none algorithm over against another, the results presented here have not consciously \ntuned to give the best performance. For example, even though from observations, \nwe know that the distribution of different output categories is uneven, we have \nnot made any adjustments to the a priori probabilities in running CART. We will \nassume that the output categories occur with equal prior probabilities. We have \nnot tuned the number of hidden layer neurons in the MLP, except we have taken a \nparticular number which has been used by others. We have not tuned the learning \nrate, nor the momentum rate in the MLP except just a nominal default value which \nappears to work for other examples. We have not tuned the C4.5 nor CART apart \nfrom using the default values. Hopefully by doing this, the comparison will appear \nfairer. \n\nThe structure of the paper is as follows: in section 2, we will describe the classifi(cid:173)\ncation results, while in section 3 we will present generalisation results. \n\n2 Comparison of classification performances \n\nBefore we present the results of comparing the performances of the algorithms, we \nwill give a brief description of the testing example used. This example is known as a \nclump example in Denker et aI, while in Maxwell et al it is refered as the contiguity \nexample (see [Webb, Lowe]). \n\nThere are N input features, each feature can take only the values of 0 or 1. Thus \nthere are altogether 2N examples. The output class of a particular input feature \nvector is the number of clumps involving l's in the input feature vector. Thus, for \nexample, if the input feature vector is 00110100, then this is in class 2 as there are \ntwo distinct clumps of 1 's in the input features. Hence it is possible to generate \n\n\fComparison of Three Classification Techniques \n\n967 \n\nthe closed set of all input output examples given a particular value of N. For \nconvenience, we will call this an Nth order Penzias example. In our case considered \nhere, we have used N = 8, i.e., there are 256 examples in the entire set. The input \nfeatures are binary equivalent of their ordinal numbers. For example, example 10 \nis 00001010. This allows us to denote any sample within the set more conveniently. \nThe distribution of the output classes are as follows: \n\ntotal number \n37 \n126 \n84 \n9 \n\nclass \n1 \n2 \n3 \n4 \nFor classification purposes, we use all 256 examples as both the training and testing \ndata sets. The following table summarises the classificiation results. \nname # of errors \ncart \nc4.5 \nmlp1 \nmlp2 \n\n96 \n105 \n117 \n47 \n\naccur % \n0.625 \n0.59 \n0.54 \n0.82 \n\nwhere mlp1 and mlp2 are the values related to the MLP when it has run for 10000 \niterations and 100000 iterations respectively. We have run the MLP in the fol(cid:173)\nlowing fashion: we run it 10000 times and then in steps of 10000 iterations but \nat the beginning of each 10000 iterations it is run with a different initial param(cid:173)\neter estimate. In this way, we can ensure that the MLP will not fall into a local \nminimum. Secondly, we can observe how the MLP accuracies will improve with \nincreasing number of iterations. We found that in general, the MLP converges in \nabout 20000 iterations. After that the number of iterations the results do not im(cid:173)\nprove by a significant amount. In addition, becasue of the way in which we run the \nexperiemnt the convergence would be closer to the average convergence rather than \nthe convergence for a particular initial condition. \n\nThe parameter values used in running the experiments are as follows: In the MLP, \nboth the learning rate and the momentum are set at 0.1. The architeture used \nis: 8 input neurons, 5 hidden layer neurons, and 4 output neurons. In CART, the \nprior probability is set to be equi-probable. The pruning is performed when the \nprobability of the leaf node is equal 0.5. In C4.5, all the default values are used. \n\nWe have also examined the ways in which each algorithm predicts the output cat(cid:173)\negories. We found that none of the algorithms ever predict an output category of \n4. This is interesting in that the output category 4 occurs only 9 times out of a \ntotal possible of 256. Thus each algorithm, even though it mayor may not be able \nto adjust the prior probability of the output categories, has made an implicit as(cid:173)\nsumption of equal prior probability. This leads to the non occurrence of prediction \nof category 4 as it is the least frequent occurred one. \n\nSecondly, all algorithms have a default prediction. For example, in CART, the \ndefault is class 2, being the most frequently occurred output category in the training \nexamples, while in the case of C4.5, the default is determined by the algorithm. On \nthe other hand, in the cases of CART, or MLP, it is not clear how the default cases \n\n\f968 \n\nTsoi and Pearson \n\nare determined. \nThirdly, the algorithms make mistaken predictions at different places. For example, \nfor sample 1, C4.5 makes the wrong prediction of category 3 while MLP makes the \nwrong prediction of 2, and CART makes the correct prediction. For sample 9, both \nCART and C4.5 make a wrong prediction, while MLP makes the correct prediction. \n\n3 Comparison of generalisation performances \n\nWe have used the same set of input output examples generated by an 8th order \nPenzias example. For testing the generalisation capabilties, We have used the first \n200 examples as the training vector set, and the rest of the vectors in the testing \ndata set. \n\nThe results are summarised in the following table: \n\ntraining \n\nname \ncart \nc4.5 \nmlpl \nmlp2 \n\n# of errors \n84 \n97 \n100 \n50 \n\ntesting \n\naccur % # of errors \n0.58 \n51.5 \n50 \n75 \n\n34 \n25 \n28 \n25 \n\naccur % \n39.3 \n55.4 \n50 \n55.4 \n\nIt is noted that the generalisation accuracy of the MLP is better than CART, and \nis comparable to C4.5. \n\nWe have also examined closely the mistakes made by the algorithms as well as the \ndefault predictions. In this case, the comments made in section 2 also appear to be \ntrue. \n\n4 Concl usions \n\nIn this paper, we considered three classification algorithms, viz., CART, C4.5, and \nMLP. We compared their performance both in terms of classification, and general(cid:173)\nisation on one example, an 8th order generalised Penzias example. It is found that \nthe MLP once it is converged, in general, has a better classification and generalisa(cid:173)\ntion accuracies compared with CART, or C4.5. On the other hand it is also noted \nthat the prediction errors made by each algorithm are different. This indicates that \nthere may be a possibility of combining these algorithms in such a way that their \nprediction accuracies could be improved. This is presented as a challenge for future \nresearch. \n\nReferences \n\nJ. Hartigan. (1974) Clustering Algorithms. J. Wiley, New York. \nL. Breiman, J .H. Friedman, R.A. Olshen, J. Stone. (1984) Classification and \nRegression Trees. Wadsworth and Brooks, Monterey, Calif. \nE. Charniak, D. McDermott. (1985) Introduction to Artificial Intelligence. Ad-\n\n\fComparison of Three Classification Techniques \n\n969 \n\ndision Wesley, Reading, Mass. \nR. Michalski, J.G. Carbonell, T. Mitchell. (1983) Machine Learning: An Artificial \nIntelligence Approach. Tioga, Palo Alto, Calif. \nJ .R. Quinlan. (1983) Learning efficient classification procedures and their appli(cid:173)\ncation to Chess End Games. In R. Michalski et al (ed.), Machine Learning: An \nArtificial Intelligence Approach. Tioga, Palo Alto, Calif. \nJ. R. Quinlan. (1986) Induction of Decision Trees. Machine Learning, 1, 81-106. \n\nP. Utgoff. Incremental Induction of Decision Trees. Machine Learning, 4, 161-186. \n\nR. Hecht-Nielson. (1990) Neurocomputing Addison Wesley, New York. \n\nF. Rosenblatt. (1962) Principles of Neurodynamics. Spartan Books, Washington, \nDC. \nD. Rumelhart, J. McClelland. (1987) Parallel Distributed Processing: Exploration \nin the Microstructure of Cognition Volume 1. MIT Press: Bradford Books. \nG. Cybenko. (1989) Approximation by superpositions of sigmoidal function. Math(cid:173)\nematics of Control, Signal, and Systems, 2:4. \n\nK. Hornik, M. Stinchcombe, H. White. (1989) Multi-layer feedforward networks \nare universal approximators. Neural Networks, 2:5, 359-366. \n\nL. Atlas, R. Cole, Y. Muthusamy, A. Lippman, J. Connor, D. Park, M. EI(cid:173)\nSharkawi, R. Marks II. (1990). A Performance Comparison of Trained Multilayer \nPerceptrons and Trained Classification Trees. Proc IEEE, 78:10, 1614-1619. \n\nT. Dietterich, H. Hild, G. Bakiri, (1990), \"A Comparison ofID3 and Backpropa(cid:173)\ngation for English Text-to-Speech Mapping\", Preprint. \n\nJ. Denker, et al. (1987) Large automatic learning, rule extraction, and generalisa(cid:173)\ntion. Complex Systems, 3 877-922. \nT. Maxwell, L. Giles, Y.C. Lee. (1987) Generalisation in neural networks, the \ncontiguity problem. Proc IEEE 1st Int Conf on Neural Networks, San Diego, Calif. \nA.R. Webb, D. Lowe. (1990) The Optimised internal representation of multilayered \nclassifier networks performs nonlinear discriminant analysis. Neural Networks 3:4, \n367-376. \n\n\f", "award": [], "sourceid": 410, "authors": [{"given_name": "A. C.", "family_name": "Tsoi", "institution": null}, {"given_name": "R.", "family_name": "Pearson", "institution": null}]}