{"title": "A Comparative Study of the Practical Characteristics of Neural Network and Conventional Pattern Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 970, "page_last": 976, "abstract": "", "full_text": "A Comparative Study of the Practical \nCharacteristics of Neural Network and \n\nConventional Pattern Classifiers \n\nKenney Ng \nBBN Systems and Technologies \nCambridge, MA 02138 \n\nRichard P. Lippmann \nLincoln Laboratory, MIT \nLexington, MA 02173-9108 \n\nAbstract \n\nSeven different pattern classifiers were implemented on a serial computer \nand compared using artificial and speech recognition tasks. Two neural \nnetwork (radial basis function and high order polynomial GMDH network) \nand five conventional classifiers (Gaussian mixture, linear tree, K nearest \nneighbor, KD-tree, and condensed K nearest neighbor) were evaluated. \nClassifiers were chosen to be representative of different approaches to pat(cid:173)\ntern classification and to complement and extend those evaluated in a \nprevious study (Lee and Lippmann, 1989). This and the previous study \nboth demonstrate that classification error rates can be equivalent across \ndifferent classifiers when they are powerful enough to form minimum er(cid:173)\nror decision regions, when they are properly tuned, and when sufficient \ntraining data is available. Practical characteristics such as training time, \nclassification time, and memory requirements, however, can differ by or(cid:173)\nders of magnitude. These results suggest that the selection of a classifier \nfor a particular task should be guided not so much by small differences in \nerror rate, but by practical considerations concerning memory usage, com(cid:173)\nputational resources, ease of implementation, and restrictions on training \nand classification times. \n\nINTRODUCTION \n\n1 \nFew studies have compared practical characteristics of adaptive pattern classifiers \nusing real data. There has frequently been an over-emphasis on back-propagation \nclassifiers and artificial problems and a focus on classification error rate as the main \nperformance measure. No study has compared the practical trade-offs in training \ntime, classification time, memory requirements, and complexity provided by the \n\n970 \n\n\fPractical Characteristics of Neural Network and Conventional Pattern Classifiers \n\n971 \n\nmany alternative classifiers that have been developed (e.g. see Lippmann 1989). \n\nThe purpose of this study was to better understand and explore practical character(cid:173)\nistics of classifiers not included in a previous study (Lee and Lippmann, 1989; Lee \n1989). Seven different neural network and conventional pattern classifiers were eval(cid:173)\nuated. These included radial basis function (RBF), high order polynomial GMDH \nnetwork, Gaussian mixture, linear decision tree, J{ nearest neighbor (KNN), KD \ntree, and condensed J{ nearest neighbor (CKNN) classifiers. All classifiers were \nimplemented on a serial computer (Sun 3-110 Workstation with FPA) and tested \nusing a digit recognition task (7 digits, 22 cepstral inputs, 16 talkers, 70 training \nand 112 testing patterns per talker), a vowel recognition task (10 vowels, 2 formant \nfrequency inputs, 67 talkers, 338 training and 333 testing patterns), and two ar(cid:173)\ntificial tasks with two input dimensions that require either a single convex or two \ndisjoint decision regions. Tasks are as in (Lee and Lippmann, 1989) and details of \nexperiments are described in (Ng, 1990). \n\n2 TUNING EXPERIMENTS \nInternal parameters or weights of classifiers were determined using training data. \nGlobal free parameters that provided low error rates were found experimentally \nusing cross-validation and the training data or by using test data. Global parameters \nincluded an overall basis function width scale factor for the RBF classifier, order \nof nodal polynomials for the GMDH network, and number of nearest neighbors for \nthe KNN, KD tree, and CKNN classifiers. \n\nExperiments were also performed to match the complexity of each classifier to that \nof the training data. Many classifiers exhibit a characteristic divergence between \ntraining and testing error rates as a function of their complexity. Poor performance \nresults when a classifier is too simple to model the complexity of training data \nand also when it is too complex and \"over-fits\" the training data. Cross-validation \nand statistical techniques were used to determine the correct size of the linear tree \nand GMDH classifiers where training and test set error rates diverged substantially. \nAn information theoretic measure (Predicted Square Error) was used to limit the \ncomplexity of the GMDH classifier. This classifier was allowed to grow by adding \nlayers and widening layers to find the number of layers and the layer width which \nminimized predicted square error. Nodes in the linear tree were pruned using 10-\nfold cross-validation and a simple statistical test to determine the minimum size tree \nthat provides good performance. Training and test set error rates did not diverge \nfor the RBF and Gaussian mixture classifiers. Test set performance was thus used \nto determine the number of Gaussian centers for these classifiers. \n\nA new multi-scale radial basis function classifier was developed. It has multiple \nradial basis functions centered on each basis function center with widths that vary \nover 1 1/2 orders of magnitude. Multi-scale RBF classifiers provided error rates \nthat were similar to those of more conventional RBF classifiers but eliminated the \nneed to search for a good value of the global basis function width scale factor. \n\nThe CKNN classifier used in this study was also new. It was developed to reduce \nmemory requirements and dependency on training data order. In the more conven(cid:173)\ntional CKNN classifier, training patterns are presented sequentially and classified \nusing a KNN rule. Patterns are stored as exemplars only if they are classified in-\n\n\f972 \n\nNg and Lippmann \n\ncorrectly. In the new CKNN classifier, this conventional CKNN training procedure \nis repeated N times with different orderings of the training patterns. All exemplar \npatterns stored using any ordering are combined into a new reduced set of training \npatterns which is further pruned by using it as training data for a final pass of \nconventional CKNN training. This approach typically required less memory than \na KNN or a conventional CKNN classifier. Other experiments described in (Chang \nand Lippmann, 1990) demonstrate how genetic search algorithms can further reduce \nKNN classifier memory requirements. \n\nA) GAUSSIAN MIXTURE \n\nB) POLYNOMIAL GMDH NETWORK \n\n4000 \n\n3000 \n\n2000 \n\n1000 \n\n-N \n:J: -N \n\nL&. \n\n500 \n\no \n\n-N \n:J: -\n\n500 \n\n1000 \n\n1400 \n\no \n\n500 \n\n1000 \n\n1400 \n\nF1 (Hz) \n\nF1 (Hz) \n\nFigure 1: Decision Regions Created by (A) RBF and (B) GMDH Classifiers for the \nVowel Problem. \n\n3 DECISION REGIONS \nClassifiers differ not only in their structure and training but also in how decision \nregions are formed. Decision regions formed by the RBF classifier for the vowel \nproblem are shown in Figure 1A. Boundaries are smooth spline-like curves that \ncan form arbitrarily complex regions. This improves generalization for many real \nproblems where data for different classes form one or more roughly ellipsoidal clus(cid:173)\nters. Decision regions for the high-order polynomial (GMDH) network classifier are \nshown in Figure lB. Decision region boundaries are smooth and well behaved only \nin regions of the input space that are densely sampled by the training data. Decision \nboundaries are erratic in regions where there is little training data due to the high \npolynomial order of the discriminant functions formed by the GMDH classifier. As \na result, the GMDH classifier generalizes poorly in regions with little training data. \nDecision boundaries for the linear tree classifier are hyperplanes. This classifier may \nalso generalize poorly if data is in ellipsoidal clusters. \n4 ERROR RATES \nFigure 2 shows the classification (test set) error rates for all classifiers on the bulls(cid:173)\neye, disjoint, vowel, and digit problems. The solid line in each plot represents the \n\n\fPractical Characteristics of Neural Network and Conventional Pattern Classifiers \n\n973 \n\nBULLSEYE \n\n10 \n\n8 \n\n6 \n\n4 \n\nDIGIT 50% \n\n-\"i!P--c:: \n\n2 \n\n0 \nc:: \nc:: \nW \nZ \n0 \n0 \n~ 30 \nU \n25 \nu.. -rn \nrn \n.ex: \n-J \nU \n\n20 \n\n15 \n\n10 \n\n5 \n\n0 \n\nu. \nIn \na: \n\nIIJ \n\nen a: IIJ \n:E: \n::IE <IIJ C \nu. Z .... (!J \nIn -\na: ...I \n\nI wa: ::IE ~ a: \n\nZ \nIIJ \na: w Z \n.... 0 \nlIf:: \n::IE C \nlIf:: \n(!J \n\nI \n\n10 \n\n8 \n\n6 \n\n4 \n\n2 \n\n0 \n30 \n\n25 \n\n20 \n\n15 \n\n10 \n\n5 \n\n0 \n\nDISJOINT \n\nVOWEL \n\nU. \nIn \na: \n\nen a: IIJ \n:E: \n::IE <IIJ C \nI wa: ::IE \nu. Z .... \naI -\na: ..I \n\nIIJ \na: \nIIJ \n::J \n1= \n(!J ~ \n::IE C \nlIf:: \n0 \n\nI \n\nFigure 2: Test Data Error Rates for All Classifiers and All Problems. \n\nmean test set error rate across all the classifiers for that problem. The shaded re(cid:173)\ngions represent one binomial standard deviation, u, above and below. The binomial \nstandard deviation was calculated as u = )\u00a3(1- \u00a3)jN, where \u00a3 is the estimated \naverage problem test set error rate and N is the number of test patterns for each \nproblem. The shaded region gives a rough measure of the range of expected sta(cid:173)\ntistical fluctuation if the error rates for different classifiers are identical. A more \ndetailed statistical analysis of the test set error rates for classifiers was performed \nusing McNemar's significance test. At a significance level of a = 0.01, the error \nrates of the different classifiers on the bullseye, disjoint, and vowel problems do not \ndiffer significantly from each other. \n\nPerformance on the more difficult digit problem, however, did differ significantly \nacross classifiers. This problem has very little training data (10 training patterns \nper class) and high dimensional inputs (an input dimension of 22). Some classifiers, \nincluding the RBF and Gaussian mixture classifiers, were able to achieve very low \nerror rates on this problem and generalize well even in this high dimensional space \nwith little training data. Other classifiers, including the multi-scale RBF, KD(cid:173)\ntree, and CKNN classifiers, provided intermediate error rates. The GMDH network \nclassifier and the linear tree classifier provided high error rates. \n\nThe linear tree classifier performed poorly on the digit problem because there is \n\n\f974 \n\nNg and Lippmann \n\nnot enough training data to sample the input space densely enough for the training \nalgorithm to form decision boundaries that can generalize well. The poor perfor(cid:173)\nmance of the GMDH network classifier is due, in part, to the inability of the GMDH \nnetwork classifier to extrapolate well to regions with little training data. \n\n5 PERFORMANCE TRADE-OFFS \nAlthough differences in the error rates of most classifiers are small, differences in \npractical performance characteristics are often large. For example, on the vowel \nproblem, although both the Gaussian mixture and KD tree classifiers perform well, \nthe Gaussian mixture classifier requires 20 times less classification memory than the \nKD tree classifier, but takes 10 times longer to train. \n\n1000 \n\nG' 100 \nw \n(/) \n\nW \n~ \nI-\n(9 z \nz \n;;{ \na: \nl-\n\n10 \n\n.1 \n\nA) VOWEL PROBLEM \n.-\n\nI \n\nGMOH \n\n- I \n\u2022 \n\nRBF\u00b7MS \n\nL\u00b7TREE \n\nRBF \n\n\u2022 \u2022 \n\u2022 \n\nCKNN \n\nGMIX \n\n\u2022 \n\nKO\u00b7TREE \n\n\u2022 \n\u2022 \n\nKNN \n\nB) DIGIT PROBLEM \n\n-\n\n-\n\n1000 \n\nG' 100 f-\nw \n~ \nw \n:::2 \nf= \nCJ \nz \nZ \n;;{ \na: \nI-\n\n10 f-\n\nI . \n\nGMDH \n\n\u2022 \n\nL\u00b7TREE \n\nGMIX \n\n\u2022 \n\u2022 \n\nCKNN \n\nI \n\n\u2022 \n\nRBF\u00b7MS \n\n\u2022 \n\nRBF \n\nKO\u00b7TREE \n\n\u2022 \n\u2022 \n\nKNN \n\n-\n\nI \n\nI \n\n100 \n\n10000 \nCLASSIFICATION MEMORY (BYTES) \n\n1000 \n\n.1 \n\nI \n\nI \n\n100 \n\n10000 \nCLASSIFICATION MEMORY (BYTES) \n\n1000 \n\nFigure 3: Training Time Versus Classification Memory Usage For All Classifiers On \nThe (A) Vowel And (B) Digit Problems. \n\nFigure 3 shows the relationship between training time (in CPU seconds measured on \na Sun 3/110 with FPA) and classification memory usage (in bytes) for the different \nclassifiers on the vowel and digit problems. On these problems, the KNN and KD(cid:173)\ntree classifiers train quickly, but require large amounts of memory. The Gaussian \nmixture (GMIX) and linear tree (L-TREE) classifiers use little memory, but require \nmore training time. The RBF and CKNN classifiers have intermediate memory and \ntraining time requirements. Due to the extra basis functions, the multiscale RBF \n(RBF-MS) classifier requires more training time and memory than the conventional \nRBF classifier. The GMDH classifier has intermediate memory requirements, but \ntakes the longest to train. On average, the GMDH classifier takes 10 times longer \nto train than the RBF classifier, and 100 times longer than the KD tree classifier. \nIn general, classifiers that use little memory require long training times, while those \nthat train rapidly are not memory efficient. \nFigure 4 shows the relationship between classification time (in CPU milliseconds \n\n\fPractical Characteristics of Neural Network and Conventional Pattern Classifiers \n\n975 \n\nA) VOWEL PROBLEM \n\nB) DIGIT PROBLEM \n\n50~----------TI-----------'I---' \n\n100 ,-----------,--1----------...-.---, \n\nRBF\u00b7MS \n\n30 \n\n6\" 40 \nw en \n::2 \nw \n::2 \n~ \nz \n0 \n~ \n<t \n(.) \nli: \n(j) \nen \n:5 \n\n20 \n\n10 \n\n() \n\nGMIX \u2022 \n\nCKNN \n\nGMOH \n\n\u2022 \n\u2022 \u2022 \n\u2022\u2022 \n\nRBF\u00b7MS \n\nKNN \n\nKO\u00b7TREE \n\nRBF \n\n\u2022 \n\n-\n\n6\" 80 \nw \nen \n::2 \nw \n::2 60 \ni= \nz \nQ \nf-<3 40 \nli: \nen \nen \n:5 20 \n\n() \n\nGMOH \u2022 \n\u2022 GMIX \n\n\u2022 CKNN \n\nKNN \u2022 \u2022 \n\nKO\u00b7TREE \n\n\u2022 \nRBF \n\n-\n\n-\n\n-\n\n-\n\nOL-________ ~-~~I----------~I--~ \n\nl\u00b7TREE \n\n100 \n10000 \nCLASSIFICATION MEMORY USAGE (BYTES) \n\n1000 \n\nl\u00b7TREE \n\nOL---------~\u00b7~I----------~I--~ \n100 \n10000 \nCLASSIFICATION MEMORY USAGE (BYTES) \n\n1000 \n\nFigure 4: Classification Time Versus Classification Memory Usage For All Classifiers \nOn The (A) Vowel And (B) Digit Problems. \n\nfor one pattern) and classification memory usage (in bytes) for the different clas(cid:173)\nsifiers on the vowel and digit problems. At one extreme, the linear tree classifier \nrequires very little memory and classifies almost instantaneously. At the other, the \nGMDH classifier takes the longest to classify and requires a large amount of mem(cid:173)\nory. Gaussian mixture and RBF classifiers are intermediate. On the vowel problem, \nthe CKNN and the KD tree classifiers are faster than the conventional KNN clas(cid:173)\nsifier. On the digit problem, the CKNN classifier is faster than both the KD tree \nand KNN classifiers because of the greatly reduced number of stored patterns (15 \nout of 70). The speed up in search provided by the KD tree is greatly reduced for \nthe digit problem due to the increase in input dimensionality. In general, the trend \nis for classification time to be proportional to the amount of classification memory. \nIt is important to note, however, that trade-offs in performance characteristics de(cid:173)\npend on the particular problem and can vary for different implementations of the \nclassifiers. \n\n6 SUMMARY \nSeven different neural network and conventional pattern classifiers were compared \nusing artificial and speech recognition tasks. High order polynomial GMDH clas(cid:173)\nsifiers typically provided intermediate error rates and often required long training \ntimes and large amounts of memory. In addition, the decision regions formed did \nnot generalize well to regions of the input space with little training data. Radial ba(cid:173)\nsis function classifiers generalized well in high dimensional spaces, and provided low \nerror rates with training times that were much less than those of back-propagation \nclassifiers (Lee and Lippmann, 1989). Gaussian mixture classifiers provided good \nperformance when the numbers and types of mixtures were selected carefully to \nmodel class densities well. Linear tree classifiers were the most computationally ef-\n\n\f976 \n\nNg and Lippmann \n\nficient but performed poorly with high dimensionality inputs and when the number \nof training patterns was small. KD-tree classifiers reduced classification time by a \nfactor of four over conventional KNN classifiers for low 2-input dimension problems. \nThey provided little or no reduction in classification time for high 22-input dimen(cid:173)\nsion problems. Improved condensed KNN classifiers reduced memory requirements \nover conventional KNN classifiers by a factor of two to fifteen for all problems, \nwithout increasing the error rate significantly. \n\n7 CONCLUSION \nThis and a previous study (Lee and Lippmann, 1989) explored the performance of 18 \nneural network, AI, and statistical pattern classifiers. Both studies demonstrated \nthe need to carefully select and tune global parameters and the need to match \nclassifier complexity to that of the training data using cross-validation and/or in(cid:173)\nformation theoretic approaches. Two new variants of existing classifiers (multi-scale \nRBF and improved versions of the CKNN classifier) were developed as part of this \nstudy. Classification error rates on speech problems in both studies were equiva(cid:173)\nlent with most classifiers when classifiers were powerful enough to form minimum \nerror decision regions, when sufficient training data was available, and when clas(cid:173)\nsifiers were carefully tuned. Practical classifier characteristics including training \ntime, classification time, and memory usage, however, differed by orders of magni(cid:173)\ntude. These results suggest that the selection of a classifier for a particular task \nshould be guided not so much by small differences in error rate, but by practical \nconsiderations concerning memory usage, ease of implementation, computational \nresources, and restrictions on training and classification times. Researchers should \ntake time to understand the wide range of classifiers that are available and the \npractical tradeoffs that these classifiers provide. \nAcknowledgements \nThis work was sponsored by the Air Force Office of Scientific Research and the Department \nof the Air Force. \nReferences \nEric I. Chang and Richard P. Lippmann. Using Genetic Algorithms to Improve Pattern \nClassification Performance. In Lippmann, R. Moody, J., Touretzky, D., (Eds.) Advances \nin Neural Information Processing Systems 9, 1990. \n\nYuchun Lee. Classifiers: Adaptive modules in pattern recognition systems. Master's \nThesis, Massachusetts Institute of Technology, Department of Electrical Engineering and \nComputer Science, Cambridge, MA, May 1989. \n\nYuchun Lee and R. P. Lippmann. Practical Characteristics of Neural Network and Con(cid:173)\nventional Pattern Classifiers on Artificial and Speech Problems. In D. Touretzky (Ed.) \nAdvances in Neural Information Processing Systems !l, 168-177, 1989. \n\nR. P. Lippmann. Pattern Classification Using Neural Networks. IEEE Communications \nMagazine, 27(27):47-54, 1989. \n\nKenney Ng. A Comparative Study of the Practical Characteristics of Neural Network and \nConventional Pattern Classifiers. Master's Thesis, Massachusetts Institute of Technology, \nDepartment of Electrical Engineering and Computer Science, Cambridge, MA, May 1990. \n\n\f", "award": [], "sourceid": 431, "authors": [{"given_name": "Kenney", "family_name": "Ng", "institution": null}, {"given_name": "Richard", "family_name": "Lippmann", "institution": null}]}