{"title": "Identifying Fault-Prone Software Modules Using Feed-Forward Networks: A Case Study", "book": "Advances in Neural Information Processing Systems", "page_first": 793, "page_last": 800, "abstract": null, "full_text": "Identifying Fault-Prone Software \n\nModules Using Feed-Forward Networks: \n\nA Case Study \n\nN. Karunanithi \n\nRoom 2E-378, Bellcore \n\n435 South Street \n\nMorristown, NJ 07960 \n\nE-mail: karun@faline.bellcore.com \n\nAbstract \n\nFunctional complexity of a software module can be measured in \nterms of static complexity metrics of the program text. Classify(cid:173)\ning software modules, based on their static complexity measures, \ninto different fault-prone categories is a difficult problem in soft(cid:173)\nware engineering. This research investigates the applicability of \nneural network classifiers for identifying fault-prone software mod(cid:173)\nules using a data set from a commercial software system. A pre(cid:173)\nliminary empirical comparison is performed between a minimum \ndistance based Gaussian classifier, a perceptron classifier and a \nmultilayer layer feed-forward network classifier constructed using \na modified Cascade-Correlation algorithm. The modified version \nof the Cascade-Correlation algorithm constrains the growth of the \nnetwork size by incorporating a cross-validation check during the \noutput layer training phase. Our preliminary results suggest that \na multilayer feed-forward network can be used as a tool for iden(cid:173)\ntifying fault-prone software modules early during the development \ncycle. Other issues such as representation of software metrics and \nselection of a proper training samples are also discussed. \n\n793 \n\n\f794 \n\nKarunanithi \n\n1 Problem Statement \n\nDeveloping reliable software at a low cost is an important issue in the area of soft(cid:173)\nware engineering (Karunanithi, Whitley and Malaiya, 1992). Both the reliability \nof a software system and the development cost can be reduced by identifying trou(cid:173)\nblesome software modules early during the development cycle. Many measurable \nprogram attributes have been identified and studied to characterize the intrinsic \ncomplexity and the fault proneness of software systems. The intuition behind soft(cid:173)\nware complexity metrics is that complex program modules tend to be more error \nprone than simple modules. By controlling the complexity of software modules \nduring development, one can produce software systems that are easy to maintain \nand enhance (because simple program modules are easy to understand). Static \ncomplexity metrics are measured from the passive program texts early during the \ndevelopment cycle and can be used as a valuable feedback for allocating resources \nin future development efforts (future releases or new projects). \n\nTwo approachs can be applied to relate static complexity measures with faults \nfound or program changes made during testing. In the estimative approach regres(cid:173)\nsions models are used to predict the actual number of faults that will be disclosed \nduring testing (Lipow, 1982; Gaffney, 1984; Shen et al., 1985; Crawford et al., 1985; \nMunson and Khoshgoftaar, 1992). Regression models assume that the metrics that \nconstitute independent variables are independent and normally distributed. How(cid:173)\never, most practical measures often violate the normality assumptions and exhibit \nhigh correlation with other metrics (i.e., multicollinearity). The resulting fit of the \nregression models often tend to produce inconsistent predictions. \n\nUnder the classification approach software modules are categorized into two or more \nfault-prone classes (Rodriguez and Tsai, 1987; Munson and Khoshgoftaar, 1992; \nKarunanithi, 1993; Khoshgoftaar et al., 1993). A special case of the classifica(cid:173)\ntion approach is to classify software modules into either low-fault (non-complex) or \nhigh-fault (complex) categories. The main rationale behind this approach is that \nthe software managers are often interested in getting some approximate feedback \nfrom this type of models rather than accurate predictions of the number of faults \nthat will be disclosed. Existing two-class categorization models are based on lin(cid:173)\near discriminant principle (Rodriguez and Tsai, 1987; Munson and Khoshgoftaar, \n1992). Linear discriminant models assume that the metrics are orthogonal and that \nthey follow a normal distribution. To reduce multicollinearity, researchers often use \nprinciple component analysis or some other dimensionality reduction techniques. \nHowever, the reduced metrics may not explain all the variability if the original \nmetrics have nonlinear relationship. \n\nIn this paper, the applicability of neural network classifiers for identifying fault \nproneness of software modules is examined. The motivation behind this research is \nto evaluate whether classifiers can be developed without usual assumptions about \nthe input metrics. In order to study the usefulness of neural network classifiers, a \npreliminary comparison is made between a simple minimum distance based Gaus(cid:173)\nsian classifier, a single layer perceptron and a multilayer feed-forward network devel(cid:173)\noped using a modified version of Fahlman's Cascade Correlation algorithm (Fahlman \nand Lebiere, 1990). The modified algorithm incorporates a cross-validation for con(cid:173)\nstraining the growth of the size of the network. In this investigation, other issues \n\n\fIdentifying Fault-Prone Software Modules Using Feed-Forward Networks: A Case Study \n\n795 \n\nsuch as selection of proper training samples and representation of metrics are also \nconsidered. \n\n2 Data Set Used \n\nThe metrics data used in this study were obtained from a research conducted by Lind \nand Vairavan (Lind and Vairavan, 1989) for a Medical Imaging System software. \nThe complete system consisted of approximately 4500 modules amounting to about \n400,000 lines of code written in Pascal, FORTRAN, PL/M and assembly level. \nFrom this set, a random sample of 390 high level language routines was selected \nfor the analysis. For each module in the sample, program changes were recorded \nas an indication of software fault. The number of changes in the program modules \nvaried from zero to 98. In addition to changes, 11 software complexity metrics \nwere extracted from each module. These metrics range from total lines of code \nto Belady's bandwidth metric. (Readers curious about these metrics may refer to \nTable I of Lind and Vairavan, 1989.) For the purpose of our classification study, \nthese metrics represent 11 input (both real and integer) variables of the classifier. \n\nA software module is considered as a low fault-prone module (Category I) if there \nare 0 or 1 changes and as a high fault-prone module (Category II) if there are 10 \nor more changes. The remaining modules are considered as medium fault category. \nFor the purpose of this study we consider only the low and high fault-prone modules. \nOur extreme categorization and deliberate discarding of program modules is similar \nto the approach used in other studies (Rodriguez and Tsai, 1987; Munson and \nKhoshgoftaar, 1992). After discarding medium fault-prone modules, there are 203 \nmodules left in the data set. Of 203 modules, 114 modules belong to the low \nfault-prone category while the remaining 89 modules belong to the high fault-prone \ncategory. The output layer of the neural nets had two units corresponding to two \nfault categories. \n\n3 Training Data Selection \n\nWe had two objectives in selecting training data: 1) to evaluate how well a neural \nnetwork classifier will perform across different sized training sets and 2) to select \nthe training data as much unbiased as possible. The first objective was motivated \nby the need to evaluate whether a neural network classifier can be used early in the \nsoftware development cycle. Thus the classification experiments were conducted \nusing training samples of size S = ~, ~, }, ~, ~, 190 fraction of 203 samples belonging \nto Categories I and II. The remaining ~ 1-S) fraction of the samples were used for \ntesting the classifiers. In order to avoid bias in the training data, we randomly \nselected 10 different training samples for each fraction S. This resulted in 6 X 10 \n(=60) different training and test sets. \n\n\f796 \n\nKarunanithi \n\n4 Classifiers Compared \n\n4.1 A Minimum Distance Classifier \n\nIn order to compare neural network classifiers and linear discriminant classifiers we \nimplemented a simple minimum distance based two-class Gaussian classifier of the \nform (Nilsson, 1990): \n\nIX - Gi l = ((X - Gi)(X - Gi)t)1/2 \n\nwhere Gi, i = 1, 2 represent the prototype points for the Categories I and II, X is \na 11 dimensional metrics vector, and t is the transpose operator. The prototype \npoints G1 and G2 are calculated from the training set based on the normality as(cid:173)\nsumption. In this approach a given arbitrary input vector X is placed in Category \nI if IX - G11 < IX - G21 and in Category II otherwise. \nAll raw component metrics had distributions that are asymmetric with a positive \nskew (i.e., long tail to the right) and they had different numerical ranges. Note \nthat asymmetric distributions do not conform to the normality assumption of a \ntypical Gaussian classifier. First, to remove the extreme asymmetry of the original \ndistribution of the individual metric we transformed each metric using a natural \nlogarithmic base. Second, to mask the influence of individual component metric on \nthe distance score, we divided each metric by its standard deviation of the training \nset. These transformations considerably improved the performance of the Gaussian \nclassifier. To be consistent in our comparison we used the log transformed inputs \nfor other classifiers also. \n\n4.2 A Perceptron Classifier \n\nA perceptron with a hard-limiting threshold can be considered as a realization of a \nnon-parametric linear discriminant classifier. If we use a sigmoidal unit, then the \ncontinuous valued output of the perceptron can be interpreted as a likelihood or \nprobability with which inputs are assigned to different classes. In our experiment we \nimplemented a perceptron with two sigmoidal units (outputs 1 and 2) corresponding \nto two categories. A given arbitrary vector X is assigned to Category I if the value \nof the output unit 1 is greater than the output of the unit 2 and to Category II \notherwise. The weights of the network are determined iteratively using least square \nerror minimization procedure. In almost all our experiments, the perceptron learned \nabout 75 to 80 percentages of the training set. This implies that the rest of the \ntraining samples are not linearly separable. \n\n4.3 A Multilayer Network Classifier \n\nTo evaluate whether a multilayer network can perform better than the other two \nclassifiers, we repeated the same set of experiments using feed-forward networks \nconstructed by Fahlman's Cascade-Correlation algorithm. The Cascade-Correlation \nalgorithm is a constructive training algorithm which constructs a suitable network \narchitecture by adding one hidden (layer) unit at a time. (Refer to Fahlman and \nLebiere, 1990 for more details on the Cascade-Correlation algorithm.) Our initial \nresults suggested that the multilayer layer networks constructed by the Cascade(cid:173)\nCorrelation algorithm are not capable of producing a better classification accuracy \n\n\fIdentifying Fault-Prone Software Modules Using Feed-Forward Networks: A Case Study \n\n797 \n\nthan the other two classifiers. An analysis of the network suggested that the re(cid:173)\nsulting networks had too many free variables (i.e., due to too many hidden units). \nA further analysis of the rate of decrease of the residual error versus the number \nof hidden units added to the networks revealed that the Cascade-Correlation algo(cid:173)\nrithm is capable of adding more hidden units to learn individual training patterns \nat the later stages of the training phase than in the earlier stages. This happens \nif the training set contains patterns that are interspersed across different decision \nregions or what might be called \"border patterns\" (Ahmed, S. and Tesauro, 1989). \nIn an effort to constrain the growth of the size of the network, we modified the \nCascade-Correlation algorithm to incorporate a cross-validation check during the \noutput layer training phase. For each training set of size S, one third was used \nfor cross-validation and the remaining two third was used to train the network. \nThe network .construction was stopped as soon as the residual error of the cross(cid:173)\nvalidation set stopped decreasing from the residual error at the end of the previous \noutput layer training phase. The resulting network learned about 95% of the train(cid:173)\ning patterns. However, the cross-validated construction considerably improved the \nclassification performance of the networks on the test set. Table 1 presented in \nthe next section provides a comparison between the networks developed with and \nwithout cross-validation. \n\nTraining Hidden Unit \nType I Error Type II Error \nSet Size \nSin% Mean I Std Mean I Std Mean I Std \n\nError Statistics \n\nStatistics \n\nWithout Cross-Validation \n\n25 \n33 \n50 \n67 \n75 \n90 \n\n25 \n33 \n50 \n67 \n75 \n90 \n\n5.1 \n6.2 \n7.4 \n9.7 \n10.4 \n11.2 \n\n1.9 \n2.2 \n2.0 \n2.7 \n2.7 \n2.9 \n\n1.5 \n1.8 \n1.8 \n1.7 \n1.8 \n1.6 \n\n24.64 \n20.24 \n18.30 \n15.78 \n14.54 \n10.33 \n\n7.2 \n8.4 \n7.4 \n6.5 \n7.6 \n7.2 \nWith Cross-Validation \n5.4 \n5.5 \n5.6 \n5.8 \n7.0 \n9.4 \n\n20.19 \n18.24 \n17.41 \n14.32 \n13.27 \n9.77 \n\n1.3 \n1.0 \n0.9 \n1.1 \n1.3 \n1.2 \n\n16.38 \n17.27 \n18.65 \n18.05 \n16.85 \n17.73 \n\n12.11 \n12.40 \n15.04 \n14.08 \n13.84 \n15.47 \n\n6.4 \n5.5 \n6.4 \n7.1 \n7.3 \n8.3 \n\n4.7 \n4.1 \n5.2 \n5.5 \n5.4 \n5.1 \n\nTable 1: A Comparison of Nets With and Without Cross-Validation. \n\n5 Results \n\nIn this section we present some preliminary results from our classification experi(cid:173)\nments. First, we provide a comparison between the multilayer networks developed \nwith and without cross-validation. Next, we compare different classifiers in terms \nof their classification accuracy. Since a neural network's performance can be af(cid:173)\nfected by the weight vector used to initialize the network, we repeated the training \nexperiment 25 times with different initial weight vectors for each training set. This \n\n\f798 \n\nKarunanithi \n\nresulted in a total of 250 training trials for each value of S. The results reported here \nfor the neural network classifiers represent a summary statistics for 250 experiments. \n\nThe performance of the classifiers are reported in terms of classification errors. \nThere are two type of classification errors that a classifier can make: a Type I error \noccurs when the classifier identifies a low fault-prone (Category I) module as a high \nfault-prone (Category II) module; a Type II error is produced when a high fault(cid:173)\nprone module is identified as a low fault-prone module. From a software manager's \npoint of view, these classification errors will have different implications. Type I \nmisclassification will result in waste of test resources (because modules that are less \nfault-prone may be tested longer than what is normally required). On the other \nhand, Type II misclassification will result in releasing products that are of inferior \nquality. From reliability point of view, a Type II error is a serious error than a Type \nI error. \n\nNo. of Patterns \n\nS I Training I Test \n% \nSet \n\nSet \n\n25 \n33 \n50 \n67 \n75 \n90 \n\n25 \n33 \n50 \n67 \n75 \n90 \n\n50 \n66 \n101 \n136 \n152 \n182 \n\n50 \n66 \n101 \n136 \n152 \n182 \n\n86 \n77 \n57 \n37 \n28 \n12 \n\n67 \n60 \n45 \n30 \n23 \n9 \n\n4.7 \n4.0 \n3.2 \n4.1 \n5.4 \n7.9 \n\n4.2 \n4.6 \n5.1 \n5.4 \n5.8 \n6.3 \n\nError Statistics \nI Perceptron I Multilayer Nets \nGaussian \nMean 1 Std Mean 'I Std Mean r Std \nType I Error Statistics \n13.16 \n11.44 \n12.45 \n9.46 \n8.57 \n14.17 \n\n20.19 \n18.24 \n17.41 \n14.32 \n13.27 \n9.77 \n\n16.17 \n11.74 \n11.58 \n10.14 \n9.15 \n4.03 \n\n5.5 \n3.9 \n3.2 \n3.9 \n5.8 \n4.3 \n\n5.4 \n5.5 \n5.6 \n5.8 \n7.0 \n9.4 \n\nType 11 Error Statistics \n\n15.61 \n15.46 \n16.01 \n16.00 \n17.39 \n21.11 \n\n15.98 \n15.78 \n16.97 \n16.11 \n18.39 \n19.11 \n\n7.8 \n6.6 \n6.8 \n7.6 \n6.3 \n5.6 \n\n12.11 \n12.40 \n15.04 \n14.08 \n13.84 \n15.47 \n\n4.7 \n4.1 \n5.2 \n5.5 \n5.4 \n5.1 \n\nTable 2: A Summary of Type I and Type II Error Statistics. \n\nTable 1 compares the complexity and the performance of the multilayer networks \ndeveloped with and without cross-validation. Columns 2 through 7 represent the \nsize and the performance of the networks developed by the Cascade-Correlation \nwithout cross-validation. The remaining six columns correspond to the networks \nconstructed with cross-validation. Hidden unit statistics for the networks suggest \nthat the growth of the network can be constrained by adding a cross-validation \nduring the output layer training. The corresponding error statistics for both the \nType I and Type II errors suggest that an improvement classification accuracy can \nbe achieved by cross-validating the size of the networks. \n\nTable 2 illustrates the preliminary results for different classifiers. The first two \ncolumns in Table 2 represent the size of the training set in terms of S as a per(cid:173)\ncentage of all patterns and the number of patterns respectively. The third column \nrepresents the number oft est patterns in Categories I (1st half) and the II (2nd half). \nThe remaining six columns represent the error statistics for the three classifiers in \n\n\fIdentifying Fault-Prone Software Modules Using Feed-Forward Networks: A Case Study \n\n799 \n\nterms of percentage mean errors and standard deviations. The percentages errors \nwere obtained by dividing the number of misclassifications by the total number of \ntest patterns in that Category. The Type I error statistics in the first half of the \ntable suggest that the Gaussian and the Perceptron classifiers may be better than \nmultilayer networks at early stages of the software development cycle. However, \nthe difference in performance of the Gaussian classifier is not consistent across all \nvalues of S. The neural network classifiers seem to improve their performance with \nan increase in the size of the training set. Among neural networks, the perceptron \nclassifier seems to perform classification than a multilayer net. However, the Type \nII error statistics in the second half of the table suggest that a multilayer network \nclassifier may provide a better classification of Category II modules than the other \ntwo classifiers. This is an important results from the reliability perspective. \n\n6 Conclusion and Work in Progress \n\nWe demonstrated the applicability of neural network classifiers for identifying fault(cid:173)\nprone software modules. We compared the classification efficacy of three different \npattern classifiers using a data set from a commercial software system. Our pre(cid:173)\nliminary empirical results are encouraging in that there is a role for multilayer \nfeed-forward networks either during the software development cycle of a subsequent \nrelease or for a similar product. \n\nThe cross-validation implemented in our study is a simple heuristics for constraining \nthe size of the networks constructed by the Cascade-Correlation algorithm. Though \nthis improved the performance of the resulting networks, it should be cautioned that \ncross-validation may be needed only if the training patterns exhibit certain charac(cid:173)\nteristics. In other circumstances, the networks may have to be constructed using \nthe entire training set. At this stage we have not performed complete analysis \non what characteristics of the training samples would require cross-validation for \nconstraining the network growth. Also we have not used other sophisticated struc(cid:173)\nture reduction techniques. We are currently exploring different loss functions and \nstructure reduction techniques. \n\nThe Cascade-Correlation algorithm always constructs a deep network. Each addi(cid:173)\ntional hidden unit develops an internal representation that is a higher order sig(cid:173)\nmoidal computation than those of previously added hidden units. Such a complex \ninternal representation may not be appropriate in a classification application such \nas the one studied here. We are currently exploring alternatives to construct shallow \nnetworks within the Cascade-Correlation frame work. \n\nAt this stage, we have not performed any analysis on how the internal represen(cid:173)\ntations of a multilayer network correlate with the input metrics. This is currently \nbeing studied. \n\nReferences \n\nAhmed, S. and G. Tesauro (1989). \"Scaling and Generalization in Neural Networks: \nA Case Study\", Advances in Neural Information Processing Systems 1, pp 160-168, \nD. Touretzky, ed. Morgan Kaufmann. \n\n\f800 \n\nKarunanithi \n\nCrawford, S. G., McIntosh, A. A. and D. Pregibon (1985). \"An Analysis of Static \nMetrics and Faults in C Software\", The Journal of Systems and Software, Vol. 5, \npp. 37-48. \n\nFahlman, S. E. and C. Lebiere (1990). \"The Cascaded-Correlation Learning Ar(cid:173)\nchitecture,\" Advances in Neural Information Processing Systems 2, pp 524-532, D. \nTouretzky, ed. Morgan Kaufmann. \nGaffney Jr., J. E. (1984). \"Estimating the Number of Faults in Code\", IEEE Trans. \non Software Eng., Vol. SE-lO, No.4, pp. 459-464. \n\nKarunanithi, N, Whitley, D. and Y. K. Malaiya (1992). \"Prediction of Software \nReliability Using Connectionist Models\" , IEEE Trans. on Software Eng., Vol. 18, \nNo.7, pp. 563-574. \nKarunanithi, N. (1993). \"Identifying Fault-Prone Software Modules Using Con(cid:173)\nnectionist Networks\", Proc. of the 1st Int'l Workshop on Applications of Neural \nNetworks to Telecommunications, (IWANNT'93), pp. 266-272, J. Alspector et al., \ned., Lawrence Erlbaum, Publisher. \nKhoshgoftaar, T. M., Lanning, D. L. and A. S. Pandya (1993). \"A Neural Network \nModeling Methodology for the Detection of High-Risk Programs\" , Proc. of the 4th \nInt'l Symp. on Software Reliability Eng. pp. 302-309. \n\nLind, R. K. and K. Vairavan (1989). \"An Experimental Investigation of Software \nMetrics and Their Relationship to Software Development Effort\", IEEE Trans. on \nSoftware Eng., Vol. 15, No.5, pp. 649-653. \n\nLipow, M. (1982). \"Number of Faults Per Line of Code\", IEEE Trans. on Software \nEng., Vol. SE-8, No.4, pp. 437-439. \n\nMunson, J. C. and T. M. Khoshgoftaar (1992). \"The Detection of Fault-Prone \nPrograms\", IEEE Trans. on Software Eng., Vol. 18, No.5, pp. 423-433. \n\nNilsson, J. Nils (1990). The Mathematical Foundations of Learning Machines, Mor(cid:173)\ngan Kaufmann, Chapters 2 and 3. \nRodriguez, V. and W. T. Tsai (1987). \"A Tool for Discriminant Analysis and \nClassification of Software Metrics\", Information and Software Technology, Vol. 29, \nNo.3, pp. 137-149. \nShen, V. Y., Yu, T., Thebaut, S. M. and T. R. Paulsen (1985). \"Identifying Error(cid:173)\nProne Software: An Empirical Study\", IEEE Trans. on Software Eng., Vol. SE-ll, \nNo.4, pp. 317-323. \n\n\f", "award": [], "sourceid": 741, "authors": [{"given_name": "N.", "family_name": "Karunanithi", "institution": null}]}