Thesis work on TFT

Investigating the Capability of Object-OrientedMetrics for Fault Proneness

Submitted in partial fulfillment of the requirements for the degree of

Master of Technology

by

Santosh Singh Rathore

(Roll no: 1120103)

under the guidance of

Dr. Atul Gupta

Computer Science & Engineering

PDPM INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,

DESIGN AND MANUFACTURING JABALPUR

2013

Approval Sheet

This thesis entitled Investigating the Capability of Object-Oriented Met-

rics for Fault Proneness submitted by Santosh Singh Rathore (1120103)

is approved for partial fulfillment of the requirements for the degree of Master of

Technology in Computer Science and Engineering.

Examining Committee

................................................

................................................

................................................

Guide

................................................

................................................

................................................

Chairman

................................................

Date .......................... ................................................

Place ......................... ................................................

Certificate

This is to certify that the thesis entitled, ”Investigating the Capability of Object-

Oriented Metrics for Fault Proneness”, submitted by Santosh Singh Rathore,

Roll No. 1120103 in partial fulfillment of the requirements for the award of Mas-

ter of Technology Degree in Computer Science and Engineering, at PDPM

Indian Institute of Information Technology, Design and Manufacturing Jabalpur is an

authentic work carried out by him under my supervision and guidance.

To the best of my knowledge, the matter embodied in the thesis has not been submitted

elsewhere to any other university/institute for the award of any other degree.

(Atul Gupta) February 6, 2013

Associate Professor

Computer Science & Engineering Discipline

PDPM Indian Institute of Information Technology, Design and Manufacturing Jabalpur

India-482005

Acknowledgments

Foremost, I would like to express my sincere gratitude to my supervisor Dr.

Atul Gupta for the continuous support of my M.Tech study and research, for his

valuable guidance, patience, motivation, enthusiasm, and immense knowledge.

His approach towards software engineering will always be a valuable learning

experience for me. His guidance helped me in all the time of research and writing

of this thesis. I could not have imagined having a better supervisor for my M.Tech

study. His dedication, professionalism and hard work have been and shall be a

source of inspiration throughout my life.

My deepest gratitude goes to my family for their unflagging love and support

throughout my life; this thesis is simply impossible without them. I thanks my

parents to inspired, encouraged and fully supported me. Along with, I would

also like to express my thanks to Pratibha and Deepika (my sisters) to brought a

light inside me and always filled me with enthusiasm to do my jobs with complete

effort and dedication.

I would like to thank Mr. Saurabh Tiwari and Mr. Deepak Banthia, who as a

good friend, were always willing to help and give their best suggestions. It would

have been a lonely lab without them. I would also like to give my sincere thanks

to Mr. Amaltas Khan, Mr. Amit Dhama, Mr. Arpit Gupta, Mr. Ravindra Singh

and to my batch mates for their support and being there always, no matter what.

I would like to thank IIITDM and Department of Computer Science for providing

me such a congenial environment, labs and other resources.

Santosh Singh Rathore

Abstract

Software fault prediction is used to streamline the efforts of software quality assur-

ance (SQA) activities by identifying the more faulty modules first. It is typically

done by training a prediction model over some known project data augmented

with fault information, and subsequently using the prediction models to predict

faults for unseen project. However, the earlier efforts of fault prediction based

on the classification of the software modules to be faulty or non-faulty. This

prediction does not provide enough logistics to identify and fix the faults in the

software system. The fault prediction can be more useful if, besides predicting the

software modules being faulty or non-faulty, their fault densities can also be pre-

dicted accurately. In this thesis, we aim to investigate the relationship between

object-oriented (OO) metrics and their capability of predicting fault densities in

the object-oriented software. As a follow up, we investigate two important and

related issues relevant to the fault prediction. First, how to select a subset of

OO metrics that are significantly correlated with fault proneness? Subsequently,

how to use this subset of metrics to predict fault densities in a given software

system. Here, we present an approach to identify a subset containing software

metrics with significant fault-correlation and then use this identified subset with

the count models to predict fault densities over the subsequent releases of the

software system. To select significant metrics, we first, evaluate each metric in-

dependently for their potential to predict faults by performing Univariate Logistic

Regression analysis. Next, we perform Spearman’s correlation and Multivariate

Linear Regression analysis between the selected significant metrics to further up-

date the metrics subset for an improved performance. The identified metrics

subset is then used with the count models to predict fault densities. The re-

sults of the prediction were evaluated using confusion matrix parameters and a

cost-benefit model. Our results suggest that among the used five count models,

negative binomial regression (NBR) analysis produced the best performance for

fault densities prediction. Its predictive accuracy was the highest compared to the

other count models. The results of the cost-benefit analysis also confirmed that

prediction model based on negative binomial regression was most cost-effective

compared to the other count models used in the study.

V

Table of Contents

Approval I

Certificate II

Acknowledgments III

Abstract IV

List of Figures IX

List of Tables XI

List of Symbols XIII

Abbreviations XIV

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 6

2.1 Object-Oriented Metrics . . . . . . . . . . . . . . . . . . . . . . . 6

2.1.1 Metrics suites for object-oriented software . . . . . . . . . 7

2.1.2 Validation of object-oriented metrics . . . . . . . . . . . . 15

2.2 Public Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 Subset Selection of Object-Oriented Metrics for Fault Prediction . 22

2.5 Fault Prediction Studies . . . . . . . . . . . . . . . . . . . . . . . 24

2.5.1 Binary class classification of the faults . . . . . . . . . . . 24

2.5.2 Number of faults and the fault densities prediction . . . . 27

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 A Framework for Subset Selection of Object-Oriented Metrics

for Fault Proneness 29

3.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 34

3.2.1 Metrics set used for investigation . . . . . . . . . . . . . . 34

3.2.2 Dependent variable . . . . . . . . . . . . . . . . . . . . . . 35

3.2.3 Project datasets . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2.4 Research questions . . . . . . . . . . . . . . . . . . . . . . 35

3.2.5 Experimental execution . . . . . . . . . . . . . . . . . . . 36

3.2.6 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.6.1 Univariate logistic regression analysis . . . . . . . 37

3.2.6.2 Correlation analysis between metrics . . . . . . . 39

3.2.6.3 Multivariate linear regression analysis . . . . . . 40

3.2.6.4 Validation of prediction models over the succes-

sive releases . . . . . . . . . . . . . . . . . . . . . 41

3.2.7 Threats to validity . . . . . . . . . . . . . . . . . . . . . . 43

3.2.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4 A Count Model Based Analysis to Predict Fault Densities in

Software Modules 48

4.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.1.1 Selection of fault-correlated metrics . . . . . . . . . . . . . 50

4.1.2 Count model analysis . . . . . . . . . . . . . . . . . . . . . 51

4.1.3 Evaluation of count models . . . . . . . . . . . . . . . . . 51

4.1.4 Cost-benefit model . . . . . . . . . . . . . . . . . . . . . . 52


4.2.1 Metrics set used for the experiment . . . . . . . . . . . . . 54

4.2.2 Project dataset . . . . . . . . . . . . . . . . . . . . . . . . 55

4.2.3 Count models . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.2.3.1 Poisson regression model . . . . . . . . . . . . . . 56

4.2.3.2 Negative binomial regression model . . . . . . . . 57

VII

4.2.3.3 Zero-inflated count model . . . . . . . . . . . . . 58

4.2.3.4 Generalized negative binomial regression model . 58

4.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2.4.1 Prediction of the number of faults and fault densities 59

4.2.4.2 Evaluating the results of five count models . . . . 62

4.2.4.3 Prediction of the number of faults and the fault

densities in the modules ranked as top 20% . . . 63

4.2.4.4 Cost-benefit analysis . . . . . . . . . . . . . . . . 64


4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 An Application of the Count Models to Predict Fault Densities

With Binary Fault Classification 71

5.1 The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

5.1.1 Subset selection of fault-correlated metrics . . . . . . . . . 72

5.1.2 Count model analysis . . . . . . . . . . . . . . . . . . . . . 72

5.1.3 Evaluation of count models . . . . . . . . . . . . . . . . . 73

5.1.4 Cost-benefit model . . . . . . . . . . . . . . . . . . . . . . 73


5.2.1 Metrics set used for the experiment . . . . . . . . . . . . . 73

5.2.2 Experimental data . . . . . . . . . . . . . . . . . . . . . . 74

5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74


densities . . . . . . . . . . . . . . . . . . . . . . . 75

5.2.3.2 Evaluating the results of the five count models . . 77


densities in modules ranked as top 20% . . . . . . 79

5.2.3.4 Cost-benefit analysis . . . . . . . . . . . . . . . . 79


5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6 Conclusions and Future Work 84

References 86

Publications 93

VIII

List of Figures

1.1 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Framework of proposed approach . . . . . . . . . . . . . . . . . . 31

3.2 Results of the validation of prediction models constructed using

original set of metrics and using four machine-learning techniques 44

3.3 Results of the validation of prediction models constructed using

identified subset of metrics and using four machine-learning tech-

niques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Overview of the proposed approach . . . . . . . . . . . . . . . . . 50

4.2 Result of the predicted number of fault using count models (PROP1-

POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.3 Result of the predicted number of faulty modules using count mod-

els (PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.4 Result of the fault densities prediction using count models (PROP1-

POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.5 Comparison of count model using various confusion matrix criteria

(PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.6 Cost-benefit model for the count models (PROP1-PROP6) . . . . 66

5.1 Result of the predicted number of faults using count models (PROP1-

POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Result of the predicted number of faulty modules using count mod-

els (PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.3 Result of the fault densities prediction using count models (PROP1-

POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.4 Comparison of count model using various confusion matrix criteria

(PROP1-POPR6) . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.5 Cost-benefit model for the count models (PROP1-PROP6) . . . . 80

X

List of Tables

2.1 CK metrics suite [15] . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 MOODS metrics suite [32] . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Wei Li metrics suite [45] . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Loren & Kidd metrics suite [48] . . . . . . . . . . . . . . . . . . . 9

2.5 Briand metrics suite [11] . . . . . . . . . . . . . . . . . . . . . . . 10

2.6 Bansiya’s metrics suite [4] . . . . . . . . . . . . . . . . . . . . . . 10

2.7 Other metrics suites . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.8 Summary of empirical study on object-oriented metrics . . . . . . 15

2.9 Datasets used in the study . . . . . . . . . . . . . . . . . . . . . . 20

2.10 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.1 Datasets used for study . . . . . . . . . . . . . . . . . . . . . . . . 35

3.2 Univariate logistic regression analysis- Camel 1.0 to 1.4 . . . . . . 37

3.3 Univariate logistic regression analysis- Ivy 1.0 to 1.4 . . . . . . . 38

3.4 Univariate logistic regression analysis- Velocity 1.4 to 1.5 . . . . . 38

3.5 Univariate logistic regression analysis- Xalan 2.4 to 2.5 . . . . . . 38

3.6 Univariate logistic regression analysis- Xerces 1.2 to 1.3 . . . . . 39

3.7 Reduced metrics subset after ULR analysis . . . . . . . . . . . . . 39

3.8 Spearman’s correlation analysis over Camel project dataset . . . . 40

3.9 Spearman’s correlation analysis over Ivy project dataset . . . . . 40

3.10 Spearman’s correlation analysis over Velocity project dataset . . . 40

3.11 Spearman’s correlation analysis over Xalan project dataset . . . . 41

3.12 Spearman’s correlation analysis over Xerces project dataset . . . . 41

3.13 Multivariate linear regression analysis over Camel project datasets 41

3.14 Multivariate linear regression analysis over Ivy project datasets . 42

3.15 Multivariate linear regression analysis over Velocity project datasets 42

3.16 Multivariate linear regression analysis over Xalan project datasets 42

3.17 Multivariate linear regression analysis over Xerces project datasets 43

3.18 Resulted subset of metrics after MLR analysis . . . . . . . . . . . 43

4.1 Fault removal cost of testing techniques (in staff-hours per defect) 52

4.2 Fault identification efficiencies of different testing phases . . . . . 52

4.3 Identified metrics for each release of the PROP dataset . . . . . . 55

4.4 Detail of PROP project dataset used for study . . . . . . . . . . . 56

4.5 Percentage of faults contained in the modules ranked as top 20%

(T=Training set) . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.6 Percentage of faults density contained in the modules ranked as

top 20% of modules (Fault density=faults/100 lines of code) . . . 65

5.1 Identified metrics for each release of the PROP dataset . . . . . . 74

5.2 Datasets use for the study . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Percentage of faults contained in the modules ranked as TOP 20%

(T=Training set) . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Percentage of faults density contained in the modules ranked as

top 20% of modules (Fault density=faults/100 lines of code) . . . 80

XII

List of Symbols

Cf Normalized fault removal cost in field

Ci Initial setup cost of used fault prediction approach

Cs Normalized fault removal cost in system testing

Cu Normalized fault removal cost in unit testing

Mp Percentage of modules unit tested

δs Fault identification efficiency of system testing

δu Fault identification efficiency of unit testing

Abbreviations

Acc Accuracy

AUC Area under the ROC curve

Ecost Estimated fault removal cost of the software when we

use fault prediction

FNR False negative rate

FP False positive

FPR False positive rate

NEcost Normalized estimated fault removal cost of the software

when we use fault prediction

PR Precision

Tcost Estimated fault removal cost of the software without the

use fault prediction

TN True negative

TP True positive

ULR Univariate logistic regression

MLR Multivariate linear regression

NBRM Negative binomial regression model

PRM Poisson regression model

ZIP Zero-inflated poisson regression model

GNBR Generalized negative binomial regression model

ZIN Zero-inflated negative binomial regression model

OO Object-oriented

Chapter 1

Introduction

Software quality assurance activities consist of monitoring and controlling the

software development process to ensure the desired software quality at a lower

cost [69]. It may include the application of formal code inspections, code walk-

throughs, software testing, and fault prediction. Software fault prediction is a

technique to identify the fault-prone software modules by using some underlying

properties of the software system. It is typically performed by training a predic-

tion model over some known project data augmented with fault information, and

subsequently using the prediction model to predict faults for unseen projects.

The underlying theory of software fault prediction is that a module currently

under development is likely to be fault prone if a module with the similar char-

acteristics of an earlier project (or release) developed in the same environment

was found to be faulty. In this case, the early detection of the faulty modules can

be useful to streamline the efforts to be applied in the later phases of software

development by better focusing the quality assurance efforts to those modules.

The potential of software fault prediction to identify the faulty software modules

early in the development life cycle has gained considerable attention over last two

decades. The earlier fault prediction studies used a wide range of classification

algorithms to predict the faultiness of the software modules. Different experimen-

tal studies result in a limited ability to comprehend the algorithm’s strengths and

weakness [37]. The prediction accuracy of a fault-prediction technique found to

be considerably lower, ranging 70-85 percent with the high misclassification error

rate [66] [18] [29]. An important concern related to the fault prediction is the lack

of suitable performance evaluation measures that would assess the capability of

fault prediction models [37]. Another concern is about the unequal distribution

1.1 Motivation 2

of the fault data that may lead to a biased learning [51]. Moreover, there remain

some issues like which software properties/metrics to include, how context affects

fault prediction, the cost-effectiveness of fault prediction, and fault densities pre-

diction need further understanding and investigations before the results of fault

prediction can be put to practice.

1.1 Motivation

Fault prediction models are generally constructed by identifying the relationship

between the structural measures of the software such as coupling, cohesion, com-

plexity etc. with faults. These models quantitatively describe how these internal

structural properties are related to relevant external system qualities such as fault

proneness.

However, there are some critical issues those need to be resolved before using the

fault prediction results to guide the quality assurance process. One important

concern is about the difficulty of knowing the software metrics that are signif-

icantly correlated with fault proneness and this issue is not being adequately

investigated [23]. This may be such that some of the metrics may contain re-

dundant information or worse, have adverse effects on the fault proneness of

other metrics. Earlier studies in this regard have confirm that a high number of

features (attributes) may lead to lower classification accuracy and higher misclas-

sification errors [42] [59]. Higher dimensional data can also a serious problem for

many classification algorithms due to its high computational cost and memory

usage [47].

The other issue is about using the fault prediction results in practice. Many of

the earlier fault prediction studies were based on binary fault classification model

i.e., a module is considered as faulty or non-faulty. There are the several issues

with this binary class classification. The binary class classification of the software

modules do not provide enough logistics to streamline the effort that would be

useful to identify and fix the faults in the software systems. In addition, even if the

performance of the prediction model was reported excellent, the interpretation of

the finding are hard to put into the proper usability context i.e., identification of

the actual number of faults.

Investigating the capability of Object-oriented Metrics for Fault Proneness

1.2 Objectives 3

1.2 Objectives

The objective of this thesis work is to present an approach to identify a subset of

object-oriented (OO) metrics that produce significant fault-correlation. Subse-

quently use this subset to train various count models for fault densities prediction.

In this thesis, we aim to investigate two important and related issues, as mention

above with respect to fault prediction. In this regards, we frame our research

questions as follows:

RQ1: How to select a subset of OO metrics that are significantly correlated with

fault proneness?

RQ2: How to use this subset of metrics to predict fault densities in a given

software system?

1.3 Thesis Organization

The overall structure of the thesis illustrated in Figure 1.1. The content can

broadly be divided into three major sections: Background of the research includes

literature review, research contribution and future scope of the proposed work.

ChapterInt

Problem Definition and Literature Review Research Contribution Future Scope of the Proposed

work

Chapter1: Introduction

Chapter 2: Related Work

Chapter 3: Subset

Selection of Fault-

correlated Metrics

Chapter 4&5: Fault

Densities Prediction

Chapter 6:

Conclusion and

Future Work

Figure 1.1: Thesis organization

Chapter 2: This chapter summarizes the concepts relevant to the fault predic-

tion study. Specifically, we present a detailed survey of existing object-oriented


1.3 Thesis Organization 4

metrics, including the empirical studies of these metrics performed earlier, the

detail of the public datasets used in our experimental study, various models eval-

uation techniques and a literature review of the earlier studies related to the

software fault prediction. We have drawn our general arguments and finding at

the end of the chapter.

Chapter 3: We present an approach of identifying the metrics subset consisting

metrics with significant fault-correlation. The metrics subset selection process is

undertaken in three steps. In the first step, we assist the fault proneness of each

metric separately by performing Univariate Logistic Regression (ULR) analysis

and select those metrics having significant fault correlation. In the next step,

we analyze the pair wise correlation among the selected metrics by performing

Spearman’s correlation analysis. In the last step, we construct Multivariate Lin-

ear Regression (MLR) models to further reduce the metrics and identify a group

of metrics that are more significant for fault proneness. Finally, we evaluate the

performance of the selected metrics subset against the original project metric

suite. Our results demonstrated that the identified metrics subset produced an

improved fault prediction performance compares to the original project metrics

suite.

Chapter 4: The fault prediction can be more useful if, besides predicting soft-

ware modules being faulty or non-faulty, their fault densities can also be predicted

accurately. We used the identified subset of significant fault-correlated metrics

with various count models to predict fault densities. The results of the prediction

are evaluated using confusion matrix parameters and a cost-benefit model. Our

results suggested that among the used five count models, the negative binomial

regression (NBR) analysis produced the best performance for fault prediction.

Chapter 5: We extend the approach of fault densities prediction presented in

the previous chapter to evaluate the effectiveness of the count models, when we

identify a subset of significant fault-correlated metrics by classifying the faultiness

of the software modules into binary class classification, i.e., faulty and non-faulty.

This analysis will help to decide, whether the nature of the fault classification

(i.e., binary class classification or multi class classification) for the selection of

significant fault-correlated will affect the result of the fault densities prediction.

We observed that the results are similar to the results that we were found out in

the previous chapter. However, the values of the number of faults and the fault


1.4 Summary 5

densities predicted by the count models here, are lower and more closely fit to

their actual values compare to the values predicted by the count models in the

previous chapter.

Chapter 6: We concluded our work in this chapter. We also discuss the future

scope of research in this area.

1.4 Summary

Software fault prediction is a technique to identify the faults in the software mod-

ules without executing it. It aims to help the software validation and verification

process by targeting the effort of quality assurance to the faulty modules. How-

ever, there are some issues like identification of a metrics subset consisting metrics

with significant fault-correlation and software fault densities prediction are asso-

ciated with the fault prediction process that need to be resolved before ensuring

their practical use in software quality assurance. In this chapter, we highlighted

the issues in fault prediction process, stated the objectives of the thesis and how

these objectives are pursued have been summarized in the organization of our

thesis work.


Chapter 2

Related Work

In this chapter, we present a detailed survey of existing object-oriented metrics,

including the empirical studies of the metrics performed earlier. Later, we present

a detail of the public datasets used in our experimental study, various models

evaluation techniques and a literature review of the earlier studies related to the

software fault prediction.

2.1 Object-Oriented Metrics

Software quality assurance aims to develop quality software that meets the cus-

tomer’s requirements with desired quality as well as easy to maintain. In order to

assess and improve software quality during the development process, developers

and managers need to measure the software design. For this, software metrics

have been proposed. By using metrics, a software project can be quantitatively

analyzed and its quality can be evaluated. Generally, each metric is associated

with some structural property of the software, such as coupling, cohesion, inheri-

tance etc. and used to indicate of an external quality attribute, such as reliability,

maintainability and fault-proneness [4].

There have been many object-oriented (OO) metrics suites proposed to capture

the structural properties of a software system. Chidamber and Kemerer proposed

a software metrics suite for object-oriented software in 1991 known as CK metrics

suite [15]. Later on, Several other metrics suites have also been proposed by

various authors. Harrison and Counsel proposed MOOD metrics suite [32], Wei

Li et al. proposed a metrics suite for maintainability [32], Lorenz and Kidd [48],

2.1 Object-Oriented Metrics 7

Briand et al. [11], Marchesi [50], Bansiya et al. [4], Judith Barnard [6] have also

proposed their metrics suite. These all metrics suites contained metrics of static

types. Meanwhile, Yacoub et al. [71] and Arisholm et al. [2] have also proposed

dynamic metrics suites that are capturing the dynamic behavior of the software.

2.1.1 Metrics suites for object-oriented software

(1) C&K Metrics Suite: Chidamber and Kemerer define a set of metrics

known as CK metrics suite. Later on, they revised their metrics and pro-

posed an improved version [15]. This metrics suite contains six metrics

namely: WMC, DIT, NOC, CBO, RFC, and LCOM, which are given in

Table 2.1 -

Table 2.1: CK metrics suite [15]

Coupling between Ob-

ject class CBO)

CBO for a class is a count of the number of other

classes to which it is coupled.

Lack of Cohesion in

Methods (LCOM)

LCOM is the subtraction of the number of methods

pairs that do not share a field to the number of

methods pairs that do.

Depth of inheritance

tree (DIT)

DIT is the measure of the depth of inheritance of a

class.

Response for a lass

(RFC)

It is the number of methods of the class plus the

number of methods called by any of these methods

of class.

Weighted Method

Count (WMC)

WMC metric is sum of the complexities of all the

methods define in a class.

Number of Children

(NOC)

This metric is the measures the number of imme-

diate descendants of the class.

(2) MOODS Metrics Suite: This metrics suite [32] provides measures of the

structural characteristics of OO programming . This metrics suite includes

six metrics- MHF, AHF, MIF, AIF, PF, and CF that are defined in Table

2.2.



Table 2.2: MOODS metrics suite [32]

Method Hiding

Factor (MHF)

MHF is the ratio of the sum of the invisibility of all

methods defined in all classes to the total number of

methods defined in the system under consideration.

Attribute Hiding

Factor (AHF)

AHF is the ratio of the sum of the invisibility of all

attributes defined in all classes to the total number of

attributes defined in the system under consideration.

Method In-

heritance Factor

(MIF)

MIF is the ratio of the sum of the inherited methods

in all classes of the system under consideration to the

total number of available methods (locally defined plus

inherited) for all classes.

Attribute Inheri-

tance Factor

(AIF)

AIF is the ratio of the sum of inherited attributes in

all classes of the system under consideration to the

total number of available attributes for all classes.

Polymorphism

Factor (PF)

PF is the ratio of the actual number of possible differ-

ent polymorphic situation for class to the maximum

number of possible distinct polymorphic situations for

class.

Coupling Factor

(CF)

CF is the ratio of the maximum possible number of

couplings (both inheritance and non-inheritance re-

lated coupling) in the system to the actual number

of non-inheritance couplings.

(3) Wei Li & Henry Metrics Suite: Wei Li et al. [45] evaluated C&K

metrics by using Kichenham’s metric evaluation framework and found some

deficiencies and ambiguities in the definition of these metrics. For example-

CBO imply that all couples are considered as equal. However, a more

complete form of object coupling depending on the several circumstances

that need to be defined. Accordingly, they proposed a more comprehensive

metrics suite consisting of six new metrics- Coupling Through Inheritance,

Coupling Through Message Passing, Coupling Through ADT, Number of

Local Methods and two size metrics SIZE1,SIZE2. Four of them can be

used for measuring Coupling and Cohesion as defined in Table 2.3



Table 2.3: Wei Li metrics suite [45]

Coupling Through

Inheritance

It is a measure of the inheritance of a class.

Coupling Through

Message passing

(CTM)

It is equal to the number of send statements defined

in a class.

Coupling Through

ADT (Abstract Data

Type) (CTA)

It measures the coupling, which occurs due to access

of ADT.

Number of local

methods (NOM)

It measures the total number of local method define

in a class.

SIZE1 Number of semicolon in a class.

SIZE2 Number of attributes + Number of local methods

(4) Lorenz and Kidd’s Metrics Suite: Lorenz et al. [48] defined ten met-

rics in their metric suite, which are classified into size metrics, inheritance

metrics and internal metrics as given in Table 2.4

Table 2.4: Loren & Kidd metrics suite [48]

PIM This metric counts the total number of public instance methods in a

class. Public methods are those that are available as services to other

classes.

NIM This metric counts all the public, protected, and private methods defined

in a class.

NIV This metric counts the total number of instance variables in a class.

Instance variables include private and protected variables available to

the instances.

NCM This metric counts the total number of class methods in a class. A class

method is a method that is global to its instances.

NCV The metric counts the total number of class variables in a class.

NMO The metric counts the total number of methods overridden by a subclass.

A subclass is allowed to define a method of the same name as a method

in one of its super-classes. This is called overriding the method.

NMI This metric counts the total number of methods inherited by a subclass.

NMA This metric counts the total number of methods defined in a subclass.



SIX It is the ratio of the number of overridden methods * Hierarchy nesting

level to total number of methods.

APPM It is the ratio of total number of method’s parameters to the total num-

ber of methods.

(5) Briand et al. Metrics Suite: Briand et al. define the metrics at the

class level to measure the coupling occurred due to the interactions between

classes. These metrics are given in Table 2.5.

Table 2.5: Briand metrics suite [11]

IFCAIC, ACAIC, OCAIC, FCAEC, DCAEC, OCAEC, IFCMIC, ACMIC,

OCMIC, FCMEC, DCMEC, OCMEC, IFMMIC, AMMIC, OMMIC, FMMEC,

DMMEC, OMMEC,

These coupling measures count the number of interactions between classes.

The measures distinguish the relationship between classes (friendship, inher-

itance, none), different types of interactions, and the locus of impact of the

interaction. The acronyms for the measures indicates the type of the inter-

actions counted: The first or first two letters indicate the relationship (A:

coupling to ancestor classes, D: Descendents, F: Friend classes, IF: Inverse

Friends (classes that declare a given class c as their friend), O: Others, i.e.,

none of the other relationships). The next two letters indicate the type of

interaction: CA: There is a Class-Attribute interaction between two classes

c and d, if c has an attribute of type d. CM: There is a Class-Method inter-

action between two classes c and d, if class c has a method with a parameter

of type class d. MM: There is a Method-Method interaction between two

classes c and d, if c invokes a method of d, or if a method of class d is passed

as parameter (function pointer) to a method of class c. The last two letters

indicate the locus of impact: IC: Import coupling, this metric counts all the

interactions for a class c, where c is using another class. EC: Export coupling,

this metric counts all the interactions for a class d, where class d is the used

class.

(6) Bansiya et al. Metrics Suite: Bansiya et al. [4] proposed eleven metrics,

which can be applied at the class level. They are given in Table 2.6

Table 2.6: Bansiya’s metrics suite [4]



DAM It is the ratio of total number of private attributes in a class

to the total number of attributes defines in a class.

DCC It is a count of the total number of classes that a class di-

rectly related.

CIS This metric is a count of the number of public methods in a

class.

MOA The Measure of Aggregation metric is a count of the number

of data declarations whose types are user-defined type.

MFA The Measure of Functional Abstraction metric is the ratio

of the number of methods inherited by a class to the total

number of methods accessible by member methods of the

class.

DSC This metric is a count of the total number of classes in the

design.

NOH This metric is a count of the number of class hierarchies in

the design.

ANA This metrics value signifies the average number of classes

from which a class inherits information.

CAM It is a sum of the interactions of a method’s parameters with

the maximum independent set of all parameter types in a

class.

NOP It is a count of the methods that can exhibit polymorphic

behavior.

NOM It is a sum of the total number of methods defines in class.

Other Metrics Suites: Besides from the above listed metrics suites, there are

some other metrics suites also proposed by the authors. They are listed in Table

2.7



Table 2.7: Other metrics suites

Author Metrics Significance of Metrics

Luis Fer-

nandez,

Rosalia

Pena [24]

Sensitive Class Cohe-

sion Metric (SCOM).

SCOM metrics is normalized to pro-

duce values in the range [0...1]. It is

more sensitive while calculating cohe-

sion than those metrics that was pre-

viously stated. This metric has analyt-

ical threshold. This metric is simple,

precise, general and able to be auto-

mated, which are the important prop-

erties to be applicable to large-size soft-

ware systems.

Letha Et-

zkom

Harry

Delu-

gach [21]

Logical Relatedness of

Meth-

ods (LORM), LORM2,

LORM3, Class Domain

Com-

plexity (CDC), Relative

Class Domain Complex-

ity (RCDC), Class In-

terface Complexity

(CIC), Semantic Class

Interface Definition En-

tropy (SCIDE).

This metrics suite provides a high

level, semantic, domain-oriented view

of object-oriented software compare to

the traditional syntactically-oriented

view, and therefore, it can be more

accurate in many cases than syntactic

metrics.

Andrian

Marcus,

Denys

Poshy-

vanyk [56]

The Concep-

tual Cohesion of Classes

(C3), Lack of Concep-

tual Similarity Between

Methods (LCSM), The

conceptual coupling of a

class

C3 and LCSM metrics help to iden-

tify special cases like wrappers or

classes that have several concepts im-

plemented into a set of classes. The

above state coupling metrics capture

new dimensions in coupling measure-

ment, compared to existing structural

metrics.



Gui Gui,

Paul D.

Scott [28]

Weighted Tran-

sitive Coupling (WT-

Coup), Weighted Tran-

sitive Cohesion

(WTCoh)

These metrics pass two significant char-

acteristics. First, they use a numeric

measure of the degree of coupling or

similarity between entities rather than

a binary quantity. Secondly, they in-

clude indirect coupling mediated by in-

tervening entities. Proposed metrics

for coupling and cohesion are very good

predictors of the number of lines of

code required to make simple modifi-

cations to java components retrieved

from the internet.

Belaujhazi,

Rudolf,

Ferenc,

Denys

Poshy-

vanyk and

TiborGyi-

mothy [9]

Conceptual Coupling

Between Object classes

(CCBO), Concep-

tual Lack of Cohesion

on Methods (CLCOM5)

These metrics assume that the meth-

ods and classes of object-oriented soft-

ware are connected in more than one

ways and the most explored and eval-

uated set of relations among meth-

ods and classes are based on data and

control dependencies. These proposed

metrics rely on parameterized concep-

tual similarities among methods, which

assume specifying a threshold for oper-

ational measures.

Judith

Barnard

[6]

Calls to Foreign classes

(CBO), Depth of Inher-

itance (DIT), Meaning-

ful Description

of (MDc) , Meaningful

Name (MNc)

These reusability metrics used to pro-

vide a value of reusability for a class

irrespective to the programming lan-

guage and can be used to guide the

programmer to write a more reusable

code.



Meghan

Revelle

and Mal-

com

Gethers

and

Denys

Poshy-

vanyk [58]

Structural Feature Cou-

pling (SFC), Structural

Feature Coupling prime

(SFC), Conceptual Sim-

ilarity between Meth-

ods (CSM), Conceptual

Similarity Between a

Method and a Feature

(CSMF), Textual Fea-

ture Coupling (TFC),

Maximum Textual Fea-

ture Coupling (TFC-

max), Hybrid Feature

Coupling (HFC).

These metrics capture feature-level

coupling by using structural and tex-

tual information. These metrics are

useful since they are good predictors

of fault-proneness. Additionally, they

have an application in feature-level im-

pact analysis to determine if a change

made to one feature may have undesir-

able effects on other features.

Jehad

Al Dallal,

Lionel C.

Briand [17]

Similarity-based

Class Cohesion (SCC),

Direct Method Invoca-

tion (DMI), Method In-

vocation (MI), Direct

Attribute Type (DAT),

Attribute Type (AT)

matrix.

These metrics account all types of in-

teractions between the class members:

methodmethod

interactions; attribute-attribute inter-

actions; attributemethod interactions;

and methodmethod-invocation interac-

tions. Both direct and transitive inter-

actions are considered. These metrics

are more useful for predicting fault oc-

currence in statistical terms.

Sherif M.

Yacoub,

Hany H.

Ammar

and Tom

Robin-

son [71]

Export object coupling,

Import object coupling

This metrics suite provides a set of dy-

namic metrics used to measure the de-

sign at early phase of software devel-

opment. These metrics can be used to

measure the run time properties of a

software.



Arisholm

et al. [2]

IC OD, IC OM, IC OC

, IC

CD, IC CM, IC CC,

EC OD, EC OM,

EC OC, EC CD,

EC CM, EC CC

These metrics include most of OO fea-

tures (inheritance, polymorphism and

dynamic binding) to measure accurate

behavior of OO program. These met-

rics may be used for various purposes,

such as focusing, supporting documen-

tation on those parts of a system that

are more likely to undergo change, or

making use of design patterns to bet-

ter anticipate change.

2.1.2 Validation of object-oriented metrics

Many researchers have been proposed various metrics suites during 1990-92 for

the measurement of an object-oriented software, but not all of these offered a

theoretical or an empirical validation. Chidamber and Kemerer proposed an

object-oriented metrics suite, consisting metrics to measure the characteristics

of the object-oriented software [15]. These metrics were tested and evaluated by

many authors. Summary of the empirical studies related to the metrics have been

given in Table 2.8. The first column of the table indicates the reference of the

authors that performed the validation of the OO metrics. The second column

refers to the external quality attributes that were targeted by the authors for

their studies. The third column indicates the set of metrics used for the study,

followed by the results of the study in last column.

Table 2.8: Summary of empirical study on object-oriented

metrics

Author Variable

Analyzed

Metrics

Used

Results

Basili et

al. [7]

Fault

proneness

All C&K

metrics

All metrics except LCOM were good

to predict fault proneness.

Briand et

al. [52]

Fault

proneness

CBO, RFC,

LCOM

All metrics were significantly corre-

lated to fault proneness.

Tang et

al. [64]

Fault

proneness

WMC,

RFC

Both metrics were correlated to

fault proneness.



Briand et

al. [13]

Fault

proneness

All C&K

metrics

None of the metrics except LCOM

were not correlated to fault

proneness.

EI Emam et

al. [20]

Fault

proneness

All C&K

metrics

All metrics were correlated to fault

proneness.

Chidamber

et al. [16]

Productivity,

design

effort

All C&K

metrics

CBO & LCOM associated with pro-

ductivity and design work.

Binkley et

al. [10]

Maintenance

code change

CBO, NOC Only CBO was correlated with the

code change.

Wei Li et

al. [45]

Maintenance

effort

All C&K

metrics

All metrics except CBO were corre-

lated to predict maintenance effort.

Mohammad

Al-

shayeb and

Wei Li [1]

Design ef-

fort, Main-

tenance

effort

CTA,

CTM, NLM

None of the metric found to be sig-

nificant for predicting maintenance

effort in the software development.

Wei Li,

Raed Shat-

nawi [61]

Error

proneness

CTA,

CTM, NLM

All metrics were significantly associ-

ated with error proneness and found

to be a good predictor of class er-

ror probability in all error severity

categories.

Wei Li,

Huaming

Zhang [62]

Error

proneness

CTA, CTM Both metrics were associated with

error proneness.

Wei Li,

Salie

Henry [45]

Maintenance

effort

DAT,

MPC,

SIZE1,

SIZE2,

NOM

All metrics were correlated with

maintenance effort. SIZE metrics

could be account for large portion

of the total variance in maintenance

effort.

Hector

M. Olague

et al. [53]

Error

prone-

ness, fault

proneness

All MOOD

metrics

None of the metric were correlated

with prediction of fault proneness.

P.M. Shan-

thi et

al. [60]

Error

proneness

All MOOD

metrics

All metrics were associated with er-

ror proneness.



Ayaz

Farooq [22]

System size All MOOD

metrics

None of MOOD metrics associated

with prediction of the system size.

Loren and

Kidd [48]

Static char-

acteristics

of a design

All the met-

rics of

Loren’s

metric

Suite

Large number of instance increase

coupling and reduce reuse. Deeper

class hierarchy indicates poor sub-

class performance. NIM, SIX,

NCM, NIV and NMO were signifi-

cant predictor of quality attributes.

Briand et

al. [12]

Fault

proneness

All metrics

of

C&K met-

rics suites

and Briand

metrics

suite

The coupling metrics found to be

an important predictor of faults.

More specifically, the impact of

export coupling on fault-proneness

is weaker than that for import

coupling.

Briand et

al. [52]

Fault

proneness

All metrics

of

C&K met-

rics suites

and Briand

metrics

suite

Many of the coupling, cohesion, and

inheritance measures appear to cap-

ture similar dimensions in the data.

Coupling and inheritance measures

are strongly related to the proba-

bility of fault detection in a class.

Cohesion measures shown little rel-

evance to fault proneness.

Emam et

al. [20]

Fault

proneness

All Briand

metrics

Out of total metrics OCAEC,

ACMIC and OCMEC tend to be as-

sociated with fault-proneness.

Bansi [4] Design

quality

assessment

All

the Bansiya

metrics

CAMC was shown to effectively pre-

dictor of class cohesiveness. Build

a model for evaluating the overall

quality of an OO software system

based on its internal design proper-

ties and showed that the used met-

rics were the significant for design

quality assessment.

Arisholm et

al. [3]

Fault

proneness

Complexity

metrics

LOC and WMC have been signifi-

cant to predict fault proneness.


2.2 Public Datasets 18

Zhou et

al. [72]

Fault

proneness

Complexity

metrics

LOC and WMC were better fault

predictor than SDMC and AMC

metrics.

Mahmoud

et al. [19]

Fault

proneness

in package

level

Three met-

ric suites

(Mar-

tin, MOOD

CK)

Martin metrics suite was more accu-

rate than the MOOD and CK suites.

Their have been few studies investigated OO metrics for the fault prediction.

Moreover, they don’t seem to be provided any consolidated results. These studies

yielded us on mixed results such that some of them was confirming the predictive

capabilities of the metrics and others prompting questions about these metrics.

By observing these studies, we found that-

• The earlier reported studies have been used different-different approaches

to validate the metrics and used different metrics suites. As the result, no

standard and widely accepted metrics have been found.

• Most of the authors carried out their studies using C&K metrics suite while

other metrics were not adequately investigated. Therefore, further investi-

gation and validation of these metrics are needed to ensure their usability

for fault prediction.

• Most of the authors used OO metrics for fault prediction without evalu-

ating their potential for fault-correlation. However, it is also required to

investigate the relationship between the metrics to determine a subset of

significant fault-correlated metrics for an improve fault prediction.

2.2 Public Datasets

The datasets used in our study have been collected from the PROMISE data

repository [57]. PROMISE data repository contains datasets for defect predic-

tion, effort estimation and text mining. Currently, it comprises 23 datasets, but

this number is constantly growing. The fault data were collected during require-

ments, design, development, unit testing, integration testing, system testing, beta


2.3 Evaluation Measures 19

release, controlled release, and general release of each release of the software sys-

tem and was recorded in a database associated with the software. Therefore, these

datasets can be used to validate the performance of the various fault-prediction

techniques. In the experiments of this thesis work, we used six fault datasets

with their twenty-two releases from PROMISE Data Repository. All the used

software project’s datasets have been implemented in the Java programming lan-

guage. Each dataset contain the information of twenty OO metrics available at

class level along with the fault information (number of faults) for each instance

(class). Most of the twenty metrics are objected-oriented class metrics, such as

those defined in the above discusses metrics suite [15].

A detailed description of the datasets is tabulated in Table 2.9. This table con-

tained six columns. The first column consisting name of the project dataset.

The second column shows the no. of instances (classes) present in each dataset.

The third column shows the number of non-commented lines of code (LOC). The

fourth column is corresponding to the number of faulty instances out of all the

instances in the dataset. The fifth column is corresponding to the number of non-

faulty instances out of all the instances in the dataset. The last column shows

the percentage of faults.

2.3 Evaluation Measures

Once a fault prediction model has been constructed, we need to evaluate it to

find out its capability of fault prediction. Confusion matrix parameters help with

this by reporting the performance of the prediction model.

A confusion matrix shows how the predictions are made by the model. The rows

correspond to the known class of the data, i.e. the labels in the data. The

columns correspond to the predictions made by the model. Table 2.10 shows the

confusion matrix for a binary class classification of the faults. All the measures

below can be derived from the confusion matrix.

Accuracy:

The prediction accuracy of a fault-prediction technique is measured as

1In this thesis, we use terms instance, class and module in interchanging. All of them refer

a class of a object-oriented software system.



Table 2.9: Datasets used in the study

Project

Name

No. of

Instances

LOC No. of

Faulty

Instances

No. of

non-

Faulty

Instances

percentage

of Faults

Camel 1.0 340 19632 13 327 3.80%

Camel 1.2 609 36792 215 394 35.03%

Camel 1.4 873 49007 144 729 16.49%

Camel 1.6 966 57996 187 779 19.35%

Xalan-2.4 724 225088 109 615 15.05%

Xalan-2.5 804 304860 386 418 48.00%

Xalan-2.6 886 411737 410 456 46.27%

Xerces 1.2 441 159254 70 371 15.87%

Xerces 1.3 454 167095 68 386 14.97%

Xerces 1.4 589 141180 435 154 78.85%

Ivy 1.1 111 27292 61 50 54.94%

Ivy 1.5 241 59286 14 227 5.08%

Ivy 2.0 352 87359 39 313 11.07%

Velocity 1.4 196 51513 146 50 74.49%

Velocity 1.5 214 53141 140 74 65.42%

Velocity 1.6 229 57012 76 153 33.18%

PROP1 18472 3816692 2739 15733 14.82%

PROP2 23015 3748585 2432 20583 10.56%

PROP3 10275 1604319 1180 9095 11.48%

PROP4 8719 1508381 841 7878 9.60%

PROP5 8517 1081625 1299 7218 15.25%

PROP6 661 97570 66 595 9.90%



Table 2.10: Confusion matrix

Defect Present No Yes

Defect Predicted

No TN=True Negative FN=False Negative

Yes FP=False Positive TP=True Positive

Accuracy =TN + TP

TN + TP + FN + FP(2.1)

False positive rate (FPR):

It is measured as the ratio of modules incorrectly predicted as faulty module to

the entire non-faulty modules. False alarm and type-1 error are similar as FPR.

FPR =FP

TN + FP(2.2)

False negative rate (FNR):

It is measured as the ratio of modules incorrectly predicted as non-faulty module

to the entire faulty modules. Type-2 error is similar as FNR.

FNR =FN

TP + FN(2.3)

Precision:

It is measured as the ratio of modules correctly predicted as faulty to the entire

modules predicted as faulty.

Precision =TP

TP + FP(2.4)

Recall:

It is measured as the ratio of modules correctly predicted as faulty to the entire

faulty modules. Probability of detection (PD) is similar to recall.

Recall =TP

TP + FN(2.5)

F-measure:

It is measured as the harmonic mean of precision and recall.

F −measure = 2 ∗ Precision ∗RecallPrecision+Recall

(2.6)


2.4 Subset Selection of Object-Oriented Metrics for Fault Prediction 22

ROC curve:

An ROC curve provides visualization of the tradeoff between the ability to cor-

rectly predict fault-prone modules (PD) and the number of incorrectly predicted

fault free modules (PF). The area under the ROC curve (denoted AUC) is a

numeric performance evaluation measure to compare the performance of fault-

prediction techniques. In ROC curves, the best performance indicates high PD

and low PF.

2.4 Subset Selection of Object-Oriented Met-

rics for Fault Prediction

An important issue associated with fault datasets in practice is the problem of

having too many metrics (attributes). Simply put, not all metrics are likely to be

necessary for accurate classification and include them in the prediction model may

in fact lead to a worse model [42] [59]. There some work have been reported for

solving the subset selection problem in order to identify the significant software

metrics.

Guyon et al. [30] highlighted the key approaches used for attribute selection,

including feature construction, feature ranking, multivariate feature selection, ef-

ficient search methods and feature validity assessment methods. They concluded

that sophisticated wrapper or embedded methods improve predictive performance

compared to simple variable ranking methods like correlation methods, but the

improvements are not always significant: domains with large numbers of input

variables suffer from the curse of dimensionality and multivariate methods may

over fit the prediction model.

Harman et al. [31] provided a comprehensive survey of the studies related to

search based software engineering. They identified research trends and relation-

ships between the techniques applied, the applications to which they have been

applied and highlighted gaps in the literature and avenues for further research.

Rodriguez et al. [59] performed an investigation using feature selection algorithms

with three filter models and three wrapper models over five software project

datasets. They concluded that the reduced datasets maintained the prediction

capability with fewer attributes than the original datasets. In addition, while it


2.4 Subset Selection of Object-Oriented Metrics for Fault Prediction 23

was stated that the wrapper model was better than the filter model, it came at

a high computational cost.

Liu and Yu [47] provided a survey of feature selection algorithms and presented

an integrated approach of intelligent feature selection. Their study introduced

concepts and algorithms of feature selection, survey existing feature selection

algorithms for classification and clustering, groups and compares different algo-

rithms with a categorizing framework based on search strategies. Their evaluation

criteria, and data mining tasks, reveals unattempted combinations, and provides

guidelines in selecting feature selection algorithms. They stated that as data min-

ing develops and expands to new application areas, feature selection also faces

new challenges that need to be further researched.

Khoshgoftaar et al. [42] reported a study of selecting software metrics for de-

fect prediction. Their study focused on the problem of attribute selection in the

context of software quality estimation. They presented a comparative investi-

gation for evaluating their proposed hybrid attribute selection approach. Their

results demonstrated that the automatic hybrid search algorithm performed the

best among the feature subset selection methods. Moreover, performances of the

defect prediction models either improved or remained unchanged when over 85%

of the software metrics were eliminated.

The studies listed above were investigated the subset selection problem using some

filters and wrapper based approaches. There are some some issues associated with

these approaches.

• Wrappers make usage of a search algorithm to search through the space

of possible attributes and evaluate each subset by running a model on the

subset. They are generally computationally expensive and have a risk of

over fitting to the model [68].

• The subset obtained by wrapper methods is lacking generality since it is

tied to the bias of the classifier used in the evaluation function.

• Filters are similar to Wrappers in the search approach, but instead of eval-

uating against a model, a simpler filter is evaluated. Since they evaluate

the structural properties of the data, rather than tied up with a particular

classifier, their results exhibit more generality to the solution.


2.5 Fault Prediction Studies 24

• Since the filter resulted the full attributes set as the optimal solution. This

forces the user to select an arbitrary cutoff on the number of attributes to

be selected for model building.

• The earlier studies were based on using cross-validation instead of using an

independent test dataset. Most of the studies made the use of single release

of the software for investigation and of results validation. Therefore, it

will be a problem of multiple comparisons, which leads to the failure of

generalized the real-world consequences of the results.

• When evaluating the model on the single release of the dataset often created

the models that may be too small than the real models in the data.

2.5 Fault Prediction Studies

Software fault prediction is a technique to identify the fault-prone modules before

the testing phase by using the underlying properties of the software. It aims to

streamline the efforts to be applied in the later phases of software development.

Typically, the fault-prediction is done by training a prediction model over some

known project data augmented with fault information, and subsequently using

the prediction model to predict faults for unseen projects. Existing studies in

software fault prediction mainly focuses on predicting faults into two perspectives:

Binary class classification of faults and Multi class classification of faults (i.e.,

fault densities).

2.5.1 Binary class classification of the faults

This type of fault prediction classified the modules of a software into binary class

classification i.e., either faulty or non-faulty. To construct these fault prediction

models generally two methods have been used- Supervised learning and Unsu-

pervised learning. Both of them are used in the different context of application.

When a new system without any previous release is built, to predict fault-prone

subsystems, unsupervised learning needs to be adopt. After some subsystems are

tested and put into function, these pre-release subsystems can be used as training

data to build software fault prediction models to predict new subsystems. This



time supervised learning is used. The difference between supervised and unsu-

pervised learning is the status of training data′s class, if it is unknown, then the

learning is unsupervised, otherwise, the learning is supervised learning.

There have been many efforts reported earlier to predict fault proneness of the

software modules in terms of modules being faulty or non-faulty [18] [51] [35].

The authors of these studies have used different techniques such as Genetic Pro-

gramming, Decision Trees, Neural Networks, Naive Bayes, Fuzzy Logic, Logistic

Regression etc. for predicting faultiness of the software modules [14].

S. S. Gokhale et al. [63] performed a fault prediction study over an industrial

dataset using Regression Tree and Density modeling techniques to build their

fault prediction models. They found that Regression Tree based prediction model

produced higher prediction accuracy and lower misclassification rate compare to

Density based prediction model.

Lan Guo et al. [29] carried out an empirical investigation using Dempster-Shafer

(D-S) belief network, Logistic Regression and Discriminant Analysis based tech-

niques over KC2 NASA dataset. They evaluated the prediction models by using

various performance measurement parameters and concluded that accuracy of

D-S belief networks based prediction model was higher than Logistic Regression

and Discriminant Analysis based model.

A. Koru et al. [44] reported a study of fault prediction using J48 and Kstar

techniques on public datasets. They suggested that it is better to perform defect

prediction on the data that belong to the large modules. They found that defect

prediction using class level metrics, produced better performance as compared to

method level metrics.

Venkata U.B. Challagulla et al. [66] performed a comparative study using various

machine learning techniques like- Linear Regression, Pace Regression, Support

Vector Regression, Neural Network, Support Vector Logistic Regression, Neural

Network for discrete goal field, Logistic Regression, Naive Bayes, Instance Based

Learning, J48 Tree, and 1-Rule. They have used four public datasets and eval-

uated the potential of the prediction models using various parameters. They

showed that combination of 1R and Instance-based Learning gives better predic-

tion accuracy and size and complexity metrics are not sufficient for efficient fault

prediction.



B. Turhan et al. [65] build fault prediction model using the Naive Bayes machine

learning technique. They have used seven NASA datasets and suggested that in-

dependence assumption of Naive Bayes was not harmful for the defect prediction

in datasets with PCA preprocessing. They found that assigning weights to static

code level metrics can significantly increase the performance of fault prediction

models.

Elish et al. [18] compared the performance of Support Vector Machines (SVMs)

with various other machine learning techniques over the NASA datasets and

stated that the performance of SVM in general better than, or similar to the

other machine learning techniques. Kanmani et al. [41] investigated Probabilistic

Neural Network (PNN) and Back-propagation Neural Network (BPN) using a

fault data collected from the students projects and found that the performance

of PNN is better compared to BPN. Menzies et al. [51] empirically investigated the

Naive Bayes with a logNum filter for fault proneness and found that Naive Bayes

with logNum filter is the best fault prediction model among the prediction models

used. Huihua Lu et al. [35] investigated Random Forest and FTF techniques

for fault prediction over the NASA datasets and found that Semi-supervised

technique outperforms compare to supervised technique.

Catal et al. [14] presented a literature review on fault-prediction studies from 1990

to 2009. They reviewed the results of previous studies as well as discussed the

current trends of fault prediction. They concluded that till then no study exist

in literature which could investigate the impact of fault prediction in software

development process. They also highlighted that coming up with a method that

would assess the effectiveness of fault-prediction studies if adopted in the software

project would be helpful for the software community.

These studies show that a lot of researches have been done in the field of soft-

ware fault prediction. However, most of these studies have been resulted a high

misclassification rate (normally, 15 to 35%) with lower classification accuracy

(normally, 70 to 85%). It shows the requirement of more specific studies showing

the effect of fault prediction on software quality. In this thesis, we address one of

the major and complex problem in software fault prediction studies i.e. how to

determine a best possible subset of OO metrics they produce an improved perfor-

mance of fault prediction. As a solution we proposed an approach of determining

a subset of OO metrics for fault prediction.



2.5.2 Number of faults and the fault densities prediction

There have been few efforts examining the fault proneness of software modules

in terms of predicting the fault density or the number of faults in a given mod-

ule [54] [36] [27].

Graves et al. reported a study using the fault history of the software modules [27].

They performed their study over a large telecommunication system consisting of

1.5 million lines of code and considered different file’s characteristics. They found

that the module size and other software complexity metrics were generally poor

predictors of fault likelihood. The best predictors were the combinations of a

module′s age, the changes made to the module, and the ages of the changes.

Ostrand et al. [54] have used negative binomial regression (NBR) analysis to pre-

dict fault prone in software modules. In their study, a NBR model was developed

and used to predict the expected number of faults and fault density in every mod-

ule of the next release of the system. The prediction models were based on the

number of lines of code, faults and modification history of the software modules.

They used this prediction model to two large industrial systems and found that

the NBR model was very accurate to identify the fault proneness of the software.

In another study [54], the same compared the three different variations of LOC

based NBR models to predict fault densities. They have used NBR model to

pre-dict the number of faults in each of the software and then sorted them in

the decreasing order of their fault contained and then select first 20% of the files.

They found the model to be accurate in terms of predicting faults in top 20% of

the files.

Janes et al. [36] reported a study of using NBR analysis to predict fault proneness.

They investigated the relation between object-oriented metrics and class defect

in a real time telecommunication system. They built different prediction models

and found the zero-inflated negative binomial regression model to be the most

accurate for fault prediction.

Recently, Liguo et al. [46] performed a case study using NBR analysis to predict

fault proneness on an open source software. They compared the performance of

the NBR model with Binary Regression model and found that in predicting fault

prone modules, NBR model could not outperform Binary Regression but they

suggested that NBR is effective in predicting multiple errors in one module.


2.6 Summary 28

Kehan Gao et al. [25] reported a comprehensive study of count models for fault

prediction over a full-scale industrial software system. They concluded that

among the different count models, the zero-inflated negative binomial and the

hurdle negative binomial models demonstrated a better correlation with fault

proneness.

These studies showed some earlier efforts have been made to predict fault densities

but they did not provide enough logistics that can prove the significance of the

count models to predict fault densities. As well as the selection of a count model

for an optimal performance is still equivocal. Ostrand et al. [54] applied NBR

model to predict the number of faults and the fault densities in each file of the

software. They made the use of change history and LOC metric of the files

to determine faults without providing any appropriateness of these metrics for

the NBR model. Kehan et al. [25] reported a comprehensive study of eight

count models for fault prediction. They evaluated the quality of the fitted count

models using some hypothesis testing and goodness of fit parameters. However,

no evaluation was provided to assess the potential of the count models to predict

fault densities.

2.6 Summary

This chapter presented a brief introduction of the concepts related to our study.

In particular, we have given a description of object-oriented metrics suites pro-

posed by different authors along with the empirical studies that were validated

these metrics suites. Later on, we discussed the studies related to software fault

prediction, the measures used to evaluate the performance of fault-prediction

technique and the information of available public dataset repositories. Here, we

also summarized the studies of subset selection of significant metrics and framed

a background for the same.


Chapter 3

A Framework for Subset

Selection of Object-Oriented

Metrics for Fault Proneness

Software metrics aim to represent the necessary measurement that could be help-

ful to assess the quality of a software system with desired accuracy and at a lower

cost. However, the difficulty lies in knowing the metrics that actually capture

the important quality attributes of a class, such as fault proneness. There some

efforts have been reported to validate these class levels object-oriented metrics

with respect to fault proneness [1] [26] [52] [64] [61]. These studies yielded us on

mixed results with some studies confirming the predictive capabilities of the met-

rics and others prompting questions about these metrics [23]. In their study [43],

Kitchenham reported the limitation of earlier studies of metrics. Their study sug-

gested that the results of empirical studies are not comprehensible. The context

of metrics validation and the relationship of metrics with fault proneness was not

properly investigated. There can be a possibility that some of the metrics depend

on the project characteristics. Some of them contained redundant information,

or not added any new information, or worse have an adverse effect on the other

metrics.

In this chapter, we aim to investigate the relationship of existing class level object-

oriented metrics with fault proneness over the multiple releases of the software

system to identify the metrics producing significant fault-correlation. The metrics

subset selection process is undertaken in three steps. In the first step, we assist

the fault proneness of each metric separately by performing Univariate Logistic

3.1 The Approach 30

Regression (ULR) analysis and select those metrics having higher fault correla-

tion. In the next step, we analyze the pair wise correlation among the selected

metrics by performing Spearman’s correlation analysis. Each time, after a higher

correlation between a pair of metrics is observe, we check the performance of

the metrics individually and in combination for fault prediction and select any

one of the metric or keep both of the metrics, whosoever produced better fault

prediction result. In the last step, we construct Multivariate Linear Regression

(MLR) models to further reduce the metrics and identify a group of metrics that

are more significant for fault proneness.

Finally, we use the identified metrics subset for fault prediction to estimate the

overall accuracy and misclassification errors over the subsequent releases of the

same project datasets that are used for investigation. We use confusion matrix

criteria: Accuracy, Precision, Recall and AUC (area under the ROC curve) to

evaluate the performance of prediction models. To perform our investigation,

we used five datasets namely: Camel, Xalan, Xerces, Ivy, and Velocity, available

publicly in PROMISE data repository with their multiple successive releases [57].

The rest of the chapter organizes as follows- Section 3.1 presents an approach of

experimental investigation to identify a subset of metrics that significantly corre-

late with fault proneness. In Section 3.2, we present our experimental setup that

includes information about datasets, metrics (independent variables) and depen-

dent variables used for investigation and the results of investigations followed by

threats to validity. We discuss the implications of our results in Section 3.3.

3.1 The Approach

In this section, we present our approach to evaluate the potential of object-

oriented (OO) metrics for fault proneness. We have constructed an algorithm

algorithm OO subset that takes input of the original set of metrics, evaluate each

metric individually and in the conjunction of the other metrics to determine a

subset of significant fault-correlated metrics for an improved performance of fault

prediction. An overview of the proposed approach is illustrated in Figure 3.1:

Algorithm: OO subset()

// An Algorithm for subset selection of fault-correlated OO metrics.

Initialization: X = [x0, x1, x2...xn] is a vector of independent variables. Y =


3.1 The Approach 31

Data set containing object

oriented metrics and fault

found in software modules

This analysis identify the

correlation of each metrics

to fault proneness

This analysis identify the

metrics that are highly

correlated with each other

This analysis finds the

subset of significant metrics

for an improve performance

of fault prediction

Data Set

Univariate Logistic Regression

Analysis

Cross Correlation Analysis

between the Significant

Metrics

Multivariate Linear Regression

Analysis of Significant

Metrics

Validation of Resulted Metrics

to Estimate their Overall

Prediction Accuracy

This step validates the

resulted metrics to estimate

the overall prediction

accuracy.

Figure 3.1: Framework of proposed approach

[y0] is a vector of dependent variable.

Declaration: Create empty vectors ULR = [], SR= [], MLR = [], to store the

intermediate output of Logistic regression, Spearman’s correlation and Multivari-

ate linear regression analysis respectively.

Begin:

1. for each element xi of the vector X, do 0≤i≤n, n is the number of independent

variables.

1.1 perform univariate logistic regression analysis of xi with Y.

1.2 Store the value of regression coefficient, odds ratio and p-value.

1.3 If value of odds ratio ≥1 && p-value≤.05 && regression coff.>0, then

1.4 add the element to the vector ULR.

End if

End for

2. Extract each element from the vector ULR and perform its correlation analysis

with all other elements of the vector and store their correlation values.

3. for each pair of elements,


3.1 The Approach 32

3.1 if correlation≥.7, then

3.2 check their individual fault-correlation values and combined fault-correlation

value.

3.3 if individual performance of the elements is greater than their combined per-

formance, then

3.4 keep the element with higher fault-correlation value and discard the other.

Otherwise, keep both the elements.

End if

End if

End for

4. for each element of the vector SR, do

4.1 perform a multivariate linear regression analysis,

4.2 for each element selected by the linear prediction model,

4.3 add it to the vector MLR.

End for

5. Combined the elements of MLR for each releases of the project.

6. The resulted vector is subset of metrics significant to fault proneness.

End for

This algorithm determines a subset of OO metrics that produced significant fault-

correlation for a given software system. It takes the vector of metrics (indepen-

dent variables) as the input and assess the potential of each metric for fault

proneness. First, we initialize a vector X to the independent variables and a vec-

tor Y to dependent variable (fault proneness). Second, we declare three empty

vectors namely- ULR, SR and MLR, to store the results of each intermediate

steps. We start to analysis the elements of vector X. In each of the intermediate

step, some metrics dropped out based on the intermediate step’s analysis results.

The output of this algorithm results a subset of metrics for each project that

consist metrics with significant fault-correlation.

Our proposed approach consists of four steps:

(1) Perform a ULR analysis to evaluate each metric separately for fault prone-

ness.

This step evaluates each OO metrics separately for fault proneness. Here

we perform binary univariate logistic regression (ULR) analysis by consid-


3.1 The Approach 33

ering fault as dependent variable and metrics as independent variables. To

check the level of significance of each metric, we use three parameters of

LR model: (i) Regression coefficient, shows the amount of correlation of

each metric with fault proneness, (ii) Significance level (p- value), shows

the significance of correlation and (iii) Odds ratio represents the change in

odds when the value of an independent variable increases by one. This step

will result a subset of the metrics that was significant for fault prediction.

(2) Perform a pair wise spearman’s correlation analysis between the significant

metrics.

This step determines the correlation between a pair of metrics. Here, we

perform a pair wise Spearman’s correlation analysis among the significant

metrics and check for both positive as well as negative correlations. Each

time, after a higher correlation between a pair of metrics is observe, we check

the performance of these metrics individually and in the combine basis for

fault prediction and select any of the metric or pair of metrics, whosoever

performs better. If the correlation of independent metric was poorer than

a pair of metrics, we drop that metric, and continue this process until we

encounter all the metrics showing higher correlation. The remaining metrics

are significant for further analysis. The significance level of correlation is

test at 95% confidence level (p-value ≤ 0.05) and the degree of correlation

is measure using the Hopkins criteria [34]. The outcome of this step is a

subset of metrics that are significantly correlated to fault proneness and are

not redundant with each other.

(3) Determine a subset of metrics for an improved performance of fault predic-

tion models.

There is a possibility that some metrics still remain in the subset due to

their dependency with other metrics but can be further reduced. To in-

vestigate this issue, we construct Multivariate Linear Regression models.

This analysis determines a best possible subset of the metrics that can pre-

dict fault-proneness, when use in combination. In each model, a subset of

metrics is selected and rest all other are discard. At the end, this analysis

results a subset of the metrics that are more significant to predict faults.

(4) Evaluate the resulted subset of metrics to estimate their overall prediction

accuracy.


3.2 Experimental Evaluation 34

Finally, we construct fault prediction models to investigate the capability

of the obtained metrics subset for fault proneness. The construction of

prediction models are carry out on the subsequent release of same software

systems that are used in the above investigation of metrics selection. We

use four machine-learning techniques namely: Navies Bias, Logistic Regres-

sion, Random Forest and IBK. The aim of this step is to estimate the overall

predictive accuracy of metrics, rather than identifying the best fault pre-

diction technique. For this reason, the choice of fault prediction techniques

is orthogonal with respect to the intended contribution. To investigate the

fault prediction capability of the different metric subset we use confusion

matrix criteria namely: Accuracy, Precision, Recall and Area under ROC

curve (AUC) [69].

3.2 Experimental Evaluation

In this section, we present our experimental study that includes the experimental

setup, information about datasets, metrics (independent variables), dependent

variable and set of research questions, used for investigation.

3.2.1 Metrics set used for investigation

We have used existing class level OO metrics to perform our investigation. We

have used nineteen measures of coupling, cohesion, inheritance, encapsulation

and complexity of OO software system. Since, we focus to investigate the fault

proneness in a given class. Therefore, we selected only those metrics that were

available at class level [26]. One more reason to select only these metrics is that

they all are present in the datasets collected from the PROMISE data repository

that encourages us to incorporate them in our study. The metrics used for study

are as follows-

WMC, CBO, RFC, DIT, NOC, IC, CBM, CA, CE, MFA, LCOM, LCOM3, CAM,

MOA, NPM, DAM, AMC, LOC and max CC (abbreviated as CC). The detailed

description of the metrics was given in the related work. (See Section 2.1, chapter

2).



3.2.2 Dependent variable

This study investigates the relationship between OO metrics and fault proneness.

Therefore, we selected measure of fault proneness as the dependent variable. In

this study, due to the dependency of the statistical techniques, we define fault

proneness as of binary type, which means a class is either marked as faulty or

non-faulty. We mark a class as faulty if there is at least one fault found in the

module or non-faulty if no fault is found in the class. When we move from one

release to the subsequent release of a software system, the fault proneness is

defined by the faults identified in that release.

3.2.3 Project datasets

The datasets used in our study have been collected from the PROMISE data

repository, available as publicly [57]. These datasets contained OO metrics

and faults found in the software modules during testing and after their release.

The number of faulty modules varies between 3%-74% (approximately) in these

datasets. We have used five projects namely Camel, Xalan, Xerces, Ivy, and

Velocity with their sixteen successive releases for performing our study, and to

investigate our results [40]. All the datasets contained same nineteen metrics.

The size of datasets varies from one to another. The name of the datasets with

their subsequent releases in given in Table 3.1. Detailed description of these

datasets given in Section 2.2 of chapter 2.

Table 3.1: Datasets used for study

Camel 1.0, Camel 1.2, Camel 1.4, Camel 1.6, Xalan-2.4, Xalan-2.5,

Xalan-2.6, Xerces 1.2, Xerces 1.3, Xerces 1.4, Ivy 1.1, Ivy 1.5, Ivy 2.0,

Velocity 1.4, Velocity 1.5 and Velocity 1.6

3.2.4 Research questions

The objective of this experiment is to identify the best possible subset consisting

metrics with significant fault-correlation. We followed the GQM approach [8],

where we framed a set of research questions, which were investigated by obtain-



ing the relevant measures. The research questions are as-

RQ 3.1: Whether there exists a subset of object-oriented metrics that are sig-

nificantly correlated with fault proneness.

This step aims to evaluate the metrics to test their relationship with fault prone-

ness. We evaluate each metric individually for their correlation with fault prone-

ness.

RQ 3.2: Do existing object-oriented metrics show a higher correlation to each

other?

This question tests whether existing class level metrics have a correlation with

each other. Here we check metrics for both positive and negative correlations to

identify a subset for an improve performance.

RQ 3.3: Does the identified subset of metrics improve the overall prediction

accuracy and reduced misclassification errors compare to considering original set

the metrics.

This question investigates the performance of the subset metrics to predict fault

proneness.

First two research questions are evaluating the metrics for fault prediction to

determine a subset of metrics that results an improved accuracy. Therefore, the

significance of first two questions is to provide the support to the investigation

of the third question.

3.2.5 Experimental execution

To perform our investigation, we have used all five datasets with their multiple

successive releases listed in Table 3.1. In order to incorporate multiple releases,

we used the training and testing strategy. Here training and testing is performed

as, testing prediction models on the first release of software, next we train models

in first release and test on the second release, next we train models in first two

releases and test on third release and we continue in this way till all the subsequent

releases are not incorporated. All experiments were performed using a well-known



machine learning tool WEKA [49].

3.2.6 Results

This section presents the detail description of the experimental results. We

started the statistical analysis with nineteen metrics. As we progressed through

analysis steps, we eventually dropped some of them based on the intermediate

analysis results.

3.2.6.1 Univariate logistic regression analysis

The results of Univariate Logistic Regression Analysis (ULR) have been summa-

rized in Tables 3.2 to 3.6. The column metric shows the independent variable

used in ULR. The columns coff., P-value and odds ratio state the estimated re-

gression coefficient, the statistical significance of the coefficient and odds ratio

of each metric. As discussed above, we selected only those metrics that have

positive regression coff., p-value less than .05 and odds ratio more than 1. In

each version of datasets some metrics found to be significant for fault prediction

while other metrics not shown to be relevant for fault proneness. Moreover, as we

moved from one release to other releases, the nature of associated metrics were

changed. Some earlier selected metrics got deselected and some new metrics were

added. One possible reason of this is that the nature of the metrics depends on

the characteristics of the project and as we move from one release to another one,

the characteristics may get changed. We selected only those metrics that were

significantly correlated with fault proneness in all releases of the project. The

resulted metrics after the ULR analysis is given in Table 3.7.

Table 3.2: Univariate logistic regression analysis- Camel 1.0 to 1.4

Metrics Camel 1.0 Camel 1.0/1.2 Camel 1.0/1.2/1.4

Coff. p-value Odd ratio Coff. p-value Odd ratio Coff. p-value Odd ratio

WMC ..45 0.017 1.046 0.027 0 1.027 0.037 0 1.038

DIT -0.34 0.237 0.712 -0.097 0.129 0.907 -0.003 0.944 0.997

NOC 0.145 0.007 1.156 0.101 0.001 1.106 0.099 0 1.104

CBO 0.03 0.002 1.031 0.014 0.002 1.014 0.015 0 1.015

RFC 0.02 0.044 1.02 0.012 0 1.012 0.017 0 1.017

LCOM 0.001 0.326 1.001 0.001 0.035 1.001 0.001 0.003 1.001

CA 0.028 0.003 1.028 0.012 0.005 1.012 0.013 0 1.013

CE 0.053 0.242 1.054 0.015 0.256 1.015 0.025 0.004 1.025

NPM 0.047 0.021 1.048 0.033 0 1.034 0.043 0 1.044

LCOM3 0.423 0.271 1.527 0.14 0.178 1.15 0 1 1

LOC 0.002 0.071 1.002 0.002 0 1.002 0.002 0 1.002

DAM -0.302 0.611 0.739 -0.187 0.244 0.83 0.079 0.523 1.082

MOA 0.291 0.141 1.338 0.15 0.011 1.162 0.177 0 1.194

MFA -1.7 0.06 0.183 -0.106 0.575 0.9 0.046 0.724 1.048



CAM -2.534 0.075 0.079 -0.584 0.053 0.558 -1.26 0 0.282

IC -0.727 0.245 0.483 -0.125 0.363 0.883 0.157 0.109 1.17

CBM -0.682 0.243 0.506 0.092 0.106 1.097 0.147 0 1.58

AMC -0.017 0.638 0.983 0.007 0.238 1.007 0.011 0.014 1.011

CC 0.339 0.277 1.403 0.445 0.001 1.561 0.404 0 1.498

Table 3.3: Univariate logistic regression analysis- Ivy 1.0 to 1.4

Metrics Ivy 1.0 Ivy 1.0/1.4

Coff. p-value Odd ratio Coff. p-value Odd ratio

WMC 0.108 0.002 1.114 0.03 0.005 1.03

DIT 0 1 1 -0.087 0.397 0.917

NOC 0.036 0.826 1.036 0.162 0.066 1.176

CBO 0.144 0.001 1.154 0.025 0.011 1.025

RFC 0.04 0 1.041 0.013 0 1.013

LCOM 0.012 0.027 1.012 0 0.177 1

CA 0.037 0.224 1.038 0.012 0.286 1.012

CE 0.23 0 1.259 0.093 0 1.097

NPM 0.111 0.006 1.117 0.035 0.007 1.035

LCOM3 -0.184 0.533 0.832 -..42 0.832 0.959

LOC 0.003 0.012 1.003 0.001 0.002 1.001

DAM 0.373 0.372 1.452 0.143 0.613 1.154

MOA 0.284 0.173 1.328 0.184 0.093 1.202

MFA 0.235 0.683 1.264 -0.425 0.228 0.653

CAM -3.272 0.001 0.038 -3.12 0 0.044

IC 0.557 0.095 1.746 0.214 0.161 1.239

CBM 0.323 0.081 1.381 0.124 0.107 1.132

AMC 0.03 0.023 1.031 0.017 0.001 1.017

CC 0.119 0.658 1.127 0.349 0.032 1.418

Table 3.4: Univariate logistic regression analysis- Velocity 1.4 to 1.5

Metrics velocity 1.4 Velocity 1.4/1.5


WMC -0.013 0.211 0.987 0.004 0.616 1.004

DIT -1.03 0 0.357 -0.496 0 0.609

NOC 0.005 0.927 1.005 0.067 0.409 1.069

CBO 0.02 0.243 1.02 0.03 0.014 1.031

RFC -0.01 0.09 0.99 0.004 0.344 1.004

LCOM -0.001 0.306 0.999 0 0.573 1

CA 0.03 0.19 1.03 0.016 0.182 1.016

CE -0.02 0.364 0.981 0.023 0.185 1.023

NPM 0.007 0.711 1.007 0.023 0.108 1.024

LCOM3 0.194 0.406 1.214 -0.398 0.011 0.672

LOC -0.001 0.22 0.999 0 0.627 1

DAM -0.428 0.212 0.652 0.5 0.033 1.649

MOA 0 1 1 0.323 0.03 1.381

MFA -1.145 0.004 0.318 -0.816 0.001 0.442

CAM -0.824 0.232 0.438 -0.871 0.067 0.419

IC -1.152 0 0.316 -0.667 0 0.513

CBM -0.729 0 0.482 -0.418 0 0.658

AMC -0.032 0 0.969 -0.002 0.537 0.998

CC -0.187 0.117 0.83 0.044 0.632 1.045

Table 3.5: Univariate logistic regression analysis- Xalan 2.4 to 2.5

Metrics Xalan 2.4 Xalan 2.4/2.5


WMC 0.037 0 1.038 0.025 0 1.025

DIT -0.039 0.576 0.962 0.058 0.104 1.06

NOC 0.041 0.202 1.042 0.045 0.027 1.046

CBO 0.018 0 1.018 0.009 0.001 1.009

RFC 0.023 0 1.023 0.015 0 1.015

LCOM 0.001 0 1.001 0 0 1

CA 0.013 0.009 1.013 0.008 0.02 1.008



CE 0.042 0 1.042 0.018 0.001 1.018

NPM 0.035 0 1.035 0.024 0 1.024

LCOM3 -0.385 0.011 0.681 -0.194 0.013 0.823

LOC 0.001 0 1.001 0.001 0 1.001

DAM 0.534 0.015 1.705 0.206 0.075 1.229

MOA 0.187 0 1.205 0.131 0 1.14

MFA -0.615 0.008 0.541 -0.115 0.352 0.891

CAM -3.559 0 0.028 -0.674 0.001 0.51

IC 0.228 0.011 1.256 0.114 0.002 1.12

CBM 0.079 0 1.083 0.057 0 1.058

AMC 0.01 0 1.01 0.001 0.018 1.001

CC 0.5 0 1.649 0.275 0 1.317

Table 3.6: Univariate logistic regression analysis- Xerces 1.2 to 1.3

Metrics Xerces 1.2 Xerces 1.2/1.3


WMC 0.017 0.058 1.017 0.027 0 1.027

DIT -0.229 0.036 0.795 -0.11 0.137 0.895

NOC 0.012 0.728 1.012 0.021 0.363 1.021

CBO 0.012 0.421 1.012 0.034 0 1.034

RFC 0.009 0.009 1.009 0.015 0 1.015

LCOM 0.001 0.045 1.001 0.001 0 1.001

CA 0.006 0.73 1.006 0.017 0.15 1.017

CE 0.046 0.066 1.047 0.108 0 1.115

NPM 0.014 0.299 1.014 0.016 0.089 1.016

LCOM3 -0.175 0.371 0.84 -0.751 0 0.472

LOC 0 0.029 1 0 0 1

DAM 0.474 0.118 1.607 1.119 0 3.06

MOA 0.086 0.037 1.09 0.146 0 1.157

MFA -0.539 0.09 0.583 -0.043 0.843 0.958

CAM 0.305 0.565 1.356 -0.702 0.071 0.495

IC -0.012 0.949 0.988 0.543 0 1.72

CBM 0.042 0.266 1.043 0.11 0 1.117

AMC -0.001 0.696 0.999 0.003 0.032 1.003

CC -0.086 0.507 0.917 0.075 0.278 1.078

Table 3.7: Reduced metrics subset after ULR analysis

Camel WMC, CBO, RFC, NOC, NPM, CA

Xalan WMC, CBO, RFC, LCOM, CA, CE, LOC, NPM, MOA, CC

Xerces WMC, CBO, RFC, LCOM, LOC, MOA

Ivy WMC, CBO, RFC, CE, NPM, LOC, CAM, AMC

Velocity DIT, MFA, CC, IC

3.2.6.2 Correlation analysis between metrics

The results of Spearman’s correlation analysis have been summarized in Tables

3.8 to 3.12. It is observed from the tables that WMC, NPM, RFC, LOC, LOC,

and AMC were highly correlated with each others. It shows the strong structural

association between these metrics. CBO metrics were not correlated with CA

and CE metrics, which show that we need separate measures for measuring the



import and export coupling as CBO is not able to handle these issues of coupling.

RFC- LOC, WMC-LOC were correlated with the very large level. The correlation

value of WMC and RFC was higher than the correlation value of LOC. It means

that RFC, WMC are a good indicator of class complexity and we do not need

LOC measure separately for measuring class size. MFA and CAM metrics were

negatively correlated with most of the metrics. It shows that these metrics are

not significant for fault proneness and as the consequence of that we dropped

these metrics for further analysis.

Table 3.8: Spearman’s correlation analysis over Camel project dataset

wmc noc cbo rfc ca npm

wmc 1 0.134 0.566 0.888 0.244 0.918

noc 0.134 1 0.191 0.097 0.299 0.093

cbo 0.566 0.191 1 0.588 0.618 0.44

rfc 0.888 0.097 0.588 1 0.152 0.74

ca 0.244 0.299 0.618 0.152 1 0.237

npm 0.918 0.093 0.44 0.74 0.237 1

Table 3.9: Spearman’s correlation analysis over Ivy project dataset

wmc cbo rfc ce npm loc cam amc

wmc 1 0.493 0.802 0.396 0.95 0.75 -0.783 0.37

cbo 0.493 1 0.472 0.377 0.437 0.401 -0.505 0.23

rfc 0.802 0.472 1 0.467 0.711 0.966 -0.763 0.807

ce 0.396 0.377 0.467 1 0.34 0.417 -0.408 0.335

npm 0.95 0.437 0.711 0.34 1 0.651 -0.722 0.27

loc 0.75 0.401 0.966 0.417 0.651 1 -0.75 0.866

cam -0.783 -0.505 -0.763 -0.408 -0.722 -0.75 1 -0.514

amc 0.37 0.23 0.807 0.335 0.27 0.866 -0.514 1

Table 3.10: Spearman’s correlation analysis over Velocity project dataset

dit mfa cbm cc

dit 1 0.898 0.645 -0.264

mfa 0.898 1 0.584 -0.461

cbm 0.645 0.584 1 -0.101

cc -0.264 -0.461 -0.101 1

3.2.6.3 Multivariate linear regression analysis

Eliminating the metrics showing higher correlation with the other metrics means

that we identify a subset of the metrics that were significant to predict fault

proneness individually and not confounding with others. It did not mean that

we had the best set of metrics that can be used in combination. To select the

best possible subset of independent metrics, we constructed multivariate linear

regression (MLR) models. The results of multivariate linear regression analysis



Table 3.11: Spearman’s correlation analysis over Xalan project dataset

wmc cbo rfc lcom ca ce npm loc moa amc cc

wmc 1 0.474 0.841 0.583 0.448 0.367 0.937 0.742 0.48 0.262 0.576

cbo 0.474 1 0.62 0.374 0.517 0.787 0.466 0.43 0.381 0.286 0.415

rfc 0.841 0.62 1 0.51 0.301 0.643 0.771 0.863 0.547 0.578 0.648

lcom 0.583 0.374 0.51 1 0.32 0.3 0.534 0.393 0.087 0.088 0.287

ca 0.448 0.517 0.301 0.32 1 0.067 0.426 0.226 0.195 -0.069 0.245

ce 0.367 0.787 0.643 0.3 0.067 1 0.347 0.469 0.439 0.437 0.392

npm 0.937 0.466 0.771 0.534 0.426 0.347 1 0.64 0.42 0.169 0.498

loc 0.742 0.43 0.863 0.393 0.226 0.469 0.64 1 0.491 0.783 0.629

moa 0.48 0.381 0.547 0.087 0.195 0.439 0.42 0.491 1 0.311 0.381

amc 0.262 0.286 0.578 0.088 -0.069 0.437 0.169 0.783 0.311 1 0.491

cc 0.576 0.415 0.648 0.287 0.245 0.392 0.498 0.629 0.381 0.491 1

Table 3.12: Spearman’s correlation analysis over Xerces project dataset

wmc cbo rfc lcom loc moa

wmc 1 0.313 0.898 0.797 0.672 0.432

cbo 0.313 1 0.485 0.099 0.602 0.526

rfc 0.898 0.485 1 0.632 0.855 0.535

lcom 0.797 0.099 0.632 1 0.341 0.205

loc 0.672 0.602 0.855 0.341 1 0.583

moa 0.432 0.526 0.535 0.205 0.583 1

have been summarized in Tables 3.13 to 3.16. An empty cell in the table indi-

cates that corresponding metric was not one of the selected independent variable

(metric) through the regression procedure in corresponding model.

Table 3.13: Multivariate linear regression analysis over Camel project datasets

Datasets Constant WMC NOC CBO CA MOA CC

Camel 1.0 -0.04 0.008 0.011 - 0.003 - -

Camel 1.0/1.2 0.110 -0.03 0.025 0.029 0.027 0.041 0.079

Camel 1.0/1.2/1.4 0.129 -0.02 0.020 0.03 -0.03 0.030 0.044

The reduced subset of metrics after MLR analysis for each project is given in

Table 3.17.

3.2.6.4 Validation of prediction models over the successive releases

This section summarizes the validation results over the Camel 1.6, Xalan 2.6,

Xerces 1.4, Ivy 2.0 and Velocity 1.6 project datasets. Figure 3.2 shows the values

accuracy, precision, recall and AUC analysis of the prediction models constructed

using all nineteen metrics. Figure 3.3 shows the accuracy, precision, recall and

AUC values of prediction models that were built using an identified subset of

metrics.



Table 3.14: Multivariate linear regression analysis over Ivy project datasets

Dataset Constant WMC CBO CE LOC AMC

ivy 1.1 0.2977 0.0173 -0.0549 0.0722 -0.0014 0.0086

Ivy 1.1/1.4 0.4364 -0.0067 -0.167 0.018 -0.0003 0.0031

Table 3.15: Multivariate linear regression analysis over Velocity project datasets

Dataset Constant DIT MFA CAM IC

Velocity 1.4 1.111 -0.3023 0.4943 - -0.2344

Velocity 1.4/1.5 0.7983 -0.1489 0.0194 - -0.1042

Shatwani & Li [61] stated that as the system evolved, it became increasingly dif-

ficult (inaccurate) to identify error-prone classes. We also experienced the same

scenario. By examining the figures, we observed that the models constructed us-

ing an identified subset of metrics produced the desired prediction accuracy and

comparable with the models that are built by considering original set of metrics.

This confirms the ability of these selected metrics to predict faults in the subse-

quent releases of software systems and shows that the metrics can be significant to

predict fault prone modules over unseen project data. The value of ROC curves

were above the level of discrimination in all the cases (0.5 ROC≤0.6: no discrim-

ination, 0.6 ROC≤0.7: poor discrimination, 0.7 ROC≤0.8: good discrimination,

0.8 ROC≤0.9: excellent discrimination, 0.9 ROC≤1 outstanding discrimination).

It shows that the identified subset of metrics resulted the reduced misclassification

errors. Based on the finding of our results, we answer our research questions.

RQ 3.1: We found that there exists a different subset of metrics for each project

that was significantly correlated with fault proneness (in Table 3.7). For individ-

ual metric, we found that CBO, RFC, import and export coupling metrics are

equally important for predicting faults. For complexity metrics, LOC, CC and

WMC were selected by each prediction model. While cohesion metrics were not

found to significant for fault proneness. This leads us to conclude there exist a

Table 3.16: Multivariate linear regression analysis over Xalan project datasets

Dataset Constant WMC CBO LCOM CA CE LOC MOA CC

Xalan 2.4 0.0165 0.0101 -0.0082 -0.0001 0.0088 0.0049 - -0.022 -

Xalan 2.4/2.5 0.0859 - -0.0033 - 0.004 -0.0044 0.0001 -0.0121 0.0242



Table 3.17: Multivariate linear regression analysis over Xerces project datasets

Dataset Constant WMC CBO LCOM LOC MOA

Xerces 1.2 0.2454 -0.0092 -0.0517 0.0001

Xerces 1.2/1.3 0.2563 -0.0029 -0.0398

Table 3.18: Resulted subset of metrics after MLR analysis

Camel WMC, CBO, NOC, NPM, CA

Xalan WMC, CBO, LCOM, CA, CE, LOC, MOA, CC

Xerces WMC, RFC, LCOM, LOC, MOA, CBO

Ivy WMC, CBO, CE, LOC, AMC

Velocity DIT, MFA, IC

subset of metrics that significantly correlated with faults.

RQ 3.2: From Tables 3.8 to 3.12, we found that some metrics produced a higher

correlation with other metrics. This shows that in order to judge the capability

of each metric separately, it is needed to eliminate the collinearity among the

metrics. So that the models that are build based on these metrics are accurate

and did not suffer from any collinearity.

RQ 3.3: Comparing the Figures 3.2 and 3.3, we found that the identified subset

of metrics produced an improved prediction accuracy and reduced misclassifica-

tion errors of fault prediction. Therefore, we can say that the obtained subset of

metrics improve the accuracy of fault prediction.

3.2.7 Threats to validity

Experiments are always associated with potential risks that can affect their find-

ings. We present the possible risks as various validity threats and highlight our

mitigation efforts to deal with them. They are as follows-

Construct validity: It questions to the quality of choices about the independent

and dependent variables. These choices will affect the quality of experimental

findings. It includes questions like - are we actually measuring what we intend

to measure? Here, we are interested to identify a subset of OO metrics for



1) Considering all Metrics

Accuracy Precision

Recall AUC

NB LR IBK RF

00.10.20.30.40.50.60.70.80.9

1

Camel

1.6

Xalan

2.6

Xerces

1.4

Ivy 2.0 Velocity

1.6

00.10.20.30.40.50.60.70.80.9

1

Camel

1.6

Xalan 2.6 Xerces

1.4

Ivy 2.0 Velocity

1.6

00.10.20.30.40.50.60.70.80.9

1

Camel

1.6

Xalan 2.6 Xerces

1.4

Ivy 2.0 Velocity

1.6

00.10.20.30.40.50.60.70.80.9

1

Camel

1.6

Xalan 2.6 Xerces

1.4

Ivy 2.0 Velocity

1.6

Figure 3.2: Results of the validation of prediction models constructed using orig-

inal set of metrics and using four machine-learning techniques

fault proneness. Therefore, we select faults found in a given class as dependent

variable. The differentiation between development phase faults and the post

release evolution phase faults of a system are significant. The effectiveness of

metrics might vary according with variation in the collection of faults.

Internal validity: Internal validity is used to show the casual relation between

two variables. It questions like are the cause and effect are related?, is the cause

precedes effect in time? or are there not any plausible alternatives that can

effect the outcome of the experiment? We are interested to identify a subset of

metrics that better fault-correlated in a given class. Therefore, we have used

the OO metrics that are available at the class level for our experimental study.

To incorporate these metrics into our study, we have used datasets available in

PROMISE data repository. All these values may vary with the organizational

benchmarks.

Conclusion validity: This validity check for appropriate data collection and

analysis. As we calculated various confusion matrix parameters of each project

using WEKA tool. Here, we have used standard statistical data analysis, which

includes a graphical method. Our results produced a different subset of metrics



2) Subset of metrics after Cross correlation & MLR

Accuracy Precision

Recall AUC

0

0.2

0.4

0.6

0.8

1

Camel

1.6

Xalan 2.6 Xerces

1.4

Ivy 2.0 Velocity

1.6

0

0.2

0.4

0.6

0.8

1

Camel

1.6

Xalan 2.6 Xerces

1.4

Ivy 2.0 Velocity

1.6

00.10.20.30.40.50.60.70.80.9

1

Camel

1.6

Xalan 2.6 Xerces

1.4

Ivy 2.0 Velocity

1.6

00.10.20.30.40.50.60.70.80.9

1

Camel

1.6

Xalan

2.6

Xerces

1.4

Ivy 2.0 Velocity

1.6

Figure 3.3: Results of the validation of prediction models constructed using iden-

tified subset of metrics and using four machine-learning techniques

for each of the projects that we have used in our experimental investigation. One

need to be understand the characteristics and the distribution of the dataset

before selecting the metrics for a project of the new domain.

External validity: It investigates the potential threats when we try to generalize

the causal relationship obtained beyond that studied. This is the most important

aspect of an experimental study, and it requires great care and restraint to address

the related threats. Our models are build and evaluated on datasets available in

public data repositories. The system developed inside the organization may pass

the different effort pattern. Therefore, the results of our investigation needs to

considered in their context only. We do not suggest generalizing our research

results to any arbitrary project categories.

3.2.8 Discussion

The validation of OO metrics for fault proneness is already undertaken by some

researchers. However, the earlier studies make use of single release of the software

system and performed cross validation analysis to investigate and validate their


3.3 Summary 46

results. Using multiple releases of the software system to identify the metrics from

previous releases of software and to test them on the later release is not much

validated. In this experimental study, we have used Logistic Regression analysis

to evaluate each metric independently. The reason to choose Logistic Regression

is that it did not follow the dependency of the normalized dataset. We check

each metric against the three parameters of the regression i.e., regression coff.,

p-value and odds ratio to select the significant metrics. Furthermore, the resulted

metrics after the ULR analysis are investigated using Spearman’s correlation and

Multivariate linear regression analysis to determine the best possible subset of

metrics that produces an improved performance of software fault prediction.

Based on the investigation of above correlation and regression analysis, we can

draw a number of conclusions. We found that class level OO metrics are signifi-

cantly correlated with fault proneness. In comparing our result with the results of

previous studies on open source system, we found that our results agree in part

with the results obtained by [53]. We found their results encouraging because

their study used an open source system, thus making the common part of our

research findings more reliable.

It can be noticed from the investigation that it is possible to identify a metrics’s

subset that are significantly correlated with fault proneness. However, we ob-

served that the results of all five datasets are not common. The subset of metrics

was different for each project. One possible reason of this is that the difference

between the metrics subsets depended on the project characteristics and affected

by the specific project domain. In our investigation WMC, CBO, RFC, LOC,

CA and CE are the metrics that were significantly correlated with fault prone-

ness with most of the datasets. Our study identified a reduced subset of metrics

for improved performance of fault proneness over the successive releases of the

software.

3.3 Summary

Validation of object-oriented (OO) metrics to predict software fault proneness

is essential to ensure their practical use in building fault prediction model for

OO software system. Since, some of the OO metrics have previously been shown

to be relevant to predict fault proneness. Still, the other metrics have not been


3.3 Summary 47

much validated except by the authors those proposed them. For this purpose

here, we investigated the relationship of existing class level OO metrics with

fault proneness of OO software systems to determine a subset of metrics that are

better correlated with faults.

We evaluate the metrics individually as well as in the conjunction with other

metrics to determine a subset of the significant metrics. Further, we investigate

these metrics over the subsequent release of the same software to estimate their

overall prediction accuracy. Our results suggested that it is possible to identify a

subset of metrics out of total available metrics. This identified subset is able to

predict fault proneness with higher accuracy and reduced misclassification errors.


Chapter 4

A Count Model Based Analysis

to Predict Fault Densities in

Software Modules

Software fault prediction is a technique to identify the fault-prone modules before

the testing phase by using the underlying properties of the dataset. It aims to

streamline the testing and verification efforts to be applied in the later phases

of software development. Typically the fault-prediction is done by training the

prediction models over a part of some known fault data and measuring its per-

formance against the other part of the fault data.

There have been many efforts comparing the performance of fault-prediction tech-

niques on different project datasets using various performance evaluation criteria.

However, many earlier fault prediction studies were based on the classification of

fault data into two classes, namely faulty and non-faulty. There are the several

issues with this binary class classification. For example, even if the performance

of the prediction model was reported excellent, the interpretation of the finding

are hard to put into the proper usability context i.e., identification of the actual

number of faults. The binary class classification of the software modules as either

faulty or non-faulty, does not provide enough logistics to streamline the efforts

that would ensure the identification of faults in the software system.

The main motivation behind the software fault prediction is to identify and pre-

dict faults accurately, so that the effort require to find and fix them are minimized.

Hence, the idea of software fault density prediction is more useful as it assigns

an expected number of faults to each module of the software. This prediction

4.1 The Approach 49

can help the software quality assurance team to optimize the testing efforts by

targeting the modules having more number of faults.

The fault datasets available in the software data repository are stuffed with unnec-

essary information, and thereby making it difficult to be used for fault prediction.

There is also a possibility that some of the metrics may depend on the project

characteristics. Therefore, the major issue is identifying the subset of software

metrics that are showing significant fault-correlation.

In this chapter, initially we have identified a subset of the project metrics suite

that contains the metrics significant for fault-correlation by performing the Mul-

tivariate Linear Regression (MLR) analysis. Subsequently, we have used this

subset of metrics with the count models to predict fault densities. We have

performed our experimental investigation, using five different count models and

six successive releases of a software project dataset available in PROMISE data

repository [57]. The built count models have assigned an expected number of

faults and the fault densities to each module of the software. To predict fault

densities, the count models were trained using the prior releases of the software

project and tested on the later release of the software. The results of the predic-

tion were evaluated using confusion matrix parameters and a cost-benefit model.

The rest of the chapter organizes as follows. Section 4.1 describes the approach of

data analysis, includes reviews of the count models, cost-benefit model and subset

selection process. Section 4.2 contains information of the experiment evaluation

includes datasets, metrics (independent variables), the dependent variable and

the results of our investigation follow by the threats to validity. Section 4.3

discuss the implication of our results.

4.1 The Approach

The proposed approach involves initial identification of a subset of the project

metrics suite that shows significant fault-correlation. Subsequently, the identified

subset of metrics is use with the count models to predict fault densities (Fault

density=faults/100 lines of code). The built count models are validate using con-

fusion matrix parameters and a cost-benefit model. In the following subsections,

1In our study fault density= faults/100 lines of code.


4.1 The Approach 50

Fault Dataset

Subset Selection of Fault-correlated

Metrics

Construction of Count Models for Fault

Densities Prediction

Prediction of

Number of Fault and

Fault Densities

Evaluating the

Results of Five

Count Models

Cost-benefit

Analysis

Figure 4.1: Overview of the proposed approach

we present the details of each step of the approach. An overview of the proposed

approach is given in Figure 4.1.

4.1.1 Selection of fault-correlated metrics

To determine a subset of project metrics, we use the Multivariate Linear Re-

gression (MLR) analysis with the backward selection approach. MLR attempts

to model the relationship between the two or more independent variables and a

dependent variable by fitting a linear equation to observed data. Every value of

the independent variable x is associated with a value of the dependent variable y.

The regression line for n independent variables x1, x2, ..., xn is defined to be µy =

β0+β1x1+β2x2+...+βnxn. This line describes how the mean response µy changes

with the independent variables. The observed values for y vary about their means

µy and are assumed to have the same standard deviation σ. Backward selection

is a search technique in MLR, which start with considering all the independent

variables, test the significance of each variable using a chosen model comparison

criterion and delete the variable (if any) that does not improve the model by

much being deleted. This process is repeated until no further improvement is


4.1 The Approach 51

possible [69].

We carried out our investigation using class level object-oriented (OO) metrics as

independent variables and fault proneness of a class as the dependent variable. We

have used all the six releases of the PROP datasets and performed MLR analysis

to identify the significant fault-correlated metrics. We compute this subset for

each release of fault dataset incrementally. After identifying the metrics for each

release of the PROP dataset, we combine (take union) them to compute the

resulting metrics subset for later use. Subsequently this identified metrics subset

is used with the count models for the fault densities prediction.

4.1.2 Count model analysis

The identified subset of metrics is use to construct the count models. Count

model is a form of regression analysis, use to model the data where dependent

variable is a count type. All count models aim to explain the number of occur-

rences of an event. We built count models over all six releases of the software

by training the model from the earlier releases and testing on the later release.

The benefit of training model from earlier releases is that model contains the

historical information of the domain that can help the count models to better

predict fault densities. The faultiness of the modules selected as the dependent

variable for analysis. Since the number of faults in each release of the software

has a high variance. Therefore, we performed a square root transformation to

reduce the influence of the outlier values and take the logarithmic transformation

of the LOC metrics. These transformations help us to better fit the model in

terms of the log likelihood ratio.

4.1.3 Evaluation of count models

Once these count models have been constructed. Then, we can use the confusion

matrix parameters to evaluate the potential of these models for fault densities

prediction. Since every count model assigns an expected number of faults to

each module of the software system. Therefore, we used this information as fault

prediction means, and every module that contains one or more faults are marked

as faulty and modules that contain zero faults are marked as non-faulty. Hence

these values can then serve as the values of TP, FP, FN and TN (described in


4.1 The Approach 52

chapter 2). We use these values to calculate the elements of the confusion matrix

(i.e., Accuracy, Precision and Recall) in order to evaluate the overall accuracy of

the count models.

4.1.4 Cost-benefit model

We need to use a cost-benefit model that qualifies the fault removal cost at differ-

ent phases of software development, when we are using fault density prediction

model. This cost-benefit model can help to put the results of fault densities

prediction in proper usability context. Essentially, the framework can provide

an estimate of the saving in the efforts applied by using the results of the fault

densities prediction in the subsequent phases of the software development.

Jiang et al. [38] introduced cost curve, a measure to estimate the cost effective-

ness of a classification technique, to evaluate the performance of a fault-prediction

technique. They drew out the conclusion that cost characteristics must be con-

sidered to select the best prediction technique. Deepak et al. [5] proposed a

cost evaluation framework, where they accounted realistic fault removal cost of

different testing phases, along with their fault identification efficiency. In our

study, we have used their concept of a cost evaluation framework to construct

our cost-benefit model.

Table 4.1: Fault removal cost of testing techniques (in staff-hours per defect)

Type Lowest Mean Median Highest

Unit 1.5 3.46 2.5 6

System 2.82 8.37 6.2 20

Field 3.9 27.24 27 66.6

Table 4.2: Fault identification efficiencies of different testing phases

Type Lowest Median Highest

Unit 0.1 0.25 0.5

System 0.25 0.5 0.65

The constraints, that their framework include are -


4.1 The Approach 53

(1) Fault removal cost varies with testing phases.

(2) It is not possible to identify 100% faults in specific testing phase.

(3) It is practically not feasible to perform unit test on all modules.

We have used normalized fault removal cost suggested by Wagner et al. [67]

and fault removal efficiency of the different testing phases from the study of

Caper Jones [39] to formulate the cost-benefit model. The normalized costs are

summarized in Table 4.1. The efficiencies of testing phases are summarized in

Table 4.2. Wilde et al. [70] stated that more than fifty percent of the modules

are very small in size, hence unit testing of these modules is unfruitful. We have

included this value (0.5) as the threshold for unit testing in our framework.

Equation 4.1 shows the proposed cost evaluation framework to estimate the over-

all fault removal cost. Equation 4.2 shows the minimum fault removal cost with-

out the use of count model. Normalized fault removal cost and its interpretation

is shown in Equation 4.3.

Ecost = Ci + Cu ∗ (NoF ) ∗ δu + δs ∗ Cs ∗ (1− δu) ∗ (NoF )

+(1− δs) ∗ (1− δu) ∗ Cf ∗ (NoF ) (4.1)

Tcost = Mp ∗ Cu ∗ (TM) + δs ∗ Cs ∗ (1− δu) ∗NoF

+(1− δs) ∗ Cf ∗ (1− δu) ∗ (NoF ) (4.2)

NEcost =Ecost

Tcost(4.3)

Where, Ecost - Estimated fault removal cost of the software when we use count

model for fault prediction.

Tcost- Estimated fault removal cost of the software without the use count model.

NEcost- Normalized Estimated fault removal cost of the software when we use

count model.

Ci - Initial setup cost of used fault-prediction technique.

Cu - Normalized fault removal cost in unit testing.

Cs - Normalized fault removal cost in system testing.

Cf - Normalized fault removal cost in field testing.

Mp - Percentage of modules unit tested.



TM - Total modules.

NoF - Total number of faults.

δu - Fault identification efficiency of unit testing.

δs - Fault identification efficiency of system testing.

We will see below that how this framework helps to estimate a normalized cost

of used count models to determine their economic viability. Here, for our study

we use a median value of fault removal costs of testing techniques and fault

identification efficiencies of the different testing phases.


In this section, we present an experimental study to evaluate the performance

of count models for fault densities prediction. We have used five different count

models namely: Poisson Regression model, Negative Binomial Regression model,

Zero-Inflated Poisson Regression model, Generalized Negative Binomial Regres-

sion model and Zero-Inflated Negative Binomial Regression model, over the six

successive releases of a software project dataset consisting nineteen class level

object-oriented metrics. In this study, we investigated the prediction of fault

densities and the number of faults for a given module. Therefore, we have se-

lected measure of fault proneness as the dependent variable. The fault proneness

of a class is the probability that a class contains a fault, given the metrics for

that class. It is a key factor for monitoring and controlling the quality of the

software.

4.2.1 Metrics set used for the experiment

To perform our experimental investigation, we have used the nineteen measures of

coupling, cohesion, inheritance, encapsulation and complexity of object-oriented

software system. They are as follows- WMC, CBO, RFC, DIT, NOC, IC, CBM,

CA, CE, MFA, LCOM, LCOM3, CAM, MOA, NPM, DAM, AMC, LOC and

CC (CC same as max CC in the PROP dataset). For each release of the PROP

dataset, we performed Multivariate Logistic Regression (MLR) analysis to test

whether each of the nineteen metrics would be significant predictor in the count

models. The MLR analysis results the subset of significant metrics corresponding



to each release of the dataset. The criteria used to select metrics is - In all six

releases of the datasets, there should be at least 50% or more times that metric

appeared. Based on this selection criteria, only eleven metrics (WMC, NOC,

CBO, CA, CE, NPM, LOC, CAM, DAM, LCOM3 and AMC) were selected for

the further analysis. The results of this analysis are summarized in Table 4.3.

Table 4.3: Identified metrics for each release of the PROP dataset

Dataset Name Name of the Metrics Identify

PROP1 CBO, CE, LOC, DAM, CAM, LCOM3, LCOM and WMC

PROP2 RFC, CBO, WMC, LCOM3, NOC, CE, IC, DAM, CAM,

AMC, LOC, NPM, MOA and CA

PROP3 LOC, CBO, NOC, LOCM3, CC, MOA, CAM, DAM, MFA,

DIT, AMC and WMC

PROP4 RFC, CBO, LCOM3, NOC, CAM, NPM, LOC, IC, AMC,

WMC, CC, CE, CA, LCOM, DIT, DAM and MFA

PROP5 CBO, NOC, LOC, LCOM3, WMC, NPM, LOCM and CE

PROP6 DIT, CBO, CA, CE, LCOM3, LOC, DAM and IC

4.2.2 Project dataset

We have used PROP dataset with its six successive releases to perform our study

and to evaluate our results [40]. PROP dataset is one of the largest dataset that is

available in PROMISE data repository. This dataset is collected from a software

project that was developed inside an organization (commercial software) and

written in the java programming language. Each release of the dataset consists

fault data of one or more versions of the project release. Like PROP1 dataset

correspond to the versions 4, 40, 85, 121, 157 and 185. Similarly other release

correspond to other versions of the project. Each version of the project contains

some modification in the functionality, but all the versions do not provide major

changes. Therefore, we group them into one dataset and the version that address

major changes are grouped in the different-2 releases, i.e., PROP1, PROP2 etc.

For each release, same nineteen metrics have been calculated and recorded with

respect to the software modules. The size of dataset varies from one release to

another release, but for all the releases, we collected the same nineteen metrics.

The detailed description of about the dataset in given in Table 4.2.2.



Table 4.4: Detail of PROP project dataset used for study

Version no. Total no. of

instances

No. of

faulty

instances

Total no. of

faults

PROP1 18472 2739 5493

PROP2 23015 2432 4096

PROP3 10275 1180 1640

PROP4 8719 841 1362

PROP5 8517 1299 1930

PROP6 661 66 79

4.2.3 Count models

Count model such as poisson regression or negative binomial regression is a form

of regression analysis used to model the data where the dependent variable is

a count type. All count models aim to explain the number of occurrences, or

counts, of an event. The counts themselves have a variance that increases with

the mean of the distribution [46]. They inherit the basic idea of linear regression,

by assigning a regression coefficient to each variable showing its contribution

for occurring dependent variable, while keeping other independent variables as

constant. These models retain all the power of the linear regression models but

extend the analysis to predict the mean of variables that are not reasonably

assumed to be normally distributed.

In this subsection, we describe the different count models used for the experi-

mental investigation.

4.2.3.1 Poisson regression model

Poisson regression is the standard or base count response regression model. It is

based on the Poisson probability distribution, which is the fundamental method

used for modeling count response data. It assumes that the dependent variable

Y has a Poisson distribution, and assumes the logarithm of its expected value

can be modeled by a linear combination of independent variables.

let Yi equal the number of faults (dependent variable) observed in the file i and



Xi be a vector of independent variables for the ith observation. Given Xi, assume

Yi is Poisson distributed with the probability density function (PDF) of

Pr(Yi|µiXi) =e−µiµYiiYi!

(4.4)

where µi is the mean value of the dependent variable Yi. To ensure that the

expected value of µi is nonnegative, the link function which displays a relation-

ship between the expected value, and the independent variables should have the

form [33]

µi = E(Yi|Xi) = eX′iβ (4.5)

where β=[bo,b1,b2,....,bk] denotes an independent variables vector and X ′i repre-

sents the transpose of the Xi, which is equal to [1, Xi].

4.2.3.2 Negative binomial regression model

Negative binomial models have been derived from two different origins. First, and

initially, the negative binomial can be thought of as a Poisson-gamma mixture

designed to model overdispersion Poisson count data. Conceived of in this man-

ner, estimation usually takes the form of a maximum likelihood Newton-Raphson

type algorithm. This parametrization estimates both the mean parameter, as well

as the ancillary or heterogeneity parameter, α.

In the context of our prediction model, we can view the negative binomial regres-

sion (NBR) model as- let Yi equal the number of faults observed in the file i and

Xi be a vector of OO metrics for that module. The NBR model specifies that Yi,

given Xi, has a Poisson distribution with mean µ.

The negative binomial regression model is given as-

V ar(Yi|Xi) = µi(1 + αµi) = eX′iβ(1 + αeX

′iβ) (4.6)

if α =0, the negative binomial distribution reduces to a Poisson distribution.

NBR model is appropriate to use when the data has follows overdispersion. The

variance σ2 , which is known as the dispersion parameter, is allows for the type

of concentration observed for faults.



4.2.3.3 Zero-inflated count model

The Poisson and Negative binomial distributions define an expected number of

zero counts for a given value of the mean. The greater the mean, the fewer zero

counts are expected. However, fault data normally come with a high percentage

of zero counts make it hard for the Poisson or negative binomial distribution.

To encounter this problem zero-inflated Poisson (ZIP) and zero-inflated negative

binomial (ZINB) have been developed [33]. The data are assumed to come from a

mixture of two distributions where the structural zeros from a binary distribution

are mixed with the non-negative integer outcomes (including zeros) from a count

distribution.

The ZINB model is similar to the ZIP model. The only difference is that, in

the case of the ZINB model, the negative binomial distribution is used for the

non-perfect modules group, as compared to the Poisson distribution used in the

ZIP model [33]. The general form of zero-inflated model is given below.

ln(µi) = X ′iβ (4.7)

logit(ψi) = lnψi

1− ψi= X ′iγ (4.8)

where β and γ are the independent variable vectors that are to be estimated.

4.2.3.4 Generalized negative binomial regression model

The generalized negative binomial regression (gnbreg) has been found useful in

fitting over-dispersed as well as under-dispersed count data. A gnbreg is a type of

negative binomial regression, in which the heterogeneity parameter itself could be

parameterized. It allows a generalization of the scalar overdispersion parameter

such that parameter estimates can be calculated showing how model predictors

comparatively influence overdispersion. A generalized negative binomial regres-

sion has been formulated as-

NB −G = µ+ αµp (4.9)

where p is a third parameter to be estimated.



4.2.4 Results

54932732327427326379595810030865714105494020762056

Actual NBRM P ZIP GNBR ZIN

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

14000

15000

16000

17000

18000

19000

20000

21000

22000

23000

24000

25000

PROP1 PROP2 PROP3 PROP4 PROP5 PROP6

N

u

m

b

e

r

o

f

F

a

l

u

t

s

Figure 4.2: Result of the predicted number of fault using count models (PROP1-

POPR6)

This subsection presents the detailed description of the experimental results.

First, we discuss the result of the prediction of the number of faults and the fault

densities using various count models. Next, we compare the overall accuracy

and effectiveness of the count models using confusion matrix criteria. Finally, we

present the results of cost-benefit analysis of the count models to evaluate them

in the economic standpoint.

4.2.4.1 Prediction of the number of faults and fault densities

The built count models assign an expected number of faults and the fault densities

to each module of the software. In each scenario, the count model is built on one

or more prior release of the software and is evaluated on the latest release of the

software. Like, the count model based on release PROP1 and PROP2 is evaluated

for release PROP3, except the release PROP1, where training and testing both

performed on the same dataset due to unavailability of any prior release. The

similar procedure has been followed for all the count models.

Figure 4.2 and 4.3 show the results of the prediction of number of faults and

the number of faulty modules, predicted by each count model. Figures consist a



54932732327427326379595810030865714105494020762056


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000


N

u

m

b

e

r

o

f

F

a

u

l

t

y

I

n

s

t

a

n

c

e

s

Figure 4.3: Result of the predicted number of faulty modules using count models

(PROP1-POPR6)

graph corresponding to the each release of the project dataset that shows their

predicted number of faults and the number of faulty modules. The blue bar in

the figures show the actual number of faults and the faulty modules contained

in the release of the project dataset. This is the optimal values of finding all the

faults and faulty modules in each release of the dataset. The other bars show

the comparison of prediction results among the count models. The quality of the

each model is measured in terms of how close it predicts the value comes from

the actual curve.

We can observe by comparing the results from PROP1 to PROP6 that the number

of faults and the faulty modules predicted by negative binomial regression count

model is closest fit with the actual number of faults. This result is consistent with

all the releases of the PROP datasets, except PROP4, where the values predicted

by poisson regression count model is closest fit to the actual value. The second

most accurate prediction model is poisson (p) regression model except for the

PROP6, where ZIP is the second best predictor. The rest of all models assign a

higher values than their actual value. It questions their viability to predict fault

densities. These results were very encouraging stating the accuracy of overall

prediction made by count models to the actual number of faults discovered.

Figure 4.4 shows the results of fault densities predicted by each count model. This



0

1

2

3

4

5

6


Actual

NBRM

P

ZIP

GNBR

ZIN

Figure 4.4: Result of the fault densities prediction using count models (PROP1-

POPR6)

information can be useful, if practitioner wants to know, which module is likely to

contain the highest number of faults relative to the size of the module. The figure

shows the sum of the value of actual fault density in the project datasets (blue

bars) and the value of fault densities predicted by the count models. For the sake

of simplicity, we normalized the values of the predicted fault densities. The value

of actual curve is converted to 1 and all other values normalized corresponding

to this value by dividing them with the actual value. Therefore, the bars in

the graph corresponding to each count model is showing the amount of fault

densities that increase or decrease with respect to the actual curve value. The

predictive capability of each count model is measured by how close it predicts

the values shown in actual curve. It can see from the figure that the values of

the fault densities predicted by NBR model and P model is closest fit with the

actual values, except the result of PROP4 dataset, where NBR is the third best

performer and PROP6, where p is the third best performer. The values of the

other count models are much far different from their actual values. This confirm

the potential of NBR model and P model to predict the fault densities in the

software system.



PROP1 PROP2

PROP3 PROP4

PROP5 PROP6

Acc. Precision Recall F-measure

0

10

20

30

40

50

60

70

80

90

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

100

NBRM P ZIP GNBR ZIN

Figure 4.5: Comparison of count model using various confusion matrix criteria

(PROP1-POPR6)

4.2.4.2 Evaluating the results of five count models

The results of previous section show the potential of count models to predict

values of the number of faults and faulty modules, but it raises the question how

effective are the count models to predict actual faulty modules. If we mark every

module that contains one or more faults as faulty and rest all other that contain

zero faults as non-faulty, then these values can serve as the values of TP, FP, FN

and TN and count models can also be used for predicting binary classification of

modules, i.e., fault prone or not fault prone. These values are used to calculate

the elements of the confusion matrix (i.e., Accuracy, Precision and Recall) to



evaluate the overall accuracy of the count models. This is particularly important

because it is possible that even if a prediction model is accurate to predict the

number of faults close to their actual values but skipped the actual faulty modules

and raises the false alarm by predicting faults in non-faulty modules.

Figure 4.5 shows the predicted results of count models using various confusion

matrix criteria. Prediction accuracy, precision, recall and F-measure are the most

commonly used parameters to evaluate the prediction models. Here, we used all

four parameters and built comparison graph for all six releases of the PROP

project datasets (sub figures in Figure 4.5 from 1 to 6, corresponding to the six

releases of the PROP project).

From figures, we can observe, in general, the prediction accuracy of NBR model

is higher than all other count models, except the result on PROP5, where the

accuracy is lower than other count models. The prediction accuracy of NBR

model is varied from 75% to 85%. Similarly, to the value of the precision NBR

outperform compare to other models except for PROP6. The value of precision

for NBR varied between 18% to 33%. The value of recall of NBR model is lower

compare to the other count model and varied between 25% to 75% in general.

One possible reason of this is that other count models assigned higher value of

the number of faults compared to the NBR model. As a result of this, their

recall value is increased by some amount. To address this issue, we examined

the value of F-measure. The value of F-measure shows the trade off between

false positive and false negative. The values of F-measure of NBR model is again

higher than the other count models and varied between 23% to 32%. This shows

that NBR model has the potential to predict fault prone modules and reduce the

misclassification errors. The value of F-measure of other model in general, are

less compared to NBR model.

4.2.4.3 Prediction of the number of faults and the fault densities in

the modules ranked as top 20%

As mentioned by the Ostrand et al. [55] that the individual fault counts predicted

for each file generally do not exactly match their actual fault counts, the great

majority of the actual faults occur in the set of files at the top of the listing.

To evaluate our results in this context, we sort the modules according to their

faults count and use the modules that ranked as the top 20 percent for finding



their faults contain. Tables 4.5 and 4.6 contained the percentage of the predicted

number of faults and the fault densities by top 20% of modules of six releases of

PROP dataset using all the five count models. The results of predicted number of

faults for PROP1 to PROP6 is given in Table 4.5. From the table, it is clear that

the modules contained faults in between 55% to 72%, with the overall average

of 67% (approximately) in the case of NBR model. While for the other count

models average percentage of prediction is between 64% - 66%, which is closer

to NBR model. A similar process was followed in the case of predicting fault

densities. Table 4.6 contained the percentage of fault densities predicted by the

count models for PROP1 to PROP6. The top 20% of the modules contain fault

densities between 60% to 96% with the average of 72% in the case of NBR model,

excluding the values 34% and 100%, they occur once. For other count models, the

average percentage varies between 54% to 56%. Comparing our results with the

results of [55], we found that only NBR model was significantly able to predict

the fault densities in the given modules when considered top 20% of the total

modules.

Table 4.5: Percentage of faults contained in the modules ranked as top 20%

(T=Training set)

Model Prop1 (T

prop1)

Prop 2

(T

prop1,

2) Prop3

(T prop

1, 2, 3)

prop 4

(T prop

1, 2, 3, 4)

prop 5

(T prop 1,

2, 3, 4, 5)

prop 6

Average

NBR 70% 66.90% 66.56% 71.26% 71.63% 54.83% 66.85%

P 58.31% 57.08% 63.72% 75.24% 75.35% 58.53% 64.07%

ZIP 56.80% 58.17% 60.81% 75.23% 80.46% 54.21% 64.28%

GNBR 61.11% 62.74% 63.79% 77.08% 80% 49.07% 65.63%

ZIN 70% 67.19% 66.60% 76.19% 75.84% 38.38% 65.70%

4.2.4.4 Cost-benefit analysis

The results obtained through the experiments of cost-benefit analysis are shown

in Figure 4.6. We have used the value of the predicted number of faults to

calculate the estimated cost of each count model for fault densities prediction.

The value of Tcost was calculated to show the actual cost that incurred when



Table 4.6: Percentage of faults density contained in the modules ranked as top

20% of modules (Fault density=faults/100 lines of code)

Model Prop1 (T

prop1)

Prop 2

(T

prop1,

2) Prop3

(T prop

1, 2, 3)

prop 4

(T prop

1, 2, 3, 4)

prop 5

(T prop 1,

2, 3, 4, 5)

prop 6

Average

NBR 96.20% 59.14% 34.66% 80.06% 60.67% 100% 71.78%

P 82.80% 48.66% 33.69% 73.11% 52.28% 43.70% 55.70%

ZIP 68.39% 44.57% 38.33% 74.74% 59.59% 34.93% 53.42%

GNBR 57.91% 36.62% 36.80% 76.71% 61.38% 48.88% 53.05%

ZIN 57.57% 38.59% 37.38% 76.34% 62.61% 51.85% 54.05%

normal testing process. This served as the optimal point to calculate the NEcost

of the each count model. The Figure 4.6 shows the values of NEcost for each of

the model for fault densities prediction.

From Figure 4.6, it can see that in general, except the results of PROP4 dataset,

the value of NEcost for NBR model is less than other count models. While it

observed that all the other count model in general have value of NEcost greater

than NBR model. The cost value of P model is the second best value, except for

the value of PROP6. These results imply that it is more economic to build the

prediction model based on negative binomial regression to reduce the overall cost

of testing.

Since the used cost-benefit model utilizes the value of faults found in each phase of

software testing and faults remain and seed to next phases of testing. Therefore,

it provides a significant guidance about the cost effectiveness of the model along

with the effectiveness, efficiency and accuracy of the prediction model. These

results confirm and strengthen our prediction model in economic standpoint.

Based on our results, we find out the following observations.

• Each of the count models have assigned an expected number of faults and

the fault densities to each module of the software. We used the actual

curve value to compare the predicted values. Although, we observed that

the value predicted by each count model does not exactly matched with

the actual value and vary from one release to other release. Therefore,



0

1

2

3

4

5

6

7

8


Cost Benefit Model

NBR P ZIP GNBR ZIN

Figure 4.6: Cost-benefit model for the count models (PROP1-PROP6)

we checked the value of the number of faults occurring in modules ranked

as top 20%. We found that the majority of the faults occurred in the

modules listed as top 20%. The result of NBR model has the average of

67% (approximately). While for other models, the average percentage of

prediction is between 64% - 66%. It shows that count models have the

potential to predict fault densities.

• The NBR model in general provided a higher prediction accuracy, except

the result of PROP4 dataset for fault prediction. All other four count

models were similar at their prediction accuracy and it is lower than NBR

model.

• The value of the precision of NBR model outperformed other count models

except for PROP6. The value of precision for NBR varies between 18% to

33%.

• In the value of recall, it is found that NBR model has lower recall values

as compared to other count models. In general GNBR and ZIN have the

highest recall values. One possible reason of this is that other count models

assigned higher value of the number of faults compares to the NBR model.

As the result of this, their recall value is increased by some amount.



• For the value of F-measure, it is observed that NBR model showed higher

value as compared to other count models. The second highest value was

shown by P model. The rest of the model produced a lower value of F-

measure. This result shows that prediction model based on NBR reduced

the misclassification rate by a significant amount.

• The final model selection included the performance of the cost benefit

model. Once again, we found that the NEcost incurred by NBR model

is much lower to other count models irrespective to the datasets except

PROP4 dataset.

These results suggested that the NBR model is more significant to predict fault

densities.


In this section, we critically examined the possible side effects of our experimental

findings. We also highlight the factors affecting the validity of the cost evaluation

framework that we have used to measure the performance of the count models

in predicting fault densities for multiple releases of the software. The validity

considerations can be grouped in the following categories:

Construct validity: The effectiveness of a count model is measured as Ecost,

which is the estimated fault removal cost. The framework is developed consid-

ering the costs to be incurred to rectify faults in the later phases of software

development, if not identified before testing. In our cost evaluation framework

unit testing cost of faulty and non-faulty modules are same. The testing cost

of a particular phase is same for all modules i.e. finding a fault in a 100 LOC

module is the same as finding a fault in a 1000 LOC module. We have selected

eleven fault significant metrics out of nineteen metrics by keeping those metrics

that were appeared more than 50% of times in all six releases of the dataset.

The different framework parameters that are used in our cost-benefit analysis

have been taken from different sources as reported in literature. For example, the

cost parameters (i.e., the values of Cu, Cs and Cf ) were taken from Wagner [67].

The fault identification efficiencies (i.e., the values of δu and δs) were taken from

Jones [39] and the value of Mp (Percentage of modules unit tested) is taken from


4.3 Discussion 68

the study of Wilde et. al. [70]. However, one can substitute these parameters

with organizational specific benchmarks to ensure the practical use of the cost-

benefit analysis. One can use other criterion also to select the significant metrics

and the result may vary with the conjuncture of that criterion.

Internal validity: Our experimental study involving the use of the statistical

analysis tools namely, Weka and Stata and the data collected from the publicly

available software data repository. The fault densities and their distribution

should depend on the fault data. Any biasing in this may influence the finding

of our results.

Conclusion validity: We have used the statistics of previous versions to calcu-

late the estimated false positive, false negative and true positive. The value of

estimated false positive, false negative and true positive may be differ from the

actual value. Here, we compared the values of Ecost with the unit testing cost to

decide whether count models are useful. Our results are specific to the versions

of datasets included in the study.

External validity: We do not suggest generalizing our research results to any

arbitrary project categories because our results identify variance in the metrics

set when the examined project has changed. Our models are built and validated

on datasets available in public data repositories. The system developed in the

organization may pass the different effort pattern. One needs to take of the

underlying pattern of software before applying our approach.

4.3 Discussion

The approach proposed in this chapter suggests an effective use of a subset of

project metrics suite with the count models to predict the fault densities. To

evaluate the performance of the count models, a set of experiments was carried

out. A count model assigned an expected faults count and fault densities to each

module of a software system. The same models were also used to predict the

fault densities.

We evaluated the performance of the count models using performance measures

of confusion matrix. The NBR model in general provided higher prediction ac-

curacy as compare to other count model for fault densities prediction. The other


4.4 Summary 69

four count models were found to have lower prediction accuracy compare to NBR

model. On the other hand, NBR model has lower recall values compare to the

other count models. In general, GNBR and ZIN have the highest recall values.

For the value of F-measure, it is observed that NBR model showed higher value

as compare to other count models. The rest of the model produced lower value

of F-measure. Only contradictory results found in PROP4 dataset, where pre-

diction model based on Poisson regression analysis produced the higher accuracy

compare to negative binomial regression (NBR) based model. One possible rea-

son of is that the PROP4 dataset has a minimum number of faults contained

(9.60%). It means that PROP4 has a higher number of zeros, which may lead

to the poor results for it for NBR model. This left a question of effectiveness of

negative binomial regression based model for the software projects that has less

number of faulty modules. Overall, these results show that NBR count model

produced a higher prediction accuracy and reduced the misclassification rate by

a significant amount. These result show that count model based on NBR pro-

duced a higher prediction accuracy and reduced the misclassification rate by a

significant amount.

In this chapter, we have used a cost-benefit model for validating the count models.

This analysis aimed to assess the economic viability of models for fault densities

prediction. In this framework, we have used the value of cost parameters from

the study of Wagner and the values of various testing phases from the study

of Jones. We used these values due to the unavailability of the organizational

benchmark. These values may not be realistic but our main contribution is to

provide a cost evaluation measure that would access the cost effectiveness of fault-

prediction techniques when it is used in the development process. The Changes

in the framework parameters can only make the change in the resultant threshold

values.

4.4 Summary

The count models such as negative binomial regression have the potential to pre-

dict the fault densities of the software modules by assigning an expected number

of faults, that it best represents the fault occurrence process of the given soft-

ware. In this chapter, we investigated the performance of the five count models,


4.4 Summary 70

in predicting fault densities of the software modules. The investigation has been

performed on the six releases of a project dataset, available publicly. Confusion

matrix based evaluation parameters and cost-benefit framework have been used

to evaluate the capability of these count models.

Our results suggest that among the all five count models, the negative binomial

regression analysis showed best performance for fault prediction. Its predictive

accuracy is higher as compare to other count models. The contradicted results

have been shown only in the case of recall values, where NBR model provided

lower values. The results of the cost-benefit analysis also confirmed that negative

binomial is most cost-effective compare to the other count models. Our aim is

to provide the benchmark to estimate the fault removal cost for newer version,

when we train count model with historical information. In future, this work

could be more generalized to globally access the effectiveness of fault-prediction

techniques.


Chapter 5

An Application of the Count

Models to Predict Fault Densities

With Binary Fault Classification

In the previous chapter, we presented an approach to the fault densities prediction

based on the count models analysis. Where, initially we identified a subset of

project metrics suite that contained the metrics significant for fault-correlation by

performing the Multivariate Linear Regression (MLR) analysis and choose fault

densities of a software module as dependent variable. subsequently, we used this

subset of metrics with the count models to predict fault densities. In this chapter,

we evaluate the effectiveness of the count models, when we identify a subset of

significant fault-correlated metrics by classifying the faultiness of the software

modules into binary class classification, i.e., faulty and non-faulty. This analysis

will helps to decide, whether the nature of the fault classification (i.e., binary

class classification and multi class classification) for the selection of significant

fault-correlated metrics will affect the results of the fault densities prediction.

The rest of the chapter organizes as follows. Section 5.1 describes the approach

of data analysis and subset selection process. Section 5.2 contains information of

experiment evaluation includes datasets, metrics (independent variable), depen-

dent variable and the results of our investigation follow by the threats to validity.

The discussion of the implication of our results is in Section 5.3.

5.1 The Approach 72

5.1 The Approach

A software metric that found significantly fault-correlated for the modules that

classified into binary class classification (i.e., faulty or non-faulty) is more likely

to be significant for fault proneness, if the modules have same or similar struc-

tural properties and the occurrence of the faults. Therefore, the binary class

classification of the faults can be serve same as fault densities information of the

software modules. We use the above assumption to identify a subset of significant

fault-correlated metrics. To select the significant metrics, we use the approach

described in the chapter 3. The proposed approach involves initial identifica-

tion of a subset of project metrics suites that shows significant fault-correlation.

Subsequently, the identified subset of metrics is used with the count models to

predict fault densities. The built count models are then validate using confusion

matrix parameters and a cost-benefit model.

5.1.1 Subset selection of fault-correlated metrics

We use the approach described in the chapter 3 to determine a subset of signif-

icant fault-correlated project metrics. We carry out an investigation using class

level object-oriented (OO) metrics as independent variables and binary class clas-

sification of the software modules as dependent variable. For each release of the

PROP dataset, we perform a three steps analysis (Univariate Logistic Regression,

Spearman’s Correlation and Multivariate Linear Regression) to test whether each

of the nineteen metrics would be significant predictor in the count models. These

analysis results the subset of significant metrics corresponding to each release

of the dataset. The criteria used to select metrics is - In all six releases of the

datasets, there should be at least 50% or more times that metric appeared.

5.1.2 Count model analysis

The identified subset of metrics is used to construct the count models. We built

the count models over all six releases of the software project by training the model

from the earlier releases and testing on the later release. The fault densities of the

modules were selected as the dependent variable for count model analysis. Since

the number of faults in each release of software has a high variance. Therefore,



we performed a square root transformation to reduce the influence of the out-

lier values and take the logarithmic transformation of the LOC metrics. These

transformations help us to better fit the model in terms of the log likelihood ratio.

5.1.3 Evaluation of count models

Once these count models have been constructed. Then, we use the various confu-

sion matrix parameters to evaluate the potential of these count models for fault

densities prediction. Since each count model assigned an expected number of

faults to each module of the software system. Therefore, if we used that infor-

mation as fault prediction, then we can calculate the elements of the confusion

matrix to evaluate the overall accuracy of the count models.

5.1.4 Cost-benefit model

We have used the same cost-benefit model that was used in the previous chapter

to qualifies the fault removal cost of the different count models.


In this section, we present an experimental study to evaluate the performance

of the count models for fault densities prediction. We have used five different

count models namely: Poisson Regression model, Negative Binomial Regression

model, Zero-Inflated Poisson Regression model, Generalized Negative Binomial

Regression model and Zero-Inflated Negative Binomial Regression model, over six

successive releases of a software project’s dataset consisting nineteen class level

object-oriented metrics. In this study, we investigated the prediction of fault

densities and the number of faults for a given module. Therefore, we selected

measure of fault proneness as the dependent variable.

5.2.1 Metrics set used for the experiment

To perform our experimental investigation, we have used the nineteen measures

of coupling, cohesion, inheritance, encapsulation and complexity of OO software



system. They are as follows- WMC, CBO, RFC, DIT, NOC, IC, CBM, CA, CE,

MFA, LCOM, LCOM3, CAM, MOA, NPM, DAM, AMC, LOC and CC. We have

applied our subset selection approach to these metrics. Based on the selection

criteria, only nine metrics (WMC, DIT, CBO, RFC, CA, CE, LOC, AMC, and

CC) selected for the further analysis. The results of this analysis are summarized

in Table 5.1.

Table 5.1: Identified metrics for each release of the PROP dataset

Dataset Name Name of the Metrics Identify

PROP1 CBO, RFC, WMC, AMC, CC, CE, LOC and DIT

PROP2 CBO, RFC, CA, CE, AMC, CC and DIT

PROP3 CBO, RFC, CA, CE, WMC, AMC, CC, DIT and LOC

PROP4 CBO, RFC, CA, CE, WMC, AMC, CC, DIT and LOC

PROP5 CBO, NPM, LOC, MOA, LCOM3, MFA and CC

PROP6 CA, CE, IC, WMC, LCOM3, LOC and MFA

5.2.2 Experimental data

We have used PROP dataset with its six successive releases to perform our ex-

perimental study and to evaluate our results. For each release, same nineteen

metrics have been calculated and recorded with respect to the software modules.

The size of dataset varies from one release to another release, but for all the

releases, we collected the same eighteen metrics. They are listed in Table 5.2.

The detailed description of this dataset can be found in the previous chapter.

Table 5.2: Datasets use for the study

PROP1, PROP2, PROP3, PROP4, PROP5 and PROP6

5.2.3 Results

This subsection presents the detailed description of the experimental results.

First, we discuss the result of the prediction of number of faults and the fault

densities using various count models. Next, we compare the overall accuracy

and effectiveness of the count models using confusion matrix criteria. Finally, we



327427326379595810030865714105494020762056


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

11000

12000

13000

14000

15000

16000

17000

18000

19000

20000

21000

22000

23000

24000

25000

26000

27000

28000

29000


N

u

m

b

e

r

o

f

F

a

u

l

t

s

Figure 5.1: Result of the predicted number of faults using count models (PROP1-

POPR6)

present the results of cost-benefit analysis of the prediction models obtained by

different count models to evaluate them in the economic standpoint.

In each scenario, the count models are built on one or more prior releases of

the software and is evaluated on the latest release of the software. Like, the

prediction model based on release 1 and 2 is evaluated for release 3, except the

release PROP1, where training and testing both performed on the same dataset

due to unavailability of any prior release. The similar procedure has been followed

for all the count models.

5.2.3.1 Prediction of the number of faults and the fault densities

The count models assigned an expected number of faults and the fault densities to

each module of the software. Figure 5.1 and 5.2 show the results of the predicted

number of faults and the number of faulty modules, by each count model. The

figures contained a graph for each release of the project dataset showing their

predicted number of faults and the number of faulty modules. The blue bar in

the figures show the actual number of faults and the faulty modules contained

in the release of the project dataset. This is the optimal values of finding all the

faults and faulty instances in each release of the dataset. The other bars show



54932732327427326379595810030865714105494020762056


0

2000

4000

6000

8000

10000

12000

14000

16000

18000


N

U

m

b

e

r

o

f

F

a

u

l

t

y

I

n

s

t

a

n

c

e

s

Figure 5.2: Result of the predicted number of faulty modules using count models

(PROP1-POPR6)

the comparison of prediction results among the count models. The accuracy of

the each count model is measured in term of how close it predicts the value

comes from the actual curve. We can see by comparing the results from PROP1

to PROP6 that the negative binomial regression (NBR) model is the closest fit

with the actual number of faults. This result is consistent with all the releases

of the PROP dataset, except PROP2 where the predicted number of faults and

the faulty modules are higher than their actual value. The second most accurate

prediction model is poisson (p) regression model. The rest of all the models assign

a higher value than their actual value. It questions their viability to predict fault

proneness. These results were very encouraging stating the accuracy of the count

models to the actual number of faults discovered.

Figure 5.3 shows the results of the fault densities predicted by each count model.

This information can be useful, if practitioner wants to know, which module is

likely to contain the highest number of faults relative to the size of the module.

The figure contained the sum of the value of actual fault density in the project

datasets (blue bars) and the value of fault densities predicted by the count models.

The predictive capability of each count model is measured by how close it predicts

the values shown in actual curve. It can be observe from the figure that the values

of fault densities predicted by NBR model and P model is the closest fit with the



0

5000

10000

15000

20000

25000


Actual

NBRM

P

ZIP

GNBR

ZIN

Figure 5.3: Result of the fault densities prediction using count models (PROP1-

POPR6)

actual values. The values of the other count models are higher than their actual

values. This confirm the potential of NBR model and P model to predict fault

densities in the software system.

5.2.3.2 Evaluating the results of the five count models

The results of the previous section were showing the potential of count models to

predict values of the number of faults and the faulty modules. In this subsection,

we evaluated the effectiveness of the count models to measure their overall pre-

diction accuracy and the misclassification errors using various confusion matrix

parameters.

Figure 5.4 shows the results of accuracy, precision, recall and F-measure of the

count models. Figure contained a graph corresponding to each release of the

PROP project datasets (sub figures in Figure 5.4 from 1 to 6, corresponding to

the six releases of the PROP project). From figure, we can see that in general,

the prediction accuracy of NBR model is higher than all the other count models,

except the result on PROP2, where the accuracy is lower than other models. The

prediction accuracy of NBR model varied from 75% to 85%. The recall value of

NBR model is lower compare to the other count model and varied between 25%

to 75% in general. This is because NBR prediction model predicted some faulty



modules incorrectly. The value of f-measure shows the trade off between false

positive and false negative. The values of f-measure of NBR model is again

higher than other count models, This shows that NBR model has the potential

to predict fault densities and reduce the misclassification errors. The value of

F-measure of other model in general, are less compare to NBR model.

PROP1 PROP2

PROP3 PROP4

PROP5 PROP6

Acc. Precision Recall F-measure

0

10

20

30

40

50

60

70

80

90

NBRM P ZIP GNBR ZIN0

10

20

30

40

50

60

70

80

90

100

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

100

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

100

NBRM P ZIP GNBR ZIN

0

10

20

30

40

50

60

70

80

90

100

NBRM P ZIP GNBR ZIN

Figure 5.4: Comparison of count model using various confusion matrix criteria

(PROP1-POPR6)



5.2.3.3 Prediction of the number of faults and the fault densities in

modules ranked as top 20%

Tables 5.3 and 5.4 contained the percentage of the predicted number of faults

and fault densities by the top 20% of modules of six releases of PROP dataset

using all five count models. The results of the predicted number of faults for

PROP1 to PROP6 is given in Table 5.3. From the table, it is clear that the

modules contained faults in between 53% to 100%, with the overall average of

86% in the case of NBR model. While for other models average percentage of

prediction is between 36% - 46%, which is very low compare to NBR model. A

similar process was followed in the case of predicting fault densities. Table 5.4

contained the percentage of fault densities predicted by the count models for

PROP1 to PROP6. The top 20% of the modules contain fault densities between

72% to 100% fault densities with the average of 91% in the case of NBR model.

For other model the average percentage was varies between 38% to 50%.

Table 5.3: Percentage of faults contained in the modules ranked as TOP 20%

(T=Training set)

Model Prop1 (T

prop1)

Prop 2

(T

prop1,

2) Prop3

(T prop

1, 2, 3)

prop 4

(T prop

1, 2, 3, 4)

prop 5

(T prop 1,

2, 3, 4, 5)

prop 6

Average

NBRM 100% 53% 100% 91.68% 100% 96.26% 86.59%

P 25% 30.41% 43.66% 39.49% 51.40% 30.50% 36.74%

ZIP 21% 20.16% 50.42% 34.42% 30.94% 33.93% 31.81%

GNBR 72.02% 48.46% 43.51% 44.28% 38.89% 27.72% 45.64%

ZIN 30.78% 41.41% 51.90% 41.71% 47.77% 37.03% 41.76%

5.2.3.4 Cost-benefit analysis

The results obtained through the experiments of cost-benefit analysis are shown

in Figure 5.5. We have used the value of predicted number of faults to calculate

the estimated cost of each of count models for fault prediction. The value of

Tcost was calculated to show the actual cost that incurred when normal testing

process. This served as the optimal point to calculate the NEcost of each of the



Table 5.4: Percentage of faults density contained in the modules ranked as top

20% of modules (Fault density=faults/100 lines of code)

Model Prop1 (T

prop1)

Prop 2

(T

prop1,

2) Prop3

(T prop

1, 2, 3)

prop 4

(T prop

1, 2, 3, 4)

prop 5

(T prop 1,

2, 3, 4, 5)

prop 6

Average

NBRM 78% 72% 100% 100% 100% 97.30% 91.21%

P 91.38% 32.57% 28.78% 26.99% 20.24% 31.91% 38.64%

ZIP 72.29% 45.61% 33.48% 36.46% 32.84% 20.95% 40.27%

GNBR 88.16% 94.60% 34.38% 32.47% 30.97% 19.64% 50.03%

ZIN 46.25% 43.78% 34.69% 32.79% 36.48% 34.11% 38.01%

count models. The Figures 5.5 shows the values of NEcost for each of the model

for fault prediction.

From Figure 5.5, it can see that in general, except the results of PROP2 datasets,

the value of NEcost for NBR model is less than 1. While it observed that all

other model in general have value of NEcost greater than 1 except the case of P

model, which has value less than 1 for two datasets PROP1 & PROP2. These

results imply that that it is more economic to build the prediction model based

on negative binomial regression to reduce the overall cost of testing .

0

1

2

3

4

5


Cost Benefit Model

NBR P ZIP GNBR ZIN

Figure 5.5: Cost-benefit model for the count models (PROP1-PROP6)



Based on the above results, we find out the following observation.

• The NBR model in general provided a higher prediction accuracy compare

to the other count models for fault densities prediction. The performance

of NBR model is poor for the POPR2 dataset. Here, it assigned a higher

value of the number of faults and the faulty modules. As PROP2 dataset

contained the largest number of software modules and compare to this its

fault contains is very low. Therefore, one possible reason is that the per-

formance of NBR model has gotten effected with this high skewness of the

dataset. For rest of the datasets, all other four count models were similar

at their prediction accuracy and it is lower than NBR model.

• In the value of recall, it is found that NBR model has lower recall values

compare to other count models. In general GNBR and ZIN have the highest

recall values.

• For the value of F-measure, it is observed that NBR model showed higher

value as compare to other count models. The second highest value was

shown by P count model. The rest of the count models produced a lower

value of F-measure. These results show that prediction model based on

NBR reduced the misclassification errors by a significant amount.

• The final model selection included the performance of the cost benefit

model. Once again, we found that the NEcost incurred by NBRM model

is much lower to other count models irrespective to the datasets except

PROP2.

These results are similar to the results that we were observed in the previous

chapter, where significant fault-correlated metrics were selected using fault den-

sities of the software modules. However, the values of the number of faults and

the fault densities predicted by the count models here are lower and more closely

fit to their actual values compare to the values predicted by the count models

in the previous chapter. The cost-benefit analysis also confirm this finding. The

cost incurred of the count models here is lower compared to the previous values.

These results signify that performance of the count models is not much effected

by the selection process of the significant fault-correlated metrics. Once, we iden-

tified the significant metrics then we can use them with the count models for fault

densities prediction.




In this section, we critically examine the possible side effects of our experimental

findings. We also highlight the factors affecting the validity of the cost evaluation

framework that we have used to measure the performance of the count models

in predicting fault densities for multiple releases of the software. The validity

considerations can be grouped in the following categories:

Construct validity: As discussed in the previous chapter, the effectiveness of

a count model is measured as Ecost, which is the estimated fault removal cost.

The framework is developed considering the costs to be incurred to remove faults

in the later phases of software development, if not identified before testing. In

our cost evaluation framework unit testing cost of faulty and non-faulty modules

are same. The different framework parameters that are used in our cost-benefit

analysis have been taken from different sources as reported in literature. We have

selected nine fault significant metrics out of nineteen metrics by keeping those

metrics that were more than 50% of the times in all six releases of the dataset.

One can use other criterion also to select the significant metrics and result may

vary with the conjuncture of that criterion.

Internal validity: Our experimental study involving the use of the statistical

analysis tools namely, Weka and Stata and the data collected from the publicly

available software data repository. The fault densities and their distribution

should depend on the fault data. Any biasing in this may influence the finding

of our results.

Conclusion validity: We have used the statistics of binary class classification

of the faults to select the significant fault-correlated metrics. Subsequently, we

have used these selected metrics with the count models to predict fault densities.

Our results are specific to the versions of datasets included in the study.

External validity: We do not suggest generalizing our research results to any

arbitrary project categories because our results identify variance in the metrics

set when the examined project has changed. Our models are built and validated

on datasets available in public data repositories. The system developed in the

organization may pass the different effort pattern. One needs to take of the

underlying pattern of software before applying our approach.


5.3 Summary 83

5.3 Summary

In this chapter, we presented an application of the count models for fault den-

sities prediction, where we identified the significant fault-correlated metrics by

classifying the faultiness of the software modules into binary class i.e., faulty or

non-faulty.

Our results were consistent with the results found in the previous chapter. The

negative binomial regression analysis showed the best performance for fault den-

sities prediction. Its predictive accuracy was the highest compared to other count

models. The only contradicted results have been occurred in the case of recall val-

ues, where NBR model provided lower values. The results of cost-benefit analysis

also confirmed that negative binomial was the most cost-effective compare to the

other count models. However, the performance of the count models is improved

by some amount for fault densities prediction. But it is not creating the much

difference in the performance of the count models. These results suggest that

the selection process of the fault significant metrics do not effect the performance

of the count models. The only requirement for the count models is the input

dataset that do not stuffed with the unnecessary information that may lead to

poor performance of the count models.


Chapter 6

Conclusions and Future Work

Software metrics can be helpful to assess the quality of a software system with

desired accuracy. However, the difficulty lies in knowing the right set of metrics

that actually capture important quality attributes of a class, such as fault prone-

ness. Validation of object-oriented metrics to predict software fault proneness

is essential to ensure their practical use in fault prediction for object-oriented

software system. In this thesis, we investigated the relationship of existing class

level object-oriented metrics with fault proneness of the software systems. As a

followed up, we have presented an approach to identify a subset containing soft-

ware metrics with significant fault-correlation and then used this identified subset

with the count models to predict fault densities over the subsequent releases of

the software system.

We performed two sets of experimental investigations using project fault datasets

taken from a PROMISE data repository that make use of object-oriented metrics

available at class level. The first set of investigations consisted of evaluating the

performance of the selected metrics subset against the original project metric

suite. The second set of investigations consisted of using the identified metrics

subset with various count models to predict fault densities.

In the first set of investigations, we identified the metrics subset consisting met-

rics with significant fault-correlation. We performed our investigation over five

software project datasets namely: Camel, Xalan, Xerces, Ivy, and Velocity with

their multiple successive releases. We used confusion matrix criteria: Accuracy,

Precision, Recall and AUC (area under the ROC curve) to estimate the overall

prediction accuracy and misclassification errors of the prediction models. Our

results demonstrated that the identified metrics subset produced an improved

85

fault prediction performance compare to the original project metrics suite.

We performed our second set of the investigations using five different count mod-

els over six successive releases of PROP software project’s dataset available in

PROMISE data repository. The results of the prediction were evaluated using

confusion matrix parameters and a cost-benefit model. Our results suggested

that among the used five count models, the negative binomial regression (NBR)

model produced the best performance for fault densities prediction. The predic-

tive accuracy of NBR model was found to be the highest among the count models

used. The results of cost-benefit analysis also confirmed that the prediction model

based on negative binomial regression was the most cost-effective compared to

other count models. Though, the NBR model produced lower recall values, the

F-measure establish the results of the NBR model to be the best tradeoff between

precision and recall among the five count models used.

In the present thesis, we have used an approach to select a subset consisting

significant fault-correlated metrics. Although, there many other techniques of

subset selection, for example- Wrappers, filters or PCA are also available that

need to be investigated for their potential to identify a metrics subset. Since, our

subset selection approach identified a different subset of metrics for each project

dataset. We wish to investigate other approaches or techniques that may identify

a generalized subset of metrics by considering the inevitable differences that may

exist across the projects and the systems.

In the future, we intended to look up for some alternative approaches to inves-

tigate and validate our results to further strengthen or update the arguments

made in this thesis. We also focus on collecting more software project’s datasets

to enhance the applicability of the presented approach in the real setting.


References

[1] M. Alshayeb and W. Li. An empirical validation of object-oriented metrics

in two different iterative software processes. IEEE Transactions on Software

Engineering, 29(11):1043–1049, 2003.

[2] E. Arisholm. Dynamic coupling measurement for object-oriented software.

IEEE Transactions on Software Engineering, 30(8):491–506, 2004.

[3] E. Arisholm, L. Briand, and E. B. Johannessen. A systematic and com-

prehensive investigation of methods to build and evaluate fault prediction

models. The Journal of Systems and Software, (1):2–17, 2010.

[4] J. Bansiya and C. Davis. A hierarchical model for object-oriented design

quality assessment. IEEE Transactions on Software Engineering, 28(1):4–

17, 2002.

[5] Deepak Banthia. The economic of fault predcition. Master Thesis, Depart-

ment of computer science and engineering, PDPM-IIITDM Jabalpur, 2012.

[6] Judith Barnard. A new reusability metric for object-oriented software. Soft-

ware Quality Journal, 24(6):491–496, 1998.

[7] V. Basili, L. Briand, and W. Melo. Object-oriented metrics that predict

maintainability. Journal of Systems and Software, 23(2):111–122, 1993.

[8] V. Basili, G. Caldeira, and H. Rombach. The goal question metric approach.

Encyclopedia of Software Engineering, 1994.

[9] Belaujhazi, R. Ferenc, D. Poshyvanyk, and T. Gyimothy. New conceptual

coupling and cohesion metrics for object-oriented systems. In Proceedings of

10th IEEE Working Conference on Source Code Analysis and Manipulation,

pages 33–42, 2010.

86

References 87

[10] A. Binkley and S. Schach. Validation of the coupling dependency metric as a

predictor of run-time failures and maintenance measures,. In Proceedings of

the 20th International Conference on Software Engineering, pages 452–455,

1998.

[11] L. Briand, P. Devanbu, and W. Melo. An investigation into coupling mea-

sures for C++. In Proceeding of 19th International Conference on Software

Engineering, pages 412–421, 1997.

[12] L. Briand, W. John, and K. J. Wust. An unified framework for cohesion

measurement in object-oriented systems. Journal of Empirical Software En-

gineering, 3(1):65–117, 1998.

[13] L. Briand, W. John, and K. J. Wust. An unified framework for coupling

measurement in object-oriented systems. IEEE Transactions on Software

Engineering, 25(1):91–121, 1999.

[14] Cagatay Catal. Software fault prediction: A literature review and current

trends. Expert System Application, 38(4):4626–4636, 2011.

[15] S. Chidamber and C. Kemerer. A metrics suite for object-oriented design.

IEEE Transactions on Software Engineering, 20(6):476–493, 1994.

[16] S.R. Chidamber, D.P. Darcy, and C.F. Kemerer. Managerial use of metrics

for object oriented software: An exploratory analysis. IEEE Transanction

on Software Enggineering, 24(8):629–639, 1998.

[17] J. A. Dallal and L. C. Briand. An object-oriented high-level design-based

class cohesion metric. Information and Software Technology, 52(12):1346–

1361, 2010.

[18] K.O. Elish and M.O. Elish. Predicting defect-prone software modules using

support vector machines. Journal of Systems and Software, 81(5):649–660,

2008.

[19] M. O. Elish, A. H. Al Yafei, and M. Al Mulhem. Empirical comparison of

three metrics suites for fault prediction in packages of object-oriented sys-

tems: A case study of eclipse. Advances in Engineering Software, 42(10):852–

859, 2011.


References 88

[20] K.E. Emam and W. Melo. The prediction of faulty classes using object-

oriented design metrics. In Technical report: NRC 43609. NRC, 1999.

[21] L. Etzkom and H. Delugach. Towards a semantic metrics suite for object-

oriented design. pages 71–81, 2000.

[22] Ayaz Farooq. Conception and prototypical implementation of a web service

as an empirical-based consulting about java technologies. In Master Thesis,

Department of Computer Science Institute for Distributed Systems, 2005.

[23] N. Fenton and M. Neil. A critique of software defect prediction models.

IEEE Transactions on Software Engineering, (5):675–689, 2000.

[24] L. Fernndez and R. Pena. A sensitive metric of class cohesion. International

Journal of Information Theories & Applications, 13, 2005.

[25] K. Gao and T. M. Khoshgoftaar. A comprehensive empirical study of count

models for software fault prediction. IEEE Transactions on Software Engi-

neering, 50(2):223–237, 2007.

[26] M. Genero, M. Piattini, and C. Calero. A survey of metrics for UML class

diagrams. Journal of Object Technology, (9):59–91, 2005.

[27] T.L. Graves, A.F. Karr, J.S. Marron, and H. Siy. Predicting fault incidence

using software change history. IEEE Transactions on Software Engineering,

26(7):653–661, 2002.

[28] G. Gui and P. D. Scott. Coupling and cohesion metrics for evaluation of soft-

ware component reusability. In Proceeding of 9th International Conference

for Young Computer Scientists, pages 1181–1186, 2008.

[29] L. Guo, B. Cukic, and H. Singh. Predicting fault prone modules by the

dempster-shafer belief networks. In Proceedings of 18th IEEE International

Conference on Automated Software Engineering, pages 249–252, 2003.

[30] I. Guyon and A. Elisseeff. An introduction to variable and feature selection.

The Journal of Machine Learning Research, 3:1157–1182, 2003.

[31] M. Harman, S. A. Mansouri, and Y. Zhang. Search based software engineer-

ing: A comprehensive analysis and review of trends techniques and appli-

cations. In Technical report: TR-09-03. Department of Computer Science,

Kings College London, UK, 2009.


References 89

[32] R. Harrison and J. Steve Counsel. An evaluation of the mood set of object-

oriented software metrics. IEEE Transactions on Software Engineering,

24(6):491–496, 1998.

[33] J. M. Hilbe. Negative binomial regression. Second Edition Jet Propulsion

Laboratory, California Institute of Technology and Arizona State University,

2012.

[34] W.G. Hopkins. A new view of statistics. Sport Science, pages 2–17, 2010.

[35] L. Huihua, C. Bojan, and M. Culp. An iterative semi-supervised approach to

software fault prediction. In Proceedings of the 7th International Conference

on Predictive Models in Software Engineering, PROMISE ’11, pages 1–15,

2011.

[36] A. Janes, M. Scotto, W. Pedrycz, B. Russo, M. Stefanovic, and G. Succi.

Identification of defect-prone classes in telecommunication software systems

using design metrics. Journal of Information Sciences, 176(24):3711–3734,

2006.

[37] Y. Jiang, B. Cukic, and Y. Ma. Techniques for evaluating fault prediction

models. Empirical Software Engineering, 13(5):561–595, 2008.

[38] Y. Jiang, B. Cukic, and M. Yan. Techniques for evaluating fault prediction

models. Empirical Software Engineering, 13(5):561–595, 2008.

[39] C. Jones. Software defect-removal efficiency. Computer, 29(4):94 –95, 1996.

[40] M. Jureczko. Significance of different software metrics in defect prediction.

Software Engineering: An International Journal, 1(1):86–95, 2011.

[41] S. Kanmani, V.R. Uthariaraj, V. Sankaranarayanan, and P. Thambidurai.

Object-oriented software fault prediction using neural networks. Journal of

Information and Software Technology, 49(5):483–492, 2007.

[42] G. Kehan, T. M. Khoshgoftaar, H. Wang, and N. Seliya. Choosing soft-

ware metrics for defect prediction: an investigation on feature selection tech-

niques. Software Practice and Experience, 41(5):579–606, 2011.

[43] Barbara Kitchenham. Whats up with software metrics? a preliminary map-

ping study. The Journal of Systems and Software, (1):37–51, 2010.


References 90

[44] A. G. Koru and L. Hongfang. An investigation of the effect of module size on

defect prediction using static measures. In Proceedings of the 2005 workshop

on Predictor models in software engineering, PROMISE ’05, pages 1–5, 2005.

[45] W. Li and W. Henry. A validation of object-oriented design metrics as quality

indicators. IEEE Transactions on Software Engineering, 22(10):751–761,

1996.

[46] Yu Liguo. Using negative binomial regression analysis to predict software

faults: A study of apache ant. International Journal Information Technology

and Computer Science, 4(8):63–70, 2012.

[47] Huan Liu. Toward integrating feature selection algorithms for classification

and clustering. IEEE Transactions on Knowledge and Data Engineering,

17(4):491–502, 2005.

[48] M. Lorenz and J. Kidd. Object-oriented software metrics. In Prentice Hall,

page 146, 1994.

[49] Weka machine learning tool. http://www.cs.waikato.ac.nz/ml/weka/.

[50] M. Marchesi. OOA metrics for the unified modeling language. In Proceeding

of 2nd Euromicro Conference on Softwar eMaintenance and Reengineering,

pages 67–73, 1998.

[51] T. Menzies, Z. Milton, T. Burak, B. Cukic, Y. Jiang, and Bener. Defect

prediction from static code features: current results, limitations, new ap-

proaches. Automated Software Engineering, 17(4):375–407, 2010.

[52] N. Ohlsson, M. Zhao, and M. Helander. Application of multivariate analysis

for software fault prediction. Journal of Software Quality Journal, 7(1):51–

66, 1998.

[53] H. M. Olague, L. H.Etzkorn, S. Gholston, and S. Quattlebaum. Empiri-

cal validation of three software metrics suites to predict fault-proneness of

object-oriented classes developed using highly iterative or agile software de-

velopment processes. IEEE Transactions on Software Engineering, (6):402–

419, 2007.


References 91

[54] T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Where the bugs are. In Pro-

ceedings of 2004 International Symposium on Software Testing and Analysis,

pages 86–96, 2004.

[55] T. J. Ostrand, E. J. Weyuker, and R. M. Bell. Predicting the location and

number of faults in large software systems. IEEE Transactions on Software

Engineering, 31(4):340–355, 2005.

[56] D. Poshyvanyk and A. Marcus. The conceptual coupling metrics for object-

oriented systems. In Proceeding of International Conference on Software

Maintenance (ICSM’06), pages 469–478, 2006.

[57] PROMISE Data Repository. http://promisedata.org/.

[58] M. Revelle, M. Gethers, and D. Poshyvanyk. Using structural and textual in-

formation to capture feature coupling in object-oriented software. Empirical

Software Engineering, 16(6):773–811, 2011.

[59] D. Rodriguez, R. Ruiz, J. Cuadrado-Gallego, J. Aguilar-Ruiz, and M. Garre.

Attribute selection in software engineering datasets for detecting fault mod-

ules. In Proceedings of the 33rd EUROMICRO Conference on Software Engi-

neering and Advanced Applications, EUROMICRO ’07, pages 418–423, 2007.

[60] P. M. Shanthi and K. Duraiswamy. An empirical validation of software

quality metric suites on open source software for fault-proneness prediction

in object oriented systems. European Journal of Scientific, 51(2):168–181,

2011.

[61] R. Shatnawi and W. Li. The effectiveness of software metrics in identifying

error-prone classes in post-release software evolution process. The Journal

of Systems and Software, (11):1868–1882, 2008.

[62] R. Shatnawi, W. Li, and H. Zhang. Predicting error probability in the

eclipse project. In Proceedings of the International Conference on Software

Engineering Research and practice, pages 422–428, 2006.

[63] S. Swapna, Gokhale, and R. L. Michael. Regression tree modeling for the

prediction of software quality. In Proceeding of ISSAT’97, pages 31–36, 1997.


References 92

[64] M.H. Tang, M. H. Kao, and M-H Chen. An empirical study on object

oriented metrics. In Proceedings of the International Symposium on Software

Metrics, pages 242–249, 1999.

[65] B. Turhan and A. Bener. Analysis of naive bayes’ assumptions on software

fault data: An empirical study. Data Knowledge Engineering, 68(2):278–290,

2009.

[66] U. B. Venkata, B. Farokh Bastani, and I. Ling Yen. A unified framework for

defect data analysis using the mbr technique. In Proceeding of 18th IEEE

International Conference on Tools with Artificial Intelligence, ICTAI ’06,

2006, pages 39–46, 2006.

[67] Wagner and Stefan. A literature survey of the quality economics of defect-

detection techniques. In Proceedings of the 2006 ACM/IEEE international

symposium on Empirical software engineering, ISESE ’06, pages 194–203,

2006.

[68] Wikipedia. http://en.wikipedia.org/wiki/Feature_selection.

[69] Wikipedia. http://en.wikipedia.org/wiki/Precision_and_recall.

[70] N. Wilde and R. Huitt. Maintenance support for object-oriented programs.

IEEE Transaction on Software Engineering, 18(12):1038–1044, 1992.

[71] S. Yacoub, H. Ammar, and T. Robinson. Dynamic metrics for object-

oriented designs. In Proceeding of the 6th International Symposium on Soft-

ware Metrics (Metrics’99), pages 50–60, 1999.

[72] Y. Zhou, B. Xu, and H. Leung. On the ability of complexity metrics to pre-

dict fault-prone classes in object oriented systems. The Journal of Systems

and Software, pages 660–674, 2010.


Publications

• Santosh Singh Rathore and Atul Gupta, ”Validating the Effectiveness of

Object Oriented Metrics over Multiple Releases for Predicting Fault Prone-

ness”. In the proceeding of nineteenth Asia Pacific Software Engineer-

ing Conference (APSEC’12), Hongkong, pp 270-275, 4-7 Dec 2012. DOI:

10.1109/APSEC.2012.148.

• Santosh Singh Rathore and Atul Gupta, ”Investigating Object-Oriented

Design Metrics to Predict Fault-Proneness of Software Modules”. In the

proceeding of sixth International Conference on Software Engineering (CON-

SEG’12), Indore-India, 5-7 Sep 2012. DOI: 10.1109/CONSEG.2012.6349484.

• Saurabh Tiwari, Santosh Singh Rathore, Abhinav Singh, Abhijeet Singh

and Atul Gupta, ”An Approach to Generate Actor-Oriented Activity Charts

from Use Case Requirements”. In the proceeding of nineteenth Asia Pacific

Software Engineering Conference (APSEC’12), Hongkong, pp 350-355, 4-7

Dec 2012 DOI: 10.1109/APSEC.2012.149.

• Saurabh Tiwari, Santosh Rathore, Sudhanshu Gupta, Vaibhav Gagote and

Atul Gupta, ”Analysis of Use Case Requirements using SFTA and SFMEA

Techniques”. Seventeenth International Conference on Engineering of Com-

plex Computer Systems (ICECCS’12), Paris-France, pp 29-38, 18-20 July

2012. DOI: 10.1109/ICECCS.2012.10.

• Santosh Singh Rathore and Atul Gupta, ”Using Negative Binomial Regres-

sion Analysis to Predict Fault Densities in Software Modules”, In seven-

teenth International Conference on Evaluation and Assessment in Software

Engineering (EASE’13), Brazil. (Submitted).

Thesis work on TFT

Documents

Transcript of Thesis work on TFT