Fault prediction metrices

The Economics of Fault Prediction

Submitted in partial fulllment of the requirements for the degree of

Master of Technology

by

Deepak Banthia

(1010102)

under the guidance of

Dr. Atul Gupta

Computer Science & Engineering

INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,

DESIGN AND MANUFACTURING JABALPUR, INDIA

2012

Approval Sheet

This thesis entitled The Economics of Fault Prediction submitted by

Deepak Banthia (1010102) is approved for partial fulllment of the re-

quirements for the degree of Master of Technology in Computer Science and

Engineering.

Examining Committee

................................................

................................................

................................................

Guide

................................................

................................................

................................................

Chairman

................................................

Date .......................... ................................................

Place ......................... ................................................

Certificate

This is to certify that the work contained in the thesis entitled, The Economics

of Fault Prediction, submitted by Deepak Banthia (Roll No. 1010102) in

partial fulllment of the requirements for the degree of Master of Technology in

Computer Science and Engineering, has been carried out under my supervision and

that this work has not been submitted elsewhere.

(Atul Gupta) ............ , 2012

Associate Professor

Computer Science & Engineering Discipline

Indian Institute of Information Technology, Design and Manufacturing Jabalpur

Jabalpur, India.

Acknowledgments

This thesis would not have been possible without the sincere help and contri-

butions of several people. I would like to use this opportunity for expressing

my sincere gratitude to them.

Firstly I would like to thank god with whose blessing I could turn my idea

into reality. I express my deep sense of gratitude towards my mentor and

thesis supervisor Dr Atul Gupta for his valuable guidance, moral support and

constant encouragement throughout the thesis. His approach towards software

engineering will always be a valuable learning experience for me. No words can

express my feelings towards him for taking such a keen interest in my academics

and personal welfare. His dedication, professionalism and hard work has been

and shall be a source of inspiration throughout my life.

The contributions of a mother to the success of her child can be neither mea-

sured nor directly repaid. To such a mother, who is but a manifestation of

the divine virtues of the Earth, this report is one petite oering. Thank you

parents for all the liberty, prosperity, condence and discipline showered on

me.This thesis would not have been completed without the motivation and

blessing of my parents. My ance (Nisha) , brought a light inside me and al-

ways lled me with enthusiasm and vigour to do my jobs with complete eort

and dedication. Thanks to her for accompanying me in all the way and for

uninching help and support for all my endeavours. I would like to thank, my

uncle Mr. Hem Kumar Banthia and Mr. Khagendra Kumar Banthia for their

encouragement throughout my studies.Along with them, I also receive energy

and motivation from my sisters for my career. I would also like to give my

sincere thanks to Mr. Amaltas Khan, Mr. Arpit Gupta, Mr. Ravindra Singh,

Mr. Santosh Singh Rathore and Mr. Saurabh Tiwari for their support and

being there always, no matter what.

I thank the CSE fraternity at IIITDM Jabalpur and my special thanks to my

batch mates.

Jabalpur Deepak Banthia

..........., 2012

IV

Abstract

Fault-prediction techniques aim to predict fault prone software modules in

order to streamline the eorts to be applied in the later phases of software

development. Normally the eectiveness of a fault-prediction technique is

demonstrated by educating it over a part of some known fault data and mea-

suring its performance against the other part of the fault data. There have

been many eorts comparing the performance of various fault-prediction tech-

niques on dierent project datasets. However, invariably most of these studies

have also recorded high misclassication rate (normally, 15 to 35%), besides

not so high accuracy gures (normally, 70 to 85%). This raises serious concerns

about the viability of these techniques. In this thesis, we rst present a brief

summary of the results of some of the earlier studies undertaken in fault pre-

diction and argue about their usefulness. As a follow up, we then investigate

two important and related research questions regarding the viability of fault

prediction. First, for a given project, are the fault prediction results useful? In

case of an armative answer, then we look for how to choose a fault-prediction

technique for an overall improved performance in terms of cost-eectiveness.

Here, we propose an adaptive cost evaluation framework that incorporates cost

drivers for various fault removal phases, and performs a cost-benet analysis

for the misclassication of faults. We then used this framework to investigate

the usefulness of various fault prediction techniques in two dierent settings.

The rst part of the investigation consisted of performance evaluation of ve

major fault-prediction techniques on nineteen public datasets. Here, we found

fault prediction useful for the projects with percentage of faulty modules less

than a certain threshold, and there was no single technique that could provide

the best results in all cases i.e for all nineteen project datasets. In the other

part of the investigation study, and as a practical use of the proposed frame-

work, we have demonstrate that the fault information of the previous versions

of the software can be eectively used to predict fault proneness in the cur-

rent version of the software. Here, we found the fault prediction useful when

the dierence between inter-version fault rate was below a certain threshold.

Also, the usability of fault prediction found to be reduced with the increase of

inter-version fault rate.

VI

Contents

Approval I

Certicate II

Acknowledgments III

Abstract V

List of Figures IX

List of Tables X

List of Symbols XII

Abbreviations XIII

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Related Work 6

2.1 Fault Prediction Models . . . . . . . . . . . . . . . . . . . . . 6

2.2 Public Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 10

2.3.1 Numerical measures . . . . . . . . . . . . . . . . . . . . 10

2.3.2 Graphical evaluation measures . . . . . . . . . . . . . . 12

2.4 Fault Prediction Studies . . . . . . . . . . . . . . . . . . . . . 13

2.5 Estimating Cost of Fault Prediction . . . . . . . . . . . . . . . 16

2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3 Fault Prediction Results: How useful They Are? 20

3.1 Issues in Fault Prediction . . . . . . . . . . . . . . . . . . . . 20

3.2 A Proposed Model for Evaluating Fault Prediction Eciency . 21

3.2.1 General arguments . . . . . . . . . . . . . . . . . . . . 23

3.2.2 Evaluation model . . . . . . . . . . . . . . . . . . . . . 23

3.3 Revisiting Fault Prediction Results . . . . . . . . . . . . . . . 24

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4 A Cost Evaluation Framework 29

4.1 The Evaluation Framework . . . . . . . . . . . . . . . . . . . . 30

4.2 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . 33

4.2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 33

4.2.2 Experiment execution . . . . . . . . . . . . . . . . . . 34

4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2.4 Experiment ndings . . . . . . . . . . . . . . . . . . . 43

4.2.5 Threats to validity . . . . . . . . . . . . . . . . . . . . 45

4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5 An Application of Cost Evaluation Framework for Multiple

Releases 50

5.1 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.2 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 53

5.2.2 Experiment execution . . . . . . . . . . . . . . . . . . 54

5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2.4 Threats to validity . . . . . . . . . . . . . . . . . . . . 59

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

6 Conclusions and Future Work 61

References 63

Publications 70

Index 71

VIII

List of Figures

1.1 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3.1 Cost statistics for faulty modules . . . . . . . . . . . . . . . . . 22

3.2 Cost statistics for non-faulty modules . . . . . . . . . . . . . . . 22

4.1 Decision chart representation to evaluate the estimated Ecost . 36

4.2 Value of NEcost for category 1 when u = 0.25 and s = 0.5 . 38



4.5 Cost characteristics of used fault-prediction techniques when u

= 0.5 and s = 0.65 . . . . . . . . . . . . . . . . . . . . . . . . 44


= 0.25 and s = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . 45


= 0.15 and s = 0.25 . . . . . . . . . . . . . . . . . . . . . . . 46

5.1 Decision chart representation to evaluate the estimated Ecost . 52

5.2 Value of Ecost for Jedit versions when u = 0.25 and s = 0.5 59

List of Tables

2.1 Datasets used in the study . . . . . . . . . . . . . . . . . . . . 9

2.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Fault Prediction Studies . . . . . . . . . . . . . . . . . . . . . 13

3.1 NASA datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 Experiment results for dataset CM1 . . . . . . . . . . . . . . . 26

3.3 Experiment results for dataset kc1 . . . . . . . . . . . . . . . . 26

3.4 Experiment results for dataset kc2 . . . . . . . . . . . . . . . . 27

3.5 Experiment results for dataset pc1 . . . . . . . . . . . . . . . 27

4.1 Removal costs of test techniques (in sta-hours per defect) [52] 30

4.2 Fault identication eciencies of dierent test phases [26] . . . 31

4.3 Used projects from NASA [1] and PROMISE data repository [1] 34

4.4 Categorization of projects based on the fraction of faulty modules 34

4.5 Result of experiment for PC1 (1109) . . . . . . . . . . . . . . 37

4.6 Result of experiment for AR1 (121) . . . . . . . . . . . . . . . 37

4.7 Result of experiment for NW1 (403) . . . . . . . . . . . . . . . 37

4.8 Result of experiment for KC3 (458) . . . . . . . . . . . . . . . 38

4.9 Result of experiment for CM1 (498) . . . . . . . . . . . . . . . 38


4.11 Result of experiment for ARC (234) . . . . . . . . . . . . . . . 39


4.13 Result of experiment for KC1 (2109) . . . . . . . . . . . . . . 40

4.14 Result of experiment for AR4 (107) . . . . . . . . . . . . . . . 40

4.15 Result of experiment for JM1 (10885) . . . . . . . . . . . . . . 40

4.16 Result of experiment for KC2 (522) . . . . . . . . . . . . . . . 41

4.17 Result of experiment for Camel 1.6 (858) . . . . . . . . . . . . 41

4.18 Result of experiment for Ant 1.6 (351) . . . . . . . . . . . . . 42

4.19 Result of experiment for Ant 1.7 (493) . . . . . . . . . . . . . 42

4.20 Result of experiment for MC2 (161) . . . . . . . . . . . . . . . 42

4.21 Result of experiment for J-edit 3.2 (272) . . . . . . . . . . . . 42

4.22 Result of experiment for Lucene 2.0 (195) . . . . . . . . . . . 43

4.23 Result of experiment for J-edit 4.0 (274) . . . . . . . . . . . . 43

5.1 Used projects from PROMISE data repository [2] . . . . . . . 53

5.2 Prediction results for Ant 1.6 . . . . . . . . . . . . . . . . . . 55

5.3 Prediction results for Ant 1.7 when fault prediction model trained

using Ant 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4 Results of experiment to calculate the Ecost for Ant 1.7 using

information of Ant 1.6 . . . . . . . . . . . . . . . . . . . . . . 56

5.5 Prediction results for Jedit4.0 (3 cross-validation) . . . . . . . 57

5.6 Results of experiment to calculate the Ecost for Jedit4.1 using

information of Jedit4.0 . . . . . . . . . . . . . . . . . . . . . . 57

5.7 Prediction results for Jedit4.0 and Jedit4.1 (3 cross-validation) 57


information of Jedit4.0 and 4.1 . . . . . . . . . . . . . . . . . 58

5.9 Prediction results for Jedit4.0, Jedit4.1 and Jedit4.2. (3 cross-

validation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58


information of Jedit4.0, 4.1 and 4.2 . . . . . . . . . . . . . . . 58

A1 Details of used metrics . . . . . . . . . . . . . . . . . . . . . . 71

A2 Metrics used in datasets . . . . . . . . . . . . . . . . . . . . . 72

XI

List of Symbols

Cf Normalized fault removal cost in eld

Ci Initial setup cost of used fault prediction approach

Cs Normalized fault removal cost in system testing

Cu Normalized fault removal cost in unit testing

Mp Percentage of modules unit tested

s Fault identication eciency of system testing

u Fault identication eciency of unit testing

Abbreviations

Acc Accuracy

AUC Area Under the Curve

Ecost Estimated Fault Removal Cost of the software when we

use fault prediction

EFN Estimated number of False Positives

EFP Estimated number of False Positives

ETP Estimated number of True Positives

FN False Negative

FNR False Negative Rate

FP False Positive

FPR False Positive Rate

NEcost Normalized Estimated fault removal cost of the software

when we use fault prediction

NPV Negative Predictive Value

PD Probability of Detection

PF Probability of False Alarm

PPV Positive Predictive Value

PR Precision

Tcost Estimated fault removal cost of the software without the

use fault prediction

TN True Negative

TP True Positive

Chapter 1

Introduction

Software fault prediction has become an important area of research in the arena

of Software Development Life Cycle. It has the potential to aid in ensuring

the desired software quality as well as to achieve an economic development

process. The potential of fault prediction is backed by its ability to identify

the fault prone software modules before the actual testing process begins. This

helps in obtaining desired software quality in optimum time, with optimized

cost and eort.

Most of the major development organizations spend a lot of time and eorts

on the research in the eld of quality assurance activities. But the practical

usage of fault prediction is equivocal. It indicates that there is a need of

further research in this eld that would emphasize on how it is applicable in

the quality assurance process.

1.1 Motivation

Software quality assurance process focuses on the identication and removal of

faults quickly from the artifacts that are generated and subsequently used in

the development of software. Fault prediction can help in this by identifying

the fault-prone modules in the early stages of development life cycle, which,

then can lead to a more streamlined eort to be applied. The fault-proneness

information not only points to the need for increased quality monitoring during

the development but also provides an important advice to undertake suitable

verication and validation activities that eventually lead to improve the eec-

1.1 Motivation 2

tiveness and eciency of the fault nding process.

Fault prediction is a process to predict the fault prone software modules

without executing them. Conventionally, fault prediction is done by apply-

ing machine-learning techniques over project datasets. The eectiveness of

a fault-prediction technique is demonstrated by educating it over a part of

some known fault data and measuring its performance against the other part

of the fault data. Recently, several software project data-repositories became

publicly available such as NASA Metrics Data Program [1] and PROMISE

Data Repository [2]. Availability of these public datasets has encouraged

undertaking more investigations and their replications. A wide range of fault-

prediction techniques has been applied to demonstrate their eectiveness on

these datasets [19][8][28][49][38].

However, there are certain crucial issues, which are required to be resolved

before the results of such prediction can be incorporated in practice. An

important concern is related with the lack of suitable performance evaluation

measures that would assess the economics of fault prediction if adopted in soft-

ware development process [6]. Another concern is about the typical prediction

accuracy of a fault-prediction technique, which is found to be considerably

low, ranging 70-85 percent [32][19][20], compared to the high accuracy results

obtained in other elds like image recognition, spam lters, etc. Yet another

concern can be attributed to the unequal distribution of fault data that may

lead to biased learning. We know from our experience that fault distributions

typically emulate Pareto principle, and hence, the accuracy gures obtained

from fault prediction can be grossly misleading, as a fault-prediction technique

can produce high accuracy results by mostly classifying non-faulty modules as

non-faulty.

The key functionality of fault prediction is to identify the highest possible

number of faults with the least possible resources. However, the concerns

mentioned above, in fact, pose serious threats for the fault prediction results to

be used to streamline quality assurance activities undertaken during software

development. We need to investigate further, what these results mean and

whether they can be used economically in the software development process.


1.2 Objectives 3

1.2 Objectives

The main objective of this thesis work is to propose a cost evaluation frame-

work that helps to put the results of a fault-prediction technique in proper

perspective. If the results of the fault prediction are to be used in the devel-

opment process, the framework can provide an estimate of the saving in the

eorts applied in subsequent phases of the software development. Specically,

we aim to answer that for a given project dataset, whether fault prediction

would help. And if yes, then how to choose a fault-prediction technique that

would yield the optimal results.

With this dissertation, we will investigate:

Q1: For a given project, whether fault prediction would economically help in

software development?

Q2: If yes, then how to select a fault-prediction technique for overall optimum

performance?

1.3 Thesis Organization

The overall structure of this thesis can be illustrated as shown in Figure 1.1.

The content can broadly be divided into three major sections, namely Back-

ground Research, Research Contribution and Research Prospects.

Figure 1.1: Thesis structure

.

Chapter 2 summarizes the concepts which are relevant to the study. In particu-

lar, Fault prediction models, details of public datasets used in our experimental


1.3 Thesis Organization 4

study, model evaluation techniques and literature review of previous related

studies are given in this chapter.

In chapter 3, we present an insight towards the economy of fault prediction. In

particular, we rst revisited the results of some of the previous fault prediction

studies on the basis of the economics of fault. Then, we rene the criteria based

on fault misclassication, and again measure the performance of the above-

said fault-prediction techniques on the basis of cost eectiveness. We have used

four NASA MDP datasets to perform our study. Here, our results suggested

that simple techniques like IBK perform better over most of the datasets.

In chapter 4, we have proposed a cost evaluation framework that can help

to answer both of the questions, using limited fault data. Essentially, the

framework can provide an estimate of the saving in the eorts applied by

using the results of the fault prediction in subsequent phases of the software

development. To construct the cost evaluation framework, we accounted for

typical fault removal cost of dierent testing phases [52], along with their fault

identication eciency [26]. The rst question can be answered by comparing

the fault removal cost in both the cases, i.e. with or without use of fault

prediction.

Here, we investigated the usefulness of fault-prediction techniques based on the

proposed framework by using limited fault data. The investigation consisted

of performance evaluation of ve major fault-prediction techniques on nineteen

public datasets. Here, we have used ve well-known fault-prediction techniques

namely Random Forest, J48 (C4.5 decision tree), Neural Network, K-means

Clustering and IBK (K-nearest neighbors). These datasets provide a wide

range of percentage faulty modules (varying from 7 to 49 percentages). We

categorized these datasets based on the fault information into three categories.

We have used WEKA machine learning tool to perform all listed experiments.

The results of this study suggested that, the fault prediction can be useful for

the projects with percentage of faulty module less than a certain threshold (in

our case, it varied from 21% to 42% over the specied range of testing phases

eciency). Also, there was no single technique that could provide the best

results in all cases.

In chapter 5, we show the application of the proposed cost framework over

the multiple subsequent releases of the software. We evaluated fault removal


1.4 Summary 5

cost of the current version of software using the fault information available

from its previous versions. Then, this estimated fault removal cost helps to

decide, whether fault prediction is useful or not for the current version. To

answer both the research questions, we have investigated the usefulness of

fault-prediction techniques based on the framework on successive versions of

the two dierent software's namely Ant and Jedit. Here, we found the fault

prediction useful when the dierence between inter-version fault rate was be-

low a certain threshold (in our case, it was 2%). Also, the usability of fault

prediction found to be reduced with the increase of inter-version fault rate.

Here, the dierence between inter-version fault rate depicts, the dierence

between percentage faulty modules present in successive versions.

Finally we concluded the contribution of our research in Chapter 6. The future

prospects of our research are also discussed in the same chapter.

1.4 Summary

Fault-prediction techniques are used to identify faults in the software code

without execution it. So it has the potential to help validation and veri-

cation process by accurately identifying the faults. It may also help in an

economic software development process. But most of the organizations still do

not consider fault-prediction techniques while its potential has been validated

in couple of researches. It indicates that there is a need of further research in

this eld that would emphasize on how it can improve the quality assurance

process. In this chapter we highlight the issues in fault prediction arena and

summarized our work, which tried to put fault prediction results in the correct

perspective i.e. cost eectiveness.


Chapter 2

Related Work

In this chapter, we summarized the concepts which are relevant to the study.

In particular, Fault prediction models, detail of public datasets used in the

research study, model evaluation techniques and literature review of previous

related studies are given in this section.

2.1 Fault Prediction Models

Fault prediction allows the tester to manipulate their resources more eectively

and eciently, which would potentially result in higher quality products and

lower costs. Fault prediction is typically employed by applying various ma-

chine learning algorithms on known properties learned from the project fault

datasets. The typical way of predicting faults in software modules include use

of software metrics and fault data (collected from previous releases or similar

projects) to construct a fault-prediction model. Then this model is used to

predict their fault proneness. For Example, a module under the scanner of a

fault prediction technique is detected as faulty, if it has the similar features

(metrics) value, compared to a faulty module that has been used to train the

fault prediction technique.

Many techniques have been proposed to estimate the fault-proneness of a soft-

ware module. Some of the proposed techniques are clustering, Decision Tree,

Neural Networks, Dempster-Shafer Belief Networks, Random Forest and Quad

Tree based K-Means [19][20][8][27][28][9][49].

Dierent Approaches for Fault Prediction Models-

2.1 Fault Prediction Models 7

A project manager needs to make sure a project met its timetable and budget

plan without loss of quality. In order to help project managers to make a

decision, fault prediction models play an important role to allocate software

quality assurance resources. Existing research in software fault-prone models

focus on predicting faults from these two perspectives:

The number of faults or fault density: This technique predicts thenumber of faults (or fault density) in a module or a component. These

models typically use data from historical versions (or pre-release parts)

and predict the faults in the new version (or the new developed parts).

For example, the fault data from historical releases can be used to predict

faults in updated releases [46][33][50][23].

Classication: Classication predicts which modules (components) con-tain faults and which modules don't. The goal of this kind of prediction

distinguishes fault free subsystems from faulty subsystems. This allows

project managers to focus resources to x faulty subsystems.

To construct fault prediction models we used methods. There are two methods

to classify fault-prone modules from fault free modules: supervised learning

and unsupervised learning. Both of them are used in dierent situations.

When a new system without any previous release is built, for the new devel-

oped subsystems (modules, components, or classes), in order to predict fault-

prone subsystems, unsupervised learning needs to be adopted. After some

subsystems are tested and put into function, these pre-release subsystems can

be used as training data to build software fault prediction models to predict

new subsystems. This is the time when supervised learning can be used. The

dierence between supervised and unsupervised learning is the status of train-

ing data's class, if it is unknown, then the learning is unsupervised, otherwise,

the learning is supervised learning.

Supervised Learning Learning is called supervised because; the method

operates under supervision provided with the actual outcome for each of the

training examples. Supervised learning requires known fault measurement

data (i.e. the number of faults, fault density, or fault-prone or not) for training

data. Usually, Fault measurement data from previous versions [46], pre-release


2.2 Public Datasets 8

[44], or similar project [29] can act as training data to predict new projects

(subsystems).

Most research reported in fault prediction is supervised learning including

experiments in this dissertation. The learning result of supervised is easier to

judge than unsupervised learning. This probably helps to explain why there

are abundant reports on supervised learning in the literature and there are

few reports on unsupervised learning. Like most research conducted in fault

prediction, a data with all known classes is divided into training data and

testing data: the classes for training data are provided to a machine algorithm,

the testing data acts as the validation set and is used to judge the training

models. The success rate on test data gives an objective measure of how

well the machine learning algorithm performs. When this process is repeated

multiple times with randomized divided training and testing sets, it is the

standard data mining practice, called cross-validation. Like other research in

data mining, randomization, cross-validation, and bootstrapping are often the

standard statistical procedures for fault prediction in software engineering.

Unsupervised Learning Sometimes we may not have fault data or we may

have very little modules having previous fault data. For example, if a new

project is developing or previous fault data is not collected, supervised learning

approaches do not work because we do not have labeled training data.

Therefore, unsupervised learning approaches such as clustering methods may

be applied. However, research for this approach is seldom reported. As far

as the author is aware, Zhong et al. [55][56] are the rst group who investi-

gate this in fault prediction. They use Neural-Gas and K-means clustering to

class software modules into several groups, with the help of human experts to

identify fault-prone or not fault-prone to each group. Their results indicate

promising potentials for this unsupervised learning method.

2.2 Public Datasets

There are several software project data repositories become publically available

such as NASA Metrics Data Program [1] and PROMISE Data Repository

[2]. NASA MDP is a software project metrics repository provided by NASA

and is available to users through their website. NASA MDP data stores and


2.2 Public Datasets 9

organizes the software metrics data and associated fault data at the module

level. Currently, there are thirteen projects datasets available. All NASAMDP

datasets are also available in PROMISE public repository. There are ninety

four defect datasets available in PROMISE. Therefore, these datasets can be

used to validate the performance of the various fault-prediction techniques. In

the experiments of this thesis work, we used twenty three public datasets from

NASA and Promise Data Repositories.

Table 2.1: Datasets used in the study

Project Faulty

(%)

Number of

Modules

Language Source

Jedit 4.3 2.23 492 Java PROMISE

pc1 6.94 1109 C NASA MDP

ar1 7.44 121 C PROMISE

nw1 7.69 403 C NASA MDP

kc3 9.34 458 Java NASA MDP

cm1 9.84 498 C NASA MDP


Arc 11.54 234 C++ PROMISE


kc1 15.46 2109 C++ NASA MDP

Jedit 4.2 13.07 367 Java PROMISE

ar4 18.69 107 C PROMISE

jm1 19.35 10885 C NASA MDP

kc2 20.5 522 C++ NASA MDP

camel1.6 21.91 858 Java PROMISE

ant1.6 26.21 351 Java PROMISE

Jedit4.0 24.5 306 Java PROMISE

Jedit4.1 25.32 312 Java PROMISE

ant1.7 27.79 493 Java PROMISE

mc2 32.3 161 C++ NASA MDP

jedit 3.2 33.09 272 Java PROMISE

lucene2.0 46.67 195 Java PROMISE

jedit 4.0 m 48.9 274 Java PROMISE

The details of these datasets are tabulated in Table 2.1. These datasets corre-


2.3 Evaluation Measures 10

Table 2.2: Confusion matrix

Defect Present No Yes

Defect Predicted

No TN=True Negative FN=False Negative

Yes FP=False Positive TP=True Positive

spond to dierent programming language and have dierent software metrics

varying in size from eight to forty. The description of used datasets along with

their metrics is given in Appendix.(Appendix Table A1 and Appendix Table

A2)

2.3 Evaluation Measures

In this section, we summarize the various evaluation measures used by var-

ious researchers to evaluate the performance of a fault-prediction technique.

These measures can be broadly classied in two major categories- Numerical

measures and Graphical measures.

2.3.1 Numerical measures

All numerical measures can be derived from the Confusion matrix. A Con-

fusion matrix contains information about actual and predicted classications

done by a fault-prediction technique. Table 2.2 shows the confusion matrix

for a two class classication.

Accuracy:

The prediction accuracy of a fault-prediction technique is measured as

Accuracy =TN + TP

TN + TP + FN + FP(2.1)

False positive rate (FPR):

It is measured as the ratio of modules incorrectly predicted as faulty module

to the entire non-faulty modules. False alarm and type-1 error are similar as

FPR.



FPR =FP

TN + FP(2.2)

False negative rate (FNR):

It is measured as the ratio of modules incorrectly predicted as non-faulty

module to the entire faulty modules. Type-2 error is similar as FNR.

FNR =FN

TP + FN(2.3)

Precision:

It is measured as the ratio of modules correctly predicted as faulty to the entire

modules predicted as faulty.

Precision =TP

TP + FP(2.4)

Recall:

It is measured as the ratio of modules correctly predicted as faulty to the entire

faulty modules. Probability of detection (PD) is similar to recall.

Recall =TP

TP + FN(2.5)

F-measure:

It is measured as the harmonic mean of precision and recall. [36]

F measure = 2 Precision RecallPrecision+Recall

(2.6)

G-mean:

It is the Geometric mean. G-mean indices are dened in expressions (7) and

(8). G-mean1 is the square root of the probability of detection (PD) and

precision. G-mean2 is the square root of the product of PD and specicity.

[35]

Gmean1 =pPD Precision (2.7)

Gmean2 =qPD Specificity (2.8)

J-coecient (J-coe):

It tells about the performance of the prediction techniques more eectively.

[51]



J coeff = PD PF (2.9)

When J-coe is 0, the probability of detecting a faulty module is equal to

the false alarm rate. When J-coe is greater than 0, PD is greater than PF.

Whereas J-coe=1 represents the perfect classication, while J-coe=-1 is the

worst case because all modules are predicted inaccurately.

2.3.2 Graphical evaluation measures

Graphical measures depict the relationship between two or more numerical

measures. As the numerical measures all the graphical measures can be also

derived from the confusion matrix.

ROC curve [54]:

An ROC curve provides visualization of the tradeo between the ability to

correctly predict fault-prone modules (PD) and the number of incorrectly pre-

dicted fault-free modules (PF). The area under the ROC curve (denoted AUC)

is a numeric performance evaluation measure to compare the performance of

fault-prediction techniques. In ROC curves, the best performance indicates

high PD and low PF.

PR curve [14]:

An ROC curve provides visualization of the tradeo between Precision and

Recall. In a PR curve, x-axis represents Recall and y-axis is Precision. Recall

is another term for PD. In PR curve, the best performance indicates high PD

and high Precision.

Cost curve [15]:

A Cost curve provides the visualization on the cost of misclassication. It

describes the performance of a fault-prediction technique on the basis of cost

of misclassication. Its y-axis represents normalized expected misclassication

cost. It indicates the dierence between the maximum and the minimum cost

of misclassifying faulty modules. The x-axis represents the probability cost

function.


2.4 Fault Prediction Studies 13

2.4 Fault Prediction Studies

In this section, we present the brief summary of some of the fault prediction

studies which are relevant to our study. In particular we summarized some of

the studies on fault-prediction techniques, some of the useful review journals

and research papers relevant to cost eectiveness of fault prediction. The

summarized studies are shown in Table 2.3.

These studies show that a lot of researches have been done in the eld of fault

prediction. But it requires more specic studies showing the eect of fault

prediction on software quality and its economics. In this thesis, we address

one of the major and complex problem in software fault prediction studies

i.e. how to compare the performance of dierent fault-prediction techniques

eectively? And as a solution we proposed a cost evaluation framework which

compare the performance on the basis of resultant fault removal cost.

Table 2.3: Fault Prediction Studies

S.

no

Study Fault-Prediction

Techniques used

Evaluation

metrics

Datasets

Used

Conclusion

1 Victor R. Basili,Lionel C. Briand,

and Walcelio L.

Melo, (1996) [5]

logistic regression

(univari-

ate and multivari-

ate regression)

regression

coecient,

p-value

Private

Datasets (8

datasets)

1. They found that C and K met-

rics was useful to predict class fault-

proneness during the early phases of

the development life-cycle. 2. They

concluded that, on their dataset, C

and K metrics was better predictors

than traditional code metrics.

2 S. S. Gokhale and

M. R. Lyu (1997)

[17]

Regression tree,

density modeling

techniques

Accuracy,

Type I

and Type II

error

Private

Dataset

(med-

ical Imaging

System)

1. They found that Regression

tree based technique has high pre-

diction accuracy then density tech-

nique. 2. it has lower misclassi-

cation rate as compared to density

based technique.

3 T. Khoshgof-

taar and N. Seliya

(2002) [32]

CART-LS,

CART-LAD and

S-PLUS

average ab-

solute error

(aae)

and average

relative er-

ror (are)

Private

Dataset

(from a large

telecommu-

nication

system)

1. They concluded that perfor-

mance of CART-LAD was found

better than the other two tech-

niques. 2.S-PLUS trees had poor

predictive accuracy.

4 Lan Guo , Bojan

Cukic and

Harshinder Singh

(2003) [19]

Dempster-Shafer

(D-S) belief net-

work, logistic re-

gres-

sion and discrim-

inant analysis

Specicity,

Sensitivity,

Overall Pre-

diction Ac-

curacy,

Probabil-

ity of False

Alarm,

Eort

KC2 1. Accuracy of D-S belief networks

was found higher than logistic re-

gression and discriminant analysis.



5 Lan Guo, Yan

Ma, Bojan Cu-

kic, and Harshin-

der Singh (2004)

[20]

Logistic Re-

gression, Discrim-

inant

analysis, Decision

Tree, Rule Set,

Boosting, Logis-

tic, Ker-

nel Density, Nave

Bayes, j48, IBK,

IB1, Voted Per-

ceptron, Hyper

Pipes, ROCKY

Accuracy,

Probability

of Detection

CM1, JM1,

KC1, KC2

and PC1

1. Random Forest generally

achieves higher overall prediction

accuracy and defect detection rate

than other. 2. Compared dierent

machine learning models

6 T. Menzies, J.

DiStefano, A. Or-

rego, R. Chap-

man (2004) [42]

Naive Bayes and

J48

Accuracy,

Preci-

sion, Proba-

bility of De-

tection and

Probabil-

ity of False

alarm

CM1, JM1,

KC1, KC2

and PC1

1. They concluded that perfor-

mance of Naive Bayes is better than

J48 algorithm. 2. They stated that

accuracy is not a useful parameter

for evaluation. 3. They suggested

use of fault prediction in addition

to inspection for better quality as-

surance activity.

7 A. Koru

and Hongfang Liu

(2005) [34]

J48 and Kstar F-measure,

Precision

and Recall

CM1, JM1,

KC1, KC2

and PC1

1. They suggested, it is better

to perform defect prediction on the

data that belong to large modules.

2. They showed that when de-

fect prediction was performed us-

ing class level metrics, gives better

performance as compare to method

level metrics.

8 Venkata

U.B. Challagulla,

Farokh B. Bas-

tani, I-Ling Yen

(2005) [13]

Linear

Regression, Pace

Regression, Sup-

port Vector Re-

gression, Neural

Network for con-

tinuous goal eld,

Support Vec-

tor Logistic Re-

gression, Neural

Network for dis-

crete goal eld,

Logistic Regres-

sion, Nave Bayes,

Instance

Based Learning,

J48 Tree, and 1-

Rule

Mean Abso-

lute Error

CM1, JM1,

KC1 and

PC1

1. Evaluate performance of dif-

ferent prediction models. 2.

Shows that combination of 1R and

Instance-based Learning gives bet-

ter prediction accuracy. 3. Also

showed that Size and Complexity

metrics are not sucient for e-

cient fault prediction.

9 Tibor

Gyimothy, Rudolf

Ferenc, and Ist-

van Siket (2005)

[21]

logistic regression

(univari-

ate and multivari-

ate regression ) ,

decision tree and

neural network

Precision,

Cor-

rectness and

Completeness

Mozilla 1.0

to Mozilla

1.6

1. Presented a toolset to calcu-

late the OO metrics from C++

software. 2. Shown how fault-

proneness changed over seven ver-

sions of Mozilla

10 U.B. Challagulla,

B. Bastani, I. Yen

(2006) [12]

Memory

Based Reasoning

(MBR) technique

Accuracy,

Probability

of Detection

(PD) and

Probabil-

ity of False

alarm (PF)

CM1, JM1,

KC1 and

PC1

1. They conclude that if accu-

racy is the only criteria, then simple

MBR with Euclidean distance per-

form better than other used tech-

niques. 2. They proposed a frame-

work that can be used to derive the

optimal conguration which gives

best performance for the given de-

fect dataset.



11 Yan Ma, Lan Guo

and Bojan Cukic

(2006) [39]

Logistic

Regression, Dis-

criminant Analy-

sis, Decision Tree,

Rule Set, Boost-

ing, Kernel Den-

sity, Naive Bayes,

J48, IBK, IB1,

Voted Percep-

tron, VF1, Hyper

Pipes, ROCKY,

Random Forest,

Modied Random

Forest

Probability

of De-

tection, Ac-

curacy, Pre-

cision,

G-mean1,

G-mean2,F-

measure

CM1, JM1,

KC1, KC2

and PC1

1. Proposed a novel methodology

based on variants of the random for-

est algorithm which is more robust

than random forest. 2. Compared

dierent machine learning models.

12 T. Menzies, J.

Greenwald and A.

Frank (2007) [43]

Nave bayes, J48

and log ltering

techniques

Probability

of Detection

(PD) and

Probabil-

ity of False

alarm (PF)

CM1, KC3,

KC4, MW1,

PC1, PC2,

PC3 and

PC4

1. They showed that data mining of

static code attributes to learn de-

fect predictor techniques is useful.

2. They concluded that used pre-

dictors was found useful for priori-

tizing a resource-bound exploration

of code that has to be inspected.

13 S Kanmani, Rhy-

mend Uthariaraj,

Sankara-

narayanan, P.

Thambidurai

(2007) [27]

Back Propagation

Neural Net-

work, Probabilis-

tic Neural Net-

work, discrimi-

nant analysis and

logistic regression

Type I, type

II and over-

all misclas-

sication

rate

PC1, PC2,

PC3, PC4,

PC5 and

PC6

1. Probabilistic Neural Net-

works outperforms Back Propaga-

tion Neural Networks in predicting

the fault proneness of Object Ori-

ented Software.

14 Zhan Li, Marek

Reformat (2007)

[37]

Support

Vector Machine,

C4.5, Multilayer

Per-

ceptron and Nave

Bayes classier

Sensitivity,

Specicity

and

Accuracy

JM1 and

KC1

1. Performance of proposed

methodology i.e. SimBoost was

found better compared to conven-

tional techniques. 2. Authors pro-

posed fuzzy labels for classication

purposes.

15 Naeem Seliya,

Taghi M. Khosh-

goftaar (2007)

[47]

Expectation

Maximization,

C4.5

Type I, type

II and over-

all error rate

KC1, KC2,

KC3 and

JM1

1. EM-based semi-supervised clas-

sication improves the performance

of software quality models.

16 Yue Jiang, Bojan

Cukic and Yan

Ma (2008) [25]

Nave Bayes, Lo-

gistic, IB1, J48,

Bagging

All available

eval-

uation tech-

niques. In

addition in-

troduce

Cost curve

CM1, JM1,

KC1, KC2,

KC4, MC2,

PC1 and

PC5

1. Selection of the best pre-

diction model cannot be made

without considering software cost

characteristics.

17 Olivier Van-

decruys a, David

Martens

, Bart Baesens ,

Christophe Mues

b,

Manu De Backer

and Raf Haesen

(2008) [50]

Ant Miner+,

C4.5,

logistic regression

and support vec-

tor machine

accuracy,

specicity

and

sensitivity

KC1, PC1

and PC4

1. Authors argued that the intu-

itiveness and comprehensibility of

Ant Miner+ model found superior

then compared models.

18 B. Turhan and A.

Bener (2009) [48]

Nave Bayes Probability

of Detection

(PD) and

Probabil-

ity of False

alarm (PF)

CM1, KC3,

KC4, MW1,

PC1, PC2,

PC3 and

PC4

1. They showed that independence

assumption of Nave Bayes was not

harmful for the defect prediction

in datasets with PCA preprocess-

ing. 2. They showed that assigning

weights to static code attribute can

signicantly increase the prediction

performance.


2.5 Estimating Cost of Fault Prediction 16

19 Huihua Lu, Bojan

Cukic, Mark Culp

(2011) [38]

Random forest,

FTF

Probability

of Detection

and the

Area Under

Re-

ceiver Oper-

ating Char-

acter-

istic Curve

(AUC),

JM1, KC1,

PC1, PC3

and PC4

1. Semi-supervised technique out-

performs corresponding supervised

technique.

20 P.S. Bishnu and

V. Bhattacherjee

(2011) [9]

K-Means, Catal

et al. Two stage

approach

(CT) Single stage

approach (CS),

Nave Bayes and

Linear discrimi-

nant analysis

False Posi-

tive Rate,

False Nega-

tive Rate

and Error

AR3, AR4,

AR5, SYD1

and SYD2

1. Overall error rate of QDK al-

gorithm was found comparable to

other compared techniques

2.5 Estimating Cost of Fault Prediction

Software fault prediction attracts signicant attention as it can oer guidance

to software verication and validation activities. Over the past few years, many

organizations have provided their datasets containing software metrics and re-

spective fault information, publicly. Availability of these datasets encourages

researches to validate the performance of various machine learning techniques

in predicting fault proneness of software modules. Many research studies have

also been performed to evaluate the performance of these fault-prediction tech-

niques. But it seems that they ignored the impact of fault misclassication

on the economics of software development. Certifying considerable number of

faulty modules to be non-faulty raises serious concerns as it may result in the

increment of development cost due to the increase in fault removal cost of the

same, in the later phases. Hence, a more viable evaluation measure will be to

favor techniques which tend to reduce the fault removal cost.

Many studies have used dierent criteria to evaluate the performance of various

fault-prediction techniques under investigation. Some of the used criteria are:

Accuracy, Precision, Recall, and Mean absolute error, but these criteria could

not consider the cost parameter of software development. Then, few of them

presented cost measures to evaluate the cost eectiveness of fault prediction

studies.In this section, we summarized the studies, which measures the cost

eectiveness of fault prediction and relate them with our work.

Jiang et al. [25] have used various metrics to measure the performance of


2.5 Estimating Cost of Fault Prediction 17

fault-prediction techniques. Then, they introduced cost curve, a measure to

estimate the cost eectiveness of a classication technique, to evaluate the per-

formance of a fault-prediction technique. They drew out the conclusion that

cost characteristics must be considered to select the best prediction technique.

Jiang et al. [24] addressed a more general problem, in which they observed that

the cost implications of false positives and false negatives are dierent. They

analyzed the benets of fault-prediction techniques which incorporate misclas-

sication cost in the development of the prediction model. They performed

11 experiments with dierent cost for false positives and false negatives on

13 datasets. They concluded that a cost-sensitive modeling does not improve

the overall performance of fault-prediction techniques. Nevertheless, explicit

information about misclassication cost makes it easier for software managers

to select the most appropriate technique.

Mende et al. [41] pointed out that, traditional prediction techniques typically

ignore the eort needed to x the faults, i.e., they do not distinguish between

a predicted fault in a small module and a predicted fault in a large module.

Then, they introduced a performance measure (popt), that takes the size of

the modules into account to measure the performance of a fault-prediction

technique. They performed their study on thirteen NASA datasets. They

concluded that their drawn result indicates the need for further research to

improve existing prediction models, not only by more sophisticated classica-

tion algorithms, but also by searching for better performance measures.

Mende et al. [40] proposed two strategies namely AD (eort-aware binary

prediction) and DD (eort-aware prediction based on defect density) to in-

clude the notion of eort awareness into fault-prediction techniques. The rst

strategy, AD, is applicable to any probabilistic classier, while DD is appli-

cable only for regression algorithms. They evaluate these strategies on fteen

publicly available datasets. They concluded that both strategies improve the

cost eectiveness of fault{prediction techniques signicantly, in the statistical

and a practical sense.

Arisholm et al. [3] presented a study performed in an industrial setting where

they tried to build fault prediction models to eciently predict faults in a

JAVA system having multiple versions. They also proposed a cost performance

measure (CE), a variation of lift charts where the x-axis contains the ratio of


2.6 Summary 18

lines of code instead of modules. They concluded that the popular confusion

matrix criterion is not clearly related to the cost-eectiveness.

Catal et al. [11] presented a literature review on fault-prediction studies from

1990 to 2009. They reviewed the results of previous studies as well as dis-

cussed the current trends. Bell et al. [6] presented a challenge paper and

discussed some important issues regarding the impact of fault-prediction stud-

ies on testing and other eorts. They concluded that till then no study existed

in literature which could investigate the impact of fault prediction in software

development process. They also highlighted that coming up with a method

that would assess the eectiveness of fault-prediction studies if adopted in

software project would be helpful for the software community.

Jiang et al. [25] used cost curve to show the cost eectiveness of fault-

prediction studies, but they assume the same misclassication costs for each

module, which might be unreasonable in practice. Mende et al. [41] intro-

duced a new performance namely popt, that account module size to evaluate

the performance of a fault-prediction technique, but in our framework the fault

removal cost of an particular phase is same for all modules. Jiang et al. [24]

experimented the cost impact on fault misclassications over eleven dierent

values (taken arbitrarily) for cost of false positives and false negatives. These

values were considered as same for all phases of software development which

is not a practical assumption. In this thesis, we proposed a new cost evalu-

ation framework, which overcome this limitation by using organization-wide

cost information and compute the estimated fault removal costs based on their

place of identication. Wagner et al. [52] summarized the fault removal cost

for dierent testing stages. Jones et al [26] summarized the fault identication

eciency of dierent testing phases. We have used these parameters to com-

pute estimated fault removal cost for a specic fault-prediction technique and

that eventually helped us to decide its applicability in a more precise way.

2.6 Summary

In this chapter, we presented a brief summarization of concepts related to our

study. In particular, we have shown the conventional way of performing fault

prediction, the measures used to evaluate the performance of fault-prediction


2.6 Summary 19

technique and the brief summary of available public dataset repositories. Here,

we also summarized the studies, which are related to my thesis work and frame

a background for the same.


Chapter 3

Fault Prediction Results: How

useful They Are?

In this chapter, we give an insight towards the cost economy of fault prediction.

In particular, we revisited the results of some of the earlier fault prediction

studies to account for fault misclassication. Here, we rst investigate how

dierent authors measure the performance of their presented fault-prediction

techniques. Then, we rene the performance evaluation criteria based on fault

misclassication, and revisited the outcomes of the above-said fault-prediction

techniques.

In our study, we used fteen research papers based on public datasets along

with their outcomes and measurement criteria (see Table 2.3). The remainder

of this chapter is organized as follows. Section 3.1 discusses the issues in fault

prediction. Section 3.2 presents a new model for evaluating fault prediction

performance of a technique base on cost economics. Section 3.3 presents re-

vision of fault prediction results based on presented evaluation model, and

Section 3.4 summarize our ndings.

3.1 Issues in Fault Prediction

Economic software development process requires identication and removal of

faults in the early stages of software development process. Fault-prediction

techniques are used to predict fault-prone modules in the software. Predicting

faults correctly may help in reducing the eorts applied in the later stages of

3.2 A Proposed Model for Evaluating Fault Prediction Eciency 21

testing.

But building an accurate prediction model is a challenging task because the

dataset being used may have noisy content and may contain outliers [7]. It

is hard to nd suitable measure that can provide reliable estimation for the

various characteristics of the software system [6]. This makes the study of

fault prediction much more involved, as we are dealing with many alternative

and imprecise measures to compute the same software characteristic.

It has been found that the number of faulty modules represents only a small

fraction of the total number of modules in the software. This observation

in particular, is critically important to put the results obtained by the fault-

prediction technique in a correct perspective. Having fewer faulty modules

in the dataset, a high value of prediction accuracy may result due to the

classication of majority of non-faulty modules as non-faulty. However, our

main concern is identication of faulty modules rather than non-faulty ones.

Simply considering accuracy might be misleading, sometimes.

Many eorts have been made to evaluate the performance of fault-prediction

techniques. However, it seems that they tend to ignore the impact of the

fault misclassication on the economics of software development. For instance,

if there is high number of false positives, then it will require extra eorts

unnecessarily to scan those modules which are non-faulty. On the other hand,

if there is more number of false negatives, then leaving out too many faulty

modules under the scanner, the technique doesn't seem to help either. This

call for choosing a technique that would predict lesser number of false negatives

even if it tends to be less accurate and/or higher value of false positives.

Therefore, we revisited the results of previous fault prediction studies on the

basis of fault misclassication.

3.2 A Proposed Model for Evaluating Fault

Prediction Eciency

Here, we present a performance evaluation model, which evaluates the perfor-

mance of fault-prediction techniques in the context of economics.



Figure 3.1 and Figure 3.2 shows the cost statistics for both faulty and non-fault

modules respectively. If a faulty module predicted as faulty it requires unit

level testing eorts but, if it is predicted as non-faulty it requires extra eorts

paid in later development stages to remove the same fault (see Figure 3.1).

However, if a non-faulty module incorrectly predicted as faulty module it re-

quires extra eort paid at the time of unit testing (see Figure 3.2). We used

both of the above said observation to compare the performance of the fault-

prediction techniques in our presented evaluation model.

True Positive

Not Discovered

False Negative

Discovered

Predicted asnon-Faulty

Predicted asFaulty

Faulty Module Fault PredictionTechniqueRequire UnitTesting Cost

System Testing

On Field Testing

Require SystemTesting Cost

Require FieldTesting Cost

Figure 3.1: Cost statistics for faulty modules

.

False PositiveFalse Negative

True Negative

Predicted asnon-Faulty

Predicted asFaulty

non-FaultyModule

Fault PredictionTechnique

Require UnitTesting Cost

No TestingRequired

Figure 3.2: Cost statistics for non-faulty modules

.



3.2.1 General arguments

Based on above investigations and observations, we found a need to use those

prediction techniques which tries to minimize false negatives, even at the cost

of increasing false positive and compromising some accuracy. Accordingly, we

present a model to evaluate the performance of fault-prediction techniques.

The presented model tends to prioritize the performance of a fault-prediction

technique based on three criteria, namely, false negative rate, false positive

rate and prediction accuracy.

The general arguments to measure the performance of a fault-prediction tech-

nique are

1. False negatives are critically important for the overall reduction in the

testing and maintenance cost of the system and hence to be minimized.

2. False positives are to be reduced but can be compromised if they help to

reduce false negatives.

3. Similarly, prediction accuracy can also be compromised if it helps to reduce

false negatives.

3.2.2 Evaluation model

We quantify our arguments towards nding best technique. Here we discuss,

how we conclude a technique as the best one in the perspective of economic

software development. The dened model is given below-

1. Choose a technique as the best technique, having least FNR value but the

dierence between the FPR should be with in thresholds.

2. If two or more techniques have nearly same FNR value, then choose a

technique as the best technique, having least FPR value.

3. If two or more techniques have nearly equal FNR and FPR values, then

choose a technique as the best technique, having maximum accuracy.

We dene above three step evaluation model to compare the performance of

fault-prediction techniques so that selected technique requires minimum eort

for fault removal.


3.3 Revisiting Fault Prediction Results 24

Wagner et al. [52] presented the quality economics of defect-detection tech-

niques and the impact of the uncovered faults on software cost as well as its

quality. This study support our presented model. Use of this evaluation model

helps to determine the impact of fault prediction on the software cost due to

undetected faults.

3.3 Revisiting Fault Prediction Results

There have been various studies done in the eld of software fault prediction.

In our analysis, we used the studies performed on public datasets. Table 2.3

summarizes detailed study of dierent authors with their used evaluation mea-

sures and their drawn conclusions. We analyzed that authors used various

evaluation measures to compare the performance of dierent fault-prediction

techniques, which made the comparison even more complicated. Moreover,

the performance of a technique varies with the used dataset. Therefore, we

revisited the results of earlier fault-prediction studies (Table 2.3) over four

NASA MDP [1] datasets(Table 3.1), incorporating above mentioned perfor-

mance measures i.e. false negative and false positive. All reported experi-

ments utilized technique implementations from WEKA data-mining tool [53].

All performance measurements are generated by threefold cross-validation of

classication.

Table 3.1: NASA datasets

Project # modules % with defects Language

CM1 496 9.80% C

KC1 2,109 15.50% C++

KC2 520 20.40% C++

PC1 1,109 6.90% C

As we know that high FNR shows that many faults remain undetected under

the scanner of fault-prediction technique, so it has a high impact on the soft-

ware quality as well as the testing and maintenance cost. At the same time

high value of FPR requires more eort for unit testing.

Overall, this suggests that for the development of economic and high quality



software, we should choose a technique that predicted less number of false

negatives even if it tends to be less accurate and/or predicted high number of

false positives. For our analysis, we combined the results of various authors

(mentioned in Table 2.3) with results of our presented model (FNR and FPR)

in Appendix Tables 3.2 to 3.5. Then, we interpret the performance of these

techniques in accordance to our model.

We have evaluated the performance of these techniques over four NASA datasets

viz. CM1, KC1, KC2 and PC1. We used WEKA [2] data mining tool to run

all the experiments. The interpretation is as follow:

For dataset CM1 (Table 3.2), techniques IBK, IB1 and Nave Bayes have similar

false negative rate (FNR) values but Nave Bayes have higher false positive rate

(FPR) than other two. Since Step 2 of our model compares the FPR values,

we found that IBK and IB1 had similar FPR values so both were equally good

when compared to other techniques.

For dataset KC1 (Table 3.3), techniques IBK, IB1 and Classication via Clus-

tering have similar FNR values but IBK has least FPR value, hence it outper-

forms all other techniques and can be consider the best for this dataset.

For dataset KC2 (Table 3.4), techniques Bayesian Logistic Regression and

Voted Perceptron have least FNR values but their FPR values are very high,

so they are not eective because almost all modules are predicted as faulty.

Hence, we consider Decision Stump technique as the best technique.

For dataset PC1 (Table 3.5), techniques IBK and IB1 have similar FNR values

but IB1 has slightly more false positives, so IBK is considered to be the best

one for the dataset PC1.

Now generalizing the best technique for these four datasets, our result shows

that IBK technique is found as the best technique among all the available

techniques.



Table 3.2: Experiment results for dataset CM1

Technique name Acc TP TN FP FN FNR FPR PrecisionRecall F-

measure

Neural Network 87.55 3 433 16 46 0.94 0.04 0.16 0.06 0.09

Simple Logistic 89.76 1 446 3 48 0.98 0.01 0.25 0.02 0.04

SMO 89.76 0 447 2 49 1 0 0 0 0

Voted Perceptron 89.96 0 448 1 49 1 0 0 0 0

IBK 87.95 15 423 26 34 0.69 0.06 0.37 0.31 0.33

IB1 87.95 15 423 26 34 0.69 0.06 0.37 0.31 0.33

Bagging 89.96 0 448 1 49 1 0 0 0 0

Classication via

Regression

89.56 3 443 6 46 0.94 0.01 0.33 0.06 0.1

Dagging 89.96 0 448 1 49 1 0 0 0 0

Stacking 90.16 0 449 0 49 1 0 0 0 0

Hyper pipes 89.56 0 446 3 49 1 0.01 0 0 0

Decision Table 90.16 0 449 0 49 1 0 0 0 0

PART 89.96 1 447 2 48 0.98 0 0.33 0.02 0.04

Jrip (RIPPER) 89.56 1 445 4 48 0.98 0.01 0.2 0.02 0.04

J 48 89.96 4 439 10 45 0.92 0.02 0.29 0.08 0.13

Random Forest 89.76 6 441 8 43 0.88 0.02 0.43 0.12 0.19

Decision Stump 90.16 0 449 0 49 1 0 0 0 0

BF tree 89.96 1 447 2 48 0.98 0 0.33 0.02 0.04

Nave Bayes 83.53 15 401 48 34 0.69 0.11 0.24 0.31 0.27

Bayesian Logistic

Regression

90.16 0 449 0 49 1 0 0 0 0

Logistic 88.15 8 431 18 41 0.84 0.04 0.31 0.16 0.21

Classication via

Clustering

84.14 13 406 43 36 0.73 0.1 0.23 0.27 0.25

Grading 90.16 0 449 0 49 1 0 0 0 0

Zero r 90.16 0 449 0 49 1 0 0 0 0

Table 3.3: Experiment results for dataset kc1


measure

Neural Network 85.78 69 1740 43 257 0.79 0.02 0.62 0.21 0.32

Simple Logistic 85.63 66 1740 43 260 0.8 0.02 0.61 0.2 0.3

SMO 84.64 9 1776 7 317 0.97 0 0.56 0.03 0.05

Voted Perceptron 81.79 117 1608 175 209 0.64 0.1 0.4 0.36 0.38

IBK 84.45 134 1647 136 192 0.59 0.08 0.5 0.41 0.45

IB1 83.36 134 1624 159 192 0.59 0.09 0.46 0.41 0.43

Bagging 85.92 78 1734 49 248 0.76 0.03 0.61 0.24 0.34

Classication via

Regression

85.4 63 1738 45 263 0.81 0.03 0.58 0.19 0.29

Dagging 84.83 12 1777 6 314 0.96 0 0.67 0.04 0.07

Stacking 84.54 0 1783 0 326 1 0 0 0 0

Hyper pipes 85.07 13 1781 2 313 0.96 0 0.87 0.04 0.08

Decision Table 84.73 43 1744 39 283 0.87 0.02 0.52 0.13 0.21

PART 85.02 50 1743 40 276 0.85 0.02 0.56 0.15 0.24

Jrip (RIPPER) 84.68 84 1702 81 242 0.74 0.05 0.51 0.26 0.34

J 48 85.21 96 1701 82 230 0.71 0.05 0.54 0.29 0.38

Random Forest 85.25 92 1706 77 234 0.72 0.04 0.54 0.28 0.37

Decision Stump 84.54 0 1783 0 326 1 0 0 0 0

BF tree 85.25 40 1758 25 286 0.88 0.01 0.62 0.12 0.2

Nave Bayes 82.46 120 1619 164 206 0.63 0.09 0.42 0.37 0.39

Bayesian Logistic

Regression

84.73 13 1774 9 313 0.96 0.01 0.59 0.04 0.07

Logistic 85.3 70 1729 54 256 0.79 0.03 0.56 0.21 0.31

Classication via

Clustering

81.79 129 1596 187 197 0.6 0.1 0.41 0.4 0.4

Grading 84.54 0 1783 0 326 1 0 0 0 0

Zero r 84.54 0 1783 0 326 1 0 0 0 0



Table 3.4: Experiment results for dataset kc2


measure

Neural Network 83.14 39 395 20 68 0.64 0.05 0.66 0.36 0.47

Simple Logistic 82.95 40 393 22 67 0.63 0.05 0.65 0.37 0.47

SMO 83.52 26 410 5 81 0.76 0.01 0.84 0.24 0.38

Voted Perceptron 24.52 106 22 393 1 0.01 0.95 0.21 0.99 0.35

IBK 79.12 50 363 52 57 0.53 0.13 0.49 0.47 0.48

IB1 76.25 51 347 68 56 0.52 0.16 0.43 0.48 0.45

Bagging 83.72 50 387 28 57 0.53 0.07 0.64 0.47 0.54

Classication via

Regression

82.57 45 386 29 62 0.58 0.07 0.61 0.42 0.5

Dagging 81.8 17 410 5 90 0.84 0.01 0.77 0.16 0.26

Stacking 79.5 0 415 0 107 1 0 0 0 0

Hyper pipes 81.99 19 409 6 88 0.82 0.01 0.76 0.18 0.29

Decision Table 82.57 45 386 29 62 0.58 0.07 0.61 0.42 0.5

PART 80.84 32 390 25 75 0.7 0.06 0.56 0.3 0.39

Jrip (RIPPER) 83.52 58 378 37 49 0.46 0.09 0.61 0.54 0.57

J 48 81.42 46 379 36 61 0.57 0.09 0.56 0.43 0.49

Random Forest 81.8 48 379 36 59 0.55 0.09 0.57 0.45 0.5

Decision Stump 78.93 80 332 83 27 0.25 0.2 0.49 0.75 0.59

BF tree 82.57 50 381 34 57 0.53 0.08 0.6 0.47 0.52

Nave Bayes 83.52 45 391 24 62 0.58 0.06 0.65 0.42 0.51

Bayesian Logistic

Regression

20.88 107 2 413 0 0 1 0.21 1 0.34

Logistic 82.38 47 383 32 60 0.56 0.08 0.59 0.44 0.51

Classication via

Clustering

81.03 70 353 62 37 0.35 0.15 0.53 0.65 0.59

Grading 79.5 0 415 0 107 1 0 0 0 0

Zero r 79.5 0 415 0 107 1 0 0 0 0

Table 3.5: Experiment results for dataset pc1


measure

Neural Network 93.6 18 1020 12 59 0.77 0.01 0.6 0.23 0.34

Simple Logistic 92.79 5 1024 8 72 0.94 0.01 0.38 0.06 0.11

SMO 93.15 1 1032 0 76 0.99 0 1 0.01 0.03

Voted Perceptron 91.61 0 1016 16 77 1 0.02 0 0 0

IBK 92.43 34 991 41 43 0.56 0.04 0.45 0.44 0.45

IB1 92.25 34 989 43 43 0.56 0.04 0.44 0.44 0.44

Bagging 92.88 6 1024 8 71 0.92 0.01 0.43 0.08 0.13

Classication via

Regression

92.79 3 1026 6 74 0.96 0.01 0.33 0.04 0.07

Dagging 93.06 1 1031 1 76 0.99 0 0.5 0.01 0.03

Stacking 93.06 0 1032 0 77 1 0 0 0 0

Hyper pipes 92.52 2 1024 8 75 0.97 0.01 0.2 0.03 0.05

Decision Table 92.7 5 1023 9 72 0.94 0.01 0.36 0.06 0.11

PART 92.43 1 1024 8 76 0.99 0.01 0.11 0.01 0.02

Jrip (RIPPER) 92.88 7 1023 9 70 0.91 0.01 0.44 0.09 0.15

J 48 92.7 11 1017 15 66 0.86 0.01 0.42 0.14 0.21

Random Forest 92.9666 20 1011 21 57 0.74 0.02 0.49 0.26 0.34

Decision Stump 92.88 2 1028 4 75 0.97 0 0.33 0.03 0.05

BF tree 92.7 4 1024 8 73 0.95 0.01 0.33 0.05 0.09

Nave Bayes 89.36 24 967 65 53 0.69 0.06 0.27 0.31 0.29

Bayesian Logistic

Regression

93.06 0 1032 0 77 1 0 0 0 0

Logistic 92.06 8 1013 19 69 0.9 0.02 0.3 0.1 0.15

Classication via

Clustering

89.81 19 977 55 57 0.75 0.05 0.26 0.25 0.25

Grading 93.06 0 1032 0 77 1 0 0 0 0

Zero r 93.06 0 1032 0 77 1 0 0 0 0


3.4 Summary 28

3.4 Summary

Software fault prediction attracts signicant attention as it can oer guidance

to software verication and validation activities. Over the past few years,

many organizations have provided their datasets describing module metrics

and their fault content publicly. The availability of these datasets encour-

age researchers to perform their fault prediction studies using several machine

learning techniques. In this chapter, we studied the outcome of some of the

earlier studies undertaken in this area. We found that they have used various

criteria to evaluate the performance of a given technique. In most of the cases,

these studies have used prediction accuracy to show how good a technique is.

However, they seem to be ignoring the impact of fault misclassication rate

in judging the overall performance of the various fault-prediction techniques.

Certifying considerable number of faulty modules to be non-faulty raises seri-

ous concerns, where faulty modules themselves are small in number compared

to non-faulty modules. A more viable evaluation criterion will be to favor

techniques which tend to reduce false negatives even if compromise on false

positives and/or prediction accuracy.

We have performed re-analysis on the results of earlier studies and rene their

outcomes based on our presented model. Our contribution in this chapter is

to rene the way of selection of best technique. Here, we also identify the

need of an evaluation measure which provides the specic information about,

how cost economic fault-prediction techniques are and what their fundamental

limitations are?


Chapter 4

A Cost Evaluation Framework

In the previous chapter, we investigated the impact of fault misclassication

on software economics and quality. In this chapter, we quantify the fault

removal cost in dierent stages of software development when we are using

fault prediction and answered both research questions.

Specically, we propose a cost evaluation framework that can help to put

the results of fault prediction in proper usability context. Essentially, the

framework can provide an estimate of the saving in the eorts applied by

using the results of the fault prediction in subsequent phases of the software

development. To construct the framework, we accounted for realistic fault

removal cost of dierent testing phases [52], along with their fault identication

eciency [26]. We have used this framework to investigate two important

and related research questions that for a given project dataset, whether fault

prediction would help? And if yes, then how to choose a fault-prediction

technique that would yield the optimal results. The rst question can be

answered by comparing the fault removal cost in both the cases, i.e. with or

without use of fault prediction.

The remainder of this chapter is organized as follows. In Section 4.1, we present

our proposed cost evaluation framework. Section 4.2 presents an experimental

study to investigate the usefulness of fault-prediction techniques using our

proposed framework. We discuss the implications of using our framework in

Section 4.3 and summarization is given in Section 4.4.

4.1 The Evaluation Framework 30

4.1 The Evaluation Framework

In the previous chapter, we highlighted the need of a cost evaluation measure,

which compare the performance of a fault-prediction technique on the basis of

their economics. Jones [30] states that 30-40 percent of the development cost is

spent for quality assurance and fault removal. Since fault-prediction techniques

are used to predict fault prone modules in early development life cycle, hence

it can help in reducing the cost incurred on testing and maintenance.

Here, we construct a cost evaluation framework, which accounts for realistic

cost required to remove a fault and computes the estimated fault removal cost

for a specic fault-prediction technique. The constraints, which we accounted

for our framework include:

(1) Fault removal cost varies with testing phases.

(2) It is not possible to identify 100 % faults in specic testing phase.

(3) It is practically not feasible to perform unit test on all modules.

We have used normalized fault removal cost suggested by Wagner et al. [52]

to formulate our cost evaluation framework, but these costs may vary from

one organization to another and also depends on the various characteristics

of the project. The normalized costs are summarized in Table 4.1. The fault

identication eciencies for dierent testing phases are taken from the study

of Jones [26]. The eciencies of testing phases are summarized in Table 4.2.

Wilde et al. [45] stated that more than fty percent of modules are very small

in size, hence unit testing on these modules is unfruitful. We have included

this value (0.5) as the threshold for unit testing in our framework.

Table 4.1: Removal costs of test techniques (in sta-hours per defect) [52]

Type Lowest Mean Median Highest

Unit 1.5 3.46 2.5 6

System 2.82 8.37 6.2 20

Field 3.9 27.24 27 66.6



Table 4.2: Fault identication eciencies of dierent test phases [26]

Type Lowest Median Highest

Unit 0.1 0.25 0.5

System 0.25 0.5 0.65

Figure 3.1 and Figure 3.2 shows the cost statistics for both faulty and non-

fault modules respectively. Software modules which are predicted as faulty

(true positives and false positives) by the fault-prediction technique requires

some verication and testing cost at module level i.e. require cost equal to

the unit testing cost (Cu, specically for our study). As it is stated that 100

% identication of faults in specic testing phase is not possible, so some of

the correctly predicted faulty modules (true positive) remain undetected in

the unit testing. Faulty modules, which are predicted as non-faulty (false

negatives) and the correctly predicted faulty modules which are remain unde-

tected in unit testing, are probably detected in later stages that require the

fault removal cost equal to either system testing or eld testing (cu and Cs

respectively in our case). The used testing techniques in our framework can

also, along with the respective fault removal cost and eciency can vary from

organization to organization. Equation 4.1 shows the proposed cost evaluation

framework to estimate the overall fault removal cost. Equation 4.2 shows the

minimum fault removal cost without the use of fault prediction. Normalized

fault removal cost and its interpretation is shown in equation 4.3.

Ecost = Ci + Cu (FP + TP ) + s Cs (FN + (1 u) TP )+(1 s) Cf (FN + (1 u) TP ) (4.1)

Tcost =MpCu(TM)+sCs(1u)FM+(1s)Cf(1u)FM(4.2)

NEcost =Ecost

Tcost

8>>>>>:< 1; Fault Prediction is useful

=> 1; Use Unit Testing

(4.3)

Where, Ecost - Estimated fault removal cost of the software when we use fault

prediction.



Tcost- Estimated fault removal cost of the software without the use fault

prediction.

NEcost- Normalized Estimated fault removal cost of the software when we use

fault prediction.

Ci - Initial setup cost of used fault-prediction technique.

Cu - Normalized fault removal cost in unit testing.

Cs - Normalized fault removal cost in system testing.

Cf - Normalized fault removal cost in eld testing.

Mp - Percentage of modules unit tested.

FP - Number of false positives.

FN - Number of false negatives.

TP - Number of true positives.

TM - Total modules.

FM - Total number of faulty modules.

u - Fault identication eciency of unit testing.

s - Fault identication eciency of system testing.

Our cost evaluation framework consider more practical scenario where the un-

detected faults are traced in all the later testing phases and the corresponding

fault removal cost is evaluated based on the organization specic statistics.

It makes the proposed framework more viable performance measure then the

other measures.

In our experiment, we used values of Cu, Cs and Cf as summarized in Ta-

ble 4.1. u and s show the fault identication eciency of unit testing and

system testing, respectively. We have used the values of u and s from the

survey report "Software Quality in 2010" of Caper Jones [26]. Mp shows the

fraction of modules unit tested. Its value is taken from the study of Wilde

[45]. We have generalized the framework so that it can be applied to any sort

of organization/software with their specic values of Cu, Cs, Cf , Mp,u and

s. Our aim is to provide the bench mark to approximate the overall fault

removal cost. This is clear from our framework that if a technique is having

high false negatives and/or high false positive, then it results in higher fault

removal cost. When this approximated cost exceeds the unit testing cost, we

suggest testing all the modules at unit level instead of using fault prediction.

(equation 4.3)


4.2 Experimental Study 33

4.2 Experimental Study

In this section, we presented an experimental study to investigate the useful-

ness of fault-prediction techniques using our cost evaluation framework. In our

study, we used ve popular fault-prediction techniques [19][20][27][25][22] on

19 projects from NASA MDP [1] and PROMISE [2] repositories to investigate

our study. As these nineteen projects cover signicant range of percentage

faulty modules in the project (varying from 7 to 49 percentages), it is su-

cient for our investigation. We used WEKA machine learning tool to perform

all listed experiments.

4.2.1 Experimental setup

We have used NASA MDP [1] and PROMISE [2] datasets, listed in Table

4.3, to evaluate the impact of fault-prediction technique over the fault removal

cost using our proposed framework (Ecost). The metrics in these datasets

describe projects, which vary in size as well as in complexity. These datasets

have dierent software metrics varying in size from eight to forty. We further

classify these datasets on the basis of percentage of faulty modules present as

shown in Table 4.4.

To illustrate eectiveness of our framework, we have used ve well-known

fault-prediction techniques. Our goal is to demonstrate the cost evaluation

framework and suggest when to use fault prediction, rather than identifying

the "best" fault-prediction technique. For this reason, the choice of fault-

prediction technique is orthogonal with respect to the intended contribution.

The fault-prediction techniques which we selected for our study are Random

Forest, J48 (C4.5 decision tree), Neural Network, K-means Clustering and IBK

(K-nearest neighbours). These algorithms represent a broad range of machine

learning techniques. All reported experiments utilized technique implemen-

tations from a well-known software package WEKA [53]. All performance

measurements are generated by threefold cross-validation of classication.


4.2 Experimental Study 34

Table 4.3: Used projects from NASA [1] and PROMISE data repository [1]

Project Faulty (%) Number of Modules

pc1 6.94 1109

ar1 7.44 121

nw1 7.69 403

kc3 9.34 458

cm1 9.84 498

pc3 10.24 1563

Arc 11.54 234

pc4 12.21 1458

kc1 15.46 2109

ar4 18.69 107

jm1 19.35 10885

kc2 20.5 522

camel1.6 21.91 858

ant1.6 26.21 351

ant1.7 27.79 493

mc2 32.3 161

jedit 3.2 33.09 272

lucene2.0 46.67 195

jedit 4.0 m 48.9 274

Table 4.4: Categorization of projects based on the fraction of faulty modules

Category Faults (%

Fault prediction metrices

Documents

Transcript of Fault prediction metrices