Fault prediction metrices

download Fault prediction metrices

of 87

description

fault prediction metrics in Object oriented code

Transcript of Fault prediction metrices

  • The Economics of Fault Prediction

    Submitted in partial fulllment of the requirements for the degree of

    Master of Technology

    by

    Deepak Banthia

    (1010102)

    under the guidance of

    Dr. Atul Gupta

    Computer Science & Engineering

    INDIAN INSTITUTE OF INFORMATION TECHNOLOGY,

    DESIGN AND MANUFACTURING JABALPUR, INDIA

    2012

  • Approval Sheet

    This thesis entitled The Economics of Fault Prediction submitted by

    Deepak Banthia (1010102) is approved for partial fulllment of the re-

    quirements for the degree of Master of Technology in Computer Science and

    Engineering.

    Examining Committee

    ................................................

    ................................................

    ................................................

    Guide

    ................................................

    ................................................

    ................................................

    Chairman

    ................................................

    Date .......................... ................................................

    Place ......................... ................................................

  • Certificate

    This is to certify that the work contained in the thesis entitled, The Economics

    of Fault Prediction, submitted by Deepak Banthia (Roll No. 1010102) in

    partial fulllment of the requirements for the degree of Master of Technology in

    Computer Science and Engineering, has been carried out under my supervision and

    that this work has not been submitted elsewhere.

    (Atul Gupta) ............ , 2012

    Associate Professor

    Computer Science & Engineering Discipline

    Indian Institute of Information Technology, Design and Manufacturing Jabalpur

    Jabalpur, India.

  • Acknowledgments

    This thesis would not have been possible without the sincere help and contri-

    butions of several people. I would like to use this opportunity for expressing

    my sincere gratitude to them.

    Firstly I would like to thank god with whose blessing I could turn my idea

    into reality. I express my deep sense of gratitude towards my mentor and

    thesis supervisor Dr Atul Gupta for his valuable guidance, moral support and

    constant encouragement throughout the thesis. His approach towards software

    engineering will always be a valuable learning experience for me. No words can

    express my feelings towards him for taking such a keen interest in my academics

    and personal welfare. His dedication, professionalism and hard work has been

    and shall be a source of inspiration throughout my life.

    The contributions of a mother to the success of her child can be neither mea-

    sured nor directly repaid. To such a mother, who is but a manifestation of

    the divine virtues of the Earth, this report is one petite oering. Thank you

    parents for all the liberty, prosperity, condence and discipline showered on

    me.This thesis would not have been completed without the motivation and

    blessing of my parents. My ance (Nisha) , brought a light inside me and al-

    ways lled me with enthusiasm and vigour to do my jobs with complete eort

    and dedication. Thanks to her for accompanying me in all the way and for

    uninching help and support for all my endeavours. I would like to thank, my

    uncle Mr. Hem Kumar Banthia and Mr. Khagendra Kumar Banthia for their

    encouragement throughout my studies.Along with them, I also receive energy

    and motivation from my sisters for my career. I would also like to give my

    sincere thanks to Mr. Amaltas Khan, Mr. Arpit Gupta, Mr. Ravindra Singh,

    Mr. Santosh Singh Rathore and Mr. Saurabh Tiwari for their support and

    being there always, no matter what.

  • I thank the CSE fraternity at IIITDM Jabalpur and my special thanks to my

    batch mates.

    Jabalpur Deepak Banthia

    ..........., 2012

    IV

  • Abstract

    Fault-prediction techniques aim to predict fault prone software modules in

    order to streamline the eorts to be applied in the later phases of software

    development. Normally the eectiveness of a fault-prediction technique is

    demonstrated by educating it over a part of some known fault data and mea-

    suring its performance against the other part of the fault data. There have

    been many eorts comparing the performance of various fault-prediction tech-

    niques on dierent project datasets. However, invariably most of these studies

    have also recorded high misclassication rate (normally, 15 to 35%), besides

    not so high accuracy gures (normally, 70 to 85%). This raises serious concerns

    about the viability of these techniques. In this thesis, we rst present a brief

    summary of the results of some of the earlier studies undertaken in fault pre-

    diction and argue about their usefulness. As a follow up, we then investigate

    two important and related research questions regarding the viability of fault

    prediction. First, for a given project, are the fault prediction results useful? In

    case of an armative answer, then we look for how to choose a fault-prediction

    technique for an overall improved performance in terms of cost-eectiveness.

    Here, we propose an adaptive cost evaluation framework that incorporates cost

    drivers for various fault removal phases, and performs a cost-benet analysis

    for the misclassication of faults. We then used this framework to investigate

    the usefulness of various fault prediction techniques in two dierent settings.

    The rst part of the investigation consisted of performance evaluation of ve

    major fault-prediction techniques on nineteen public datasets. Here, we found

    fault prediction useful for the projects with percentage of faulty modules less

    than a certain threshold, and there was no single technique that could provide

    the best results in all cases i.e for all nineteen project datasets. In the other

    part of the investigation study, and as a practical use of the proposed frame-

    work, we have demonstrate that the fault information of the previous versions

  • of the software can be eectively used to predict fault proneness in the cur-

    rent version of the software. Here, we found the fault prediction useful when

    the dierence between inter-version fault rate was below a certain threshold.

    Also, the usability of fault prediction found to be reduced with the increase of

    inter-version fault rate.

    VI

  • Contents

    Approval I

    Certicate II

    Acknowledgments III

    Abstract V

    List of Figures IX

    List of Tables X

    List of Symbols XII

    Abbreviations XIII

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2 Related Work 6

    2.1 Fault Prediction Models . . . . . . . . . . . . . . . . . . . . . 6

    2.2 Public Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2.3 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3.1 Numerical measures . . . . . . . . . . . . . . . . . . . . 10

    2.3.2 Graphical evaluation measures . . . . . . . . . . . . . . 12

    2.4 Fault Prediction Studies . . . . . . . . . . . . . . . . . . . . . 13

    2.5 Estimating Cost of Fault Prediction . . . . . . . . . . . . . . . 16

  • 2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    3 Fault Prediction Results: How useful They Are? 20

    3.1 Issues in Fault Prediction . . . . . . . . . . . . . . . . . . . . 20

    3.2 A Proposed Model for Evaluating Fault Prediction Eciency . 21

    3.2.1 General arguments . . . . . . . . . . . . . . . . . . . . 23

    3.2.2 Evaluation model . . . . . . . . . . . . . . . . . . . . . 23

    3.3 Revisiting Fault Prediction Results . . . . . . . . . . . . . . . 24

    3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    4 A Cost Evaluation Framework 29

    4.1 The Evaluation Framework . . . . . . . . . . . . . . . . . . . . 30

    4.2 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . 33

    4.2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 33

    4.2.2 Experiment execution . . . . . . . . . . . . . . . . . . 34

    4.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.2.4 Experiment ndings . . . . . . . . . . . . . . . . . . . 43

    4.2.5 Threats to validity . . . . . . . . . . . . . . . . . . . . 45

    4.2.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 47

    4.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    5 An Application of Cost Evaluation Framework for Multiple

    Releases 50

    5.1 The Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 50

    5.2 Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . 53

    5.2.1 Experimental setup . . . . . . . . . . . . . . . . . . . . 53

    5.2.2 Experiment execution . . . . . . . . . . . . . . . . . . 54

    5.2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    5.2.4 Threats to validity . . . . . . . . . . . . . . . . . . . . 59

    5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    6 Conclusions and Future Work 61

    References 63

    Publications 70

    Index 71

    VIII

  • List of Figures

    1.1 Thesis structure . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    3.1 Cost statistics for faulty modules . . . . . . . . . . . . . . . . . 22

    3.2 Cost statistics for non-faulty modules . . . . . . . . . . . . . . . 22

    4.1 Decision chart representation to evaluate the estimated Ecost . 36

    4.2 Value of NEcost for category 1 when u = 0.25 and s = 0.5 . 38

    4.3 Value of NEcost for category 2 when u = 0.25 and s = 0.5 . 41

    4.4 Value of NEcost for category 3 when u = 0.25 and s = 0.5 . 43

    4.5 Cost characteristics of used fault-prediction techniques when u

    = 0.5 and s = 0.65 . . . . . . . . . . . . . . . . . . . . . . . . 44

    4.6 Cost characteristics of used fault-prediction techniques when u

    = 0.25 and s = 0.5 . . . . . . . . . . . . . . . . . . . . . . . . 45

    4.7 Cost characteristics of used fault-prediction techniques when u

    = 0.15 and s = 0.25 . . . . . . . . . . . . . . . . . . . . . . . 46

    5.1 Decision chart representation to evaluate the estimated Ecost . 52

    5.2 Value of Ecost for Jedit versions when u = 0.25 and s = 0.5 59

  • List of Tables

    2.1 Datasets used in the study . . . . . . . . . . . . . . . . . . . . 9

    2.2 Confusion matrix . . . . . . . . . . . . . . . . . . . . . . . . . 10

    2.3 Fault Prediction Studies . . . . . . . . . . . . . . . . . . . . . 13

    3.1 NASA datasets . . . . . . . . . . . . . . . . . . . . . . . . . . 24

    3.2 Experiment results for dataset CM1 . . . . . . . . . . . . . . . 26

    3.3 Experiment results for dataset kc1 . . . . . . . . . . . . . . . . 26

    3.4 Experiment results for dataset kc2 . . . . . . . . . . . . . . . . 27

    3.5 Experiment results for dataset pc1 . . . . . . . . . . . . . . . 27

    4.1 Removal costs of test techniques (in sta-hours per defect) [52] 30

    4.2 Fault identication eciencies of dierent test phases [26] . . . 31

    4.3 Used projects from NASA [1] and PROMISE data repository [1] 34

    4.4 Categorization of projects based on the fraction of faulty modules 34

    4.5 Result of experiment for PC1 (1109) . . . . . . . . . . . . . . 37

    4.6 Result of experiment for AR1 (121) . . . . . . . . . . . . . . . 37

    4.7 Result of experiment for NW1 (403) . . . . . . . . . . . . . . . 37

    4.8 Result of experiment for KC3 (458) . . . . . . . . . . . . . . . 38

    4.9 Result of experiment for CM1 (498) . . . . . . . . . . . . . . . 38

    4.10 Result of experiment for PC3 (1563) . . . . . . . . . . . . . . 39

    4.11 Result of experiment for ARC (234) . . . . . . . . . . . . . . . 39

    4.12 Result of experiment for PC4 (1458) . . . . . . . . . . . . . . 39

    4.13 Result of experiment for KC1 (2109) . . . . . . . . . . . . . . 40

    4.14 Result of experiment for AR4 (107) . . . . . . . . . . . . . . . 40

    4.15 Result of experiment for JM1 (10885) . . . . . . . . . . . . . . 40

    4.16 Result of experiment for KC2 (522) . . . . . . . . . . . . . . . 41

    4.17 Result of experiment for Camel 1.6 (858) . . . . . . . . . . . . 41

    4.18 Result of experiment for Ant 1.6 (351) . . . . . . . . . . . . . 42

  • 4.19 Result of experiment for Ant 1.7 (493) . . . . . . . . . . . . . 42

    4.20 Result of experiment for MC2 (161) . . . . . . . . . . . . . . . 42

    4.21 Result of experiment for J-edit 3.2 (272) . . . . . . . . . . . . 42

    4.22 Result of experiment for Lucene 2.0 (195) . . . . . . . . . . . 43

    4.23 Result of experiment for J-edit 4.0 (274) . . . . . . . . . . . . 43

    5.1 Used projects from PROMISE data repository [2] . . . . . . . 53

    5.2 Prediction results for Ant 1.6 . . . . . . . . . . . . . . . . . . 55

    5.3 Prediction results for Ant 1.7 when fault prediction model trained

    using Ant 1.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    5.4 Results of experiment to calculate the Ecost for Ant 1.7 using

    information of Ant 1.6 . . . . . . . . . . . . . . . . . . . . . . 56

    5.5 Prediction results for Jedit4.0 (3 cross-validation) . . . . . . . 57

    5.6 Results of experiment to calculate the Ecost for Jedit4.1 using

    information of Jedit4.0 . . . . . . . . . . . . . . . . . . . . . . 57

    5.7 Prediction results for Jedit4.0 and Jedit4.1 (3 cross-validation) 57

    5.8 Results of experiment to calculate the Ecost for Jedit4.2 using

    information of Jedit4.0 and 4.1 . . . . . . . . . . . . . . . . . 58

    5.9 Prediction results for Jedit4.0, Jedit4.1 and Jedit4.2. (3 cross-

    validation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

    5.10 Results of experiment to calculate the Ecost for Jedit4.3 using

    information of Jedit4.0, 4.1 and 4.2 . . . . . . . . . . . . . . . 58

    A1 Details of used metrics . . . . . . . . . . . . . . . . . . . . . . 71

    A2 Metrics used in datasets . . . . . . . . . . . . . . . . . . . . . 72

    XI

  • List of Symbols

    Cf Normalized fault removal cost in eld

    Ci Initial setup cost of used fault prediction approach

    Cs Normalized fault removal cost in system testing

    Cu Normalized fault removal cost in unit testing

    Mp Percentage of modules unit tested

    s Fault identication eciency of system testing

    u Fault identication eciency of unit testing

  • Abbreviations

    Acc Accuracy

    AUC Area Under the Curve

    Ecost Estimated Fault Removal Cost of the software when we

    use fault prediction

    EFN Estimated number of False Positives

    EFP Estimated number of False Positives

    ETP Estimated number of True Positives

    FN False Negative

    FNR False Negative Rate

    FP False Positive

    FPR False Positive Rate

    NEcost Normalized Estimated fault removal cost of the software

    when we use fault prediction

    NPV Negative Predictive Value

    PD Probability of Detection

    PF Probability of False Alarm

    PPV Positive Predictive Value

    PR Precision

    Tcost Estimated fault removal cost of the software without the

    use fault prediction

    TN True Negative

    TP True Positive

  • Chapter 1

    Introduction

    Software fault prediction has become an important area of research in the arena

    of Software Development Life Cycle. It has the potential to aid in ensuring

    the desired software quality as well as to achieve an economic development

    process. The potential of fault prediction is backed by its ability to identify

    the fault prone software modules before the actual testing process begins. This

    helps in obtaining desired software quality in optimum time, with optimized

    cost and eort.

    Most of the major development organizations spend a lot of time and eorts

    on the research in the eld of quality assurance activities. But the practical

    usage of fault prediction is equivocal. It indicates that there is a need of

    further research in this eld that would emphasize on how it is applicable in

    the quality assurance process.

    1.1 Motivation

    Software quality assurance process focuses on the identication and removal of

    faults quickly from the artifacts that are generated and subsequently used in

    the development of software. Fault prediction can help in this by identifying

    the fault-prone modules in the early stages of development life cycle, which,

    then can lead to a more streamlined eort to be applied. The fault-proneness

    information not only points to the need for increased quality monitoring during

    the development but also provides an important advice to undertake suitable

    verication and validation activities that eventually lead to improve the eec-

  • 1.1 Motivation 2

    tiveness and eciency of the fault nding process.

    Fault prediction is a process to predict the fault prone software modules

    without executing them. Conventionally, fault prediction is done by apply-

    ing machine-learning techniques over project datasets. The eectiveness of

    a fault-prediction technique is demonstrated by educating it over a part of

    some known fault data and measuring its performance against the other part

    of the fault data. Recently, several software project data-repositories became

    publicly available such as NASA Metrics Data Program [1] and PROMISE

    Data Repository [2]. Availability of these public datasets has encouraged

    undertaking more investigations and their replications. A wide range of fault-

    prediction techniques has been applied to demonstrate their eectiveness on

    these datasets [19][8][28][49][38].

    However, there are certain crucial issues, which are required to be resolved

    before the results of such prediction can be incorporated in practice. An

    important concern is related with the lack of suitable performance evaluation

    measures that would assess the economics of fault prediction if adopted in soft-

    ware development process [6]. Another concern is about the typical prediction

    accuracy of a fault-prediction technique, which is found to be considerably

    low, ranging 70-85 percent [32][19][20], compared to the high accuracy results

    obtained in other elds like image recognition, spam lters, etc. Yet another

    concern can be attributed to the unequal distribution of fault data that may

    lead to biased learning. We know from our experience that fault distributions

    typically emulate Pareto principle, and hence, the accuracy gures obtained

    from fault prediction can be grossly misleading, as a fault-prediction technique

    can produce high accuracy results by mostly classifying non-faulty modules as

    non-faulty.

    The key functionality of fault prediction is to identify the highest possible

    number of faults with the least possible resources. However, the concerns

    mentioned above, in fact, pose serious threats for the fault prediction results to

    be used to streamline quality assurance activities undertaken during software

    development. We need to investigate further, what these results mean and

    whether they can be used economically in the software development process.

    The Economics of Fault Prediction

  • 1.2 Objectives 3

    1.2 Objectives

    The main objective of this thesis work is to propose a cost evaluation frame-

    work that helps to put the results of a fault-prediction technique in proper

    perspective. If the results of the fault prediction are to be used in the devel-

    opment process, the framework can provide an estimate of the saving in the

    eorts applied in subsequent phases of the software development. Specically,

    we aim to answer that for a given project dataset, whether fault prediction

    would help. And if yes, then how to choose a fault-prediction technique that

    would yield the optimal results.

    With this dissertation, we will investigate:

    Q1: For a given project, whether fault prediction would economically help in

    software development?

    Q2: If yes, then how to select a fault-prediction technique for overall optimum

    performance?

    1.3 Thesis Organization

    The overall structure of this thesis can be illustrated as shown in Figure 1.1.

    The content can broadly be divided into three major sections, namely Back-

    ground Research, Research Contribution and Research Prospects.

    Figure 1.1: Thesis structure

    .

    Chapter 2 summarizes the concepts which are relevant to the study. In particu-

    lar, Fault prediction models, details of public datasets used in our experimental

    The Economics of Fault Prediction

  • 1.3 Thesis Organization 4

    study, model evaluation techniques and literature review of previous related

    studies are given in this chapter.

    In chapter 3, we present an insight towards the economy of fault prediction. In

    particular, we rst revisited the results of some of the previous fault prediction

    studies on the basis of the economics of fault. Then, we rene the criteria based

    on fault misclassication, and again measure the performance of the above-

    said fault-prediction techniques on the basis of cost eectiveness. We have used

    four NASA MDP datasets to perform our study. Here, our results suggested

    that simple techniques like IBK perform better over most of the datasets.

    In chapter 4, we have proposed a cost evaluation framework that can help

    to answer both of the questions, using limited fault data. Essentially, the

    framework can provide an estimate of the saving in the eorts applied by

    using the results of the fault prediction in subsequent phases of the software

    development. To construct the cost evaluation framework, we accounted for

    typical fault removal cost of dierent testing phases [52], along with their fault

    identication eciency [26]. The rst question can be answered by comparing

    the fault removal cost in both the cases, i.e. with or without use of fault

    prediction.

    Here, we investigated the usefulness of fault-prediction techniques based on the

    proposed framework by using limited fault data. The investigation consisted

    of performance evaluation of ve major fault-prediction techniques on nineteen

    public datasets. Here, we have used ve well-known fault-prediction techniques

    namely Random Forest, J48 (C4.5 decision tree), Neural Network, K-means

    Clustering and IBK (K-nearest neighbors). These datasets provide a wide

    range of percentage faulty modules (varying from 7 to 49 percentages). We

    categorized these datasets based on the fault information into three categories.

    We have used WEKA machine learning tool to perform all listed experiments.

    The results of this study suggested that, the fault prediction can be useful for

    the projects with percentage of faulty module less than a certain threshold (in

    our case, it varied from 21% to 42% over the specied range of testing phases

    eciency). Also, there was no single technique that could provide the best

    results in all cases.

    In chapter 5, we show the application of the proposed cost framework over

    the multiple subsequent releases of the software. We evaluated fault removal

    The Economics of Fault Prediction

  • 1.4 Summary 5

    cost of the current version of software using the fault information available

    from its previous versions. Then, this estimated fault removal cost helps to

    decide, whether fault prediction is useful or not for the current version. To

    answer both the research questions, we have investigated the usefulness of

    fault-prediction techniques based on the framework on successive versions of

    the two dierent software's namely Ant and Jedit. Here, we found the fault

    prediction useful when the dierence between inter-version fault rate was be-

    low a certain threshold (in our case, it was 2%). Also, the usability of fault

    prediction found to be reduced with the increase of inter-version fault rate.

    Here, the dierence between inter-version fault rate depicts, the dierence

    between percentage faulty modules present in successive versions.

    Finally we concluded the contribution of our research in Chapter 6. The future

    prospects of our research are also discussed in the same chapter.

    1.4 Summary

    Fault-prediction techniques are used to identify faults in the software code

    without execution it. So it has the potential to help validation and veri-

    cation process by accurately identifying the faults. It may also help in an

    economic software development process. But most of the organizations still do

    not consider fault-prediction techniques while its potential has been validated

    in couple of researches. It indicates that there is a need of further research in

    this eld that would emphasize on how it can improve the quality assurance

    process. In this chapter we highlight the issues in fault prediction arena and

    summarized our work, which tried to put fault prediction results in the correct

    perspective i.e. cost eectiveness.

    The Economics of Fault Prediction

  • Chapter 2

    Related Work

    In this chapter, we summarized the concepts which are relevant to the study.

    In particular, Fault prediction models, detail of public datasets used in the

    research study, model evaluation techniques and literature review of previous

    related studies are given in this section.

    2.1 Fault Prediction Models

    Fault prediction allows the tester to manipulate their resources more eectively

    and eciently, which would potentially result in higher quality products and

    lower costs. Fault prediction is typically employed by applying various ma-

    chine learning algorithms on known properties learned from the project fault

    datasets. The typical way of predicting faults in software modules include use

    of software metrics and fault data (collected from previous releases or similar

    projects) to construct a fault-prediction model. Then this model is used to

    predict their fault proneness. For Example, a module under the scanner of a

    fault prediction technique is detected as faulty, if it has the similar features

    (metrics) value, compared to a faulty module that has been used to train the

    fault prediction technique.

    Many techniques have been proposed to estimate the fault-proneness of a soft-

    ware module. Some of the proposed techniques are clustering, Decision Tree,

    Neural Networks, Dempster-Shafer Belief Networks, Random Forest and Quad

    Tree based K-Means [19][20][8][27][28][9][49].

    Dierent Approaches for Fault Prediction Models-

  • 2.1 Fault Prediction Models 7

    A project manager needs to make sure a project met its timetable and budget

    plan without loss of quality. In order to help project managers to make a

    decision, fault prediction models play an important role to allocate software

    quality assurance resources. Existing research in software fault-prone models

    focus on predicting faults from these two perspectives:

    The number of faults or fault density: This technique predicts thenumber of faults (or fault density) in a module or a component. These

    models typically use data from historical versions (or pre-release parts)

    and predict the faults in the new version (or the new developed parts).

    For example, the fault data from historical releases can be used to predict

    faults in updated releases [46][33][50][23].

    Classication: Classication predicts which modules (components) con-tain faults and which modules don't. The goal of this kind of prediction

    distinguishes fault free subsystems from faulty subsystems. This allows

    project managers to focus resources to x faulty subsystems.

    To construct fault prediction models we used methods. There are two methods

    to classify fault-prone modules from fault free modules: supervised learning

    and unsupervised learning. Both of them are used in dierent situations.

    When a new system without any previous release is built, for the new devel-

    oped subsystems (modules, components, or classes), in order to predict fault-

    prone subsystems, unsupervised learning needs to be adopted. After some

    subsystems are tested and put into function, these pre-release subsystems can

    be used as training data to build software fault prediction models to predict

    new subsystems. This is the time when supervised learning can be used. The

    dierence between supervised and unsupervised learning is the status of train-

    ing data's class, if it is unknown, then the learning is unsupervised, otherwise,

    the learning is supervised learning.

    Supervised Learning Learning is called supervised because; the method

    operates under supervision provided with the actual outcome for each of the

    training examples. Supervised learning requires known fault measurement

    data (i.e. the number of faults, fault density, or fault-prone or not) for training

    data. Usually, Fault measurement data from previous versions [46], pre-release

    The Economics of Fault Prediction

  • 2.2 Public Datasets 8

    [44], or similar project [29] can act as training data to predict new projects

    (subsystems).

    Most research reported in fault prediction is supervised learning including

    experiments in this dissertation. The learning result of supervised is easier to

    judge than unsupervised learning. This probably helps to explain why there

    are abundant reports on supervised learning in the literature and there are

    few reports on unsupervised learning. Like most research conducted in fault

    prediction, a data with all known classes is divided into training data and

    testing data: the classes for training data are provided to a machine algorithm,

    the testing data acts as the validation set and is used to judge the training

    models. The success rate on test data gives an objective measure of how

    well the machine learning algorithm performs. When this process is repeated

    multiple times with randomized divided training and testing sets, it is the

    standard data mining practice, called cross-validation. Like other research in

    data mining, randomization, cross-validation, and bootstrapping are often the

    standard statistical procedures for fault prediction in software engineering.

    Unsupervised Learning Sometimes we may not have fault data or we may

    have very little modules having previous fault data. For example, if a new

    project is developing or previous fault data is not collected, supervised learning

    approaches do not work because we do not have labeled training data.

    Therefore, unsupervised learning approaches such as clustering methods may

    be applied. However, research for this approach is seldom reported. As far

    as the author is aware, Zhong et al. [55][56] are the rst group who investi-

    gate this in fault prediction. They use Neural-Gas and K-means clustering to

    class software modules into several groups, with the help of human experts to

    identify fault-prone or not fault-prone to each group. Their results indicate

    promising potentials for this unsupervised learning method.

    2.2 Public Datasets

    There are several software project data repositories become publically available

    such as NASA Metrics Data Program [1] and PROMISE Data Repository

    [2]. NASA MDP is a software project metrics repository provided by NASA

    and is available to users through their website. NASA MDP data stores and

    The Economics of Fault Prediction

  • 2.2 Public Datasets 9

    organizes the software metrics data and associated fault data at the module

    level. Currently, there are thirteen projects datasets available. All NASAMDP

    datasets are also available in PROMISE public repository. There are ninety

    four defect datasets available in PROMISE. Therefore, these datasets can be

    used to validate the performance of the various fault-prediction techniques. In

    the experiments of this thesis work, we used twenty three public datasets from

    NASA and Promise Data Repositories.

    Table 2.1: Datasets used in the study

    Project Faulty

    (%)

    Number of

    Modules

    Language Source

    Jedit 4.3 2.23 492 Java PROMISE

    pc1 6.94 1109 C NASA MDP

    ar1 7.44 121 C PROMISE

    nw1 7.69 403 C NASA MDP

    kc3 9.34 458 Java NASA MDP

    cm1 9.84 498 C NASA MDP

    pc3 10.24 1563 C NASA MDP

    Arc 11.54 234 C++ PROMISE

    pc4 12.21 1458 C NASA MDP

    kc1 15.46 2109 C++ NASA MDP

    Jedit 4.2 13.07 367 Java PROMISE

    ar4 18.69 107 C PROMISE

    jm1 19.35 10885 C NASA MDP

    kc2 20.5 522 C++ NASA MDP

    camel1.6 21.91 858 Java PROMISE

    ant1.6 26.21 351 Java PROMISE

    Jedit4.0 24.5 306 Java PROMISE

    Jedit4.1 25.32 312 Java PROMISE

    ant1.7 27.79 493 Java PROMISE

    mc2 32.3 161 C++ NASA MDP

    jedit 3.2 33.09 272 Java PROMISE

    lucene2.0 46.67 195 Java PROMISE

    jedit 4.0 m 48.9 274 Java PROMISE

    The details of these datasets are tabulated in Table 2.1. These datasets corre-

    The Economics of Fault Prediction

  • 2.3 Evaluation Measures 10

    Table 2.2: Confusion matrix

    Defect Present No Yes

    Defect Predicted

    No TN=True Negative FN=False Negative

    Yes FP=False Positive TP=True Positive

    spond to dierent programming language and have dierent software metrics

    varying in size from eight to forty. The description of used datasets along with

    their metrics is given in Appendix.(Appendix Table A1 and Appendix Table

    A2)

    2.3 Evaluation Measures

    In this section, we summarize the various evaluation measures used by var-

    ious researchers to evaluate the performance of a fault-prediction technique.

    These measures can be broadly classied in two major categories- Numerical

    measures and Graphical measures.

    2.3.1 Numerical measures

    All numerical measures can be derived from the Confusion matrix. A Con-

    fusion matrix contains information about actual and predicted classications

    done by a fault-prediction technique. Table 2.2 shows the confusion matrix

    for a two class classication.

    Accuracy:

    The prediction accuracy of a fault-prediction technique is measured as

    Accuracy =TN + TP

    TN + TP + FN + FP(2.1)

    False positive rate (FPR):

    It is measured as the ratio of modules incorrectly predicted as faulty module

    to the entire non-faulty modules. False alarm and type-1 error are similar as

    FPR.

    The Economics of Fault Prediction

  • 2.3 Evaluation Measures 11

    FPR =FP

    TN + FP(2.2)

    False negative rate (FNR):

    It is measured as the ratio of modules incorrectly predicted as non-faulty

    module to the entire faulty modules. Type-2 error is similar as FNR.

    FNR =FN

    TP + FN(2.3)

    Precision:

    It is measured as the ratio of modules correctly predicted as faulty to the entire

    modules predicted as faulty.

    Precision =TP

    TP + FP(2.4)

    Recall:

    It is measured as the ratio of modules correctly predicted as faulty to the entire

    faulty modules. Probability of detection (PD) is similar to recall.

    Recall =TP

    TP + FN(2.5)

    F-measure:

    It is measured as the harmonic mean of precision and recall. [36]

    F measure = 2 Precision RecallPrecision+Recall

    (2.6)

    G-mean:

    It is the Geometric mean. G-mean indices are dened in expressions (7) and

    (8). G-mean1 is the square root of the probability of detection (PD) and

    precision. G-mean2 is the square root of the product of PD and specicity.

    [35]

    Gmean1 =pPD Precision (2.7)

    Gmean2 =qPD Specificity (2.8)

    J-coecient (J-coe):

    It tells about the performance of the prediction techniques more eectively.

    [51]

    The Economics of Fault Prediction

  • 2.3 Evaluation Measures 12

    J coeff = PD PF (2.9)

    When J-coe is 0, the probability of detecting a faulty module is equal to

    the false alarm rate. When J-coe is greater than 0, PD is greater than PF.

    Whereas J-coe=1 represents the perfect classication, while J-coe=-1 is the

    worst case because all modules are predicted inaccurately.

    2.3.2 Graphical evaluation measures

    Graphical measures depict the relationship between two or more numerical

    measures. As the numerical measures all the graphical measures can be also

    derived from the confusion matrix.

    ROC curve [54]:

    An ROC curve provides visualization of the tradeo between the ability to

    correctly predict fault-prone modules (PD) and the number of incorrectly pre-

    dicted fault-free modules (PF). The area under the ROC curve (denoted AUC)

    is a numeric performance evaluation measure to compare the performance of

    fault-prediction techniques. In ROC curves, the best performance indicates

    high PD and low PF.

    PR curve [14]:

    An ROC curve provides visualization of the tradeo between Precision and

    Recall. In a PR curve, x-axis represents Recall and y-axis is Precision. Recall

    is another term for PD. In PR curve, the best performance indicates high PD

    and high Precision.

    Cost curve [15]:

    A Cost curve provides the visualization on the cost of misclassication. It

    describes the performance of a fault-prediction technique on the basis of cost

    of misclassication. Its y-axis represents normalized expected misclassication

    cost. It indicates the dierence between the maximum and the minimum cost

    of misclassifying faulty modules. The x-axis represents the probability cost

    function.

    The Economics of Fault Prediction

  • 2.4 Fault Prediction Studies 13

    2.4 Fault Prediction Studies

    In this section, we present the brief summary of some of the fault prediction

    studies which are relevant to our study. In particular we summarized some of

    the studies on fault-prediction techniques, some of the useful review journals

    and research papers relevant to cost eectiveness of fault prediction. The

    summarized studies are shown in Table 2.3.

    These studies show that a lot of researches have been done in the eld of fault

    prediction. But it requires more specic studies showing the eect of fault

    prediction on software quality and its economics. In this thesis, we address

    one of the major and complex problem in software fault prediction studies

    i.e. how to compare the performance of dierent fault-prediction techniques

    eectively? And as a solution we proposed a cost evaluation framework which

    compare the performance on the basis of resultant fault removal cost.

    Table 2.3: Fault Prediction Studies

    S.

    no

    Study Fault-Prediction

    Techniques used

    Evaluation

    metrics

    Datasets

    Used

    Conclusion

    1 Victor R. Basili,Lionel C. Briand,

    and Walcelio L.

    Melo, (1996) [5]

    logistic regression

    (univari-

    ate and multivari-

    ate regression)

    regression

    coecient,

    p-value

    Private

    Datasets (8

    datasets)

    1. They found that C and K met-

    rics was useful to predict class fault-

    proneness during the early phases of

    the development life-cycle. 2. They

    concluded that, on their dataset, C

    and K metrics was better predictors

    than traditional code metrics.

    2 S. S. Gokhale and

    M. R. Lyu (1997)

    [17]

    Regression tree,

    density modeling

    techniques

    Accuracy,

    Type I

    and Type II

    error

    Private

    Dataset

    (med-

    ical Imaging

    System)

    1. They found that Regression

    tree based technique has high pre-

    diction accuracy then density tech-

    nique. 2. it has lower misclassi-

    cation rate as compared to density

    based technique.

    3 T. Khoshgof-

    taar and N. Seliya

    (2002) [32]

    CART-LS,

    CART-LAD and

    S-PLUS

    average ab-

    solute error

    (aae)

    and average

    relative er-

    ror (are)

    Private

    Dataset

    (from a large

    telecommu-

    nication

    system)

    1. They concluded that perfor-

    mance of CART-LAD was found

    better than the other two tech-

    niques. 2.S-PLUS trees had poor

    predictive accuracy.

    4 Lan Guo , Bojan

    Cukic and

    Harshinder Singh

    (2003) [19]

    Dempster-Shafer

    (D-S) belief net-

    work, logistic re-

    gres-

    sion and discrim-

    inant analysis

    Specicity,

    Sensitivity,

    Overall Pre-

    diction Ac-

    curacy,

    Probabil-

    ity of False

    Alarm,

    Eort

    KC2 1. Accuracy of D-S belief networks

    was found higher than logistic re-

    gression and discriminant analysis.

    The Economics of Fault Prediction

  • 2.4 Fault Prediction Studies 14

    5 Lan Guo, Yan

    Ma, Bojan Cu-

    kic, and Harshin-

    der Singh (2004)

    [20]

    Logistic Re-

    gression, Discrim-

    inant

    analysis, Decision

    Tree, Rule Set,

    Boosting, Logis-

    tic, Ker-

    nel Density, Nave

    Bayes, j48, IBK,

    IB1, Voted Per-

    ceptron, Hyper

    Pipes, ROCKY

    Accuracy,

    Probability

    of Detection

    CM1, JM1,

    KC1, KC2

    and PC1

    1. Random Forest generally

    achieves higher overall prediction

    accuracy and defect detection rate

    than other. 2. Compared dierent

    machine learning models

    6 T. Menzies, J.

    DiStefano, A. Or-

    rego, R. Chap-

    man (2004) [42]

    Naive Bayes and

    J48

    Accuracy,

    Preci-

    sion, Proba-

    bility of De-

    tection and

    Probabil-

    ity of False

    alarm

    CM1, JM1,

    KC1, KC2

    and PC1

    1. They concluded that perfor-

    mance of Naive Bayes is better than

    J48 algorithm. 2. They stated that

    accuracy is not a useful parameter

    for evaluation. 3. They suggested

    use of fault prediction in addition

    to inspection for better quality as-

    surance activity.

    7 A. Koru

    and Hongfang Liu

    (2005) [34]

    J48 and Kstar F-measure,

    Precision

    and Recall

    CM1, JM1,

    KC1, KC2

    and PC1

    1. They suggested, it is better

    to perform defect prediction on the

    data that belong to large modules.

    2. They showed that when de-

    fect prediction was performed us-

    ing class level metrics, gives better

    performance as compare to method

    level metrics.

    8 Venkata

    U.B. Challagulla,

    Farokh B. Bas-

    tani, I-Ling Yen

    (2005) [13]

    Linear

    Regression, Pace

    Regression, Sup-

    port Vector Re-

    gression, Neural

    Network for con-

    tinuous goal eld,

    Support Vec-

    tor Logistic Re-

    gression, Neural

    Network for dis-

    crete goal eld,

    Logistic Regres-

    sion, Nave Bayes,

    Instance

    Based Learning,

    J48 Tree, and 1-

    Rule

    Mean Abso-

    lute Error

    CM1, JM1,

    KC1 and

    PC1

    1. Evaluate performance of dif-

    ferent prediction models. 2.

    Shows that combination of 1R and

    Instance-based Learning gives bet-

    ter prediction accuracy. 3. Also

    showed that Size and Complexity

    metrics are not sucient for e-

    cient fault prediction.

    9 Tibor

    Gyimothy, Rudolf

    Ferenc, and Ist-

    van Siket (2005)

    [21]

    logistic regression

    (univari-

    ate and multivari-

    ate regression ) ,

    decision tree and

    neural network

    Precision,

    Cor-

    rectness and

    Completeness

    Mozilla 1.0

    to Mozilla

    1.6

    1. Presented a toolset to calcu-

    late the OO metrics from C++

    software. 2. Shown how fault-

    proneness changed over seven ver-

    sions of Mozilla

    10 U.B. Challagulla,

    B. Bastani, I. Yen

    (2006) [12]

    Memory

    Based Reasoning

    (MBR) technique

    Accuracy,

    Probability

    of Detection

    (PD) and

    Probabil-

    ity of False

    alarm (PF)

    CM1, JM1,

    KC1 and

    PC1

    1. They conclude that if accu-

    racy is the only criteria, then simple

    MBR with Euclidean distance per-

    form better than other used tech-

    niques. 2. They proposed a frame-

    work that can be used to derive the

    optimal conguration which gives

    best performance for the given de-

    fect dataset.

    The Economics of Fault Prediction

  • 2.4 Fault Prediction Studies 15

    11 Yan Ma, Lan Guo

    and Bojan Cukic

    (2006) [39]

    Logistic

    Regression, Dis-

    criminant Analy-

    sis, Decision Tree,

    Rule Set, Boost-

    ing, Kernel Den-

    sity, Naive Bayes,

    J48, IBK, IB1,

    Voted Percep-

    tron, VF1, Hyper

    Pipes, ROCKY,

    Random Forest,

    Modied Random

    Forest

    Probability

    of De-

    tection, Ac-

    curacy, Pre-

    cision,

    G-mean1,

    G-mean2,F-

    measure

    CM1, JM1,

    KC1, KC2

    and PC1

    1. Proposed a novel methodology

    based on variants of the random for-

    est algorithm which is more robust

    than random forest. 2. Compared

    dierent machine learning models.

    12 T. Menzies, J.

    Greenwald and A.

    Frank (2007) [43]

    Nave bayes, J48

    and log ltering

    techniques

    Probability

    of Detection

    (PD) and

    Probabil-

    ity of False

    alarm (PF)

    CM1, KC3,

    KC4, MW1,

    PC1, PC2,

    PC3 and

    PC4

    1. They showed that data mining of

    static code attributes to learn de-

    fect predictor techniques is useful.

    2. They concluded that used pre-

    dictors was found useful for priori-

    tizing a resource-bound exploration

    of code that has to be inspected.

    13 S Kanmani, Rhy-

    mend Uthariaraj,

    Sankara-

    narayanan, P.

    Thambidurai

    (2007) [27]

    Back Propagation

    Neural Net-

    work, Probabilis-

    tic Neural Net-

    work, discrimi-

    nant analysis and

    logistic regression

    Type I, type

    II and over-

    all misclas-

    sication

    rate

    PC1, PC2,

    PC3, PC4,

    PC5 and

    PC6

    1. Probabilistic Neural Net-

    works outperforms Back Propaga-

    tion Neural Networks in predicting

    the fault proneness of Object Ori-

    ented Software.

    14 Zhan Li, Marek

    Reformat (2007)

    [37]

    Support

    Vector Machine,

    C4.5, Multilayer

    Per-

    ceptron and Nave

    Bayes classier

    Sensitivity,

    Specicity

    and

    Accuracy

    JM1 and

    KC1

    1. Performance of proposed

    methodology i.e. SimBoost was

    found better compared to conven-

    tional techniques. 2. Authors pro-

    posed fuzzy labels for classication

    purposes.

    15 Naeem Seliya,

    Taghi M. Khosh-

    goftaar (2007)

    [47]

    Expectation

    Maximization,

    C4.5

    Type I, type

    II and over-

    all error rate

    KC1, KC2,

    KC3 and

    JM1

    1. EM-based semi-supervised clas-

    sication improves the performance

    of software quality models.

    16 Yue Jiang, Bojan

    Cukic and Yan

    Ma (2008) [25]

    Nave Bayes, Lo-

    gistic, IB1, J48,

    Bagging

    All available

    eval-

    uation tech-

    niques. In

    addition in-

    troduce

    Cost curve

    CM1, JM1,

    KC1, KC2,

    KC4, MC2,

    PC1 and

    PC5

    1. Selection of the best pre-

    diction model cannot be made

    without considering software cost

    characteristics.

    17 Olivier Van-

    decruys a, David

    Martens

    , Bart Baesens ,

    Christophe Mues

    b,

    Manu De Backer

    and Raf Haesen

    (2008) [50]

    Ant Miner+,

    C4.5,

    logistic regression

    and support vec-

    tor machine

    accuracy,

    specicity

    and

    sensitivity

    KC1, PC1

    and PC4

    1. Authors argued that the intu-

    itiveness and comprehensibility of

    Ant Miner+ model found superior

    then compared models.

    18 B. Turhan and A.

    Bener (2009) [48]

    Nave Bayes Probability

    of Detection

    (PD) and

    Probabil-

    ity of False

    alarm (PF)

    CM1, KC3,

    KC4, MW1,

    PC1, PC2,

    PC3 and

    PC4

    1. They showed that independence

    assumption of Nave Bayes was not

    harmful for the defect prediction

    in datasets with PCA preprocess-

    ing. 2. They showed that assigning

    weights to static code attribute can

    signicantly increase the prediction

    performance.

    The Economics of Fault Prediction

  • 2.5 Estimating Cost of Fault Prediction 16

    19 Huihua Lu, Bojan

    Cukic, Mark Culp

    (2011) [38]

    Random forest,

    FTF

    Probability

    of Detection

    and the

    Area Under

    Re-

    ceiver Oper-

    ating Char-

    acter-

    istic Curve

    (AUC),

    JM1, KC1,

    PC1, PC3

    and PC4

    1. Semi-supervised technique out-

    performs corresponding supervised

    technique.

    20 P.S. Bishnu and

    V. Bhattacherjee

    (2011) [9]

    K-Means, Catal

    et al. Two stage

    approach

    (CT) Single stage

    approach (CS),

    Nave Bayes and

    Linear discrimi-

    nant analysis

    False Posi-

    tive Rate,

    False Nega-

    tive Rate

    and Error

    AR3, AR4,

    AR5, SYD1

    and SYD2

    1. Overall error rate of QDK al-

    gorithm was found comparable to

    other compared techniques

    2.5 Estimating Cost of Fault Prediction

    Software fault prediction attracts signicant attention as it can oer guidance

    to software verication and validation activities. Over the past few years, many

    organizations have provided their datasets containing software metrics and re-

    spective fault information, publicly. Availability of these datasets encourages

    researches to validate the performance of various machine learning techniques

    in predicting fault proneness of software modules. Many research studies have

    also been performed to evaluate the performance of these fault-prediction tech-

    niques. But it seems that they ignored the impact of fault misclassication

    on the economics of software development. Certifying considerable number of

    faulty modules to be non-faulty raises serious concerns as it may result in the

    increment of development cost due to the increase in fault removal cost of the

    same, in the later phases. Hence, a more viable evaluation measure will be to

    favor techniques which tend to reduce the fault removal cost.

    Many studies have used dierent criteria to evaluate the performance of various

    fault-prediction techniques under investigation. Some of the used criteria are:

    Accuracy, Precision, Recall, and Mean absolute error, but these criteria could

    not consider the cost parameter of software development. Then, few of them

    presented cost measures to evaluate the cost eectiveness of fault prediction

    studies.In this section, we summarized the studies, which measures the cost

    eectiveness of fault prediction and relate them with our work.

    Jiang et al. [25] have used various metrics to measure the performance of

    The Economics of Fault Prediction

  • 2.5 Estimating Cost of Fault Prediction 17

    fault-prediction techniques. Then, they introduced cost curve, a measure to

    estimate the cost eectiveness of a classication technique, to evaluate the per-

    formance of a fault-prediction technique. They drew out the conclusion that

    cost characteristics must be considered to select the best prediction technique.

    Jiang et al. [24] addressed a more general problem, in which they observed that

    the cost implications of false positives and false negatives are dierent. They

    analyzed the benets of fault-prediction techniques which incorporate misclas-

    sication cost in the development of the prediction model. They performed

    11 experiments with dierent cost for false positives and false negatives on

    13 datasets. They concluded that a cost-sensitive modeling does not improve

    the overall performance of fault-prediction techniques. Nevertheless, explicit

    information about misclassication cost makes it easier for software managers

    to select the most appropriate technique.

    Mende et al. [41] pointed out that, traditional prediction techniques typically

    ignore the eort needed to x the faults, i.e., they do not distinguish between

    a predicted fault in a small module and a predicted fault in a large module.

    Then, they introduced a performance measure (popt), that takes the size of

    the modules into account to measure the performance of a fault-prediction

    technique. They performed their study on thirteen NASA datasets. They

    concluded that their drawn result indicates the need for further research to

    improve existing prediction models, not only by more sophisticated classica-

    tion algorithms, but also by searching for better performance measures.

    Mende et al. [40] proposed two strategies namely AD (eort-aware binary

    prediction) and DD (eort-aware prediction based on defect density) to in-

    clude the notion of eort awareness into fault-prediction techniques. The rst

    strategy, AD, is applicable to any probabilistic classier, while DD is appli-

    cable only for regression algorithms. They evaluate these strategies on fteen

    publicly available datasets. They concluded that both strategies improve the

    cost eectiveness of fault{prediction techniques signicantly, in the statistical

    and a practical sense.

    Arisholm et al. [3] presented a study performed in an industrial setting where

    they tried to build fault prediction models to eciently predict faults in a

    JAVA system having multiple versions. They also proposed a cost performance

    measure (CE), a variation of lift charts where the x-axis contains the ratio of

    The Economics of Fault Prediction

  • 2.6 Summary 18

    lines of code instead of modules. They concluded that the popular confusion

    matrix criterion is not clearly related to the cost-eectiveness.

    Catal et al. [11] presented a literature review on fault-prediction studies from

    1990 to 2009. They reviewed the results of previous studies as well as dis-

    cussed the current trends. Bell et al. [6] presented a challenge paper and

    discussed some important issues regarding the impact of fault-prediction stud-

    ies on testing and other eorts. They concluded that till then no study existed

    in literature which could investigate the impact of fault prediction in software

    development process. They also highlighted that coming up with a method

    that would assess the eectiveness of fault-prediction studies if adopted in

    software project would be helpful for the software community.

    Jiang et al. [25] used cost curve to show the cost eectiveness of fault-

    prediction studies, but they assume the same misclassication costs for each

    module, which might be unreasonable in practice. Mende et al. [41] intro-

    duced a new performance namely popt, that account module size to evaluate

    the performance of a fault-prediction technique, but in our framework the fault

    removal cost of an particular phase is same for all modules. Jiang et al. [24]

    experimented the cost impact on fault misclassications over eleven dierent

    values (taken arbitrarily) for cost of false positives and false negatives. These

    values were considered as same for all phases of software development which

    is not a practical assumption. In this thesis, we proposed a new cost evalu-

    ation framework, which overcome this limitation by using organization-wide

    cost information and compute the estimated fault removal costs based on their

    place of identication. Wagner et al. [52] summarized the fault removal cost

    for dierent testing stages. Jones et al [26] summarized the fault identication

    eciency of dierent testing phases. We have used these parameters to com-

    pute estimated fault removal cost for a specic fault-prediction technique and

    that eventually helped us to decide its applicability in a more precise way.

    2.6 Summary

    In this chapter, we presented a brief summarization of concepts related to our

    study. In particular, we have shown the conventional way of performing fault

    prediction, the measures used to evaluate the performance of fault-prediction

    The Economics of Fault Prediction

  • 2.6 Summary 19

    technique and the brief summary of available public dataset repositories. Here,

    we also summarized the studies, which are related to my thesis work and frame

    a background for the same.

    The Economics of Fault Prediction

  • Chapter 3

    Fault Prediction Results: How

    useful They Are?

    In this chapter, we give an insight towards the cost economy of fault prediction.

    In particular, we revisited the results of some of the earlier fault prediction

    studies to account for fault misclassication. Here, we rst investigate how

    dierent authors measure the performance of their presented fault-prediction

    techniques. Then, we rene the performance evaluation criteria based on fault

    misclassication, and revisited the outcomes of the above-said fault-prediction

    techniques.

    In our study, we used fteen research papers based on public datasets along

    with their outcomes and measurement criteria (see Table 2.3). The remainder

    of this chapter is organized as follows. Section 3.1 discusses the issues in fault

    prediction. Section 3.2 presents a new model for evaluating fault prediction

    performance of a technique base on cost economics. Section 3.3 presents re-

    vision of fault prediction results based on presented evaluation model, and

    Section 3.4 summarize our ndings.

    3.1 Issues in Fault Prediction

    Economic software development process requires identication and removal of

    faults in the early stages of software development process. Fault-prediction

    techniques are used to predict fault-prone modules in the software. Predicting

    faults correctly may help in reducing the eorts applied in the later stages of

  • 3.2 A Proposed Model for Evaluating Fault Prediction Eciency 21

    testing.

    But building an accurate prediction model is a challenging task because the

    dataset being used may have noisy content and may contain outliers [7]. It

    is hard to nd suitable measure that can provide reliable estimation for the

    various characteristics of the software system [6]. This makes the study of

    fault prediction much more involved, as we are dealing with many alternative

    and imprecise measures to compute the same software characteristic.

    It has been found that the number of faulty modules represents only a small

    fraction of the total number of modules in the software. This observation

    in particular, is critically important to put the results obtained by the fault-

    prediction technique in a correct perspective. Having fewer faulty modules

    in the dataset, a high value of prediction accuracy may result due to the

    classication of majority of non-faulty modules as non-faulty. However, our

    main concern is identication of faulty modules rather than non-faulty ones.

    Simply considering accuracy might be misleading, sometimes.

    Many eorts have been made to evaluate the performance of fault-prediction

    techniques. However, it seems that they tend to ignore the impact of the

    fault misclassication on the economics of software development. For instance,

    if there is high number of false positives, then it will require extra eorts

    unnecessarily to scan those modules which are non-faulty. On the other hand,

    if there is more number of false negatives, then leaving out too many faulty

    modules under the scanner, the technique doesn't seem to help either. This

    call for choosing a technique that would predict lesser number of false negatives

    even if it tends to be less accurate and/or higher value of false positives.

    Therefore, we revisited the results of previous fault prediction studies on the

    basis of fault misclassication.

    3.2 A Proposed Model for Evaluating Fault

    Prediction Eciency

    Here, we present a performance evaluation model, which evaluates the perfor-

    mance of fault-prediction techniques in the context of economics.

    The Economics of Fault Prediction

  • 3.2 A Proposed Model for Evaluating Fault Prediction Eciency 22

    Figure 3.1 and Figure 3.2 shows the cost statistics for both faulty and non-fault

    modules respectively. If a faulty module predicted as faulty it requires unit

    level testing eorts but, if it is predicted as non-faulty it requires extra eorts

    paid in later development stages to remove the same fault (see Figure 3.1).

    However, if a non-faulty module incorrectly predicted as faulty module it re-

    quires extra eort paid at the time of unit testing (see Figure 3.2). We used

    both of the above said observation to compare the performance of the fault-

    prediction techniques in our presented evaluation model.

    True Positive

    Not Discovered

    False Negative

    Discovered

    Predicted asnon-Faulty

    Predicted asFaulty

    Faulty Module Fault PredictionTechniqueRequire UnitTesting Cost

    System Testing

    On Field Testing

    Require SystemTesting Cost

    Require FieldTesting Cost

    Figure 3.1: Cost statistics for faulty modules

    .

    False PositiveFalse Negative

    True Negative

    Predicted asnon-Faulty

    Predicted asFaulty

    non-FaultyModule

    Fault PredictionTechnique

    Require UnitTesting Cost

    No TestingRequired

    Figure 3.2: Cost statistics for non-faulty modules

    .

    The Economics of Fault Prediction

  • 3.2 A Proposed Model for Evaluating Fault Prediction Eciency 23

    3.2.1 General arguments

    Based on above investigations and observations, we found a need to use those

    prediction techniques which tries to minimize false negatives, even at the cost

    of increasing false positive and compromising some accuracy. Accordingly, we

    present a model to evaluate the performance of fault-prediction techniques.

    The presented model tends to prioritize the performance of a fault-prediction

    technique based on three criteria, namely, false negative rate, false positive

    rate and prediction accuracy.

    The general arguments to measure the performance of a fault-prediction tech-

    nique are

    1. False negatives are critically important for the overall reduction in the

    testing and maintenance cost of the system and hence to be minimized.

    2. False positives are to be reduced but can be compromised if they help to

    reduce false negatives.

    3. Similarly, prediction accuracy can also be compromised if it helps to reduce

    false negatives.

    3.2.2 Evaluation model

    We quantify our arguments towards nding best technique. Here we discuss,

    how we conclude a technique as the best one in the perspective of economic

    software development. The dened model is given below-

    1. Choose a technique as the best technique, having least FNR value but the

    dierence between the FPR should be with in thresholds.

    2. If two or more techniques have nearly same FNR value, then choose a

    technique as the best technique, having least FPR value.

    3. If two or more techniques have nearly equal FNR and FPR values, then

    choose a technique as the best technique, having maximum accuracy.

    We dene above three step evaluation model to compare the performance of

    fault-prediction techniques so that selected technique requires minimum eort

    for fault removal.

    The Economics of Fault Prediction

  • 3.3 Revisiting Fault Prediction Results 24

    Wagner et al. [52] presented the quality economics of defect-detection tech-

    niques and the impact of the uncovered faults on software cost as well as its

    quality. This study support our presented model. Use of this evaluation model

    helps to determine the impact of fault prediction on the software cost due to

    undetected faults.

    3.3 Revisiting Fault Prediction Results

    There have been various studies done in the eld of software fault prediction.

    In our analysis, we used the studies performed on public datasets. Table 2.3

    summarizes detailed study of dierent authors with their used evaluation mea-

    sures and their drawn conclusions. We analyzed that authors used various

    evaluation measures to compare the performance of dierent fault-prediction

    techniques, which made the comparison even more complicated. Moreover,

    the performance of a technique varies with the used dataset. Therefore, we

    revisited the results of earlier fault-prediction studies (Table 2.3) over four

    NASA MDP [1] datasets(Table 3.1), incorporating above mentioned perfor-

    mance measures i.e. false negative and false positive. All reported experi-

    ments utilized technique implementations from WEKA data-mining tool [53].

    All performance measurements are generated by threefold cross-validation of

    classication.

    Table 3.1: NASA datasets

    Project # modules % with defects Language

    CM1 496 9.80% C

    KC1 2,109 15.50% C++

    KC2 520 20.40% C++

    PC1 1,109 6.90% C

    As we know that high FNR shows that many faults remain undetected under

    the scanner of fault-prediction technique, so it has a high impact on the soft-

    ware quality as well as the testing and maintenance cost. At the same time

    high value of FPR requires more eort for unit testing.

    Overall, this suggests that for the development of economic and high quality

    The Economics of Fault Prediction

  • 3.3 Revisiting Fault Prediction Results 25

    software, we should choose a technique that predicted less number of false

    negatives even if it tends to be less accurate and/or predicted high number of

    false positives. For our analysis, we combined the results of various authors

    (mentioned in Table 2.3) with results of our presented model (FNR and FPR)

    in Appendix Tables 3.2 to 3.5. Then, we interpret the performance of these

    techniques in accordance to our model.

    We have evaluated the performance of these techniques over four NASA datasets

    viz. CM1, KC1, KC2 and PC1. We used WEKA [2] data mining tool to run

    all the experiments. The interpretation is as follow:

    For dataset CM1 (Table 3.2), techniques IBK, IB1 and Nave Bayes have similar

    false negative rate (FNR) values but Nave Bayes have higher false positive rate

    (FPR) than other two. Since Step 2 of our model compares the FPR values,

    we found that IBK and IB1 had similar FPR values so both were equally good

    when compared to other techniques.

    For dataset KC1 (Table 3.3), techniques IBK, IB1 and Classication via Clus-

    tering have similar FNR values but IBK has least FPR value, hence it outper-

    forms all other techniques and can be consider the best for this dataset.

    For dataset KC2 (Table 3.4), techniques Bayesian Logistic Regression and

    Voted Perceptron have least FNR values but their FPR values are very high,

    so they are not eective because almost all modules are predicted as faulty.

    Hence, we consider Decision Stump technique as the best technique.

    For dataset PC1 (Table 3.5), techniques IBK and IB1 have similar FNR values

    but IB1 has slightly more false positives, so IBK is considered to be the best

    one for the dataset PC1.

    Now generalizing the best technique for these four datasets, our result shows

    that IBK technique is found as the best technique among all the available

    techniques.

    The Economics of Fault Prediction

  • 3.3 Revisiting Fault Prediction Results 26

    Table 3.2: Experiment results for dataset CM1

    Technique name Acc TP TN FP FN FNR FPR PrecisionRecall F-

    measure

    Neural Network 87.55 3 433 16 46 0.94 0.04 0.16 0.06 0.09

    Simple Logistic 89.76 1 446 3 48 0.98 0.01 0.25 0.02 0.04

    SMO 89.76 0 447 2 49 1 0 0 0 0

    Voted Perceptron 89.96 0 448 1 49 1 0 0 0 0

    IBK 87.95 15 423 26 34 0.69 0.06 0.37 0.31 0.33

    IB1 87.95 15 423 26 34 0.69 0.06 0.37 0.31 0.33

    Bagging 89.96 0 448 1 49 1 0 0 0 0

    Classication via

    Regression

    89.56 3 443 6 46 0.94 0.01 0.33 0.06 0.1

    Dagging 89.96 0 448 1 49 1 0 0 0 0

    Stacking 90.16 0 449 0 49 1 0 0 0 0

    Hyper pipes 89.56 0 446 3 49 1 0.01 0 0 0

    Decision Table 90.16 0 449 0 49 1 0 0 0 0

    PART 89.96 1 447 2 48 0.98 0 0.33 0.02 0.04

    Jrip (RIPPER) 89.56 1 445 4 48 0.98 0.01 0.2 0.02 0.04

    J 48 89.96 4 439 10 45 0.92 0.02 0.29 0.08 0.13

    Random Forest 89.76 6 441 8 43 0.88 0.02 0.43 0.12 0.19

    Decision Stump 90.16 0 449 0 49 1 0 0 0 0

    BF tree 89.96 1 447 2 48 0.98 0 0.33 0.02 0.04

    Nave Bayes 83.53 15 401 48 34 0.69 0.11 0.24 0.31 0.27

    Bayesian Logistic

    Regression

    90.16 0 449 0 49 1 0 0 0 0

    Logistic 88.15 8 431 18 41 0.84 0.04 0.31 0.16 0.21

    Classication via

    Clustering

    84.14 13 406 43 36 0.73 0.1 0.23 0.27 0.25

    Grading 90.16 0 449 0 49 1 0 0 0 0

    Zero r 90.16 0 449 0 49 1 0 0 0 0

    Table 3.3: Experiment results for dataset kc1

    Technique name Acc TP TN FP FN FNR FPR PrecisionRecall F-

    measure

    Neural Network 85.78 69 1740 43 257 0.79 0.02 0.62 0.21 0.32

    Simple Logistic 85.63 66 1740 43 260 0.8 0.02 0.61 0.2 0.3

    SMO 84.64 9 1776 7 317 0.97 0 0.56 0.03 0.05

    Voted Perceptron 81.79 117 1608 175 209 0.64 0.1 0.4 0.36 0.38

    IBK 84.45 134 1647 136 192 0.59 0.08 0.5 0.41 0.45

    IB1 83.36 134 1624 159 192 0.59 0.09 0.46 0.41 0.43

    Bagging 85.92 78 1734 49 248 0.76 0.03 0.61 0.24 0.34

    Classication via

    Regression

    85.4 63 1738 45 263 0.81 0.03 0.58 0.19 0.29

    Dagging 84.83 12 1777 6 314 0.96 0 0.67 0.04 0.07

    Stacking 84.54 0 1783 0 326 1 0 0 0 0

    Hyper pipes 85.07 13 1781 2 313 0.96 0 0.87 0.04 0.08

    Decision Table 84.73 43 1744 39 283 0.87 0.02 0.52 0.13 0.21

    PART 85.02 50 1743 40 276 0.85 0.02 0.56 0.15 0.24

    Jrip (RIPPER) 84.68 84 1702 81 242 0.74 0.05 0.51 0.26 0.34

    J 48 85.21 96 1701 82 230 0.71 0.05 0.54 0.29 0.38

    Random Forest 85.25 92 1706 77 234 0.72 0.04 0.54 0.28 0.37

    Decision Stump 84.54 0 1783 0 326 1 0 0 0 0

    BF tree 85.25 40 1758 25 286 0.88 0.01 0.62 0.12 0.2

    Nave Bayes 82.46 120 1619 164 206 0.63 0.09 0.42 0.37 0.39

    Bayesian Logistic

    Regression

    84.73 13 1774 9 313 0.96 0.01 0.59 0.04 0.07

    Logistic 85.3 70 1729 54 256 0.79 0.03 0.56 0.21 0.31

    Classication via

    Clustering

    81.79 129 1596 187 197 0.6 0.1 0.41 0.4 0.4

    Grading 84.54 0 1783 0 326 1 0 0 0 0

    Zero r 84.54 0 1783 0 326 1 0 0 0 0

    The Economics of Fault Prediction

  • 3.3 Revisiting Fault Prediction Results 27

    Table 3.4: Experiment results for dataset kc2

    Technique name Acc TP TN FP FN FNR FPR PrecisionRecall F-

    measure

    Neural Network 83.14 39 395 20 68 0.64 0.05 0.66 0.36 0.47

    Simple Logistic 82.95 40 393 22 67 0.63 0.05 0.65 0.37 0.47

    SMO 83.52 26 410 5 81 0.76 0.01 0.84 0.24 0.38

    Voted Perceptron 24.52 106 22 393 1 0.01 0.95 0.21 0.99 0.35

    IBK 79.12 50 363 52 57 0.53 0.13 0.49 0.47 0.48

    IB1 76.25 51 347 68 56 0.52 0.16 0.43 0.48 0.45

    Bagging 83.72 50 387 28 57 0.53 0.07 0.64 0.47 0.54

    Classication via

    Regression

    82.57 45 386 29 62 0.58 0.07 0.61 0.42 0.5

    Dagging 81.8 17 410 5 90 0.84 0.01 0.77 0.16 0.26

    Stacking 79.5 0 415 0 107 1 0 0 0 0

    Hyper pipes 81.99 19 409 6 88 0.82 0.01 0.76 0.18 0.29

    Decision Table 82.57 45 386 29 62 0.58 0.07 0.61 0.42 0.5

    PART 80.84 32 390 25 75 0.7 0.06 0.56 0.3 0.39

    Jrip (RIPPER) 83.52 58 378 37 49 0.46 0.09 0.61 0.54 0.57

    J 48 81.42 46 379 36 61 0.57 0.09 0.56 0.43 0.49

    Random Forest 81.8 48 379 36 59 0.55 0.09 0.57 0.45 0.5

    Decision Stump 78.93 80 332 83 27 0.25 0.2 0.49 0.75 0.59

    BF tree 82.57 50 381 34 57 0.53 0.08 0.6 0.47 0.52

    Nave Bayes 83.52 45 391 24 62 0.58 0.06 0.65 0.42 0.51

    Bayesian Logistic

    Regression

    20.88 107 2 413 0 0 1 0.21 1 0.34

    Logistic 82.38 47 383 32 60 0.56 0.08 0.59 0.44 0.51

    Classication via

    Clustering

    81.03 70 353 62 37 0.35 0.15 0.53 0.65 0.59

    Grading 79.5 0 415 0 107 1 0 0 0 0

    Zero r 79.5 0 415 0 107 1 0 0 0 0

    Table 3.5: Experiment results for dataset pc1

    Technique name Acc TP TN FP FN FNR FPR PrecisionRecall F-

    measure

    Neural Network 93.6 18 1020 12 59 0.77 0.01 0.6 0.23 0.34

    Simple Logistic 92.79 5 1024 8 72 0.94 0.01 0.38 0.06 0.11

    SMO 93.15 1 1032 0 76 0.99 0 1 0.01 0.03

    Voted Perceptron 91.61 0 1016 16 77 1 0.02 0 0 0

    IBK 92.43 34 991 41 43 0.56 0.04 0.45 0.44 0.45

    IB1 92.25 34 989 43 43 0.56 0.04 0.44 0.44 0.44

    Bagging 92.88 6 1024 8 71 0.92 0.01 0.43 0.08 0.13

    Classication via

    Regression

    92.79 3 1026 6 74 0.96 0.01 0.33 0.04 0.07

    Dagging 93.06 1 1031 1 76 0.99 0 0.5 0.01 0.03

    Stacking 93.06 0 1032 0 77 1 0 0 0 0

    Hyper pipes 92.52 2 1024 8 75 0.97 0.01 0.2 0.03 0.05

    Decision Table 92.7 5 1023 9 72 0.94 0.01 0.36 0.06 0.11

    PART 92.43 1 1024 8 76 0.99 0.01 0.11 0.01 0.02

    Jrip (RIPPER) 92.88 7 1023 9 70 0.91 0.01 0.44 0.09 0.15

    J 48 92.7 11 1017 15 66 0.86 0.01 0.42 0.14 0.21

    Random Forest 92.9666 20 1011 21 57 0.74 0.02 0.49 0.26 0.34

    Decision Stump 92.88 2 1028 4 75 0.97 0 0.33 0.03 0.05

    BF tree 92.7 4 1024 8 73 0.95 0.01 0.33 0.05 0.09

    Nave Bayes 89.36 24 967 65 53 0.69 0.06 0.27 0.31 0.29

    Bayesian Logistic

    Regression

    93.06 0 1032 0 77 1 0 0 0 0

    Logistic 92.06 8 1013 19 69 0.9 0.02 0.3 0.1 0.15

    Classication via

    Clustering

    89.81 19 977 55 57 0.75 0.05 0.26 0.25 0.25

    Grading 93.06 0 1032 0 77 1 0 0 0 0

    Zero r 93.06 0 1032 0 77 1 0 0 0 0

    The Economics of Fault Prediction

  • 3.4 Summary 28

    3.4 Summary

    Software fault prediction attracts signicant attention as it can oer guidance

    to software verication and validation activities. Over the past few years,

    many organizations have provided their datasets describing module metrics

    and their fault content publicly. The availability of these datasets encour-

    age researchers to perform their fault prediction studies using several machine

    learning techniques. In this chapter, we studied the outcome of some of the

    earlier studies undertaken in this area. We found that they have used various

    criteria to evaluate the performance of a given technique. In most of the cases,

    these studies have used prediction accuracy to show how good a technique is.

    However, they seem to be ignoring the impact of fault misclassication rate

    in judging the overall performance of the various fault-prediction techniques.

    Certifying considerable number of faulty modules to be non-faulty raises seri-

    ous concerns, where faulty modules themselves are small in number compared

    to non-faulty modules. A more viable evaluation criterion will be to favor

    techniques which tend to reduce false negatives even if compromise on false

    positives and/or prediction accuracy.

    We have performed re-analysis on the results of earlier studies and rene their

    outcomes based on our presented model. Our contribution in this chapter is

    to rene the way of selection of best technique. Here, we also identify the

    need of an evaluation measure which provides the specic information about,

    how cost economic fault-prediction techniques are and what their fundamental

    limitations are?

    The Economics of Fault Prediction

  • Chapter 4

    A Cost Evaluation Framework

    In the previous chapter, we investigated the impact of fault misclassication

    on software economics and quality. In this chapter, we quantify the fault

    removal cost in dierent stages of software development when we are using

    fault prediction and answered both research questions.

    Specically, we propose a cost evaluation framework that can help to put

    the results of fault prediction in proper usability context. Essentially, the

    framework can provide an estimate of the saving in the eorts applied by

    using the results of the fault prediction in subsequent phases of the software

    development. To construct the framework, we accounted for realistic fault

    removal cost of dierent testing phases [52], along with their fault identication

    eciency [26]. We have used this framework to investigate two important

    and related research questions that for a given project dataset, whether fault

    prediction would help? And if yes, then how to choose a fault-prediction

    technique that would yield the optimal results. The rst question can be

    answered by comparing the fault removal cost in both the cases, i.e. with or

    without use of fault prediction.

    The remainder of this chapter is organized as follows. In Section 4.1, we present

    our proposed cost evaluation framework. Section 4.2 presents an experimental

    study to investigate the usefulness of fault-prediction techniques using our

    proposed framework. We discuss the implications of using our framework in

    Section 4.3 and summarization is given in Section 4.4.

  • 4.1 The Evaluation Framework 30

    4.1 The Evaluation Framework

    In the previous chapter, we highlighted the need of a cost evaluation measure,

    which compare the performance of a fault-prediction technique on the basis of

    their economics. Jones [30] states that 30-40 percent of the development cost is

    spent for quality assurance and fault removal. Since fault-prediction techniques

    are used to predict fault prone modules in early development life cycle, hence

    it can help in reducing the cost incurred on testing and maintenance.

    Here, we construct a cost evaluation framework, which accounts for realistic

    cost required to remove a fault and computes the estimated fault removal cost

    for a specic fault-prediction technique. The constraints, which we accounted

    for our framework include:

    (1) Fault removal cost varies with testing phases.

    (2) It is not possible to identify 100 % faults in specic testing phase.

    (3) It is practically not feasible to perform unit test on all modules.

    We have used normalized fault removal cost suggested by Wagner et al. [52]

    to formulate our cost evaluation framework, but these costs may vary from

    one organization to another and also depends on the various characteristics

    of the project. The normalized costs are summarized in Table 4.1. The fault

    identication eciencies for dierent testing phases are taken from the study

    of Jones [26]. The eciencies of testing phases are summarized in Table 4.2.

    Wilde et al. [45] stated that more than fty percent of modules are very small

    in size, hence unit testing on these modules is unfruitful. We have included

    this value (0.5) as the threshold for unit testing in our framework.

    Table 4.1: Removal costs of test techniques (in sta-hours per defect) [52]

    Type Lowest Mean Median Highest

    Unit 1.5 3.46 2.5 6

    System 2.82 8.37 6.2 20

    Field 3.9 27.24 27 66.6

    The Economics of Fault Prediction

  • 4.1 The Evaluation Framework 31

    Table 4.2: Fault identication eciencies of dierent test phases [26]

    Type Lowest Median Highest

    Unit 0.1 0.25 0.5

    System 0.25 0.5 0.65

    Figure 3.1 and Figure 3.2 shows the cost statistics for both faulty and non-

    fault modules respectively. Software modules which are predicted as faulty

    (true positives and false positives) by the fault-prediction technique requires

    some verication and testing cost at module level i.e. require cost equal to

    the unit testing cost (Cu, specically for our study). As it is stated that 100

    % identication of faults in specic testing phase is not possible, so some of

    the correctly predicted faulty modules (true positive) remain undetected in

    the unit testing. Faulty modules, which are predicted as non-faulty (false

    negatives) and the correctly predicted faulty modules which are remain unde-

    tected in unit testing, are probably detected in later stages that require the

    fault removal cost equal to either system testing or eld testing (cu and Cs

    respectively in our case). The used testing techniques in our framework can

    also, along with the respective fault removal cost and eciency can vary from

    organization to organization. Equation 4.1 shows the proposed cost evaluation

    framework to estimate the overall fault removal cost. Equation 4.2 shows the

    minimum fault removal cost without the use of fault prediction. Normalized

    fault removal cost and its interpretation is shown in equation 4.3.

    Ecost = Ci + Cu (FP + TP ) + s Cs (FN + (1 u) TP )+(1 s) Cf (FN + (1 u) TP ) (4.1)

    Tcost =MpCu(TM)+sCs(1u)FM+(1s)Cf(1u)FM(4.2)

    NEcost =Ecost

    Tcost

    8>>>>>:< 1; Fault Prediction is useful

    => 1; Use Unit Testing

    (4.3)

    Where, Ecost - Estimated fault removal cost of the software when we use fault

    prediction.

    The Economics of Fault Prediction

  • 4.1 The Evaluation Framework 32

    Tcost- Estimated fault removal cost of the software without the use fault

    prediction.

    NEcost- Normalized Estimated fault removal cost of the software when we use

    fault prediction.

    Ci - Initial setup cost of used fault-prediction technique.

    Cu - Normalized fault removal cost in unit testing.

    Cs - Normalized fault removal cost in system testing.

    Cf - Normalized fault removal cost in eld testing.

    Mp - Percentage of modules unit tested.

    FP - Number of false positives.

    FN - Number of false negatives.

    TP - Number of true positives.

    TM - Total modules.

    FM - Total number of faulty modules.

    u - Fault identication eciency of unit testing.

    s - Fault identication eciency of system testing.

    Our cost evaluation framework consider more practical scenario where the un-

    detected faults are traced in all the later testing phases and the corresponding

    fault removal cost is evaluated based on the organization specic statistics.

    It makes the proposed framework more viable performance measure then the

    other measures.

    In our experiment, we used values of Cu, Cs and Cf as summarized in Ta-

    ble 4.1. u and s show the fault identication eciency of unit testing and

    system testing, respectively. We have used the values of u and s from the

    survey report "Software Quality in 2010" of Caper Jones [26]. Mp shows the

    fraction of modules unit tested. Its value is taken from the study of Wilde

    [45]. We have generalized the framework so that it can be applied to any sort

    of organization/software with their specic values of Cu, Cs, Cf , Mp,u and

    s. Our aim is to provide the bench mark to approximate the overall fault

    removal cost. This is clear from our framework that if a technique is having

    high false negatives and/or high false positive, then it results in higher fault

    removal cost. When this approximated cost exceeds the unit testing cost, we

    suggest testing all the modules at unit level instead of using fault prediction.

    (equation 4.3)

    The Economics of Fault Prediction

  • 4.2 Experimental Study 33

    4.2 Experimental Study

    In this section, we presented an experimental study to investigate the useful-

    ness of fault-prediction techniques using our cost evaluation framework. In our

    study, we used ve popular fault-prediction techniques [19][20][27][25][22] on

    19 projects from NASA MDP [1] and PROMISE [2] repositories to investigate

    our study. As these nineteen projects cover signicant range of percentage

    faulty modules in the project (varying from 7 to 49 percentages), it is su-

    cient for our investigation. We used WEKA machine learning tool to perform

    all listed experiments.

    4.2.1 Experimental setup

    We have used NASA MDP [1] and PROMISE [2] datasets, listed in Table

    4.3, to evaluate the impact of fault-prediction technique over the fault removal

    cost using our proposed framework (Ecost). The metrics in these datasets

    describe projects, which vary in size as well as in complexity. These datasets

    have dierent software metrics varying in size from eight to forty. We further

    classify these datasets on the basis of percentage of faulty modules present as

    shown in Table 4.4.

    To illustrate eectiveness of our framework, we have used ve well-known

    fault-prediction techniques. Our goal is to demonstrate the cost evaluation

    framework and suggest when to use fault prediction, rather than identifying

    the "best" fault-prediction technique. For this reason, the choice of fault-

    prediction technique is orthogonal with respect to the intended contribution.

    The fault-prediction techniques which we selected for our study are Random

    Forest, J48 (C4.5 decision tree), Neural Network, K-means Clustering and IBK

    (K-nearest neighbours). These algorithms represent a broad range of machine

    learning techniques. All reported experiments utilized technique implemen-

    tations from a well-known software package WEKA [53]. All performance

    measurements are generated by threefold cross-validation of classication.

    The Economics of Fault Prediction

  • 4.2 Experimental Study 34

    Table 4.3: Used projects from NASA [1] and PROMISE data repository [1]

    Project Faulty (%) Number of Modules

    pc1 6.94 1109

    ar1 7.44 121

    nw1 7.69 403

    kc3 9.34 458

    cm1 9.84 498

    pc3 10.24 1563

    Arc 11.54 234

    pc4 12.21 1458

    kc1 15.46 2109

    ar4 18.69 107

    jm1 19.35 10885

    kc2 20.5 522

    camel1.6 21.91 858

    ant1.6 26.21 351

    ant1.7 27.79 493

    mc2 32.3 161

    jedit 3.2 33.09 272

    lucene2.0 46.67 195

    jedit 4.0 m 48.9 274

    Table 4.4: Categorization of projects based on the fraction of faulty modules

    Category Faults (%