Survey on Software Defect Prediction (PhD Qualifying Examination Presentation)

Survey on Software Defect Prediction

- PhD Qualifying Examination -

July 3, 2014 Jaechang Nam

Department of Computer Science and Engineering HKUST

Outline

•  Background •  Software Defect Prediction Approaches

–  Simple metric and defect estimation models –  Complexity metrics and Fitting models –  Prediction models –  Just-In-Time Prediction Models –  Practical Prediction Models and Applications –  History Metrics from Software Repositories –  Cross-Project Defect Prediction and Feasibility

•  Summary and Challenging Issues

Motivation •  General question of software defect prediction

–  Can we identify defect-prone entities (source code file, binary, module, change,...) in advance? •  # of defects •  buggy or clean

•  Why? –  Quality assurance for large software

(Akiyama@IFIP ’71) –  Effective resource allocation

•  Testing (Menzies@TSE`07) •  Code review (Rahman@FSE’11)

Ground Assumption

•  The more complex, the more defect-prone

Two Focuses on Defect Prediction

•  How much complex are software and its process? –  Metrics

•  How can we predict whether software has defects? –  Models based on the metrics

Prediction Performance Goal

•  Recall vs. Precision

•  Strong predictor criteria –  70% recall and 25% false positive rate (Menzies@TSE`07) –  Precision, recall, accuracy ≥ 75% (Zimmermann@FSE`09)

Outline

Defect Prediction Approaches

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Identifying Defect-prone Entities

•  Akiyama’s equation (Ajiyama@IFIP`71)

–  # of defects = 4.86 + 0.018 * LOC (=Lines Of Code)

•  23 defects in 1 KLOC •  Derived from actual systems

•  Limitation –  Only LOC is not enough to capture software

complexity

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Fitting Model

Cyclomatic Metric

Halstead Metrics

Complexity Metrics and Fitting Models

•  Cyclomatic complexity metrics (McCabe`76) –  “Logical complexity” of a program represented in

control flow graph –  V(G) = #edge – #node + 2

•  Halstead complexity metrics (Halsted`77)

–  Metrics based on # of operators and operands –  Volume = N * log2n –  # of defects = Volume / 3000

Complexity Metrics and Fitting Models

•  Limitation –  Do not capture complexity (amount) of change. –  Just fitting models but not prediction models in most of

studies conducted in 1970s and early 1980s •  Correlation analysis between metrics and # of defects

–  By linear regression models

•  Models were not validated for new entities (modules).

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Fitting Model Prediction Model (Regression)

Cyclomatic Metric

Halstead Metrics

Process Metrics

Prediction Model (Classification)

Regression Model •  Shen et al.’s empirical study (Shen@TSE`85)

–  Linear regression model –  Validated on actual new modules –  Metrics

•  Halstead, # of conditional statements •  Process metrics

–  Delta of complexity metrics between two successive system versions

–  Measures •  Between actual and predicted # of defects on new modules

–  MRE (Mean magnitude of relative error) »  average of (D-D’)/D for all modules

•  D:#actual###of#defects#•  D’:#predicted###of#defects##

»  MRE = 0.48

Classification Model •  Discriminative analysis by Munson et al. (Munson@TSE`92)

•  Logistic regression •  High risk vs. low risk modules •  Metrics

–  Halstead and Cyclomatic complexity metrics

•  Measure –  Type I error : False positive rate –  Type II error : False negative rate

•  Result –  Accuracy: 92% (6 misclassi f ication out of 78 modules) –  Precision: 85% –  Recal l : 73% –  F-measure: 88%

Defect Prediction Process (Based on Machine Learning)

Classification / Regression

Software Archives

B"C"C"

2"5"0"

Instances with metrics (features)

and labels

Training Instances (Preprocessing)

New instances

Generate Instances

Build a model

Defect Prediction (Based on Machine Learning)

•  Limitations –  Limited resources for process metrics

•  Error fix in unit testing phase was conducted informally by an individual developer (no error information available in this phase). (Shen@TSE`85)

–  Existing metrics were not enough to capture complexity of object-oriented (OO) programs.

–  Helpful for quality assurance team but not for individual developers

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

Process Metrics

Just-In-Time Prediction Model

Practical Model and Applications

History Metrics

CK Metrics

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

Process Metrics

History Metrics

CK Metrics

Risk Prediction of Software Changes (Mockus@BLTJ`00)

•  Logistic regression •  Change metrics

–  LOC added/deleted/modified –  Diffusion of change –  Developer experience

•  Result –  Both false positive and false negative rate: 20% in the

best case

Risk Prediction of Software Changes (Mockus@BLTJ`00)

•  Advantage –  Show the feasible model in practice

•  Limitation –  Conducted 3 times per week

•  Not fully Just-In-Time

–  Validated on one commercial system (5ESS switching system software)

BugCache (Kim@ICSE`07) •  Maintain defect-prone entities in a cache

•  Approach

•  Result –  Top 10% files account for 73-95% of defects on 7 systems

BugCache (Kim@ICSE`07) •  Advantages

–  Cache can be updated quickly with less cost. (c.f. static models based on machine learning)

–  Just-In-Time: always available whenever QA teams want to get the list of defect-prone entities

•  Limitations –  Cache is not reusable for other software projects. –  Designed for QA teams

•  Applicable only in a certain time point after a bunch of changes (e .g., end of a sprint)

•  Stil l l imited for individual developers in development phase

Change Classification (Kim@TSE`08)

•  Classification model based on SVM •  About 11,500 features

–  Change metadata such as changed LOC, change count –  Complexity metrics –  Text features from change log messages, source code, and file

•  Results –  78% accuracy and 60% recall on average from 12 open-source

projects

Change Classification (Kim@TSE`08) •  Limitations

–  Heavy model (11,500 features) –  Not validated on commercial software products.

Follow-up Studies •  Studies addressing limitations

–  “Reducing Features to Improve Code Change-Based Bug Prediction” (Shiva j i@TSE`13)

•  With less than 10% of all features, buggy F-measure is 21% improved.

–  “Software Change Classification using Hunk Metrics” (Ferzund@ICSM`09) •  27 hunk-level metrics for change classif ication •  81% accuracy, 77% buggy hunk precision, and 67% buggy hunk recall

–  “A large-scale empirical study of just-in-time quality assurance” (Kamei@TSE`13)

•  14 process metrics (mostly from Mockus`00) •  68% accuracy, 64% recall on 11open-source and commercial projects

–  “An Empirical Study of Just-In-Time Defect Prediction Using Cross-Project Models” (Fukushima@MSR`14)

•  Median AUC: 0.72

Challenges of JIT model

•  Practical validation is difficult –  Just 10-fold cross validation in current literature –  No validation on real scenario

•  e.g., online machine learning

•  Still difficult to review huge change –  Fine-grained prediction within a change

•  e.g., Line-level prediction

Next Steps of Defect Prediction

1980s 1990s 2000s 2010s 2020s

Online Learning JIT Model

Prediction Model (Regression)

Process Metrics

Metrics

Models

Others

Fine-grained Prediction

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

Process Metrics

History Metrics

CK Metrics

Defect Prediction in Industry •  “Predicting the location and number of faults in large

software systems” (Ostrand@TSE`05) –  Two industrial systems –  Recall 86% –  20% most fault-prone modules account for 62% faults

Case Study for Practical Model •  “Does Bug Prediction Support Human Developers?

Findings From a Google Case Study” (Lewis@ICSE`13)

–  No identifiable change in developer behaviors after using defect prediction model

•  Required characteristics but very challenging –  Actionable messages / obvious reasoning

1980s 1990s 2000s 2010s 2020s

Actionable Defect

Prediction

Process Metrics

Metrics

Models

Others

Evaluation Measure for Practical Model

•  Measure prediction performance based on code review effort

•  AUCEC (Area Under Cost Effectiveness Curve)

33 Percent of LOC

0 100%

50% 10%

Rahman@FSE`11, Bugcache for inspections: Hit or miss?

Practical Application

•  What else can we do more with defect prediction models? –  Test case selection on regression testing

(Engstrom@ICST`10) –  Prioritizing warnings from FindBugs (Rahman@ICSE`14)

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics Process Metrics

History Metrics

Representative OO Metrics

Metric Description

WMC Weighted Methods per Class (# of methods)

DIT Depth of Inheritance Tree ( # of ancestor classes)

NOC Number of Children

CBO Coupling between Objects (# of coupled classes)

RFC Response for a class: WMC + # of methods called by the class)

LCOM Lack of Cohesion in Methods (# of "connected components”)

•  CK metrics (Chidamber&Kemerer@TSE`94)

•  Prediction Performance of CK vs. code (Basili@TSE`96) – F-measure: 70% vs. 60%

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics Process Metrics

History Metrics

Representative History Metrics

Name # of metrics

Metric source Citation

Relative code change churn 8 SW Repo.* Nagappan@ICSE`05

Change 17 SW Repo. Moser@ICSE`08

Change Entropy 1 SW Repo. Hassan@ICSE`09

Code metric churn Code Entropy 2 SW Repo. D’Ambros@MSR`10

Popularity 5 Email archive Bacchelli@FASE`10

Ownership 4 SW Repo. Bird@FSE`11

Micro Interaction Metrics (MIM) 56 Mylyn Lee@FSE`11

* SW Repo. = version control system + issue tracking system

Representative History Metrics •  Advantage

–  Better prediction performance than code metrics

10.0%#

20.0%#

30.0%#

40.0%#

50.0%#

60.0%#

Moser`08# Hassan`09# D'Ambros`10# Bachille`10# Bird`11# Lee`11#

Performance Improvement (all metrics vs. code complexity metrics)

(F-measure) (F-measure) (Absolute prediction

error)

(Spearman correlation)

(Spearman correlation*)

(*Bird`10’s results are from two metrics vs. code metrics, No comparison data in Nagappan`05)

Performance Improvement

History Metrics

•  Limitations –  History metrics do not extract par ticular program characteristics

such as developer social network, component network, and anti-pattern.

–  Not applicable for new projects and projects lacking in historical data

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics

Cross-Project Prediction

Universal Model

Process Metrics

Cross-Project Feasibility

History Metrics

Other Metrics

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics

Universal Model

Process Metrics

History Metrics

Other Metrics

Name # of metrics

Metric source Citation

Component network 28 Binaries

(Windows Server 2003)

Zimmermann@ICSE`08

Developer-Module network 9 SW Repo. + Binaries

Pinzger@FSE`08

Developer social network 4 SW Repo. Meenely@FSE`08

Anti-pattern 4 SW Repo. +

Design-pattern

Taba@ICSM`13

* SW Repo. = version control system + issue tracking system

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics

Universal Model

Process Metrics

History Metrics

Other Metrics

Defect Prediction for New Software Projects

•  Universal Defect Prediction Model

•  Cross-Project Defect Prediction

Universal Defect Prediction Model (Zhang@MSR`14)

•  Context-aware rank transformation –  Transform metric values ranged from 1 to 10 across all

projects.

•  Model built by 1398 projects collected from SourceForge and Google code

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics

Universal Model

Process Metrics

History Metrics

Other Metrics

Cross-Project Defect Prediction (CPDP)

•  For a new project or a project lacking in the historical data

Training

Project A Project B

Only 2% out of 622 prediction combinations worked. (Zimmermann@FSE`09)

Transfer Learning (TL)

Traditional Machine Learning (ML)

Learning System

Transfer Learning

Learning System

Knowledge Transfer

Pan et al.@TNN`10, Domain Adaptation via Transfer Component Analysis

•  Adopting transfer learning

Transfer learning Metric Compensation NN Filter TNB TCA+

Preprocessing N/A Feature selection, Log-filter Log-filter Normalization

Machine learner C4.5 Naive Bayes TNB Logistic Regression

# of Subjects 2 10 10 8

# of predictions 2 10 10 26

Avg. f-measure 0.67 (W:0.79, C:0.58)

0.35 (W:0.37, C:0.26)

0.39 (NN: 0.35, C:0.33)

0.46 (W:0.46, C:0.36)

Citation Watanabe@PROMISE`08 Turhan@ESEJ`09 Ma@IST`12 Nam@ICSE`13

* NN = Nearest neighbor, W = Within, C = Cross

Metric Compensation (Watanabe@PROMISE`08)

•  Key idea •  New target metric value =

target metric value * average source metric value

average target metric value

Source Target New Target

Let me transform like source!

Metric Compensation (cont.) (Watanabe@PROMISE`08)

Avg. f-measure 0.67 (W:0.79, C:0.58)

0.35 (W:0.37, C:0.26)

0.39 (NN: 0.35, C:0.33)

0.46 (W:0.46, C:0.36)

NN filter (Turhan@ESEJ`09)

•  Key idea

•  Nearest neighbor filter – Select 10 nearest source instances of each

target instance

New Source Target

Hey, you look like me! Could you be my model?

Source

NN filter (cont.) (Turhan@ESEJ`09)

Avg. f-measure 0.67 (W:0.79, C:0.58)

0.35 (W:0.37, C:0.26)

0.39 (NN: 0.35, C:0.33)

0.46 (W:0.46, C:0.36)

Transfer Naive Bayes (Ma@IST`12)

•  Key idea

Target

Hey, you look like me! You will get more chance to be my best model!

Source

! Provide more weight to similar source instances to build a Naive Bayes Model Build a model

Please, consider me more important than other instances

Transfer Naive Bayes (cont.) (Ma@IST`12)

•  Transfer Naive Bayes

– New prior probability

– New conditional probability

•  How to find similar source instances for target –  A similarity score

–  A weight value

F1 F2 F3 F4 Score (si)

Max of target 7 3 2 5 -

src. inst 1 5 4 2 2 3

src. inst 2 0 2 5 9 1

Min of target 1 2 0 1 -

k=# of features, si=score of instance i

Avg. f-measure 0.67 (W:0.79, C:0.58)

0.35 (W:0.37, C:0.26)

0.39 (NN: 0.35, C:0.33)

0.46 (W:0.46, C:0.36)

TCA+ (Nam@ICSE`13)

•  Key idea – TCA (Transfer Component Analysis)

Source Target

Oops, we are different! Let’s meet in another world!

New Source New Target

Transfer Component Analysis (cont.)

•  Feature extraction approach – Dimensionality reduction – Projection •  Map original data

in a lower-dimensional feature space

1-dimensional feature space

TCA (cont.)

63 Pan et al.@TNN`10, Domain Adaptation via Transfer Component Analysis

Target domain data Source domain data

TCA (cont.)

Pan et al.@TNN`10, Domain Adaptation via Transfer Component Analysis

TCA+ (Nam@ICSE`13)

Source Target

Oops, we are different! Let’s meet at another world!

But, we are still a bit different!

Source Target

Oops, we are different! Let’s meet at another world!

Normalize US together!

TCA TCA+

Normalization Options

•  NoN: No normalization applied

•  N1: Min-max normalization (max=1, min=0)

•  N2: Z-score normalization (mean=0, std=1)

•  N3: Z-score normalization only using source mean and standard deviation

•  N4: Z-score normalization only using target mean and standard deviation

Preliminary Results using TCA

F*measure"

67 *Baseline:#CrossLproject#defect#predicNon#without#TCA#and#normalizaNon#

Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4

Project A ! Project B Project B ! Project A

Preliminary Results using TCA

F*measure"

68 *Baseline:#CrossLproject#defect#predicNon#without#TCA#and#normalizaNon#

Prediction performance of TCA varies according to different

normalization options! Baseline NoN N1 N2 N3 N4 Baseline NoN N1 N2 N3 N4

Project A ! Project B Project B ! Project A

TCA+: Decision Rules

•  Find a suitable normalization for TCA •  Steps – #1: Characterize a dataset – #2: Measure similarity

between source and target datasets – #3: Decision rules

TCA+: #1. Characterize a Dataset

Dataset A Dataset B

DIST={dij : i,j, 1 ≤ i < n, 1 < j ≤ n, i < j}

TCA+: #2. Measure Similarity between Source and Target

•  Minimum (min) and maximum (max) values of DIST •  Mean and standard deviation (std) of DIST •  The number of instances

TCA+: #3. Decision Rules

•  Rule #1 –  Mean and Std are same ! NoN

•  Rule #2 –  Max and Min are different ! N1 (max=1, min=0)

•  Rule #3,#4 –  Std and # of instances are different ! N3 or N4 (src/tgt mean=0, std=1)

•  Rule #5 –  Default ! N2 (mean=0, std=1)

TCA+ (cont.) (Nam@ICSE`13)

Avg. f-measure 0.67 (W:0.79, C:0.58)

0.35 (W:0.37, C:0.26)

0.39 (NN: 0.35, C:0.33)

0.46 (W:0.46, C:0.36)

Current CPDP using TL •  Advantages

–  Comparable prediction performance to within-prediction models

–  Benefit from the state-of-the-art TL approaches

•  Limitation –  Performance of some cross-prediction pairs is still poor.

(Negative Transfer)

74 Source Target

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics

Universal Model

Process Metrics

History Metrics

Other Metrics

Feasibility Evaluation for CPDP •  Solution for negative transfer

–  Decision tree using project characteristic metrics (Zimmermann@FSE`09)

•  E.g. programming language , # developers, etc .

Follow-up Studies •  “An investigation on the feasibility of cross-project

defect prediction.” (He@ASEJ`12)

–  Decision tree using distributional characteristics of a dataset E.g. mean, skewness, peakedness, etc.

Feasibility for CPDP

•  Challenges on current studies –  Decision trees were not evaluated properly.

•  Just fitting model

–  Low target prediction coverage •  5 out of 34 target projects were feasible for cross-predictions

(He@ASEJ`12)

1980s 1990s 2000s 2010s 2020s

Cross-Prediction Feasibility Model

CK Metrics

Universal Model

Process Metrics

Metrics

Models

Others

History Metrics

Other Metrics

1970s 1980s 1990s 2000s 2010s LOC

Simple Model

Cyclomatic Metric

Halstead Metrics

CK Metrics

History Metrics

Other Metrics

Universal Model

Process Metrics

Cross-prediction Model •  Common challenge

–  Current cross-prediction models are limited to datasets with same number of metrics

–  Not applicable on projects with different feature spaces (different domains) •  NASA Dataset: Halstead, LOC •  Apache Dataset: LOC, Cyclomatic , CK metrics

Source Target

1980s 1990s 2000s 2010s 2020s

CK Metrics

Data Privacy

Personalized Model

Universal Model

Process Metrics

Metrics

Models

Others

Cross-Domain Prediction

History Metrics

Other Metrics

Survey on Software Defect Prediction (PhD Qualifying Examination Presentation)

Software

Transcript of Survey on Software Defect Prediction (PhD Qualifying Examination Presentation)

Personalized Defect Prediction

Defect prediction from static code features: current ...

1 Survey on Software Defect Prediction - GitHub Pageslifove.github.io/files/PQE_Survey_JC.pdf · 1 Survey on Software Defect Prediction Jaechang Nam Abstract Software defect prediction

Nearest Neighbor Sampling for Better Defect Prediction

Multi-Objective Cross-Project Defect Prediction

Nearest neighbor, defect prediction

Which process metrics can significantly improve defect ...Keywords Software metrics Product metrics Process metrics Defect prediction models Software defect prediction 1 Introduction

Heterogeneous Defect Prediction (ESEC/FSE 2015)

Evaluating Defect Prediction Approaches: A …users.dcc.uchile.cl/~rrobbes/p/EMSE-BugPrediction.pdfEvaluating Defect Prediction Approaches: A Benchmark ... of line-based code churn.

Software Defect Prediction via Convolutional Neural Network

Defect Prediction & Prevention In Automotive Software Development

Machine Learning Prediction of Defect Structures in ...

Software Defect Prediction Techniques in Automotive Domain ... · of machine learning based software defect prediction techniques. The effective use of software defect prediction

A Framework for Software Defect Prediction Using Neural ... · Software Defect, Software Defect Prediction Model, Neural Network, Quality Management 1. Introduction Software quality

Software Defect Prediction Models for Quality Improvement ... · Software Defect Prediction Models for Quality Improvement: A Literature Study . ... Software metrics has been ...

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 32, … Defect 200… · Index Terms—Software defect prediction, defect association, defect isolation effort, defect correction effort.

Software Defect Prediction & Release Readiness Assessment

Software Defect Prediction on Unlabeled Datasets

Preparing and qualifying the visual inspection defect kits

Evidence-based defect assessment and prediction for ...