Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika...
-
Upload
ursula-potter -
Category
Documents
-
view
216 -
download
0
Transcript of Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika...
![Page 1: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/1.jpg)
Towards Logistic Regression Models for Predicting Fault-prone Code
across Software ProjectsErika Camargo
and Ochimizu Koichiro
Japan Institute of Science and Technology
ESEM 2009ESEM 20091
![Page 2: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/2.jpg)
Contents
1. Abstract2. Background3. Problem Analysis4. Case study5. Results6. Conclusion and Future Work
2
![Page 3: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/3.jpg)
Abstract
Challenge: To make logistic regression (LR) models, which use design-complexity metrics, able to predict fault-prone o-o classes across software projects.
First attempt of solution: simple log data transformations
P(y=1)
xX = X = design-design-complexity complexity metricmetric
P(Fault prone P(Fault prone class)class)
3
![Page 4: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/4.jpg)
Background• Some design-complexity metrics have shown to
be good predictors of fault-prone classes in LR models
• Among these metrics are the Chidamber & Kemerer (CK) metrics
– 80th and 20th percentiles of the distributions can be used to determine high and low values
– Their thresholds cannot be determined before their use and should be derived and used locally
4
![Page 5: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/5.jpg)
Problem Analysis
Can a LR model built with these kind of metrics work efficiently with different software projects?
LEAST FAULTY MOST FAULTY
Small Size SW project
Large Size SW project
X = Number of Methods
P (y=1)
105
20
![Page 6: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/6.jpg)
Case Study
1. Data analysis of 7 different projects and application of simple log data transformations.
2. Construction of 3 univariate LR models using a large open source project (1st release of the MYLYN System with 638 Java classes).– Dependent Variables: CK-CBO, CK-RFC, CK-WMC– Independent Variables: Defects (from Bugzilla & CVS)
3. Test these models with 2 other smaller projects (with 11 and13 Java classes)
6
![Page 7: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/7.jpg)
7
Challenge
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
produced biased regression estimates and reduce the predictive power of regression models
BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **
![Page 8: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/8.jpg)
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
RFC Data of BNS is more spread than the data of
the MYL
BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **
8
![Page 9: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/9.jpg)
(**) Eclipse Project
(*) systems developed by students of JAIST, described in: Gomaa Hassan, Designing Concurrent, Distributed, and Real-Time Applications with UML, Addison Wesley-Object Technology Series Editors, July 2000.
RFC Data of BNS is more spread than the data of
the MYL
BNS: Banking system (2006) *CRS: Cruise control system (2005) *ECS: ecommerce system (2006) * ELCS: Elevator control system (2003)*FACS: Factory automation system (2005) *GMF: Graphic Modeling Framework **MYL : Mylyn system **
9
![Page 10: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/10.jpg)
Case Study
Solution. Simple data transformation using “Log10”
Example :
10
Number of Outliers are lessData Spread is more uniform
LCBO = Log10(CBO+1) LTCBO = Log10(CBO+1) + dm;Where dm is the difference of CBO medias of the Mylyn system and the system which data is being transformed
![Page 11: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/11.jpg)
Results
Effects of the Log data Transformations:• Elimination of great number of outliers• Overall goodness of fit of the 3 models is
better • Discrimination (Most Faulty/Least Faulty)– All models discriminate well between most Faulty
and Least Faulty classes of the Mylyn System– What about using different projects?
11
![Page 12: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/12.jpg)
Results
Group Model Correct Classification (RAW DATA)
Correct Classification(LOG Tx DATA)
Effect
MF(6 classes)
CBO 2 5
RFC 5 5 =
WMC 6 6 =
LF(5 classes)
CBO 5 5 =
RFC 3 3 =
WMC 4 4 =
BOTH(11 classes)
CBO 7 10
RFC 8 8 =
WMC 10 10 =
BANKING SYSTEM
12
MF: Most FaultyLF: Least Faulty
![Page 13: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/13.jpg)
Results
Group Model Correct Classification (RAW DATA)
Correct Classification(LOG Tx DATA)
Effect
MF(9 classes)
CBO 3 7
RFC 9 8
WMC 7 6
LF(4 classes)
CBO 4 4 =
RFC 0 3
WMC 0 4
BOTH(13 classes)
CBO 7 11
RFC 9 11
WMC 7 10
E-COMMERCE SYSTEM
13
MF: Most FaultyLF: Least Faulty
![Page 14: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/14.jpg)
Conclusions and Future work
• CK-CBO, CKR-RFC ad CK-WMC can have different distributions in different projects
• Simple Log Transformations seem to improve the prediction ability of LR models, specially when the project measures are not as spread as those used in the construction of the model.
• Further data exploration and study of data transformations
14
![Page 16: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/16.jpg)
16
![Page 17: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/17.jpg)
17
![Page 18: Towards Logistic Regression Models for Predicting Fault-prone Code across Software Projects Erika Camargo and Ochimizu Koichiro Japan Institute of Science.](https://reader030.fdocuments.net/reader030/viewer/2022032723/56649d0a5503460f949dd091/html5/thumbnails/18.jpg)
18