Improving maintenance efficiency at AstraZeneca through increased ...
Improving Software Maintenance using Unsupervised Machine Learning techniques
-
Upload
valerio-maggio -
Category
Technology
-
view
588 -
download
1
description
Transcript of Improving Software Maintenance using Unsupervised Machine Learning techniques
![Page 1: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/1.jpg)
Dottorato in Scienze Computazionali e Informatiche, XXV Ciclo
Ph.D. Candidate: Valerio Maggio
Thesis Advisors: Dr. Sergio Di MartinoDr. Anna Corazza
June 5th, 2013
UNSUPERVISEDMACHINE LEARNING FOR SOFTWARE MAINTENANCE
![Page 2: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/2.jpg)
THESISG O A L
UNSUPERVISEDMACHINE LEARNING FOR SOFTWARE MAINTENANCE
These solutions exploit (unsupervised) machine learning techniques to mine information from the source code
Define and experimentally evaluate solutions (techniques and prototype tools) for automatic software analysis to support
software maintenance activities.
![Page 3: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/3.jpg)
There exist different types of Software Maintenance.
SOFTWARE MAINTENANCE“A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)
![Page 4: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/4.jpg)
SOFTWARE MAINTENANCE“A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)
Software Maintenance represents the most expensive, time consuming and challenging
phase of the whole development process.
![Page 5: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/5.jpg)
SOFTWARE MAINTENANCE“A software system must be continually adapted during its overall life cycle or it progressively becomes less satisfactory.” (cit. Lehman’s First Law of Software Evolution)
Software Maintenance represents the most expensive, time consuming and challenging
phase of the whole development process.
Software Maintenance could account up to the 85-90% of the total software costs.
![Page 6: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/6.jpg)
ISSUESSOFTWARE MAINTENANCE
“Software Maintenance is about change!” (cit. S. Jarzabek)
Change Analysis
![Page 7: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/7.jpg)
ISSUESSOFTWARE MAINTENANCE
“Software Maintenance is about change!” (cit. S. Jarzabek)
Change AnalysisProgram
Comprehension
The documentation is usually scarce or not up to date!
![Page 8: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/8.jpg)
ISSUESSOFTWARE MAINTENANCE
The source code (usually) represents the most reliable source of information about the system
“Software Maintenance is about change!” (cit. S. Jarzabek)
Change AnalysisProgram
Comprehension
The documentation is usually scarce or not up to date!
![Page 9: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/9.jpg)
ISSUESSOFTWARE MAINTENANCE
The source code (usually) represents the most reliable source of information about the system
“Software Maintenance is about change!” (cit. S. Jarzabek)
Change AnalysisProgram
ComprehensionReverse Engineering
The documentation is usually scarce or not up to date!
![Page 10: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/10.jpg)
REVERSEE N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities• Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document
• Goal: To aid the comprehension of the system
![Page 11: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/11.jpg)
REVERSEE N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities• Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document
• Goal: To aid the comprehension of the system
STATIC ANALYSIS
DYNAMIC ANALYSIS
![Page 12: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/12.jpg)
REVERSEE N G I N E
E R I N G
• Definition of tools and techniques to support maintenance activities• Goal: Build higher-level software models in an automatic fashion gathering information from the source code or any other document
• Goal: To aid the comprehension of the system
STATIC ANALYSIS DYNAMIC
ANALYSIS
![Page 13: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/13.jpg)
MACHINEL E A R N
I N G
• Provides computational effective solutions to analyze large data sets
![Page 14: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/14.jpg)
MACHINEL E A R N
I N G
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
![Page 15: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/15.jpg)
MACHINEL E A R N
I N G
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
![Page 16: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/16.jpg)
MACHINEL E A R N
I N G
• Provides computational effective solutions to analyze large data sets
• Provides solutions that can be tailored to different tasks/domains
• Requires many efforts in:
• the definition of the relevant information best suited for the specific task/domain
• the application of the learning algorithms to the considered data
![Page 17: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/17.jpg)
UN
SUPE
RVIS
ED L
EARN
ING
• Supervised Learning: • Learn from labelled samples
• Unsupervised Learning:• Learn (directly) from the data
Learn by examples
![Page 18: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/18.jpg)
UN
SUPE
RVIS
ED L
EARN
ING
• Supervised Learning: • Learn from labelled samples
• Unsupervised Learning:• Learn (directly) from the data
Learn by examples
![Page 19: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/19.jpg)
UN
SUPE
RVIS
ED L
EARN
ING
• Supervised Learning: • Learn from labelled samples
• Unsupervised Learning:• Learn (directly) from the data
Learn by examples
(+) No cost of labeling samples
(-) Trade-off imposed on the quality of the data
![Page 20: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/20.jpg)
THESISC O N T RI B U T I
O N S
&
![Page 21: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/21.jpg)
THESISC O N T RI B U T I
O N S
Contributions to three relevant and related open issues in Software Maintenance
&
![Page 22: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/22.jpg)
THESISC O N T RI B U T I
O N S
Contributions to three relevant and related open issues in Software Maintenance
1. SOFTWARE RE-MODULARIZATION
3. CLONE DETECTION
2. SOURCE CODE NORMALIZATION
&
![Page 23: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/23.jpg)
ISSUE#1
SOFTWARERE-MODULARIZATION
![Page 24: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/24.jpg)
Software Classes
UI Process Components
UI Components
Data Access Components
Data Helpers / Utilities
Security
Operational Management
Communications
Business Components
Application Facade
Buisiness Workflows
Messages InterfacesService Interfaces
Re-modularization provides a way to support software maintainers by automat ica l ly grouping together (clustering) “related” software classes
SOFTWARE RE-MODULARIZATION
PROBLEM
S T A T EM E N T
![Page 25: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/25.jpg)
Re-modularization provides a way to support software maintainers by automat ica l ly grouping together (clustering) “related” software classes
SOFTWARE RE-MODULARIZATION
External Systems
Service Consumers
Service
s
Service Interfaces
Messages Interfaces
Cross Cutting
Security
Operational M
anagement
Comm
unications
Data Data Access
ComponentsData Helpers /
Utilities
Pres
entatio
n UI Components
UI Process Components
Busin
ess
Application Facade
Buisiness Workflows
Business Components
Clusters of Software Classes
PROBLEM
S T A T EM E N T
![Page 26: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/26.jpg)
SOUR
CE
CO
DELE
XIC
AL
INFO
RMAT
ION
![Page 27: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/27.jpg)
SOUR
CE
CO
DELE
XIC
AL
INFO
RMAT
ION
![Page 28: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/28.jpg)
SOUR
CE
CO
DELE
XIC
AL
INFO
RMAT
ION
(IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related.
![Page 29: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/29.jpg)
SOUR
CE
CO
DELE
XIC
AL
INFO
RMAT
ION
(IDEA): Exploit the lexical information gathered from the source code to produce clusters of classes that are lexically related.
State of the Art: Information Retrieval (IR) based approaches.
![Page 30: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/30.jpg)
IR IN
DEXI
NG
![Page 31: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/31.jpg)
Implicit assumption: The “same” words are used whenever a particular concept occurs
IR IN
DEXI
NG
![Page 32: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/32.jpg)
1. Tokenization
2. Normalization draw, the, are, null, handl, box, r, rectangl, g, graphic, box, display, box, ...
Draws, the, are, NullHandle, box, r, Rectangle, g, Graphics, box, displayBox, ...
Implicit assumption: The “same” words are used whenever a particular concept occurs
IR IN
DEXI
NG
![Page 33: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/33.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
![Page 34: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/34.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
Class Names
![Page 35: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/35.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
Class Names
Attribute Names
![Page 36: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/36.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
Class Names
Attribute Names
Method Names
Method Names
![Page 37: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/37.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
Class Names
Attribute Names
Method Names
Parameter NamesMethod Names
Parameter Names
![Page 38: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/38.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
Class Names
Attribute Names
Method Names
Parameter Names
Comments
Comments
Method Names
Parameter Names
Comments
![Page 39: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/39.jpg)
SOUR
CE
CO
DE Z
ON
ESLE
XIC
AL
INFO
RMAT
ION
Source Code
Class Names
Attribute Names
Method Names
Parameter Names
Comments
Comments
Method Names
Parameter NamesSource Code
Comments
![Page 40: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/40.jpg)
ZON
E IN
DEXI
NG
![Page 41: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/41.jpg)
ZON
E IN
DEXI
NG
RQ1: Do terms in different Zones provide different contributions?
![Page 42: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/42.jpg)
1. Tokenization
2. Normalization
ZON
E IN
DEXI
NG
Draws, the, are, NullHandle, box, r, draw, g, Graphics, color, displayBox, ...
draw, the, are, null, handl, box, r, draw, g, graphic, color, display, box, ...
RQ1: Do terms in different Zones provide different contributions?
![Page 43: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/43.jpg)
SOUR
CE
CO
DE LEX
ICON JFreeChart JEdit
JUnit Xerces
![Page 44: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/44.jpg)
SOUR
CE
CO
DE L
EXIC
ON JFreeChart
• Good lexicon in every Zone!
![Page 45: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/45.jpg)
SOUR
CE
CO
DE L
EXIC
ON JEdit
• Very good in Comments • Very poor in Method and Parameter names
![Page 46: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/46.jpg)
SOUR
CE
CO
DE L
EXIC
ON JUnit
• Good in Class and Attribute names• No Comments at all
![Page 47: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/47.jpg)
SOUR
CE
CO
DE L
EXIC
ON Xerces
• Poor in Method, Parameter names and Comments• Good in Method Code and Variable names
![Page 48: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/48.jpg)
SOUR
CE
CO
DE L
EXIC
ON JFreeChart JEdit
JUnit XercesRQ2: How to automatically weight the different contributions ?
![Page 49: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/49.jpg)
CONTRIBUTION
SOFTWARE RE-MODULARIZATION
• We propose a source code Zone Indexing
Corazza A., Di Martino, S., Maggio, V., Scanniello, G.Investigating the use of lexical information for software system clustering(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.Combining machine learning and information retrieval techniques for software clustering(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
![Page 50: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/50.jpg)
CONTRIBUTION
SOFTWARE RE-MODULARIZATION
• We propose a source code Zone Indexing
• An approach to automatically weight the contribution of different terms in Zones
• Maximum-Likelihood Estimation (MLE) Approach
Corazza A., Di Martino, S., Maggio, V., Scanniello, G.Investigating the use of lexical information for software system clustering(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.Combining machine learning and information retrieval techniques for software clustering(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
![Page 51: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/51.jpg)
CONTRIBUTION
SOFTWARE RE-MODULARIZATION
• We propose a source code Zone Indexing
• An approach to automatically weight the contribution of different terms in Zones
• Maximum-Likelihood Estimation (MLE) Approach
• A variant of the K-medoid clustering algorithm
• Tailored to the software clustering task
Corazza A., Di Martino, S., Maggio, V., Scanniello, G.Investigating the use of lexical information for software system clustering(2011) Proceedings of the European Conference on Software Maintenance and Reengineering, CSMR, art. no. 5741257, pp. 35-44. ISSN: 15345351 ISBN: 978-076954343-7
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.Combining machine learning and information retrieval techniques for software clustering(2012) Communications in Computer and Information Science, 255 CCIS, pp. 42-60. ISSN: 18650929 ISBN: 978-364228032-0
![Page 52: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/52.jpg)
StackedVerticalBarRenderer
LineAndShapeRenderer
K-M
EDO
ID V
ARIA
NTArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
HorizontalNumberAxis
NumberAxisPropertyEditPanel
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
DateAxisPropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDatasetSampleXYDatasetThread
VerticalPlot
FitAlgorithm
![Page 53: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/53.jpg)
StackedVerticalBarRenderer
LineAndShapeRenderer
K-M
EDO
ID V
ARIA
NTArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
HorizontalNumberAxis
NumberAxisPropertyEditPanel
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
DateAxisPropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDatasetSampleXYDatasetThread
Non-extreme Distribution property
VerticalPlot
FitAlgorithm
![Page 54: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/54.jpg)
StackedVerticalBarRenderer
LineAndShapeRenderer
K-M
EDO
ID V
ARIA
NTArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
HorizontalNumberAxis
NumberAxisPropertyEditPanel
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
DateAxisPropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDatasetSampleXYDatasetThread
Non-extreme Distribution property
VerticalPlot
FitAlgorithm
![Page 55: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/55.jpg)
StackedVerticalBarRenderer
LineAndShapeRenderer
K-M
EDO
ID V
ARIA
NTArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
HorizontalNumberAxis
NumberAxisPropertyEditPanel
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
DateAxisPropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDatasetSampleXYDatasetThread
Non-extreme Distribution property
VerticalPlot
FitAlgorithm
ExtremeExtreme
![Page 56: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/56.jpg)
StackedVerticalBarRenderer
LineAndShapeRenderer
K-M
EDO
ID V
ARIA
NTArrayHolder
Plot
LinearPlotFitAlgorithm
LinearPlotFit
HorizontalNumberAxis
NumberAxisPropertyEditPanel
PlotFitAlgorithm
ChartFrame
GraphicsHolder
PropertyEditPanel
DateAxisPropertyEditPanel
StandardLegend
LegendItem
FigureLegend
VerticalNumberAxis
VerticalXYDataPlot
DataPlot
SampleXYDatasetSampleXYDatasetThread
Non-extreme Distribution property
VerticalPlot
FitAlgorithm
![Page 57: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/57.jpg)
AUTHORITATIVENESSRE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces0
0.2
0.4
0.6
0.8
1.0
19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART
FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
![Page 58: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/58.jpg)
AUTHORITATIVENESSRE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces0
0.2
0.4
0.6
0.8
1.0
19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART
FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
![Page 59: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/59.jpg)
AUTHORITATIVENESSRE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces0
0.2
0.4
0.6
0.8
1.0
19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART
FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
![Page 60: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/60.jpg)
AUTHORITATIVENESSRE
SULTS
Comparison of Clustering Results with an Authoritative Target Partition
19 target Open Source Java Systems FLAT INDEXING (NO ZONES) = STATE OF THE ART
0
0.2
0.4
0.6
0.8
1.0
Ant Lucene Tomcat Azureus Hibernate ITextdotNet jEdit JFreeChart JFTP JHotDraw JRefactory JUnit LiferayPortal Pmd Synapse TigerEnvelopes Velocity Xalan Xerces
FLAT INDEXING (NO ZONES) ZONE INDEXING ZONE INDEXING + MLE WEIGHTS
![Page 61: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/61.jpg)
ISSUE#2
SOURCE CODENORMALIZATION
![Page 62: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/62.jpg)
SOUR
CE
CO
DELE
XIC
AL
INFO
RMAT
ION
Implicit assumption: The “same” words are used whenever a particular concept occurs
![Page 63: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/63.jpg)
SOUR
CE
CO
DELE
XIC
AL
INFO
RMAT
ION
Implicit assumption: The “same” words are used whenever a particular concept occurs
![Page 64: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/64.jpg)
1. Tokenization
2. Normalization1.5 Identifier Splitting
• Splitting algorithms based on naming conventions
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• NullHandle ==> Null | Handle
• displayBox ==> display | Box
draw, the, are, box, r, rectangl, g, graphic, color, ....
STAT
E O
F TH
E A
RT
TOO
LS
display, box
null, handl
![Page 65: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/65.jpg)
1. Tokenization
2. Normalization1.5 Identifier Splitting
• Splitting algorithms based on naming conventions
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• NullHandle ==> Null | Handle
• displayBox ==> display | Box
draw, the, are, box, r, rectangl, g, graphic, color, ....
STAT
E O
F TH
E A
RT
TOO
LS
display, box
null, handl
![Page 66: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/66.jpg)
1. Tokenization
2. Normalization1.5 Identifier Splitting
• Splitting algorithms based on naming conventions
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• NullHandle ==> Null | Handle
draw, the, are, box, r, rectangl, g, graphic, color, ....
STAT
E O
F TH
E A
RT
TOO
LS
display, box
null, handl
![Page 67: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/67.jpg)
1. Tokenization
2. Normalization1.5 Identifier Splitting
• Splitting algorithms based on naming conventions
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• NullHandle ==> Null | Handle
• displayBox ==> display | Box
draw, the, are, box, r, rectangl, g, graphic, color, ....
STAT
E O
F TH
E A
RT
TOO
LS
null, handl
display, box
![Page 68: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/68.jpg)
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
IDEN
TIFIE
RS S
PLIT
TIN
G
![Page 69: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/69.jpg)
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
IDEN
TIFIE
RS S
PLIT
TIN
G
![Page 70: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/70.jpg)
• camelCase Splitter: r’(?<=!^)([A-Z][a-z]+)’
• drawXORRect ==> drawXOR | Rect
• drawxorrect ==> NO SPLIT
Splitting algorithms based on naming conventions
are not robust enough
IDEN
TIFIE
RS S
PLIT
TIN
G
![Page 71: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/71.jpg)
Splitting algorithms based on naming conventions
are not robust enough
ABBR
EVIAT
IONS
EXP
AN
SIO
N
![Page 72: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/72.jpg)
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• r as for Rectangle
• rect as for Rectangle
ABBR
EVIAT
IONS
EXP
AN
SIO
N
![Page 73: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/73.jpg)
Splitting algorithms based on naming conventions
are not robust enough
• Heavy use of Abbreviations in the source code
• r as for Rectangle
• rect as for Rectangle
ABBR
EVIAT
IONS
EXP
AN
SIO
N
![Page 74: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/74.jpg)
1. Tokenization
CO
DE N
OR
MA
LIZA
TIO
N
2. Normalization1.5 Code Normalization
• AMAP (Hill and Pollock, 2008)
• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• LINSEN (Corazza, Di Martino and Maggio, 2012)
draw, the, are, null, handl, box, rectangl, graphic, color, display, box,
rectangl,
draw, xor,rectangl
![Page 75: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/75.jpg)
1. Tokenization
CO
DE N
OR
MA
LIZA
TIO
N
2. Normalization1.5 Code Normalization
• AMAP (Hill and Pollock, 2008)
• SAMURAI (Enslen, et.al , 2011) • TIDIER (Guerrouj, et.al , 2011)
• GenTest+Normalize (Lawrie and Binkley, 2011)
• LINSEN (Corazza, Di Martino and Maggio, 2012)
draw, the, are, null, handl, box, rectangl, graphic, color, display, box,
rectangl,
draw,xor,rectangl
![Page 76: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/76.jpg)
CONTRIBUTION
SOURCE CODE NORMALIZATION
• LINSEN: novel technique for code normalization
• Able to both Split Identifiers and Expand possible occurring abbreviations
Corazza, A., Di Martino, S., Maggio, V.LINSEN: An efficient approach to split identifiers and expand abbreviations(2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
![Page 77: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/77.jpg)
CONTRIBUTION
SOURCE CODE NORMALIZATION
• LINSEN: novel technique for code normalization
• Able to both Split Identifiers and Expand possible occurring abbreviations
• Based on an efficient String Matching technique: Baeza-Yates&Perlberg Algorithm (BYP)
• Exploit different Sources of Information to find the matching words
Corazza, A., Di Martino, S., Maggio, V.LINSEN: An efficient approach to split identifiers and expand abbreviations(2012) IEEE International Conference on Software Maintenance, ICSM, art. no. 6405277, pp. 233-242. ISBN: 978-146732312-3
![Page 78: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/78.jpg)
SKET
CH
OF
THE
ALG
OR
ITH
M
• NODES correspond to characters of the current identifier
Model: Weighted Directed GraphExample: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
![Page 79: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/79.jpg)
SKET
CH
OF
THE
ALG
OR
ITH
M
• ARCS corresponds to matchings between identifier substrings and dictionary words
• NODES correspond to characters of the current identifier
Model: Weighted Directed GraphExample: drawXORRect identifier
0
11
1
9
4
5
7
8
2 63
10
![Page 80: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/80.jpg)
SKET
CH
OF
THE
ALG
OR
ITH
M
• ARCS corresponds to matchings between identifier substrings and dictionary words
• Padding Arcs to ensure the Graph always connected
• NODES correspond to characters of the current identifier
Model: Weighted Directed GraphExample: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
![Page 81: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/81.jpg)
SKET
CH
OF
THE
ALG
OR
ITH
M
• Every Arc is Labelled with the corresponding dictionary word
• Weights represent the “cost” of each matching
• Cost function [c(“word”)] favors longest words and domain-
related information
Model: Weighted Directed GraphExample: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
![Page 82: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/82.jpg)
SKET
CH
OF
THE
ALG
OR
ITH
M
• The final Mapping Solution corresponds to the sequence of labels
in the path with the minimum cost (Djikstra Algorithm)
Model: Weighted Directed GraphExample: drawXORRect identifier
0
11
1
9
4
“r”,C-MAX
“d”,C-MAX
5
7
8
2
“R”,
C-MA
X
6“a”,C-MAX 3
“w”,C-MAX “X”,C-MAX
“O”,C-MAX
“e”,C-MAX10
“c”,C-MAX“t”,C-MAX
“raw”,c(“raw”)“draw”,c(“draw”)
“or”,c(“or”)
“xor”,c(“xor”)
“rectangle”,c(“rectangle”)
“R”,C-MAX
![Page 83: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/83.jpg)
SPLITTING
Accuracy Rates for the comparison withDTW (Madani et. al 2010)
0
0.25
0.5
0.75
1
JhotDraw 5.1 Lynx 2.8.5DTW LINSEN DTW LINSEN
DTW LINSEN
RESULTS
# 1
Accuracy Rates for the comparison with GenTest (Lawrie and Binkley, 2011)
0
0.25
0.5
0.75
1
which 2.20 a2ps 4.14
GenTest LINSEN
DTW O(n3)
![Page 84: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/84.jpg)
Accuracy Rates for the comparison withAMAP (Hill and Pollock, 2008)
0
0.25
0.5
0.75
1
CW DL OO AC PR SL TOTAL (Avg)
AMAPLINSEN
EXPANSIONRESULTS
# 2
CW: Combination WordsDL: Dropped LettersOO: Others
AC: AcronymsPR: PrefixSL: Single Letters
![Page 85: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/85.jpg)
ISSUE#3
CLONEDETECTION
![Page 86: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/86.jpg)
PROBLEM
S T A T EM E N T
CLONE DETECTION
![Page 87: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/87.jpg)
PROBLEM
S T A T EM E N T
CLONE DETECTION
Clones Textual Similarity
![Page 88: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/88.jpg)
PROBLEM
S T A T EM E N T
CLONE DETECTION
Clones Functional Similarity
![Page 89: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/89.jpg)
PROBLEM
S T A T EM E N T
CLONE DETECTION
Clones affect the reliability of the system!
Sneaky Bug!
![Page 90: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/90.jpg)
Duplix
Scorpio
PMD
CCFinder
Dup CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
STAT
E O
F TH
E A
RT
TOO
LS
![Page 91: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/91.jpg)
Duplix
Scorpio
PMD
CCFinder
Dup CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDDText Based Tools:Text is compared line by line
STAT
E O
F TH
E A
RT
TOO
LS
![Page 92: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/92.jpg)
Duplix
Scorpio
PMD
CCFinder
Dup CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Token Based Tools:Token sequences are compared to sequences
STAT
E O
F TH
E A
RT
TOO
LS
![Page 93: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/93.jpg)
Duplix
Scorpio
PMD
CCFinder
Dup CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDDSyntax Based Tools:Syntax subtrees are compared to each other
STAT
E O
F TH
E A
RT
TOO
LS
![Page 94: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/94.jpg)
Duplix
Scorpio
PMD
CCFinder
Dup CPD
Duplix
Shinobi
Clone Detective
Gemini
iClones
KClone
ConQAT
Deckard
Clone Digger
JCCD
CloneDr SimScan
CLICS
NiCAD
Simian
Duploc
Dude
SDD
Graph Based Tools:(sub) graphs are compared to each other
STAT
E O
F TH
E A
RT
TOO
LS
![Page 95: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/95.jpg)
CONTRIBUTION CLONE DETECTION
• Combining different sources of information to improve the effectiveness of the detection process
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8
Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press
![Page 96: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/96.jpg)
CONTRIBUTION CLONE DETECTION
• Combining different sources of information to improve the effectiveness of the detection process
• Structural and Lexical Information
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8
Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press
![Page 97: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/97.jpg)
CONTRIBUTION CLONE DETECTION
• Combining different sources of information to improve the effectiveness of the detection process
• Structural and Lexical Information
• Application of Kernel Methods to Source Code Structures
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8
Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press
![Page 98: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/98.jpg)
CONTRIBUTION CLONE DETECTION
• Combining different sources of information to improve the effectiveness of the detection process
• Structural and Lexical Information
• Application of Kernel Methods to Source Code Structures
• Typically applied in other field such as NLP or Bioinformatics
Corazza, A., Di Martino, S., Maggio, V., Scanniello, G.A Tree Kernel based approach for clone detection(2010) IEEE International Conference on Software Maintenance, ICSM, art. no. 5609715. ISBN: 978-142448629-8
Corazza, A., Di Martino, S., Maggio, V., Moschitti, A., Passerini, A., Scanniello, G., Silvestri F. Using Machine Learning and Information Retrieval Techniques to Improve Software Maintainability(2013) Communications in Computer and Information Science, In Press
![Page 99: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/99.jpg)
CODESTRUCTURES
KER
NEL
S FO
R ST
RUC
TURE
SComputation of the dot product between (Graph) Structures
K( ),
![Page 100: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/100.jpg)
CODESTRUCTURES
KER
NEL
S FO
R ST
RUC
TURE
S
Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function)
Program Dependencies Graph (PDG)(Directed) Graph structure representing the relationship among the different statement of a program
Computation of the dot product between (Graph) Structures
K( ),
![Page 101: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/101.jpg)
CODESTRUCTURES
KER
NEL
S FO
R ST
RUC
TURE
S
Abstract Syntax Tree (AST) Tree structure representing the syntactic structure of the different instructions of a program (function)
Program Dependencies Graph (PDG)(Directed) Graph structure representing the relationship among the different statement of a program
Computation of the dot product between (Graph) Structures
K( ),
![Page 102: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/102.jpg)
CODEKE
RNEL
FO
R C
LONE
S
![Page 103: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/103.jpg)
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE ASTKE
RNEL
FO
R C
LONE
S
![Page 104: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/104.jpg)
<
x y = =
x +
x 1
y -
y 1
while
block
while
block
block
if
>
b a = =
a +
a 1
b -
b 1
>
b 0 =
c 3
CODE AST AST KERNELKE
RNEL
FO
R C
LONE
S
<
block
while
= =
block
=
y -
=
x +
+
x 1
-
y 1
<
x y
>
b 0 =
c 3
if
block
>
b a
-
b 1
<block
while
+
a 1
=
b -
=
a +
![Page 105: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/105.jpg)
while
block<
x y
KERNELSFOR CODE
STRUCTURES:AST
KERN
EL F
EATU
RES
![Page 106: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/106.jpg)
while
block<
x y
KERNELSFOR CODE
STRUCTURES:AST
KERN
EL F
EATU
RES Instruction Class (IC)
i.e., LOOP, CALL,CONDITIONAL_STATEMENT
![Page 107: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/107.jpg)
while
block<
x y
KERNELSFOR CODE
STRUCTURES:AST
KERN
EL F
EATU
RES Instruction Class (IC)
i.e., LOOP, CALL,CONDITIONAL_STATEMENT
Instruction (I)i.e., FOR, IF, WHILE, RETURN
![Page 108: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/108.jpg)
while
block<
x y
KERNELSFOR CODE
STRUCTURES:AST
KERN
EL F
EATU
RES Instruction Class (IC)
i.e., LOOP, CALL,CONDITIONAL_STATEMENT
Instruction (I)i.e., FOR, IF, WHILE, RETURN
Context (C)i.e., Instruction Class of the closer statement node
![Page 109: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/109.jpg)
while
block<
x y
KERNELSFOR CODE
STRUCTURES:AST
KERN
EL F
EATU
RES Instruction Class (IC)
i.e., LOOP, CALL,CONDITIONAL_STATEMENT
Instruction (I)i.e., FOR, IF, WHILE, RETURN
Context (C)i.e., Instruction Class of the closer statement node
Lexemes (Ls)Lexical information gathered (recursively) from leaves
![Page 110: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/110.jpg)
while
block<
x y
KERNELSFOR CODE
STRUCTURES:AST
KERN
EL F
EATU
RES
IC = Conditional-ExprI = Less-operatorC = LoopLs= [x,y]
IC = LoopI = while-loopC = Function-BodyLs= [x, y]
Instruction Class (IC)i.e., LOOP, CALL,CONDITIONAL_STATEMENT
Instruction (I)i.e., FOR, IF, WHILE, RETURN
Context (C)i.e., Instruction Class of the closer statement node
Lexemes (Ls)Lexical information gathered (recursively) from leaves
IC = BlockI = while-bodyC = LoopLs= [ x ]
![Page 111: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/111.jpg)
CLONE DETECTION• Comparison with another (pure) AST-based clone detector• Comparison on a system with randomly seeded clones
0
0.25
0.5
0.75
1
Precision Recall F-measure
CloneDigger Tree Kernel Tool
RESULTS
Results refer to clones where code
fragments have been modified by adding/
removing or changing code statements
![Page 112: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/112.jpg)
CONCLUSIONS
![Page 113: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/113.jpg)
CONCLUSIONS
![Page 114: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/114.jpg)
CONCLUSIONS
![Page 115: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/115.jpg)
CONCLUSIONS
![Page 116: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/116.jpg)
CONCLUSIONS
![Page 117: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/117.jpg)
CONCLUSIONSFUTURE RESEARCH DIRECTIONS
• Re-modularization:
• Integrate code normalization algorithm (LINSEN) to lexical analysis
• Code Normalization:
• Investigate the contributions of different sources of information used for disambiguations
• Clone Detection:
• Combine Static (and Kernel methods) with Dynamic Analysis
![Page 118: Improving Software Maintenance using Unsupervised Machine Learning techniques](https://reader034.fdocuments.net/reader034/viewer/2022052315/5485b2a9b47959ce0c8b4ec7/html5/thumbnails/118.jpg)
THANK YOU
June 5th, 2013 - Ph.D. Defense
Valerio MaggioPh.D. Student, University of Naples “Federico II”