Python Meetup Talk 21072009

18
Introduction Data Mining And the results are A vision over the present and the future Mining Software Repositories Improving software Pere Urb´ on Bayes Data Management Group Dept. Arquitectura de Computadors Universitat Polit` ecnica de Catalunya [email protected] July of 2009 Pere Urb´ on Bayes Mining Software Repositories

Transcript of Python Meetup Talk 21072009

Page 1: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Mining Software Repositories

Improving software

Pere Urbon Bayes

Data Management Group

Dept. Arquitectura de Computadors

Universitat Politecnica de Catalunya

[email protected]

July of 2009

Pere Urbon Bayes Mining Software Repositories

Page 2: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Index

Introduction

Data Mining

The results

The future

Pere Urbon Bayes Mining Software Repositories

Page 3: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

MotivationsThe SituationObjectives

The problem

Companies need to own highly available and reliable software.

The software of low quality harms both, clients and producers.

Unfortunately, avoiding defects is a difficult task to undertake.

Project Leaders need to keep an eye inside to many projects.

Software engineer tend not to document software in deep.

The complexity of software projects is growing every day.

Pere Urbon Bayes Mining Software Repositories

Page 4: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

MotivationsThe SituationObjectives

The software development process

Pere Urbon Bayes Mining Software Repositories

Page 5: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

MotivationsThe SituationObjectives

Support tools

Tools used to support software development:

Version Control server.

Bug Tracker server.

Project Management server.

Life cycle management software.

...

This set of tools store a huge amount of information during theprocess, Why not to use this information to improve our software?

Pere Urbon Bayes Mining Software Repositories

Page 6: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

MotivationsThe SituationObjectives

Objective and Applications

Objectives:

Analyse the use of data mining technology, to data stored insupport tools, with the aim to improve software quality.

Develop an experimental prototype tool.

Applications:

Reduce the error rate.

Provides a non-exploited source of documentation.

Provide a new source of support tools for IDE’s.

Pere Urbon Bayes Mining Software Repositories

Page 7: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

IntroductionThe use of

Data mining

Type of database analysis that attempts to discover useful patternsor relationships in a group of data. The analysis uses advancedstatistical methods, such as cluster analysis, and sometimesemploys artificial intelligence or neural network techniques. Amajor goal of data mining is to discover previously unknownrelationships among the data, especially when the data come fromdifferent databases.

Pere Urbon Bayes Mining Software Repositories

Page 8: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

IntroductionThe use of

Methods

Types of:

Traditional Data Mining (K-Means, C4.5, Bayesian Networks).

Relational Data Mining (ILP, Markov logic networks,

Relational bayesian methods, Dependency Networks).

Categories:

Clusterers

Classifiers

Associative rules

Network Models.

Pere Urbon Bayes Mining Software Repositories

Page 9: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

IntroductionThe use of

Data mining

Type of database analysis that attempts to discover useful patternsor relationships in a group of data. The analysis uses advancedstatistical methods, such as cluster analysis, and sometimesemploys artificial intelligence or neural network techniques. Amajor goal of data mining is to discover previously unknownrelationships among the data, especially when the data come fromdifferent databases.

Pere Urbon Bayes Mining Software Repositories

Page 10: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

IntroductionThe use of

Issue detection

LOC DefectAppearence2Month RevisionsAuthor

LineAddedIRLAdd ReportedI2Month Revision2Month

LineAddedIRLDel Revision3Month Releases

AlterType DefectAppearence3Month ReportedI1Month

AgeMonths ReportedI3Month ReportedIssues

RevisionAge Revision5Month ReportedI5Month

DefectReleases DefectAppearence5Month

Revision1Month DefectAppearance1Month

Question: Has this file a non detected error. The exact number oferrors can be predicted to.

Pere Urbon Bayes Mining Software Repositories

Page 11: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

IntroductionThe use of

Another types of objectives

Predict bugs related to a software developer.

Prediction of bugs in software components.

This techniques could be used in different topics:

Software understanding.

Software evolution.

Software visualization.

Change propagation.

Impact analysis.

Software complexity.

Fault prediction.

Pere Urbon Bayes Mining Software Repositories

Page 12: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Error predictionSoftware

Error prediction

Eclipse Project Firefox Project

Correctly classified 94.65% 94.822%

Statistics Kappa 0.893 0.8883

Precision 0.9465 0.9482

Recall 0.945 0.949

AUC ROC 0.9682 0.9808

Eclipse-Firefox Firefox-Eclipse

Correctly classified 82.0065% 87.975%

Statistics Kappa 0.5976 0.7595

Precision 0.818 0.894

Recall 0.82 0.88

AUC ROC 0.805 0.83

Pere Urbon Bayes Mining Software Repositories

Page 13: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Error predictionSoftware

The end App

Pere Urbon Bayes Mining Software Repositories

Page 14: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Software librariesAn envision

The Prototype

Software being used:

Programming: JAVA

Database: MySQL and MonetDB.

Data Mining: Weka 3.6 and Proximity 4.3

XML: Apache Xerces 2.9.1

SVN, CVS : svnkit 1.3.0, for CVS netbeans-cvs lib and acustom rcs file parser.

Presentation: Prefuse Visualization Toolkit and WekaDrawing facilities.

Pere Urbon Bayes Mining Software Repositories

Page 15: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Software librariesAn envision

Could python give use the same?

Machine Learning:

Orange: With 1.0 this lib has many interesting and usefulmethods, Classification, Regression and Clustering. The mostsimilar to Weka.

PyML: Only has classifier facilities.

Shogun: Only for Support Vector Machines.

RPy: An interface to R.

Databases:

The most important relational databases are available viaDB-API.

ZODB: Zope Object Database.Metakit: An embedded database with a not defined paradigm.Pygr: Python graph database framework for bioinformatics.

Pere Urbon Bayes Mining Software Repositories

Page 16: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Software librariesAn envision

Could python give use the same?

Presentation:

Graph Drawing: NetworkX, with nice result. There are someother but they look incomplete.

GUI: PyQT, wxWindows, pyGTK. It’s your taste XD!.

SVN, CVS processing:

SVN: pysvn - Python interface to Subversion.

CVS: It seams nothing is available.

GIT: PyGit - Pythonic git bindings targeted towardsporcelains.

XML Processing could be done using built-in support and with anySAX or DOM parser.

Pere Urbon Bayes Mining Software Repositories

Page 17: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Software librariesAn envision

The future

Known issues:

Data preprocessing performance.

Database performance, is the relational model valid?

Dynamic procedure addition.

The Todo List:

Develop new procedures over different related topics, likesoftware visualization, change support, etc.

Develop a more mature software. Python could help in someparts. This software must be easily extensible.

Improve the hole process performance.

Pere Urbon Bayes Mining Software Repositories

Page 18: Python Meetup Talk 21072009

IntroductionData Mining

And the results areA vision over the present and the future

Software librariesAn envision

The end

Question?

Pere Urbon BayesData Management Group

Dept. Arquitectura de ComputadorsUniversitat Politecnica de Catalunya

[email protected]

Pere Urbon Bayes Mining Software Repositories