Iman KeivanlooChristopher ForbesAseel HmoodMostafa ErfaniChristopher NealGeorge PeristerakisJuergen Rilling
MSR 2012 June 2
A Linked Data Platform for Mining Software Repositories
MSR 2012 2
SeCold is a “Wikipedia of source code related facts” produced from over 1,000,000 open source projects.
SeCold main objectives:
(1) establish the fundamental framework (2) perform data analysis
SeCold 2.0 is an ongoing research project (currently in its second year)
Software Analysis Story
3
Issue TrackerSource CodeMailing ListVersioning Control…
Some output
Some analysis
MSR 2012
Software Analysis Story
4
Issue TrackerSource CodeMailing ListVersioning Control…
Some output
Extraction Process
Raw Data
Structured Internal Data
Representation Analysis ProcessStructured Output
[Source Code Analysis: A Roadmap, FOSE’07]
MSR 2012
Sharing
5
Issue TrackerSource CodeMailing ListVersioning Control…
[Source code analysis: a roadmap, FOSE’07][Fostering synergies: how … ICSE-SUITE’10]
MSR 2012
Integration
6
Internal Data
Analysis Process
Output
Issue TrackerSource CodeMailing ListVersioning Control…
Internal Data
Analysis Process
Output
Internal Data
Analysis Process
Output
Internal Data
Analysis Process
Output
Alignm
ent
Inter-dataset Analysis
MSR 2012
How to align?
7
The Challenge
MSR 2012
Same as!
Dataset A Dataset B
History of Data Sharing
TXT
CSV
DATABASES
XML
LINKED DATA8
Linked Data is about being …
Online a URL for each fact!
Standard uses HTTP, XML, HTML and …
Open usable for both human and machines
NOT Static data and schema are editable
Graph-based graph of triples vs. XML (tree)
Integrating integrated/linked on the fly
9MSR 2012
SeCold Project A Linked Data Platform for Mining Software Repositories
10MSR 2012
1- Vocabulary Set (aka Schema, Data Model, Ontology)
Source Code Ecosystem Ontology Family (SECON)SOCON, VERON, METON, ISSUEON, LICENSON, CLON
SeCold Project
11MSR 2012
2- URL/ID Generation SchemaA URL for each piece of fact (e.g. var. def. stmt)http://aseg.cs.concordia.ca/secold/page/type/java/DatasetChangeInfo
Integration ChallengeSeveral ways to generate URLs (e.g. random )REPRODUCIBLE IDENTIFIERS
A Linked Data Platform for Mining Software Repositories
SeCold Project
12MSR 2012
3- Baseline Data PublicationGeneral Information ( ~2,000,000 triples)Source Code (~2,000,000,000 triples)Issue Tracker ( ~30,000,000 triples)Version Control ( ~700,000,000 triples)
A Linked Data Platform for Mining Software Repositories
~1 MILLION PROJECTS
LinkedData Cloud (LOD)
[Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/, as of Sept 2011]
13MSR 2012
Publication
Life Science
Government
Media
Circle size Triple count
Very large >1B
Large 1B-10M
Medium 10M-500k
Small 500k-10k
Very small <10k
SeCold:Among the 9 largest datasets in the cloud
SeCold
14
secold.org
Showcase #1 (Similar Code Search)
15MSR 2012
Showcase #2 –Part1 (Copyright violation detection)
16MSR 2012
Internal Data
Analysis Process
Output
Source Code of 25K projects
Internal Data
Analysis Process
Output
Ninka [A sentence-matching …, ASE’10]
Se Clone [SeClone … ICPC’11& WCRE’11]
Line level fingerprintsClone (Type 1,2 and 3)
License per file
Upload
Showcase #2 –Part2 (Copyright violation detection)
17MSR 2012
Analysis Process
Output
Analysis Process
Output
Ninka [A sentence-matching …, ASE’10]
Se Clone [SeClone … ICPC’11& WCRE’11]
Line level fingerprintsClone (Type 1,2 and 3)
License per file
Upload
Copyright violation detection:
select ?fileA ?fileB where { ?fileA testxi ?fingerprint . ?fileB testxi ? fingerprint . ?fileA hasLicense ?la . ?fileB hasLicense ?lb . Filter (?la != ?lb) }
Showcase #3 (Statistical Analysis)
18MSR 2012
No License 42%
GPL 217%
All Rights Reserved14%
Apache 29%
LGPL 2.112%
BSD3%
Mozilla PL 1.01%
MIT0%
Apache 10%
Nokos0%
Mozilla PL 1.10%
PHP0%
Sleepycat0%
Artistic0%
Shareware0%
Patented0%
No License ; 46%
All Rights Reserved; 13%
GPL 2; 12%
Apache 2; 10%
LGPL 2.1; 9%
BSD; 3% Mozilla PL 1.0; 3%MIT; 1%Apache 1; 1%BSD; 0%
Mozilla PL 1.1; 0%PHP; 0% Sleepycat; 0% Artistic; 0%Nokos; 0%Shareware; 0%
2009
2012
MSR 2012 19
Top Related