IPA Spring Days 2012
-
Upload
bogdan-vasilescu -
Category
Education
-
view
318 -
download
1
description
Transcript of IPA Spring Days 2012
Linguistic diversity in
open-source development
Bogdan Vasilescu
Alexander Serebrenik
Mark van den Brand
Motivation
/ Mathematics and Computer Science PAGE 123-4-2012
Lisp
CC++
Java
PythonUnix shell
HTML
XML
I „speak‟ Java
I „speak‟ PythonI „speak‟ Java
and Python
……
…
If leaves the project, what is the risk of not finding
replacement developers that speak Python?
Motivation
/ Mathematics and Computer Science PAGE 223-4-2012
No risk, plenty of other Python
developers to choose from
What about now?
Linguistic diversity
• Greenberg (1956)
• compare geographic regions
• probability that two random individuals do not speak the
same language
/ Mathematics and Computer Science PAGE 323-4-2012
Linguistic diversity
/ Mathematics and Computer Science PAGE 423-4-2012
• Simple model
• everyone speaks exactly one language
• languages are independent
L
pA
21
P
Sp
Probability that two random individuals do not speak the same language
Linguistic diversity
/ Mathematics and Computer Science PAGE 523-4-2012
• Related-languages model
• everyone speaks exactly one language
• languages are similar
Probability that two random individuals do not speak the same language
Lm
m msimppB,
),(1
1),(
1),(0
sim
msim
P
Sp
Linguistic diversity
/ Mathematics and Computer Science PAGE 623-4-2012
• Polyglot related-languages model
• everyone speaks at least one language
• languages are similar
Probability that two random individuals do not speak the same language
)(,
,
),(
1LPts
tms
tsts
msim
ppF
P
Xp
s
s
ABCBCACABCBALPCBAL ,,,,,,)(,,
Our risk measure
• Probability that two random individuals do not speak the
same language
• Risk of not finding developers that „speak‟
/ Mathematics and Computer Science PAGE 723-4-2012
)(
)(max1)(LPs
sks ksimprisk
)(,
,
),(
1LPts
tms
tsts
msim
ppF
StackOverflow.com
/ Mathematics and Computer Science PAGE 823-4-2012
User tags
/ Mathematics and Computer Science PAGE 923-4-2012
Similarity measure
• Reverend Gonzo: Java, C, C++, C#, Python,…
• Alexander Serebrenik: Prolog, SQL, C++,…
• Bogdan Vasilescu: Python,…
• Jon Skeet: C#, Java, ASP.net, XML,…
• … > 400,000
/ Mathematics and Computer Science PAGE 1023-4-2012
nLeft
nBothconfksim k
C
Java
• Association rule mining:
• “C => Java”
Similarity measure - results
/ Mathematics and Computer Science PAGE 1123-4-2012
• Assembly posts: 44
• Assembly + Java developers: > 1000
When in need for Java developers, ask Assembly guys
Case study - Emacs
• 1985-2012: C, Emacs Lisp, C++, Java, Lisp, Python, M4, … (26)
/ Mathematics and Computer Science PAGE 1223-4-2012
Exotic languages
High/low risk
Case study - Emacs
/ Mathematics and Computer Science PAGE 1323-4-2012
C: spoken by half of the community
+ similar to other languages
low risk
Python: spoken very sporadically
+ similar to other languages
low risk
What is the risk of not finding developers
that speak Python?
Conclusions
/ Mathematics and Computer Science PAGE 1423-4-2012
• Risk measure)(
)(max1)(LPs
sks ksimprisk
• Similarity measure (StackOverflow)
• “C => Java”nLeft
nBothconfksim k
Low risk Depends on similarity