A Large Scale Study of Multiple Programming Languages and Code Quality
-
Upload
pavneet-singh-kochhar -
Category
Software
-
view
100 -
download
0
Transcript of A Large Scale Study of Multiple Programming Languages and Code Quality
A Large Scale Study of Multiple Programming Languages
and Code Quality
Pavneet S. Kochhar, Dinusha Wijedasa, and David LoSingapore Management University
{kochharps.2012,dwijedasa,davidlo}@smu.edu.sg
International Conference on Software Analysis, Evolution and Reengineering (SANER 2016)
2
Motivation
• Over 700 languages1 available• Developers often use multiple languages
– To improve performance, productivity and agility– Linux OS kernel mainly written in C Utilities & app. written in C++, Perl, Python
• Concerns– Increase in complexity – Proper interfaces between languages
1. https://en.wikipedia.org/wiki/List_of_programming_languages
3
Study Goal
What is the impact of the usage of multiple languages on bug proneness?
4
Outline
• Study Goal• Empirical Study Methodology
– Data Collection and Basic Statistics – Categorizing Bugs– Statistical Methods
• Research Questions• Findings• Conclusion and Future Work
5
Data Collection• We collect projects hosted on• Most popular projects in 17 languages
– C, C++, C#, Objective-C, Go, Java, JavaScript, CoffeeScript, TypeScript, Ruby, Php, Python, Perl, Clojure, Erlang, Haskell, Scala
• In total, we have 628 popular projects– Linux, Redis, Php-src, Elasticsearch– 85M SLOC, 134K developers, 3.17M commits
6
Basic StatisticsDistribution of GitHub Projects w.r.t Number of Language Usage
Language Usage Number of Projects
% of Projects
Written in a single language
297 47.29%
Written in multiple languages
331 52.71%
Total – 628 projects
7
Basic StatisticsProject DetailsPrimary
LanguageNumber
of Projects
Number of Associated Languages
Number of Developers
SLOC (KLOC)
Period
C 104 9 25,203 17,350
4/1999 - 10/2015
C++ 94 12 6,855 14,179 7/2000 - 10/2015
C# 46 8 4,818 8,848 6/2001 - 10/2015
Objective-C 38 4 2,319 378 7/2008 - 10/2015
Go 41 10 6,297 4,013 3/2008 - 10/2015
8
Basic StatisticsEvolution History Details
Primary Language
Total Commits BugFix CommitsNumber of Commits
Code Churn
(KLOC)
Bug Fixes Code Churn
(KLOC)C 780,153 229,411 258,822 42,736C++ 284,331 250,238 112,591 63,881C# 235,048 138,719 69,974 15,786Objective-C 30,048 16,617 7,561 1,031Go 117,282 33,134 32,371 4,486
9
Categorizing Bugs
• Categorize each bug fix commit – cause and impact
• Keyword matching– 10% of bug fix commits (926,789)– Semi-automated ways (Keywords & manual)
• Supervised classification– To classify remaining 90% of the commits– Randomly selected 270 commits & manually
classified to check precision & recall.
10
Statistical Method
• Negative Binomial Regression (NBR)• To handle multi-collinearity, we compute
variance inflation factors (VIFs)• Dependent variable - Number of bug fix
commits• Control variables - age, size (LOC),
number of developers and total number of commits.
11
Outline
• Study Goal• Empirical Study Methodology
–Data Collection and Basic Statistics –Categorizing Bugs–Statistical Methods
• Research Questions• Findings• Conclusion and Future Work
12
Research Questions
RQ1: Does the use of more languages correlate with higher bug proneness? RQ2: Are some languages more bug prone when they are used with other languages?
RQ3: What are the relationships between multi-language usage and bugs of various categories?
13
Findings
RQ1:Does the use of more languages correlate
with higher bug proneness?
14
RQ1: More Languages More Bugs
Coeff. Std. Error z-val Pr(> |z|)(Intercept) 4.718 0.089 53.19 0.000 ***
age 2.939e-04 3.794e-05 7.75 0.000 ***
size 3.390e-07 1.310e-07 2.59 0.010 **
developers 9.807e-04 1.107e-04 8.86 0.000 ***
commits 6.048e-05 5.177e-06 11.68 0.000 ***
num-langs 0.273 0.029 9.50 0.000 ***
***p < 0.001, **p < 0.01,*p < 0.05
15
RQ1: More Languages More Bugs
Number of languages has a significant impact on increasing
bug proneness.
16
Findings
RQ2:Are some languages more bug prone when
they are used with other languages?
17
Coeff. Std. Error z-val Pr(> |z|)
(Intercept) 5.208 0.074 70.864 < 2e-16 ***age 3.091e-04 3.828e-05 8.075 6.77e-16 ***size 4.143e-07 1.462e-07 2.833 0.005 **developers 7.672e-04 1.068e-04 7.181 6.92e-13 ***commits 6:077e-05 5.286e-06 11.498 < 2e-16 ***C++S -0.550 0.275 -2.005 0.045 *C++M 0.609 0.138 4.403 0.000 ***JavaS -0.568 0.164 -3.473 0.001 ***JavaM 0.530 0.192 2.767 0.006 **TypeScriptS -3.664 0.915 -4.005 0.000 ***TypeScriptM 0.542 0.131 4.139 0.000 ******p < 0.001, **p < 0.01, *p < 0.05; S=Single-language, M=Multi-language
RQ2: Multi-Language Setting
18
Six languages C++, Objective-C, Java, TypeScript, Clojure, and Scala are more defect prone when they are
used with other languages.
RQ2: Multi-Language Setting
19
Findings
RQ3:Does the correlation between number of languages and bug proneness exist for
various bug categories?
20
RQ3: Bug Categories Memory Bugs Concurrency Bugs
Coeff. Std. Error Sig. Coeff. Std. Error Sig.
(Intercept) 0.297 0.148 * 0.265 0.160
age 4.078e-04 6.230e-05 *** 6.296e-05 6.750e-05
size 8.519e-07 2.107e-07 *** 1.978e-07 2.283e-07
developers 4.875e-04 1.782e-04 ** 4.272e-04 1.899e-04 *
commits 4.017e-05 8.335e-06 *** 5.308e-05 8.942e-06 ***
num-langs 0.421 0.047 *** 0.396 0.050 ***
***p < 0.001, **p < 0.01,*p < 0.05
21
Number of languages has a positive and significant impact on the number of bug fixing commits for all categories of bugs – highest
for memory, concurrency and algorithm bugs.
RQ3: Bug Categories
Threats to Validity
22
• Construct Validity• Bug fix commits as a proxy of bug proneness
instead of issue reports.• Semi-automated way to classify bug into its
cause & impact categories.• Internal Validity
• Use of CLOC tool to compute lines of code.• Don’t consider complexity of applications.
• External Validity• Only projects on GitHub.
Conclusion
23
• Increasing the number of languages to implement a project, significantly increases bug proneness.
• Six languages C++, Objective-C, Java, TypeScript, Clojure, and Scala are significantly defect prone when used with other languages.
• The impact of the number of languages to cause higher bug proneness is significantly observed in all bug categories.
• Higher impact in memory, concurrency and algorithm categories.
24
Future Work
• Identify language pairs that are less compatible and more bug prone.
• Identify the types of bugs that affect multi-language programs.
• Expand the study beyond 17 programming languages.
Thank you!
Questions? Comments? Advice?{kochharps.2012,dwijedasa,davidlo}@smu.edu.sg
26
Appendix (Basic Statistics)Other Languages Used with each Primary Language
Primary Language
Languages Used with the Primary Language
C Java, C++, Ruby, PHP, Python, Javascript, Perl, Objective-C, TypeScript
C++ Javascript, C, Java, Objective-C, Python, Ruby, CoffeeScript, PHP, Perl, Haskell, C#, TypeScript
C# C++, C, Javascript, Ruby, Perl, Python, Objective-C, Java
Objective-C Python, Javascript, C, Ruby
Go C++, TypeScript, Javascript, Ruby, C, Python, Perl, Java, PHP, Objective-C
27
AppendixProject Details
Primary Language
#Projects #Associated Languages
#Developers SLOC (KLOC)
Period
C 104 9 25,203 17,350 4/1999 to 10/2015C++ 94 12 6,855 14,179 7/2000 to 10/2015C# 46 8 4,818 8,848 6/2001 to 10/2015Objective-C 38 4 2,319 378 7/2008 to 10/2015Go 41 10 6,297 4,013 3/2008 to 10/2015Java 84 10 5,456 5,004 3/2006 to 10/2015CoffeeScript 44 3 3,930 564 12/2009 to 10/2015JavaScript 245 13 10,536 3,596 3/2006 to 10/2015TypeScript 42 8 4,941 16,792 5/2002 to 10/2015Ruby 87 8 24,184 2,991 1/1998 to 10/2015Php 59 4 12,145 2,959 4/2003 to 10/2015Python 152 5 11,397 1,864 7/2005 to 10/2015Perl 38 5 1,482 297 1/2004 to 10/2015Clojure 46 5 2,171 432 4/2008 to 10/2015Erlang 36 7 2,429 2,670 5/2001 to 10/2015Haskell 37 5 4,158 1,053 4/2004 to 10/2015Scala 42 8 5,698 2,319 2/2003 to 10/2015Summary 628 134,019 85,309 1/1998 to 10/2015
28
AppendixEvolution History Details
Primary Language
Total Commits BugFix Commits#Commits Code Churn
(KLOC)
Bug Fixes Code Churn (KLOC)
C 780,153 229,411 258,822 42,736C++ 284,331 250,238 112,591 63,881C# 235,048 138,719 69,974 15,786Objective-C 30,048 16,617 7,561 1,031Go 117,282 33,134 32,371 4,486Java 156,153 105,063 58,922 21,412CoffeeScript 80,613 25,016 14,267 2,630JavaScript 123,500 45,256 33,498 6,254TypeScript 188,698 167,050 51,970 18,531Ruby 337,457 46,014 67,619 6,577Php 218,751 77,268 73,700 10,221Python 199,974 37,691 59,670 6,332Perl 38,231 7,944 7,649 418Clojure 41,588 5,650 6,994 511Erlang 77,411 24,425 15,579 1,977Haskell 128,149 21,834 24,558 4,612Scala 133,754 70,807 31,044 24,287Summary 3,171,141 1,302,137 926,789 231,682
29
AppendixOther Languages Used with each Primary Language
Primary Language Languages Used with the Primary LanguageC Java, C++, Ruby, PHP, Python, Javascript, Perl, Objective-C, TypeScriptC++ Javascript, C, Java, Objective-C, Python, Ruby, CoffeeScript, PHP, Perl, Haskell, C#,
TypeScriptC# C++, C, Javascript, Ruby, Perl, Python, Objective-C, JavaObjective-C Python, Javascript, C, RubyGo C++, TypeScript, Javascript, Ruby, C, Python, Perl, Java, PHP, Objective-CJava C++, C, Python, Javascript, Clojure, Perl, Ruby, Objective-C, C#, ScalaCoffeeScript Javascript, Ruby, PythonJavaScript C#, PHP, C++, Clojure, CoffeeScript, Ruby, Erlang, Python, Haskell, Java, Go, TypeScript,
ScalaTypeScript C++, Java, Python, Javascript, CoffeeScript, C, Perl, PHPRuby CoffeeScript, Javascript, Python, C++, C, Java, Perl, PHPPhp Javascript, Python, Perl, CPython Javascript, C, C++, CoffeeScript, PHPPerl Javascript, Java, Python, C++, CClojure Javascript, Java, C, Objective-C, RubyErlang C, Python, Javascript, PHP, C++, Java, ClojureHaskell C, JavaScript, Python, Perl, JavaScala Java, Ruby, Javascript, Perl, C#, Python, CoffeeScript, Clojure