Seeing the forest for the trees, UMons 2011

Post on 17-Dec-2014

309 views 0 download

description

Slides used during the talk at UMons in November 2011.

Transcript of Seeing the forest for the trees, UMons 2011

Seeing the forest for the trees

Bogdan Vasilescub.n.vasilescu@tue.nlhttp://www.win.tue.nl/∼bvasiles/

Software Engineering and Technology groupEindhoven University of Technology

November 23, 2011

2/21

/ department of mathematics and computer science

Eindhoven

2/21

/ department of mathematics and computer science

Eindhoven

3/21

/ department of mathematics and computer science

Computer Science @TU/e

I Section Model Driven Software Engineering (MDSE)I Group Software Engineering and Technology (SET)

Mark van den Brand Alexander Serebrenik

3/21

/ department of mathematics and computer science

Computer Science @TU/e

I Section Model Driven Software Engineering (MDSE)I Group Software Engineering and Technology (SET)

Mark van den Brand Alexander Serebrenik

4/21

/ department of mathematics and computer science

Interested in . . .

I Software evolutionAggregation of code metrics Activity in open-source projects

I Computational geometry

4/21

/ department of mathematics and computer science

Interested in . . .

I Software evolutionAggregation of code metrics Activity in open-source projects

I Computational geometry

5/21

/ department of mathematics and computer science

Aggregation of software metrics

Maintaining a software system is like renovating a house.

Maintainability assessment precedes changing the software.

Metrics are often applied to measure maintainability.

But metrics are defined at a low level (method, class).

We need aggregation techniques.

6/21

/ department of mathematics and computer science

Aggregation of software metrics

7/21

/ department of mathematics and computer science

Traditional aggregation techniques

Standard summary statistics: mean, median, . . .

Red line – mean; blue line – median

8/21

/ department of mathematics and computer science

Recent trend: Inequality indices

Econometrics: measure/explain the inequality of income or wealth.

Software metrics and econometric variables have distributions withsimilar shapes.

Source Lines of Code: freecol−0.9.4

SLOC per class

Fre

quen

cy

0 500 1000 1500 2000 2500 3000

010

020

030

040

0

Household income in Ilocos, Philippines (1998)

Income

Fre

quen

cy

0 500000 1500000 2500000

010

020

030

040

050

0

9/21

/ department of mathematics and computer science

Degree of concentration of functionality

Lorenz curve for SLOC in Hibernate3.6.0-beta4.

% Classes

% S

LOC

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Measure inequality between:I individuals

(e.g., classes)I groups

(e.g., components)

Often desirable to assess thecontribution of the inequalitybetween the groups.

I Decomposable indicesI Root-cause analysis

9/21

/ department of mathematics and computer science

Degree of concentration of functionality

Lorenz curve for SLOC in Hibernate3.6.0-beta4.

IHoover

IGini =A

A

B

A+B = 2A

Measure inequality between:I individuals

(e.g., classes)I groups

(e.g., components)

Often desirable to assess thecontribution of the inequalitybetween the groups.

I Decomposable indicesI Root-cause analysis

9/21

/ department of mathematics and computer science

Degree of concentration of functionality

Lorenz curve for SLOC in Hibernate3.6.0-beta4.

IHoover

IGini =A

A

B

A+B = 2A

Measure inequality between:I individuals

(e.g., classes)I groups

(e.g., components)

Often desirable to assess thecontribution of the inequalitybetween the groups.

I Decomposable indicesI Root-cause analysis

10/21

/ department of mathematics and computer science

Traceability via decomposability

Which individuals (classes in package) contribute to 80% of theinequality (of SLOC)?

Which class contributes the most to the inequality?

11/21

/ department of mathematics and computer science

Other properties of inequality indices

Symmetry

Inequality stays the same for any permutation of the population.

11/21

/ department of mathematics and computer science

Other properties of inequality indices

Symmetry

Inequality stays the same for any permutation of the population.

11/21

/ department of mathematics and computer science

Other properties of inequality indices

Symmetry

Inequality stays the same for any permutation of the population.

12/21

/ department of mathematics and computer science

Other properties of inequality indices

Population principle

Inequality does not change if the population is replicated any number oftimes.

12/21

/ department of mathematics and computer science

Other properties of inequality indices

Population principle

Inequality does not change if the population is replicated any number oftimes.

12/21

/ department of mathematics and computer science

Other properties of inequality indices

Population principle

Inequality does not change if the population is replicated any number oftimes.

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

13/21

/ department of mathematics and computer science

Other properties of inequality indices

Transfers principle

20 36 45

30 36

A transfer from a rich man to a poor man (without reversing theirposition) should decrease inequality.

14/21

/ department of mathematics and computer science

Other properties of inequality indices

Scale invariance: Gini, Theil, Atkinson, Hoover

Inequality does not change if all values are multiplied by the sameconstant.

14/21

/ department of mathematics and computer science

Other properties of inequality indices

Scale invariance: Gini, Theil, Atkinson, Hoover

Inequality does not change if all values are multiplied by the sameconstant.

15/21

/ department of mathematics and computer science

Summary

Ineq. index Sym. Inv. Dec. Pop. Tra.IGini X × X XITheil X × X X XIMLD X × X X XIHoover X × XIαAtkinson X × X X XIβKolm X + X X X

Problems include:I Domain not always Rn .I No distinction between all values equal but low, and all values

equal but high.

15/21

/ department of mathematics and computer science

Summary

Ineq. index Sym. Inv. Dec. Pop. Tra.IGini X × X XITheil X × X X XIMLD X × X X XIHoover X × XIαAtkinson X × X X XIβKolm X + X X X

Problems include:I Domain not always Rn .I No distinction between all values equal but low, and all values

equal but high.

16/21

/ department of mathematics and computer science

Our research

17/21

/ department of mathematics and computer science

Which are redundant?

IGini, ITheil, IMLD, IAtkinson, and IHoover always convey the same information.-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

SLOC

MLD-Hoo Gin-MLD The-MLD Gin-Hoo Atk-Hoo The-Hoo Gin-Atk MLD-Atk Gin-The The-Atk

(91%) (89%) (91%) (90%) (92%) (92%) (90%) (91%) (91%) (92%)

-1.0

-0.5

0.0

0.5

1.0

-1.0

-0.5

0.0

0.5

1.0

DIT

MLD-Hoo Atk-Hoo Gin-MLD The-Hoo Gin-Atk Gin-Hoo Gin-The The-MLD The-Atk MLD-Atk

(85%) (87%) (87%) (88%) (88%) (89%) (88%) (88%) (88%) (89%)

18/21

/ department of mathematics and computer science

Is the correlation meaningful?

Superlinear (e.g., ITheil–IGini) and chaotic (e.g., ITheil–IKolm) patterns canbe observed in the scatter plots.

0.1 0.2 0.3 0.4 0.5 0.6

0.0

0.2

0.4

0.6

0.8

1.0

compiere: Theil-Gini. Kendall: 0.94, p-val: 0.00

Gini (SLOC)

The

il (S

LOC

)

0 50 100 150 200 250 300 350

0.0

0.2

0.4

0.6

0.8

1.0

compiere: Theil-Kolm. Kendall: 0.25, p-val: 0.01

Kolm (SLOC)

The

il (S

LOC

)

19/21

/ department of mathematics and computer science

Does the aggregation level matter?

Changing the aggregation level to class level does not affect thecorrelation between various aggregation techniques as measured atpackage level.

-1.0

-0.5

0.0

0.5

1.0

Kendall: Gini - Theil (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Gini - Theil (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - Atkinson (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - Atkinson (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - MLD (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

-1.0

-0.5

0.0

0.5

1.0

Kendall: Theil - MLD (SLOC) (100%)

Ken

dall

corr

elat

ion

coef

ficie

nt

20/21

/ department of mathematics and computer science

Does system size matter?

System size does influence the correlation between aggregationtechniques, e.g., ITheil–IKolm increases with system size.

0.0

0.2

0.4

0.6

0.8

1.0

hibernate − Kendall(Theil(SLOC), Kolm(SLOC)) (86 releases)

Cor

. coe

ff. T

heil(

SLO

C)

− K

olm

(SLO

C)

0.8.

11.

01.

12.

0−be

ta−

12.

0−be

ta−

22.

0−be

ta−

32.

0−be

ta−

42.

0−fin

al2.

0−rc

22.

0.1

2.0.

22.

0.3

2.1−

beta

−1

2.1−

beta

−2

2.1−

beta

−3

2.1−

beta

−3b

2.1−

beta

−4

2.1−

beta

−5

2.1−

beta

−6

2.1−

final

2.1−

rc1

2.1.

12.

1.2

2.1.

32.

1.4

2.1.

52.

1.6

2.1.

72.

1.8

3.0

3.0−

alph

a3.

0−be

ta1

3.0−

beta

23.

0−be

ta3

3.0−

beta

43.

0−rc

13.

0.1

3.0.

23.

0.3

3.0.

43.

0.5

3.1

3.1−

alph

a13.

1−be

ta1

3.1−

beta

23.

1−be

ta3

3.1−

rc1

3.1−

rc2

3.1−

rc3

3.1.

13.

1.2

3.1.

33.

2−al

pha1

3.2−

alph

a23.

2−cr

13.

2−cr

23.

2.0−

cr3

3.2.

0−cr

43.

2.0−

cr5

3.2.

0.ga

3.2.

1−ga

3.2.

2−ga

3.2.

3−ga

3.2.

4−ga

3.2.

4−sp

13.

2.5−

ga3.

2.6−

ga3.

2.7−

ga3.

3.0−

cr2

3.3.

0−ga

3.3.

0−sp

13.

3.0.

cr1

3.3.

1−ga

3.3.

2−ga

3.5.

0−be

ta−

13.

5.0−

beta

−2

3.5.

0−be

ta−

33.

5.0−

beta

−4

3.5.

0−cr

−1

3.5.

0−cr

−2

3.5.

3−fin

al3.

5.5−

final

3.6.

0−be

ta1

3.6.

0−be

ta2

3.6.

0−be

ta3

3.6.

0−be

ta4

21/21

/ department of mathematics and computer science

References

A. Serebrenik and M. G. J. van den Brand.

Theil index for aggregation of software metrics values.

In Int. Conf. on Software Maintenance, pages 1–9. IEEE, 2010.

B. Vasilescu.

Analysis of advanced aggregation techniques for software metrics.

Master’s thesis, Eindhoven, The Netherlands, July 2011.

B. Vasilescu, A. Serebrenik, and M. G. J. van den Brand.

By no means: A study on aggregating software metrics.

In 2nd International Workshop on Emerging Trends in Software Metrics,Honolulu, Hawaii, USA, 2011.

B. Vasilescu, A. Serebrenik, and M. G. J. van den Brand.

You can’t control the unfamiliar: A study on the relations betweenaggregation techniques for software metrics.

In Int. Conf. on Software Maintenance. IEEE, 2011.

22/21

/ department of mathematics and computer science

Correlation

Linear correlation can be misleading.

5 10 15

46

810

12

Pea: 0.816; Ken: 0.963; Spe: 0.990

●●

●●

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.636; Spe: 0.818

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.563; Spe: 0.690

●●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.426; Spe: 0.5

●●

[Vas11, VSvdB11a, SvdB10, VSvdB11b]

22/21

/ department of mathematics and computer science

Correlation

Linear correlation can be misleading.

5 10 15

46

810

12

Pea: 0.816; Ken: 0.963; Spe: 0.990

●●

●●

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.636; Spe: 0.818

●●

●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.563; Spe: 0.690

●●●

5 10 15

46

810

12

Pea: 0.816; Ken: 0.426; Spe: 0.5

●●

[Vas11, VSvdB11a, SvdB10, VSvdB11b]