Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis...

53
Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis <[email protected] > Department of Computer Science and Engineering University of Ioannina, Greece September 2013

Transcript of Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis...

Page 1: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Analysis of Schema Evolution for Databases in Open-Source Software

MSc Thesis - Ioannis Skoulis<[email protected]>

Department of Computer Science and EngineeringUniversity of Ioannina, Greece

September 2013

Page 2: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

What is Software Evolution?Software evolution:

The change of a software system, over the years and releases, from its initial formation to the point it is withdrawn (is no longer used or surpassed by competitive software)

E-type systems:Software solving a problem or addressing an application in the real-world

Page 3: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

What about Schema Evolution?● Databases also have users with requirements● Informational capacity must be raised to keep up

with the real world● They are fairly independent from the rest of the

software● Schema changes cause inconsistency in

application (both syntactic and semantic)

Page 4: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

What is the Status in Literature?● Software Evolution

○ Theoretical level [Mens04]○ Case studies on proprietary software [LeBr85] (many in

the seventies)○ Open Source made things easier [GoTu00], [XiSt05],

[WeYL08], [XiCN09]○ Laws on Software Evolution [LeRa03]

● Schema Evolution○ Three main case studies [Sjob93], [PVSV12], [CMDZ13]

Page 5: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

What is the Problem?

● We do not have any lead whatsoever as to why and how evolution takes place in a database

Page 6: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

What do we do about it?We try to fill the gap in literature as there are no published works on whether the laws of software evolution can be applied on schema evolution.● Large scale study on schema evolution● Collected and processed eight schemas● Report on measures (size, growth, changes)● Study the applicability of the laws on DB● We use concrete measures to do so

Page 7: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Roadmap● The Laws of Software Evolution● Experimental Setup● Adapting the Laws for Schema Evolution● Conclusion

Page 8: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Roadmap● The Laws of Software Evolution● Experimental Setup● Adapting the Laws for Schema Evolution● Conclusion

Page 9: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Laws on Software Evolution● Its a set of eight rules on the behavior of

software as it evolves● Derived from a study, due to M. Lehman of

proprietary software (OS/360)● Almost 40 years of reviewing and evaluation

(first three published in 1976)● Have been recognized for their useful insights as

to what and why evolves in the lifetime of a software system

Page 10: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

I. Continuing change“An E-Type system must be continually adapted or else it becomes progressively less satisfactory.”

II. Increasing Complexity“As an E-type system is changed its complexity increases and becomes more difficult to evolve unless work is done to maintain or reduce the complexity.”

III. Self Regulation“Global E-type systems evolution is feedback regulated.”

IV. Conservation of Organizational Stability“The work rate of an organization evolving an E-type software system tends to be constant over the operational lifetime of that system or phases of that lifetime.”

Laws on Software Evolution

Page 11: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Laws on Software EvolutionV. Conservation of Familiarity

“In general, the incremental growth of E-type systems is constrained by the need to maintain familiarity.”

VI. Continuing Growth“The functional capacity of E-type systems must be continually enhanced to maintain user satisfaction over system lifetime.”

VII. Declining Quality“Unless rigorously adapted and evolved to take into account changes in the operational environment, the quality of an E-type system will appear to be declining.”

VIII. Feedback System“E-type evolution process are multi-level, multi-loop, multi-agent feedback systems.”

Page 12: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Roadmap● The Laws of Software Evolution● Experimental Setup● Adapting the Laws for Schema Evolution● Conclusion

Page 13: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Experimental SetupFor each dataset:● We gathered DDL files from public repos● We collected all commits of the database at the

time of the trunk/master branch● We ignored all other branches● We ignored commits of other modules of the

project● Focused on MySQL

Page 14: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Hecate: SQL schema diff viewer● Parses DDL files● Creates a model for the parsed SQL elements● Differentiates two version of the same schema● Reports on the diff performed with a variety of

metrics● Exports the transitions that occurred in XML

format

Page 15: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Datasets● Content management Systems

● MediaWiki, TYPO3, Coppermine, phpBB, OpenCart

● Medical Databases● Ensemble, BioSQL

● Scientific● ATLAS Trigger

Page 16: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Roadmap● The Laws of Software Evolution● Experimental Setup● Adapting the Laws for Schema Evolution● Conclusion

Page 17: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Laws for Schema EvolutionThree main groups for the Laws:

● Feedback-based Systemo I. Continuing changeo VIII. Feedback Systemo III. Self Regulation

● Positive feedbacko VI. Continuing Growtho V. Conservation of Familiarityo IV. Conservation of Organizational Stability

● Negative feedbacko II. Increasing Complexityo VII. Declining Quality

Page 18: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“The database schema is continually adapted.”Evaluation: The Database must shows signs of evolution as time passesMetrics: heartbeat of changes over time and version

I. Continuing change

ATLAS Trigger

BioSQL

Coppermine

Ensembl

0

50

100

150

0

50

100

150

0

5

10

15

20

25

0

100

200

300

0

50

100

150200

250

300350

Page 19: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

0

40

80

120

160

020406080100120140160

0

5

10

15

20

0

50

100

150200

250

300350

0

200

400

600

800

0

40

80

120

160

020406080100120140

0

40

80

120

160

Change over time

Page 20: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Change over version

ATLAS Trigger

BioSQL

Coppermine

Ensembl

0

50

100

150

0

50

100

150

0

5

10

15

20

25

0

100

200

300

OpenCart

phpBB

TYPO3

MediaWiki

0

200

400

600

800

0

50

100

150

0

50

100

0

50

100

150

Page 21: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● Databases do change but not continuously

I. Continuing change

Page 22: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“Database schema evolution processes are multi-level, multi-loop, multi-agent feedback systems.”

Evaluation: Regression analysis to the estimate size of the database schemata

Metrics: estimated size , effort

VIII. Feedback System

21

1 ˆˆˆ

i

iiS

ESS

1

2

1i

ajj

aii

s

ssE

Page 23: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Actual size Est - last 5 last 1 Est - last 10 last 1

48

53

58

63

68

73

17

19

21

23

25

27

29

1 12 23 34 45 56 67 78 89 100 111579

11131517192123

1 49 97 1451932412893373854334811020304050607080

1 19 35 51 67 83 99 115131147163405060708090

100110120

1 14 27 40 53 66 79 92 105 118 131585960616263646566

911131517192123

1 31 61 91 12115118121124127130110152025303540455055

Estimated Size

Page 24: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● The regression formula for the estimation of size holds

VIII. Feedback System

Page 25: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“Database schema evolution is feedback regulated.”Evaluation: i) indication of patterns in size growth, ii) existence of negative feedback (drop in size and growth locally decreasing), iii) “ripples” in growthMetrics: size over version, system growth

III. Self Regulation

48

53

58

63

68

73

-4

-2

0

2

4

6

Page 26: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

48

53

58

63

68

73

17

19

21

23

25

27

29

1 12 23 34 45 56 67 78 89 100111579

11131517192123

1 49 97 14519324128933738543348110

20

30

40

50

60

70

80

1 19 35 51 67 83 99 115131147163405060708090

100110120

1 14 27 40 53 66 79 92 105 118 131585960616263646566

911131517192123

1 31 61 91 12115118121124127130110152025303540455055

Schema Size (relations)

Page 27: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

-10-8-6-4-202468

-2

-1

0

1

2

3

4

5

6

-2.5

-1.5

-0.5

0.5

1.5

2.5

-12

-8

-4

0

4

8

-5

0

5

10

15

20

25

30

-5

-3

-1

1

3

5

-4

-2

0

2

4

6

-4

-3

-2

-1

0

1

2

3

4

Schema Growth

Page 28: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● We see sudden drops

● In all we see increase especially at the beginning or after large drops (positive feedback)

● Overall databases increase

● In all we have periods of stability

● Too many occurrences of zero growth

● No periods of continuous change but we have small spikes

● Immediate positive growth is followed with immediate negative growth or stability

● Oscillations exist in growth

● We cannot see patterns of smooth growth interrupted by perfective maintenance

III. Self Regulation

Page 29: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Laws for Schema EvolutionThree main groups for the Laws:

● Feedback-based Systemo I. Continuing Changeo VIII. Feedback Systemo III. Self Regulation

● Positive feedbacko VI. Continuing Growtho V. Conservation of Familiarityo IV. Conservation of Organizational Stability

● Negative feedbacko II. Increasing Complexityo VII. Declining Quality

Page 30: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“The informational capacity of databases must be continually enhanced to maintain user satisfaction over system lifetime.”

Evaluation: Overall expansion trend for the metrics involvedMetrics: number of relations, number of attributes

VI. Continuing Growth

48

53

58

63

68

73

17

19

21

23

25

27

29

1 12 23 34 45 56 67 78 89 100111579

11131517192123

Page 31: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● Phases:

Stability (unique for databases)Smooth expansionAbrupt change

VI. Continuing Growth

Page 32: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“In general, the incremental growth of database schema is constrained by the need to maintain familiarity.”

Evaluation: i) growth is constant or declining, ii) version with significant change in size are followed by small growthMetrics: schema growth, schema growth rate

V. Conservation of Familiarity

-12

-8

-4

0

4

8

911131517192123

Page 33: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

-10-8-6-4-202468

-2

-1

0

1

2

3

4

5

6

-2.5

-1.5

-0.5

0.5

1.5

2.5

-12

-8

-4

0

4

8

-5

0

5

10

15

20

25

30

-5

-3

-1

1

3

5

-4

-2

0

2

4

6

-4

-3

-2

-1

0

1

2

3

4

Schema Growth

Page 34: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

48

53

58

63

68

73

17

19

21

23

25

27

29

1 12 23 34 45 56 67 78 89 100111579

11131517192123

1 49 97 14519324128933738543348110

20

30

40

50

60

70

80

1 19 35 51 67 83 99 115131147163405060708090

100110120

1 14 27 40 53 66 79 92 105 118 131585960616263646566

911131517192123

1 31 61 91 12115118121124127130110152025303540455055

Schema Size (relations)

Page 35: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● No deminishing in growth trend● Drop is due to density

● Change is frequent in the beginning● Large changes and dense periods in any time● No expansion of growth

We covered intuitions but is this ok?

V. Conservation of Familiarity

Page 36: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

V. Conservation of FamiliarityThe growth reacts as expected but is it because of the need to maintain familiarity?In Databases there are other reason that might constrain growth:

● Other modules are higly depentent on them● Effort might be taken to clean and organize a database

Page 37: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

V. Conservation of Familiarity

Page 38: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“The work rate of an organization evolving a database schema tends to be constant over the operational lifetime of that schema or phases of that lifetime.”

Evaluation: i) detect phases with constant growth, ii) those phases must be connected with abrupt changesMetrics: schema growth

IV. Conservation of Organizational Stability

-2.5

-1.5

-0.5

0.5

1.5

2.5

-12

-8

-4

0

4

8

-4

-3

-2

-1

0

1

2

3

4

Page 39: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

IV. Conservation of Organizational Stability

Page 40: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● Growth is bounded in small values● Almost all numbers are between [-2,2] or [0,2]● Few changes● Overdominant zero values

IV. Conservation of Organizational Stability

Page 41: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Laws for Schema EvolutionThree main groups for the Laws:

● Feedback-based Systemo I. Continuing Changeo VIII. Feedback Systemo III. Self Regulation

● Positive feedbacko VI. Continuing Growtho V. Conservation of Familiarityo IV. Conservation of Organizational Stability

● Negative feedbacko II. Increasing Complexityo VII. Declining Quality

Page 42: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“Efforts to maintain internal quality must be made.”Evaluation: i) We must identify version with perfective maintenance, ii) the VIII law must hold, iii) the approximate complexity must increase

Metrics:

II. Increasing Complexity

1

ii SS

handledmodulescomplexity

sizeold

handledmodulesrateemaintenanc

0123456789

0%

20%

40%

60%

80%

100%

Page 43: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Complexity

0

5

10

15

20

25

30

024681012141618

0

0.5

1

1.5

2

2.5

0123456789

0

10

20

30

40

50

60

0

2

4

6

8

10

00.51

1.52

2.53

3.54

4.5

0

0.5

1

1.5

2

2.5

3

3.5

Page 44: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Maintenance Rate

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

0%

20%

40%

60%

80%

100%

Page 45: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

● Complexity is dropping rather than rising● Changes also decline in density over time so

complexity declines● Maintenance becomes easier● Complexity is estimates

II. Increasing Complexity

Page 46: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

“Unless rigorously adapted and evolved to take into account changes in the operational environment, the quality of a database schema will appear to be declining.”

Evaluation: Hold by logical induction, if III, VIII, and II holdMetrics: not possible to measure external quality

We are unsure of the behavior of internal quality so we are even more reluctant towards declaring external quality as improving.

VII. Declining Quality

Page 47: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Laws for Schema EvolutionThree main groups for the Laws:

● Feedback-based Systemo I. Continuing Changeo VIII. Feedback Systemo III. Self Regulation

● Positive feedbacko VI. Continuing Growth o V. Conservation of Familiarityo IV. Conservation of Organizational Stability

● Negative feedbacko II. Increasing Complexityo VII. Declining Quality?

Page 48: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Roadmap● The Laws of Software Evolution● Experimental Setup● Adapting the Laws for Schema Evolution● Conclusion

Page 49: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Conclusions● High degree of certainty

• Databases do not grow continuously• Changes reduce in density as databases age• The size grows overall• Regressive formula holds• Growth is smaller than typical software• Schema changes follows Zipf’s law• Average growth is close to zero

Page 50: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Conclusions● Requiring further insight

• Change frequently follows spike patterns• Change follows three patterns• Stillness

• Abrupt change

• Smooth growth

• Large changes sequenced one after the other• Age reduces complexity

Page 51: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Future Work● Time related measures

o We have occasions were effort is high or lowo We need better measures of change over time (patterns)

● Detection of “abrupt change”o Splitting of a lifetime in phaseso Compute running averages over fixed version

● Identifying Perfecting Maintenanceo Capture renames

Page 52: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Future Work● Complexity

o We lack a representative set of metrics that measure the complexity of a database schema

o Structural complexity may involve:• Number of foreign keys of the relational schema• Number of relationships of the conceptual schema

o Measuring relations that are semantically related to each other

● More datasets

Page 53: Analysis of Schema Evolution for Databases in Open-Source Software MSc Thesis - Ioannis Skoulis iskoulis@cs.uoi.gr Department of Computer Science and Engineering.

Reaching the End...

Questions ?