Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... ·...

27
Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulis a,1 , Panos Vassiliadis b , Apostolos V. Zarras b a Opera, Helsinki, Finland b University of Ioannina, Ioannina, Hellas Abstract Like all software systems, databases are subject to evolution as time passes. The impact of this evolution can be vast as a change to the schema of a database can affect the syntactic correctness and the semantic validity of all the surrounding applications. In this paper, we have performed a thorough, large-scale study on the evolution of databases that are part of larger open source projects, publicly available through open source repositories. Lehman’s laws of software evolution, a well-established set of observations on how the typical software systems evolve (matured during the last forty years), has served as our guide towards providing insights on the mechanisms that govern schema evolution. Much like software systems, we found that schemata expand over time, under a stabilization mechanism that constraints uncontrolled expansion with perfective maintenance. At the same time, unlike typical software systems, the growth is typically low, with long periods of calmness interrupted by bursts of maintenance and a surprising lack of complexity increase. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed. Tom Robbins, Even Cowgirls Get the Blues 1. Introduction Software evolution is the change of a software system over time, typically performed via a re- markably difficult, complicated and time consum- ing process, software maintenance. Schema evolu- tion is the most important aspect of software evo- lution that pertains to databases, as it can have a tremendous impact to the entire information sys- tem built around the evolving database, severely affecting both developers and end-users. Quite fre- quently, development waits till a ”schema back- bone” is stable and applications are build on top of it. This is due to the ”dependency magnet ” na- ture of databases: a change in the schema of a database may immediately drive surrounding ap- plications to crash (in case of deletions or renam- ings) or be semantically defective or inaccurate (in Email addresses: [email protected] (Ioannis Skoulis), [email protected] (Panos Vassiliadis), [email protected] (Apostolos V. Zarras) 1 Work conducted while in the Univ. of Ioannina the case of information addition, or restructuring). Therefore, discovering laws, patterns and regulari- ties in schema evolution can result in great bene- fits, as we would be able to design databases with a view to their evolution and minimize the impact of evolution to the surrounding applications: (a) by avoiding ”design anti-patterns” leading to cu- mulative complexity for both the database and the surrounding applications, and, (b) by planing ad- ministration and maintenance tasks and resources, instead of just responding to emergencies. In sharp distinction to traditional software sys- tems, and disproportionately to the severity of its implications, database evolution has hardly been studied throughout the entire lifetime of the data management discipline. It is only amazing to find out that, in the history of the discipline, just a handful of studies had been published in the area. The deficit is really amazing in the case of tradi- tional database environments, where only two(!) studies [1], [2] have been published. Apart from amazing, this deficit should also be expected: al- lowing the monitoring, study and eventual publica- tion of the evolution properties of a database would expose the internals of a critical part of the core of an organization’s information system. Fortunately, the open-source movement has provided us with the Preprint submitted to Elsevier March 4, 2015

Transcript of Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... ·...

Page 1: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Growing up with Stability: how Open-Source Relational Databases Evolve

Ioannis Skoulisa,1, Panos Vassiliadisb, Apostolos V. Zarrasb

aOpera, Helsinki, FinlandbUniversity of Ioannina, Ioannina, Hellas

Abstract

Like all software systems, databases are subject to evolution as time passes. The impact of this evolutioncan be vast as a change to the schema of a database can affect the syntactic correctness and the semanticvalidity of all the surrounding applications. In this paper, we have performed a thorough, large-scale studyon the evolution of databases that are part of larger open source projects, publicly available through opensource repositories. Lehman’s laws of software evolution, a well-established set of observations on howthe typical software systems evolve (matured during the last forty years), has served as our guide towardsproviding insights on the mechanisms that govern schema evolution. Much like software systems, we foundthat schemata expand over time, under a stabilization mechanism that constraints uncontrolled expansionwith perfective maintenance. At the same time, unlike typical software systems, the growth is typicallylow, with long periods of calmness interrupted by bursts of maintenance and a surprising lack of complexityincrease.

A truly stable system expects the unexpected, isprepared to be disrupted, waits to be transformed.Tom Robbins, Even Cowgirls Get the Blues

1. Introduction

Software evolution is the change of a softwaresystem over time, typically performed via a re-markably difficult, complicated and time consum-ing process, software maintenance. Schema evolu-tion is the most important aspect of software evo-lution that pertains to databases, as it can have atremendous impact to the entire information sys-tem built around the evolving database, severelyaffecting both developers and end-users. Quite fre-quently, development waits till a ”schema back-bone” is stable and applications are build on topof it. This is due to the ”dependency magnet” na-ture of databases: a change in the schema of adatabase may immediately drive surrounding ap-plications to crash (in case of deletions or renam-ings) or be semantically defective or inaccurate (in

Email addresses: [email protected] (Ioannis Skoulis),[email protected] (Panos Vassiliadis), [email protected](Apostolos V. Zarras)

1Work conducted while in the Univ. of Ioannina

the case of information addition, or restructuring).Therefore, discovering laws, patterns and regulari-ties in schema evolution can result in great bene-fits, as we would be able to design databases witha view to their evolution and minimize the impactof evolution to the surrounding applications: (a)by avoiding ”design anti-patterns” leading to cu-mulative complexity for both the database and thesurrounding applications, and, (b) by planing ad-ministration and maintenance tasks and resources,instead of just responding to emergencies.

In sharp distinction to traditional software sys-tems, and disproportionately to the severity of itsimplications, database evolution has hardly beenstudied throughout the entire lifetime of the datamanagement discipline. It is only amazing to findout that, in the history of the discipline, just ahandful of studies had been published in the area.The deficit is really amazing in the case of tradi-tional database environments, where only two(!)studies [1], [2] have been published. Apart fromamazing, this deficit should also be expected: al-lowing the monitoring, study and eventual publica-tion of the evolution properties of a database wouldexpose the internals of a critical part of the core ofan organization’s information system. Fortunately,the open-source movement has provided us with the

Preprint submitted to Elsevier March 4, 2015

Page 2: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

possibility to slightly change this landscape. Aspublic repositories (git, svn, . . .) keep the entirehistory of revisions of software projects, includingthe schema files of any database internally hostedwithin them, we are now presented with the oppor-tunity to study the version histories of such opensource databases. Hence, within only a few years inthe late ’00’s, several research efforts [3], [4], [5], [6]have studied of schema evolution in open source en-vironments. Those studies, however, focus on thestatistical properties of the evolution and do notprovide details on the mechanism that governs theevolution of database schemata.To contribute towards amending this deficit, the

research goal of this paper involves the identificationof patterns and regularities of schema evolution thatcan help us understand the underlying mechanismthat governs it. To this end, we study the evolu-tion of the logical schema of eight databases, thatare parts of publicly available, open-source softwareprojects (Sec. 3). We have collected and cleansedthe available versions of the database schemata forthe eight case studies, extracted the changes thathave been performed in these versions and, finally,we have come up with usable datasets that we sub-sequently analyzed.Our main tool for this analysis came from the

area of software engineering. In an attempt tounderstand the mechanics behind the evolution ofsoftware and facilitate a smoother, lest disruptivemaintenance process, Meir Lehman and his col-leagues introduced a set of rules in mid seventies[7], also known as the Laws on Software Evolution(Sec. 2). Their findings, that were reviewed andenhanced for nearly 40 years [8], [9], have, sincethen, given an insight to managers, software de-velopers and researchers, as to what evolves in thelifetime of a software system, and why it does so.Other studies (see [10] for a survey) have comple-mented these insights in this field, typically withparticular focus to open-source software projects.In our case, we adapted the laws of software evo-lution to the case of schema evolution and utilizedthem as a driver towards understanding how thestudied schemata evolve. Our findings (Sec. 4) in-dicate that the schemata of open source databasesexpand over time, with long periods of calmnessconnected via bursts of maintenance effort focusedin time, and with significant effort towards the per-fective maintenance of the schema that appears toresult in an unexpected lack of complexity increase.Incremental growth of the schema is typically low

and its volume follows a Zipfian distribution. Inboth the presentations of our results and in ourconcluding notes (Sec. 5) we also demonstrate thatalthough the technical assessment of Lehman’s lawsshows that the typical software systems evolve quitedifferently than database schemata, the essence ofthe laws is preserved: evolution is not about uncon-trolled expansion; on the contrary, there appears tobe a stabilization mechanism that employs perfec-tive maintenance to control the otherwise growingtrend of increase in the information capacity of thedatabase.

Roadmap. In Sec. 2, we summarize Lehman’slaws for the non-expert reader and survey relatedefforts, too. In Sec. 3 we discuss the experimen-tal setup of this study and in Sec. 4, we detail ourfindings. We conclude our deliberations with a sum-mary of our findings and their implications in Sec. 5.

2. Lehman Laws of Software Evolution in aNutshell

Meir M. Lehman and his colleagues, have intro-duced, and subsequently amended, enriched, andcorrected a set of rules on the behavior of softwareas it evolves over time [7], [8], [9]. Lehman’s laws fo-cus on E-type systems, that concern “software solv-ing a problem or addressing an application in thereal-world” [8]. The main idea behind the laws ofevolution for E-type software systems is that theirevolution is a process that follows the behavior ofa feedback-based system. Being a feedback-basedsystem, the evolution process has to balance (a)positive feedback, i.e., the need to adapt to a chang-ing environment and grow to address the need formore functionality, and, (b) negative feedback, i.e.,the need to control, constrain and direct change inways that prevent the deterioration of the main-tainability and manageability of the software. Inthe sequel, we list the definitions of the laws as theyare presented in [9], in a more abstract form thanprevious versions and with the benefit of retrospect,after thirty years of maturity and research findings.

(I) Law of Continuing Change An E-type sys-tem must be continually adapted or else it be-comes progressively less satisfactory in use.

(II) Law of Increasing Complexity As an E-type system is changed its complexity increasesand becomes more difficult to evolve unlesswork is done to maintain or reduce the com-plexity.

2

Page 3: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

(III) Law of Self Regulation Global E-typesystem evolution is feedback regulated.

(IV) Law of Conservation of Organisational StabilityThe work rate of an organisation evolving anE-type software system tends to be constantover the operational lifetime of that system orphases of that lifetime.

(V) Law of Conservation of Familiarity Ingeneral, the incremental growth (growth ratiotrend) of E-type systems is constrained by theneed to maintain familiarity.

(VI) Law of Continuing Growth The func-tional capability of E-type systems mustbe continually enhanced to maintain usersatisfaction over system lifetime.

(VII) Law of Declining Quality Unless rigor-ously adapted and evolved to take into accountchanges in the operational environment, thequality of an E-type system will appear to bedeclining.

(VIII) Law of Feedback System E-type evolu-tion processes are multi-level, multi-loop,multi-agent feedback systems.

Before proceeding with our study, we present a firstapodosis of the laws, taking into consideration boththe wording of the laws, but most importantly theiraccompanying explanations [9].An E-Type software system continuously changes

over time (I) obeying a complex feedback-based evo-lution process (VIII). On the one hand, due to theneed for growth and adaptation that acts as positivefeedback, this process results in an increasing func-tional capacity of the system (VI), produced by agrowth ratio that is slowly declining in the long term(V). The process is typically guided by a pattern ofgrowth that demonstrates its self-regulating nature:growth advances smoothly; still, whenever there areexcessive deviations from the typical, baseline rateof growth (either in a single release, or accumulatedover time), the evolution process obeys the needfor calibrating releases of perfective maintenance,i.e., code restructuring and documentation for bet-ter maintainability and comprehension (expressedvia minor growth and demonstrating negative feed-back) to stop the unordered growth of the system’scomplexity (III). On the other hand, to regulate theever-increasing growth, there is negative feedback inthe system controlling both the overall quality of the

system (VII), with particular emphasis to its inter-nal quality (II). The effort consumed for the aboveprocess is typically constant over phases, with thephases disrupted with bursts of effort from time totime (IV).

2.1. Lehman’s Laws & Related Empirical Studies

Software evolution is an active research field formore than 40 years and concerns different levelsof abstraction, including the software architecture[11], design [12] and implementation [10]. Lehman’stheory of software evolution is the cornerstone ofthe efforts that have been performed all these years.For a detailed historical survey on the evolution ofLehman’s theory and other related works the in-terested reader can refer to [10]. Following, webriefly discuss the milestones and key findings thatresulted from these efforts.

Lehman’s theory of software evolution was firstintroduced in the 70’s. Back then, the theory in-cluded the first three laws, concerning the continu-ous change, the increasing complexity and the selfregulating properties of the software evolution pro-cess [7]. The experimental evidence that producedthese laws was based on a single case study, namelythe OS/360 operating system. During the 70’s andthe 80’s the formulation of the first three laws hasbeen revised, with respect to further results and em-pirical observations that came up [13]. Moreover,Lehman’s theory has been extended with the forthand the fifth law that concerned the issues of orga-nizational stability and conservation of familiarity[13]. In the 90’s, based on additional case stud-ies, the laws have been revised again and extendedwith the last three laws, referring to the continu-ous growth, the declining quality and to the feed-back mechanism that governs the evolution process[14, 8, 15]. Lehman’s theory did not grow sincethen, the set of laws has been stabilized, and mostof the activity around them concerned moderatechanges in their formulation, performed in the 00’s[9].

During all these years there have also been stud-ies by other authors on the validity of the laws[16, 17]. An interesting finding uncovered fromthese efforts is that the behavior of commercial soft-ware differs from that of academic and research soft-ware, with the former kind being much more faith-ful to the laws, compared to the latter two kinds.The partially validity of the laws is also highlightedin [18], along with the need for a more formal frame-

3

Page 4: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

work that would facilitate the assessment of thelaws.The diverse behavior of software concerning the

validity of Lehman’s laws is emphasized in sub-sequent studies that investigated the evolution ofopen source software. Most of these studies foundonly partial support for the validity of the laws.The efforts in this line of research vary from thepioneer studies of Godfrey and Tu [19, 20], fo-cusing mainly on Linux, to large scale studies[21, 22, 23, 24]. The common ground in all thesestudies is that they found support for the laws ofcontinuing change and growth. [23, 25] concludedin the validation of more laws, including the onesof self-regulation and conservation of familiarity.Moreover, [26] revealed that the laws may be validafter a certain point in the software lifecycle. Inparticular, taking a step further from the effortsof Godfrey and Tu, [26] found that after a certainversion the evolution of Linux follows, at least par-tially, most of the laws.

2.2. Empirical Studies on Database Evolution

Being at the very core of most software,databases are also subject to evolution, which con-cerns changes in their contents and, most impor-tantly, their schemas. Database evolution can con-cern (a) changes in the operational environment ofthe database, (b) changes in the content of thedatabases as time passes by, and (c) changes inthe internal structure, or schema, of the database.Schema evolution, itself, can be addressed at (a)the conceptual level, where the understanding ofthe problem domain and its representation via anER schema evolves, (b) the logical level, where themain constructs of the database structure evolve(for example, relations and views in the relationalarea, classes in the object-oriented database area,or (XML) elements in the XML/semi-structuredarea), and, (c) the physical level, involving dataplacement and partitioning, indexing, compression,archiving etc.Interestingly, the related literature on the actual

mechanics of schema evolution includes only a fewcase studies, as the research community would findit very hard to obtain access to monitor databaseschemata for an in depth study over a significantperiod of time. Despite the fact that in our workwe study schema evolution at the logical level ofdatabases in open-source software, here, we proceedto survey all the works we are aware about in thebroader area of schema evolution.

The first paper [1] discusses the evolution of thedatabase of a health management system over a pe-riod of 18 months, monitored by a tool specificallyconstructed for this purpose. A single databaseschema was examined, and the monitoring revealedthat all the tables of the schema were affected andthe schema had a 139% increase in size for relationsand 274% for attributes. The consequences of thisevolution were significantly large as a cumulative45% of all the names that were used in the querieshad to be deleted or inserted.

Fifteen years later, the authors of [3] made ananalysis on the database back-end of MediaWiki,the software that powers Wikipedia. The studyconducted over the versions of four years, revealed a100% increase in schema size, the observation thataround 45% of changes do not affect the informationcapacity of the schema (but are rather index adjust-ments, documentation, etc), and a statistical studyof lifetimes, change breakdown and version com-mits. Special mention should be made to this lineof research [27], as it is based on PRISM (recentlyre-engineered to PRISM++ [28]), a change man-agement tool, that provides a language of SchemaModification Operations (SMO) (that model thecreation, renaming and deletion of tables and at-tributes, and their merging and partitioning) to ex-press schema changes. More importantly, the peo-ple involved in this line of research should be cred-ited for providing a large collection of links2 foropen source projects that include database support.

A work in the area of data warehousing [2] moni-tored the evolution of seven ETL scripts along withthe evolution of the source data. The experimen-tal analysis of the authors is based in a six-monthmonitoring of seven real-world ETL scenarios thatprocess data for statistical surveys. The findingsof the study indicate that schema size and modulecomplexity are important factors for the vulnerabil-ity of an ETL flow to changes. This work has beenpart of an effort to provide what-if analysis facili-ties to the management of schema evolution via theHecataeus tool (see [29, 30]).

Finally, certain efforts studied the evolution ofdatabases, while taking into account the applica-tions that use them. In particular, in [5] the authorsconsidered 4 case studies of embedded databases(i.e., databases tightly coupled with correspondingapplications that rely on them) and studied the

2http://yellowstone.cs.ucla.edu/schema-evolution/index.php/Benchmark Extension

4

Page 5: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

different kinds of changes that occurred in thesecases. Moreover, they performed a respective fre-quency and timing analysis, which showed that thedatabase schemas tend to stabilize over time. In [4],the authors focused on two case studies. The resultsof this effort revealed that database schema changesand that the source code of dependent applicationsdoes not always evolve in sync with changes tothe database schema. [4] further provides a dis-cussion concerning the impact of database schemachanges on the application code. [6] takes a stepfurther with an empirical study of the co-evolutionof database schemas and code. This effort inves-tigated ten case studies. The results indicate thatdatabase schemas evolve frequently during the ap-plication lifecycle, with schema changes implying asignificant amount of code level modifications.

2.3. Novelty With Respect to the State of the Art

Going beyond the related literature on soft-ware evolution, in general, and database evolu-tion, in particular, our CAiSE’14 paper [31] inves-tigated for the first time patterns and regularitiesof database evolution, based on Lehman’s laws. Tothis end, we conducted a large scale case study ofeight databases, that are parts open-source softwareprojects. This paper extends our prior work withfurther details concerning the intuition and the rel-evance of the laws in the case of databases, themetrics that have been used in the literature forthe assessment of the laws, and the metrics thatwe employed in the case of databases. More im-portantly, we provide detailed presentations of theresults and thorough discussions of our findings.

3. Experimental Setup of the Study

Datasets. We have studied eight databaseschemata from open-source software projects. Fig-ure 1 lists the datasets along with some interestingproperties.ATLAS3 is a particle physics experiment at the

Large Hadron Collider at CERN, Geneva, Switzer-land with the goal of learning about the basic forcesthat have shaped our universe. ATLAS Trigger isthe software responsible for filtering the immensedata (40 TB per second) collected by the Colliderand storing them in its Oracle database.

3http://atlas.web.cern.ch/Atlas/Collaboration/

BioSQL4 is a generic relational model coveringsequences, features, sequence and feature anno-tation, a reference taxonomy, and ontologies (orcontrolled vocabularies) from various sources suchas GenBank or Swissport. While originally con-ceived as a local relational store for GenBank,the project has since become a collaboration plat-form between the Open Bioinformatics Founda-tion (OBF) projects (including BioPerl, BioPython,BioJava, and BioRuby). The goal is to build a suf-ficiently generic schema for persistent storage of se-quences, features, and annotation in a way that isinteroperable between these Bio* projects.

Ensembl is a joint scientific project between theEuropean Bioinformatics Institute (EBI)5 and theWellcome Trust Sanger Institute (WTSI)6 whichwas launched in 1999 in response to the imminentcompletion of the Human Genome Project. Thegoal of Ensembl was to automatically annotate thethree billion base pairs of sequences of the genome,integrate this annotation with other available bio-logical data and make all this publicly available viathe web. Since the launch of the website, manymore genomes have been added to Ensembl andthe range of available data has also expanded toinclude comparative genomics, variation and regu-latory data.

MediaWiki7 was first introduced in early 2002 bythe Wikimedia Foundation along with Wikipedia,and hosts Wikipedia’s content since then. As anopen source system (licensed under the GNU GPL)written in PHP, it was also adopted by many com-panies and is used in thousands of websites both asa knowledge management system, and for collabo-rative group projects.

Coppermine8 is a photo gallery web application.OpenCart9 is an open source shopping cart system.PhpBB10 (PHP Bulletin Board) is an Internet fo-rum package written in PHP. TYPO311 is a webcontent management framework based on PHP. Allthese platforms are highly rated and used.

Dataset Collection and Processing. A firstcollection of links to available datasets was made bythe authors of [3], [27]12; for this, these authors de-

4http://www.biosql.org/wiki/Main Page5https://www.ebi.ac.uk/6https://www.sanger.ac.uk/7https://www.mediawiki.org/wiki/MediaWiki8http://coppermine-gallery.net/9http://www.opencart.com

10https://www.phpbb.com/11http://typo3.org/12http://data.schemaevolution.org

5

Page 6: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 1: The datasets employed in our study

serve honorable credit. We isolated eight databasesthat appeared to be alive and used (as already men-tioned, some of them are actually quite prominent).For each dataset, we have gathered the schema ver-sions (DDL files) that were available at June 2013,directly from public source code repositories (cvs,svn, git) for the eight data sets listed in Fig. 1.We have targeted main development branches andtrunks to maximize the validity of the gathered re-sources. We are interested only on changes of thedatabase part of the project as they are integratedin the trunk of the project. Hence, we collected allthe commits of the trunk or master branch thatwere available at the time, and ignored all otherbranches of the project, as well as any commits ofother modules of the project that did not affect thedatabase.

For all of the projects, we focused on their re-lease for MySQL (except ATLAS Trigger, availableonly for Oracle). Those files where then renamedwith their filenames matching to the date (in stan-dard UNIX time) the commit was made. The fileswere then processed in sequential pairs from ourtool, Hecate, to give us in a fully automated way(a) the differences between two subsequent com-mits and (b) the measures we needed to conductthis study. Attributes are marked as altered if theyexist in both versions and their type or participa-tion in their tables’s primary key changed. Tablesare marked as altered if they exist in both ver-sions and their contents have changed (attributesinserted/deleted/altered).

All the datasets used, along with ourtool-suite for managing the evolution ofdatabases can be found in our group’s git:https://github.com/DAINTINESS-Group.

4. Assessing the Laws for Schema Evolution

The laws of software evolution where developedand reshaped over forty years. Explaining each lawin isolation from the others is precarious, as it riskslosing the deeper essence and inter-dependencies ofthe laws [9]. To this end, in this section, we organizethe laws in three thematic areas of the overall evolu-tion management mechanism that they reveal. Thefirst group of laws discusses the existence of a feed-back mechanism that constrains the uncontrolledevolution of software. The second group discussesthe properties of the growth part of the system, i.e.,the part of the evolution mechanism that accountsfor positive feedback. The third group of laws dis-cusses the properties of perfective maintenance thatconstrains the uncontrolled growth, i.e., the part ofthe evolution mechanism that accounts for negativefeedback. To quantitatively support our study, weutilize the following measures:

• Schema size of a version: The number of tablesof a schema version.

• Schema Growth: The difference between theschema size of two (typically subsequent) ver-sions (i.e., new - old).

• Heartbeat : a sequence of tuples, one per tran-sition, with the count of the events that oc-curred during this transition. In the con-text of this paper, for each transition betweentwo subsequent versions, we produce a tupleof measures including Table Insertions, Ta-ble Deletions, Attribute Insertions, AttributeDeletions, Attribute Alternations (change ofdata type), Attributes Inserted at Table For-mation, Attribute Deletions at Table Removal.To clarify, Attribute Insertions concern addi-tions of attributes to an existing table, whereasAttributes Inserted at Table Formation con-cern the number of attributes generated when-ever a new table is born. Attribute Deletionsconcern deletions from a table that continuesto exist, whereas Attribute Deletions at Ta-ble Removal concern attributes that are re-moved whenever their containing table is re-moved. We sum up these measures per transi-tion, to produce the Heartbeat of the lifetimeof the data set.

We would like to remind the reader that we studythe evolution of the logical schema of databases in

6

Page 7: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 2: Combined demonstration of heartbeat (number of changes per version) and schema size (no. of tables). The left axissignifies the amount of change and the right axis the number of tables.

7

Page 8: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 3: Combined demonstration of heartbeat (continued)

8

Page 9: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 4: Combined demonstration of heartbeat (continued)

open-source software. In all our deliberations, wetake the above context as granted and avoid re-peating it for reasons of better presentation of ourresults.

4.1. Is There a Feedback-based System for SchemaEvolution?

4.1.1. Law of continuing change (Law I)

The first law argues that the system continuouslychanges over time.

An E-type system must be continuallyadapted or else it becomes progressivelyless satisfactory in use.

The main idea behind this law is simple: as thereal world environment evolves, the software that is

intended to address its problems has to evolve too.If this does not happen, the system becomes lesssatisfactory.

Metrics for the assessment of the law’s validity.To establish the law, one needs to show that thesoftware shows signs of evolution as time passes.Possible metrics from the field of software engi-neering [23] include (a) the cumulative number ofchanges and (b) the breakdown of changes overtime.

Assessment. To validate the hypothesis that thelaw of continuing change holds, we study the heart-beat of the schema’s life (see Fig. 2 to 4 for a com-bined demonstration of heartbeat and schema size).

With the exception of BioSQL that appearedto be “sleeping” for some years and was later re-

9

Page 10: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

activated, in all other cases, we have changes (some-times moderate, sometimes even excessive) over theentire lifetime of the database schema. An impor-tant observation stemming from the visual inspec-tion of our change-over-time data, is that the termcontinually in the law’s definition is challenged: weobserve that database schema evolution happens inbursts, in grouped periods of evolutionary activity,and not as a continuous process! Take into accountthat the versions with zero changes are versionswhere either commenting and beautification takesplace, or the changes do not refer to the informationcapacity of the schema (relations, attributes andconstraints) but rather, they concern the physicallevel properties (indexes, storage engines, etc) thatpertain to performance aspects of the database.Can we state that this stillness makes the schema

“unsatisfactory” (referring back to the wording ofthe first law by Lehman)? We believe that theanswer to the question is negative: since the sys-tem hosting the database continues to be in use,user dissatisfaction would actually call for continu-ous growth of the database, or eventual rejection ofthe system. This does not happen. On the otherhand, our explanation relies on the reference natureof the database in terms of software architecture: ifthe database evolves, the rest of the code, which isbasically using the database (and not vice versa),breaks.Overall, if we account for the exact wording of

the law, we conclude that the law partially holds.

4.1.2. Law of self-regulation (Law III).

The third law of software evolution is known asthe law of ”self regulation” and comes with a laconicdefinition.

Global E-type system evolution is feed-back regulated.

The main idea behind this law is that the systemunder development is actually a feedback-regulatedsystem: development and maintenance take placeand there is positive and negative feedback to thesystem. As the clients of the system request morefunctionality, the system grows in size to addressthis demand; at the same time, as the system grows,corrective and perfective maintenance has to takeplace to remove bugs and improve the internal qual-ity of the software (reduced complexity, increasedunderstandability) [8].Thus, the system’s growth cannot continually

evolve with the same rate; on the contrary, what

Figure 5: Growth (tables) over version id for all the datasets

one expects is to see a typical “baseline” growth,interrupted with releases of perfective maintenance.This trend is so strong, that, in the long run, thesystem’s size demonstrates what the authors of [8]call ”cyclic effects” and the authors of [9] call ”pat-terns of growth”.

Metrics for the assessment of the law’s validity.Whereas the law simply states that the evolution ofsoftware is feedback regulated, its experimental val-idation in the area of software systems is typicallysupported by the observation of a recurring pat-tern of smooth expansion of the system’s size thatis interrupted with releases with size reductions orabrupt growth. Moreover, due to a previous word-ing of the law (e.g., see [8]) that described change tofollow a normal distribution, the experimental as-sessment included the validation of whether growthdemonstrates oscillations around an average value[7, 8, 9]. The ripples in the size of the system are

10

Page 11: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 6: Growth (attributes) over version id for all the datasets; we measure attribute growth as the difference in thetotal number of attributes of all tables, between two subse-quent versions

assumed to indicate the existence of feedback inthe system: positive feedback results in the sys-tem’s expansion and negative feedback involves per-fective maintenance coming with reduced rate ofgrowth (which is not due to functional growth butre-engineering towards better code quality) – if notwith system shrinking (due to removal of unneces-sary parts or their merging with other parts).

Assessment. We organize the discussion of ourfindings around size and growth, both of whichdemonstrate some patterns, although not the onesexpected by the previous literature.

Size. The evolution of size can be observed inFig. 2 to 4. Concerning the issue of a recurring,fundamental pattern of smooth expansion, inter-rupted with abrupt changes or, more generally, ver-sions of perfective maintenance, we have to say

that we simply cannot detect the behaviour thatLehman did (contrast Fig. 2, 3, 4 to the respec-tive figures of articles [7] and [8]): in sharp contrastto the smooth baseline growth that Lehman hashighlighted, the evolution of the size of the stud-ied database schemata provides a landscape witha large variety of sequences of the following threefundamental behaviors.

• In all schemata, we can see periods of increase,especially at the beginning of their lifetime orafter a large drop in the schema size. This isan indication of positive feedback, i.e., the needto expand the schema to cover the informationneeds of the users – especially since the overalltrend in almost all of the studied databases isto see an increase in the schema size as timepasses.

• In all schemata, there are versions with dropsin schema size. Those drops are typically sud-den and steep and usually take place in shortperiods of time. Sometimes, in fact, thesedrops are of significantly larger size than thetypical change. We can safely say that the ex-istence of these drops in the schema size in-dicates perfective maintenance and thus, theexistence of a negative feedback mechanism inthe evolution process.

• In all schemata, there are periods of calmness,i.e., periods of non-modification to the logi-cal structure of the schema. This is especiallyevident if one observes the heartbeat of thedatabase, where changes are grouped to veryspecific moments in time.

Growth and its oscillations. Growth (i.e.,the difference in the size between two subsequentversions) comes with common characteristics in alldatasets. In most cases, growth is small (typicallyranging within 0 and 1). As Fig. 5 demonstrates,we have too many occurrences of zero growth, typ-ically iterating between small non-zero growth andzero growth. Due to perfective maintenance, wealso have negative values of growth (less than thepositive ones). We do not have a constant flowof versions where the schema size is continuouslychanging; rather, we have small spikes between oneand zero. Thus, we have to state that the growthcomes with a pattern of spikes. Due to this charac-teristic, the average value is typically very close tozero (on the positive side) in all datasets, both for

11

Page 12: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 7: Frequency of change values for Ensembl attributes

tables and attributes. There are few cases of largechange too; we forward the reader to Law V for adiscussion and to Fig. 9 for a graphical depiction oftheir characteristics.

The oscillations of growth demonstrates otherpatterns too: it is quite frequent, especially at theattribute level, to see sequences of oscillations oflarge size: i.e., an excessive positive delta followedimmediately by an excessive negative growth (seeFig. 9). We do, however, observe the oscillationsbetween positive and negative values (remember,the average value is very close to zero), much moreon the positive side, however, with several occasionsof excessive negative growth (clearly demonstratingperfective maintenance).

We would like to put special emphasis to the ob-servation that change is small . In terms of tables,growth is mostly bounded in small values. Thisis not directly obvious in the charts, because theyshow the ripples; however, almost all numbers arein the range of [-2..2] – in fact, mostly in the range[0..2]. Few abrupt changes occur. In terms of at-tributes (Fig. 6), the numbers are higher, of course,and depend on the dataset. Typically, those valuesare bounded within [-20,20]. However, the devia-tions from this range are not many.

In the course of our deliberations, we have ob-served a pattern common in all datasets: there isa Zipfian model in the distribution of frequencies.Observe Fig. 7 that comes with two parts, both de-picting how often a growth value appears in theattributes of Ensemble. The x-axis keeps the deltasize and the y-axis the number of occurrences ofthis delta. In the upper part we include zeros in thecounting (343 occurrences out of 528 data points)and in the lower part we exclude them (to showthat the power law does not hold only for the mostpopular value). We observe that there is a smallrange of deltas, between -2 and 4 that takes up 450changes out of the 528. This means that, despitethe large outliers, change is strongly biased towardssmall values close to zero.

In fact, both phenomena observed here, i.e., (a)the bounded small change around zero, (b) followinga Zipfian distribution of frequencies, constitute twoof the patterns that are global to all datasets andwithout any exceptions whatsoever.

Despite the fact that change does not follow thepattern of baseline smooth growth of Lehman andthe fact that change obeys a Zipfian distributionwith a peak at zero, we believe that the presence offeedback in the evolution process is clear; thus thelaw holds.

4.1.3. Law of feedback system (Law VIII)

The eighth law of software evolution is known asthe law of “Feedback System”.

E-type evolution processes are multi-level,multi-loop, multi-agent feedback systems.

The main idea around this law refers to the factthat original “observation has shown that the sys-tem behaves as self-stabilizing feedback system”[14]. There is a big discussion in the literature onvarious components and actors whose interactionslimit and guide the possible ways via which the sys-tem can evolve. We refer the interested reader to [9]for this. From our part, we do not presume to fullyknow the mechanics that constraint the growth ofa database schema. However, we can focus to thepart that there is indeed a mechanism that stabi-lizes the tendency for uninterrupted growth of theschema – and in fact we can try to assess whetherthis is a regressive mechanism whose behavior canbe generally estimated.

Metrics for the assessment of the law’s validity.To assume the law as valid we need to establish

12

Page 13: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

that it is possible to simulate the evolution of theschema size via an accurate formula. Following [8]and [15], we will perform regression analysis to es-timate number of relations for each version of theschema. We adopt the formulas found at [8] and [15]on the relationship of the new size of the system asa function of the previous size of it, adapted viaan ”inverse square” feedback effect. The respectiveformula is:

Si = Si−1 +E

S2i−1

E = avg(Ei), i = 1 . . . n

(1)

where S refers to the estimated system size and Eis a model parameter approximating effort. Relatedliterature [8] suggests computing E as the averagevalue of individual Ei, one per transition. To es-timate these individual effort approximations, Ei,the authors of [8] suggest two alternative formulae:

Ei = (si − si−1) · s2i−1 (2)

Ei =si − s1∑i−1j=1

1s2j

(3)

Assessment. We now move on to discuss whatseems to work and what not for the case of schemaevolution. We will use the OpenCart data set asa reference example; however, all datasets demon-strate exactly the same behavior.The main challenge with formula (1) is the esti-

mation of E. As a first step, we have generalizedthe formulae (2) and (3) via a parameterized ex-pression:

Ei =si − sα∑i−1j=α

1s2j

(4)

where si refers to the actual size of the schema atversion i and α refers to the version from whichcounting starts. The model of [8] comes with twovalues for α, specifically (i) α = i − 1 for formula2, and (ii) α = 1 for formula 3. The essence of theformula is that, to compute Ei, we use α previousversions to estimate effort.Then we began our assessment. First, we as-

sessed the formulae of [8]. In this case, we com-pute the average E of the individual Ei over the en-tire dataset. We employ four different values for α,specifically i−1 (last version), 1 (for the entire dataset) and 5 and 10 for the respective past versions.

We depict the result in Fig. 8(top), where the actualsize is represented by the blue solid line. The resultsindicate that the approximation modestly succeedsin predicting an overall increasing trend for all fourcases, and, in fact, all four approximations targetedtowards predicting an increasing tendency that theactual schema does not demonstrate. At the sametime, all four approximations fail to capture the in-dividual fluctuations within the schema lifetime.

Then, we tried to improve on this result, and in-stead of computing E as the total average over allthe values of the dataset, we compute it as the run-ning average (not assuming a global average, buttuning the average effort with every added release).In this case, depicted in Fig. 8(middle), the resultsare less satisfactory than our first attempt.

After these attempts, we decided to alter thecomputation of E again. A better estimation oc-curred when we realized that back in 1997 peopleconsidered that the parameter E was constant overthe entire lifetime of the project; however, later ob-servations (see [9]) led to the revelation that theproject was split in phases. So, for every versioni, we compute E as an average over the last τ Ej

values, with small values for τ (1/5/10) – contrastthis to the previous two attempts where E was com-puted as a total average over the entire dataset (i.e.,constant for all versions) or a running average fromthe beginning of the versions till the current one.

So, the main formula of the law is restated (andactually generalized), by replacing a global parame-

ter E with a varying parameter Eithat can change

per version (thus the superscript notation signifiesthe value of the effort estimation at version i). Theversions used for this calculation are within therange [τs, τe].

Si = Si−1 +E

i

S2i−1

, Ei= avgτ

e

j=τs(Ej) (5)

For the three simulation attempts that we haverun, we have the following configurations:

Method Values for [τs, τe]global average τs: 1 τe: nrunning average τs: 1 τe: i− 1

last τ v. (τ ∈ {1, 5, 10}) τs: i-τ τ e: i− 1

We also decided to use the last 5 or 10 versionsto compute Ei, i.e., α is 5 or 10. This has alreadybeen used in the past experiments too.

13

Page 14: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 8: Actual and estimated schema size for OpenCart viaa total(top), running(middle) or bounded(bottom) averagesof individual Ei

As we can see in Fig. 8(bottom), the idea of com-puting the average E with a short memory of 5 or 10versions produced extremely accurate results. Thisholds for all data sets. This observation also sug-gests that, if the phases that [9] mentioned actually

exist for the case of database schema, they are re-ally small, or non-existent, and a memory of 5-10versions is enough to produce very accurate results.The fact that this works with τ=1, and in fact, bet-ter than the other approximations is puzzling andcounters the existence of phases.

We do not have a convincing theory as to whythe formula works. We understand that there areno constants in the feedback system and in fact,the feedback mechanism needs a second feedbackloop, with a short memory for estimating the modelparameter E. In plain words, this signifies thatboth size and effort approximation are intertwinedin a multi-level feedback mechanism.

Overall, the evolution of the database schema ap-pears to obey the behavior of a feedback-based mech-anism, as the schema size of a certain version of thedatabase can be accurately estimated via a regres-sive formula that exploits the amount of changes inrecent, previous versions.

4.2. Properties of Growth for Schema Evolution

Growth occurs as positive feedback to the system,in an attempt to expand the system with more func-tionality, or address new assumptions that make itsoperation acceptable, e.g., new user requirements,an evolving operational environment, etc. In thissubsection, we study the properties of the growth.

4.2.1. Law of continuing growth (Law VI).

The sixth law of software evolution is known asthe law of “Continuing Growth”.

The functional capability of E-type sys-tems must be continually enhanced tomaintain user satisfaction over system life-time.

The sixth law resembles the first law (continuingchange) at a fist glance; however, as explained in[8], these two laws cover different phenomena. Thefirst law refers to the necessity of a software systemto adapt to a changing world. The sixth law refersto the fact that a system cannot include all theneeded functionality in a single version; thus, dueto non-elastic time and resource constraints, severaldesired functionalities of the system are excludedfrom a version. As time passes, these functionali-ties are progressively blended in the system, alongwith the new requirements stemming from the firstlaw’s context of an evolving world. As [9] eloquently

14

Page 15: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

states “the former is primarily concerned with func-tional and behavioural change, whereas the latterleads, in general, directly to additions to the exist-ing system and therefore to its growth”.Metrics for the assessment of the law’s validity.

Possible metrics for the sixth law that come fromthe software engineering community [23] include:LOC, number of definitions (of types, functions andglobal variables) and number of modules. We ex-press again a point of concern here: it is impossibleto discern, from this kind of ”black-box” measure-ments, the percentage of change that pertains to thecontext of the law of continuing growth. Ideally, oneshould count the number of recorded ”ToDo” func-tionalities blended within each version. However,we do recognize that this task is extremely hard toautomate at a large scale. In our case, as we mainlyrefer to information capacity rather than physicallevel schema properties, we can utilize the schemasize as a safe measure of observing ”additions to theexisting system”.Assessment. In all occasions, the schema size in-

creases in the long run (Fig. 2 to 4). We frequentlyobserve some shrinking events in the timeline ofschema growth in all data sets. However, all datasets demonstrate the tendency to grow over time.At the same time, we also have differences from

traditional software systems: as with Law I, theterm ”continually” is questionable. As alreadymentioned (refer to Law III and Fig. 2 to 4), changecomes with frequent (and sometimes long) periodsof calmness, where the size of the schema does notchange (or changes very little). Calmness is clearlya phenomenon not encountered in the study of tra-ditional software systems by Lehman and acquiresextra importance if one also considers that in ourstudy we have isolated only the commits to thefiles with the database schema and not the com-mits to the entire information system that uses it:this means that there are versions of the system,for which the schema remained stable while the sur-rounding code changed.Therefore we can conclude that the law holds

(the information capacity of the database schemais enhanced in the long run), albeit modified to ac-commodate the particularities of database schemata(changes are not continuous but rather, they comewithin large periods of calmness).

4.2.2. Law of conservation of familiarity (Law V).

The fifth law of software evolution is known asthe law of ”Conservation of Familiarity”.

In general, the incremental growth(growth ratio trend) of E-type systems isconstrained by the need to maintain famil-iarity.

As the system evolves, all the stakeholders thatare associated to it (developers, users, managers,etc) must spend effort to understand and actually,master its content and functionality. Wheneverthere is excessive growth in a version, the feedbackmechanism tends to diminish the growth in sub-sequent versions, so that the change’s contents areabsorbed by people. Interestingly, whereas the orig-inal form of the law refers to a constant (”statisti-cally invariant”) rate, the new version of the lawis accompanied by explanations strongly indicat-ing a ”long term decline in incremental growth andgrowth ratio . . . of all release-based systems studied”[9]. This result came as experimental evidence fromthe observation of several systems, accompanied bythe anecdotal evidence of a growing imbalance involume in favor of corrective versus adaptive main-tenance. [23] and [15] also give a corollary of the lawstating that versions with high volume of changesare followed by versions performing corrective orperfective maintenance.

Metrics for the assessment of the law’s valid-ity. [9] gives a large list of possible metrics: ob-jects, lines of code, modules, inputs and outputs, in-terconnections, subsystems, features, requirements,and so on. [23] propose metrics that include: (i) thegrowth of the system, (ii) the growth ratio of thesystem, and (iii) the number of changes performedin each version. We align with these tactics and usethe schema growth of the involved datasets.

To validate the law we need to establish the fol-lowing facts:

• The growth of the schema is not increasing overtime; in fact, it is –at best- constant or, morerealistically, it declines over time/version. Aquestion, typically encountered in the litera-ture, is: ”What is the effect of age over thegrowth and the growth ratio of the schema?”Is it slowly declining, constant or oblivious toage? To address this question, we produce alinear interpolation of the growth per data setto show its overall trend (Fig. 5).

• Another question of interest in the relatedliterature is: ”What happens after excessivechanges? Do we observe small ripples ofchange, showing the absorbing of the change’s

15

Page 16: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 9: Different patterns of change in attribute growth ofMediawiki (over version-id, concealed for fig. clarity)

impact in terms of corrective maintenance anddeveloper acquaintance with the new version ofthe schema?” In this case, the pattern we cantry to establish is that abrupt changes are fol-lowed by versions where developers absorb theimpact of the change and produce minor modi-fications/corrections, thus resulting in versionswith small growth following the version withsignificant difference in size.

Assessment. Before proceeding, we would liketo remind the reader on the properties of growth,discussed in Law III of self-regulation: the changesare small, come with spike patterns between zeroand non-zero deltas and the average value of growthis very close to zero (from the positive side).

Concerning the ripples after large changes, wecan detect several patterns. Observe Fig. 9, de-picting attribute growth for the MediaWiki dataset.Due to the fact that this involves the growth of at-tributes, the phenomena are amplified compared tothe case of tables. Reading from right to left, we cansee that there are indeed cases where a large spikeis followed by small or no changes (case 1). How-ever, within the small pool of large changes thatexist overall, it is quite frequent to see sequencesof large oscillations one after the other, and quitefrequently being performed around zero too (case2). In some occurrences, we see both (case 3).

Concerning the effect of age, we do not see a di-minishing trend in the values of growth; however,age results to a reduction in the density of changesand the frequency of non-zero values in the spikes.

This explains the drop of the growth in almost allthe studied data sets (Fig. 5): the linear interpola-tion drops; however, this is not due to the decreaseof the height of the spikes, but due to the decreaseof their density.

The heartbeat of the systems tells a similar story:typically, change is quite more frequent in the be-ginning, despite the fact that existence of largechanges and dense periods of activities can occurin any period of the lifetime. Fig. 2 to 4 clearlydemonstrate this by combining schema size and ac-tivity. This trend is typical for almost all of thestudied databases. phpBB is the only exception,demonstrating increased activity in its latest ver-sions with the schema size oscillating between 60and 63 tables, which is actually a very small differ-ence (as all figures are fitted to show the lines asclearly as possibly, they can be deceiving as to theamount of change – phpBB is such a case).

Concerning the validity of the law, we believethat the law is possible but not confirmed. The lawstates that the growth is constrained by the needto maintain familiarity. However, the peculiarity ofdatabases, compared to typical software systems, isthat there can be other good reasons to constraingrowth, such as the high degree of dependence ofother modules from the database. Therefore, con-servation of familiarity, although important, cannotsolely justify the limited growth. The extent of thecontribution of each reason is unclear.

4.2.3. Law of conservation of organizational stabil-ity (Law IV).

The fourth law of software evolution is known asthe law of ”Conservation of Organizational Stabil-ity” also known as law of the ”invariant work rate”.

The work rate of an organization evolv-ing an E-type software system tends to beconstant over the operational lifetime ofthat system or phases of that lifetime.

This is the only law with a fundamental changebetween the two editions of 1996 and 2006. Theprevious form of the law did not recognize phasesin the lifetime of a project (”The average effectiveglobal activity rate in an evolving E-type systemis invariant over product lifetime”). Plainly put,the law states that the impact of any managerialactions to improve productivity is balanced by theincreasing complexity of software as time passes aswell as the role of forces external to the software(availability of resources, personnel, etc).

16

Page 17: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Metrics for the assessment of the law’s validity.As [23] excellently states, it is very hard to as-sess effort from the data that we can typical ac-quire from a project, as ”effort does not equateprogress”. Therefore, we can only approximate thework rate by observing the published versions of asystem. Possible metrics [23] include (i) the numberof changes per version, (ii) the average number ofchanges per day, and (iii) change and growth ratios.

To validate the law of conservation of organi-zational stability, we need to establish that theproject’s lifetime is divided in phases, each of which(a) demonstrates a constant growth, and, (b) is con-nected to the next phase with an abrupt change.Moreover, abrupt changes should occur from timeto time and not all the time (resulting in extremelyshort phases).

Assessment. If we focus on the essence of the law,we can safely say that it does not hold. The heart-beats of Fig. 2 to 4 and the arbitrary sequencingof spikes and calmness (Fig. 5, 9) make it impossi-ble to speak about constant growth, even in phases.The open-source nature of our cases plays a role tothat too [23].

4.3. Perfective Maintenance for Schema Evolution

Lehman has indicated the battle between twoantagonizing processes over a fixed amount of re-sources for the maintenance of software [14]: onthe one hand, the need to evolve the system (“sys-tem growth”) and on the other the “anti-regressive”effort to attack the growing complexity of the sys-tem. To achieve this, perfective maintenance mustbe performed from time to time, in order to re-move redundant code, to restructure code for bettermaintainability and comprehension, to documentthe code, etc. As [9] puts it: “these activities haveminor or no impact in functionality, performanceor other properties of the software in execution”.In this subsection, we are interested in the perfec-tive maintenance part and we adopt the [32] defini-tion (emphasis is ours): “modification of a softwareproduct after delivery to provide enhancements forusers, improvement of program documentation, andrecoding to improve software performance, main-tainability or other software attributes”.

4.3.1. Law of increasing complexity (Law II).

The second law of software evolution is known asthe law of “increasing complexity”.

As an E-type system is changed its com-plexity increases and becomes more diffi-cult to evolve unless work is done to main-tain or reduce the complexity.

The law states that complexity increases withage, unless effort is taken to prevent this. The ra-tionale behind verifying the law dictates the ob-servation of (a) an increasing trend in complexityof a software system, battled by (b) a perfectivemaintenance activity that attempts to reduce it anddemonstrated by drops in the system size and rateof expansion.

Metrics for the assessment of the law’s validity.Since we will ultimately resort to measurementsfor verifying the law, before proceeding further, weneed to confront a fundamental problem: the law’sdefinition -as it stands- requires a more precise def-inition of complexity. Unfortunately, complexity isa meta-property, practically involving a wide spec-trum of specific measureable properties of software.To give an example, Fenton and Pfleegler [33] men-tion four kinds of complexity: (i) problem com-plexity (computational complexity of the underly-ing problem), (ii) algorithmic complexity (of the al-gorithm eventually implemented to solve the prob-lem), (iii) structural complexity (typically measuredas the control flow or class hierarchy or modularitystructure) and (iv) cognitive complexity (measur-ing the effort required to understand the software).Lehman and Ramil [9] take a more process-orientedapproach and refer to application and functionalcomplexity, specification and requirements complex-ity, architectural complexity, design and implemen-tation complexity and structural complexity.

Unfortunately, all the above are very hard to de-fine and measure, especially if measurement is tobe performed on evidence automatically extractedfrom electronic logs or version management sys-tems. The automatic isolation of the subset ofchanges that pertain to perfective maintenance isan interesting and vast topic of research; for the mo-ment, however, it appears that we will have to re-sort to approximations. Related literature is basedon such approximations (see for example, [34]). No-tably, in the latest of Lehman’s series of papers, thelaw is supported via rationalization: the complexityincrease that age brings to a system is consideredresponsible for the decline of the growth ratio overtime (laws V and VI).

To surpass all these difficulties, we will try to as-sess the validity of the law based on the combination

17

Page 18: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

of the following observations:First, we will focus on the essence of the law: ulti-

mately, the law requires identifying releases or ver-sions where perfective maintenance is performed.To actually achieve with 100% certainty wouldrequire some project management documentationthat this is performed. Thus, we resort to the clos-est possible approximation and try to detect ver-sions with drops in the size and the growth of thesystem. Assuming that the overall trend of the sys-tem is to grow, the existence of such points fromtime to time will give a strong indication of thelaw.A second indication for the validity of the law is

the respect of the VIII law of feedback, i.e., the ex-istence of a regressive formula to which the size ofthe system conforms. The validity of this law wouldstrongly insinuate the existence of a feedback-basedsystem and therefore, the existence of negative feed-back as discussed in this second law of evolution.Third, we take a definition already found in

Lehman [7] and [34] and attempt to approximatethe measurement of complexity as the fraction ofthe evolution-affected relations (i.e., the number ofrelations modified or added to the schema) betweentwo subsequent versions of the schema over the dif-ference in the number of relations of the involvedversions. This formula approximates how much ef-fort has been invested in expanding the system overthe actual difference achieved (large values demon-strate too much effort for too small change). So,for each transition, we approximate the complex-ity of the original schema by dividing the extent ofthe involved changes over the actual increment ofthe schema size. To understand this better, assumethat we compare two transitions with the same de-nominator (i.e., difference in number of relations);if one transition had more relations updated thanthe other, it means we paid more effort for thistransition, and thus, we assume that the startingcomplexity is higher. More precisely, we divide theeffort (number of relations that we modified in anyway in a revision), by the growth (size of the resultin that revision). In case the denominator is zero,we have no escape than to define complexity as zero(which is another approximation we cannot avoid).

complexityi ∼relations handled

|Si − Si−1|(6)

Assessment. Related literature typically speaksfor increasing complexity [7], [8], [9], [23], althoughthere have been counterarguments for the case of

Figure 10: Complexity for Coppermine and Ensembl (overversion-id, concealed for clarity)

open source software [35]. In our case, in all thedatasets but phpBB, complexity, as defined in theprevious paragraph, does not increase13 (see Fig. 10,where a linear interpolation of complexity is alsodepicted). The phenomenon must be coupled withthe drop in change density (Law V) and althoughwe cannot provide undisputable explanation, we of-fer the synergy of two causes: (a) the increasing de-pendence of the surrounding code to the databasethat makes developers more cautious to performschema changes as they incur higher maintenancecosts, and, (b) the success of the perfective main-tenance, which results in a clean schema, requiringless corrective maintenance in the future.

Although we cannot confirm or disprove the lawbased on undisputed objective measurements, wehave indications that the second law partially holds,albeit with completely different connotations thanthe ones reported by Lehman for typical softwaresystems: in the case of database schemata, com-plexity, when measured as the fraction of expansioneffort over actual growth, drops.

4.3.2. Law of declining quality (Law VII)

The seventh law of software evolution is knownas the law of “Declining Quality”.

Unless rigorously adapted and evolved totake into account changes in the opera-tional environment, the quality of an E-type system will appear to be declining.

The main idea behind this law concerns the factthat the software will each time be based on as-sumptions on the user requirements or the realworld environment that will progressively be in-valid. As assumptions are invalidated, action mustbe undertaken to maintain the affected software

13In our CAiSE’14 paper [31], we erroneously refer toBioSQL instead of phpBB

18

Page 19: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

parts in order to reflect the actual user needs. Thus,the ageing of the system, along with the increasein complexity, also calls for a reestablishment ofassumptions and functionalities to serve the users’needs. [14] specifically refers to the external qualityof a software system, practically expressing a sys-tem’s quality as ‘user satisfaction’. However, thispoint of view is drastically different in [9], wherethe viewpoint on quality is generalized to all pos-sible kinds of quality an organization might deemnecessary (based on the viewpoint of users, man-agers, developers, each carrying his own interpreta-tion and measures).Metrics for the assessment of the law’s valid-

ity. Possible metrics [23] for the internal qualityof typical software systems include: (i) the numberof known defects associated with each version, (ii)defect density for each version, (iii) percentage ofmodules whose bodies have been changed. Muchlike the authors of [23], however, we are not reallyin a position to fully automate the accurate mea-surement of external quality as perceived by the endusers, the management, etc. It is noteworthy thatLehman and Fernandez-Ramil [15], [9] avoid givingany other support to the law than a logical proof: asthe system expands over time, its complexity risesand thus the addressing of user requirements andremoval of defects becomes more and more difficult,unless work is done to confront the phenomenon(”the decline in software quality with age, appearsto relate to a growth in complexity that must beassociated with ageing”).Assessment. We follow [9] and use logical induc-

tion to assess whether the law holds; specifically,we can assume that the law holds if it is stronglyestablished that the laws of feedback (III, VIII) andcomplexity (II) hold.We have already demonstrated that the ratio-

nale behind complexity increase is not supportedby our observations. At the same time, we can-not assess schema quality with undisputed means.Therefore, we cannot confirm or disprove the lawbased on undisputed objective measurements.

4.4. Threats to Validity

In this subsection, we discuss threats to the va-lidity of our conclusions. We structure our deliber-ations around three kinds of validity threats, specif-ically, construct validity, assessing the appropriate-ness of our measures, internal validity, assessing thepossibility that cause-effect relationships are pro-duced on an erroneous interpretation of causality,

and external validity, assessing the extent to whichour results can be generalized.

4.4.1. Construct Validity

Construct validity concerns the appropriatenessof the employed measures for the theoretical con-structs they purportedly assess. In our case, toassess construct validity, we review the appropri-ateness of the metrics used for each law, also witha view to the metrics used in the studies of softwareevolution. Fig. 11 summarizes our assessment.

I. Continuing change. As the goal is to establishthe continuity of change, the usage of the (accu-rately measured) heartbeat raises no concern aboutits appropriateness and the validity of our results.

II.Increasing complexity. The main metric to as-sess this law is the schema complexity. As we men-tioned before, we do not have a way to accuratelymeasure the complexity of a database schema assimilar studies have done with software’s complex-ity. We approximate the complexity with the effortspent between two schema versions divided by theincrement in size between those versions. The latercan be accurately measured but this is not the casewith the effort. Effort cannot be measured fromthe data that we have extracted for the databasesthat we studied. The only accurate way to measureeffort would be to have the actual man-hours thatevery developer has spent in the development of thedatabase. Moreover, given the fact that databasesare parts of larger software ecosystems, the possi-bility of accurately assessing effort would requirea measure able to differentiate the work done onthe database and the work pertaining to the rest ofthe software system – a possibility which we dimquite slim, in fact. On the other hand, the reason-ing behind the formula used makes sense and it isconsistent with the related literature. Overall, thecomplexity, as we approximate it, poses a threat toour construct validity that we cannot ignore; to alarge extent, this is also due to the abstract wordingof the law. This is also the reason why we are veryskeptic towards verifying the validity of the law inthe case of schema evolution. Future work needs tobe invested in the area for a more solid groundingof automated complexity assessment.

III. Self regulation. To assess this law, we usedschema size and growth as measures. Both met-rics can be accurately measured. The usage of themeasure is consistent with the bibliography and theintuition behind the law.

19

Page 20: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 11: Summary of measures employed per law and their appropriateness

IV. Conservation of organizational stability. Theinvolved metric in order to assess this law is thework rate (and the existence of periods duringwhich it remains constant). As previously men-tioned, work rate cannot be easily measured, basedon the available information To this end, we pri-marily use schema growth as an approximation ofoutput, and secondarily the heartbeat as an ap-proximation of activity, both of which are accurate.Overall, we are satisfied with our choice, as it ap-pears that this is the best possible approximationwe can get from automatically extracted data; atthe same time, we have to acknowledge that it isan approximation and not an undisputed measure-ment of the work rate.V. Conservation of familiarity. The metric used

for the assessment of this law is growth, which isaccurately measured. On the other hand, we haveno way to indisputably know the exact mechanicsbehind the observations; hence, despite the accu-racy of the observations, the law requires furtherelaboration.VI. Continuing growth. For this law, we em-

ployed schema size again, which is accurately ex-tracted by our tools and fit for assessing the law.VII. Declining quality. As schema quality is not

clearly defined in the area of databases, the assess-ment of quality via metrics requires specific stud-ies on the topic, before we are able to converge toa widely accepted solution. Rationalization aboutthe law has typically been used in the related liter-ature as a solution to the problem.VIII. Feedback system. The main measure we

used for assessing this law, is the estimated sizeof the database schema. This measure has previ-ously been used in the case of software evolution,again with an approximation for the measurement

of effort. However, the regression formula used isconsistent with its usage in the bibliography (al-beit with novelty in terms of the memory of thefeedback) and all the results in all data sets aresurprisingly consistent. Therefore, we believe thatthe specific formulae used pose no threat to validity,although a better understanding of the mechanicsbehind the feedback mechanism have to be part offuture studies.

4.4.2. Internal Validity

Internal validity refers to the case where a conclu-sion on the behavior of a dependent variable is madeas a cause-effect relationship with an independentvariable. We are very careful to treat our observa-tions only as such and avoid relating the observedphenomena with specific causes without supportingevidence.

Having said that, we extend the discussion, asthe observant reader might be tempted to intro-duce a cause-effect relationship between age (as acause) and the following phenomena: (a) droppingdensity of change, (b) dropping complexity, and (c)size growth in the long run. We conjecture (butcannot prove) that we could attribute the behav-ior of density and complexity to the existence ofa confounding variable: schema quality, improvingover time due to perfective maintenance and caus-ing the observed behavior. Still, this remains tobe proved with undisputed data and metrics. Forsize, the confounding variable is user requirementsfor more information capacity; although reasonableenough (in our minds, practically certain), this isalso a topic to be proved indisputably by dedicatedstudies.

20

Page 21: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

4.4.3. External Validity

External validity refers to the possibility of gen-eralizing the findings of a study to a broader con-text. Concerning the external validity of our study,we repeat that its context concerns the study ofthe evolution of the logical schema of databases inopen-source software. We avoid generalizing ourfindings to databases operating in closed environ-ments and we stress that our study has focusedonly on the logical structure of databases, avoidingphysical properties (let alone instance-level obser-vations).Concerning the validity of our study within this

context, we believe we have provided a safe, rep-resentative experiment. In this study, we have tar-geted a significant number of database schemas thatserve different purposes in the real world and comewith a quite broad range of time spans. Concerningthe time span, the schemas collected had an ade-quate number of versions from rather few (40) toquite many (500+). Despite these degrees of vari-ability, our findings are consistent in practically allof the datasets (with few exceptions that we men-tioned). Thus we believe that the case of logicaldatabase schema in open source software is well rep-resented.On the other hand, we would be hesitant to gen-

eralize our findings in databases in closed softwareor outside the scope of the logical schema. Open-source software comes with a larger developmentcommunity, and less control on the development ef-fort. This is not the case for closed software, es-pecially when dealing with mission critical compo-nents like databases. At the same time, we have notworked with the information concerning the physi-cal schema or the extension of the studied databasesand thus, we would take the opportunity to warnthe reader not to generalize the results outside thescope of a schema’s information capacity as ex-pressed by the logical-level schema.

5. Discussion

In this section, we summarize fundamental ob-servations and patterns that have been detected inour study. We intentionally avoid the term law,as we do not have unshakeable evidence for theirexplanation: apart from the empirical grounding,due a very large amount of datasets that obey thesame patterns (which we believe we have fairly at-tained), we would require an undisputed rational-

ized grounding, i.e., a clear explanation of the un-derlying mechanism that guides them, also estab-lished on measured, undisputed facts.

In case the reader has skipped our discussionof threats to validity, we clarify once more thatthe context under which our observations are madeconcerns the study of the evolution of the logicalschema of databases in open-source software. Inall our subsequent deliberations, we take the abovecontext as granted and avoid repeating it for rea-sons of better presentation of our results.

Before proceeding, however, to our conclusions,we devote the first part of this subsection to a dis-cussion on the validity of the problem per se.

5.1. Does the problem make sense in the first place?

We start with a fundamental inquiry : is it mean-ingful to try assessing Lehman’s laws for schemaevolution in the first place? Does it make senseto try to observe evolutionary patterns in the wayschemata evolve by following Lehman’s method andlaws?

Surely, there are fundamental differences betweenthe general case of E-type software systems anddatabases in open-source systems. First, whereassoftware systems export functionality to their users,databases, on the contrary, export information ca-pacity, i.e., the ability to store data and answerqueries. Second, databases are not complete andindependent software systems but parts of largerinformation systems. Is it then meaningful, to pur-sue this research?

Again, let us revisit the fundamental lessonlearned by Lehman’s laws: software systems arecomplex, multi-level systems, involving severalstakeholders, that have to evolve or face eviction;this evolution is governed by the antagonism be-tween (a) positive feedback, pushing the system toadapt to new environments and add new functional-ities according to the users’ needs, and, (b) negativefeedback, that constrains the uncontrolled growthand complexity of the system, by imposing perfec-tive maintenance actions that result in an improved,more maintainable internal structure of the system.

Can we replace the term ’software systems’ with’database’ in the above wording? We believe wecan, and the fundamental reason is that the an-tagonism between positive and negative feedback isthere too. On the one hand, a database schemahas to obey the part of the positive feedback andits moderators need to adapt, tune and expand it

21

Page 22: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

over time (and this concerns all kinds of databases,as well as the ones involved in open-source soft-ware). This concerns both the expansion due touser requirements concerning the availability of in-formation and the adaptation to new environments.At the same time, growth cannot be unconstrained:developers of open-source software are highly sen-sitive and attentive when it comes to database-related code, as changes in the database can in-cur both syntactic and semantic failures. Thus,it would be reasonable to expect that leaving theschema grow without any complexity control, espe-cially in an open-source environment where devel-opers are not organized in a strict hierarchy, canresult to maintenance nightmares. The a-posterioriobservations verify this intuition: we do observeschema size contractions, where renamings, restruc-turings and removal of tables and attributes are ev-ident in an attempt to keep schemata clean, under-standable and well-structured.Are databases, then, mini E-type systems with

a life of their own? We should be clear that we donot postulate that databases can be completely iso-lated from the rest of their surrounding ecosystem.Still, studying schema evolution in an attempt todiscover regularities and patterns is certainly worththe effort, given the high degree of dependence ofthe rest of the code over the database structure.With the benefit of the hindsight, we do believethat considering the laws of Lehman as a startingpoint for the study of schema evolution has been alegitimate and rewarding effort as it revealed bothcommonalities (mainly due to the same fundamen-tal feedback mechanism) and differences (due to thespecificities of the database case) with the generaltheory of Lehman’s laws.

5.2. Major Findings

In this section, we provide a critical discussion ofour findings, accompanied by concise summaries,where we also annotate each of our observationswith reference to the law where we have discussedit in detail. Fig. 12 further distils these findings ina single table.

5.2.1. Is the process of schema evolution behavinglike a feed-back based system? (hypothesis ofthe feedback-based process)

We believe that we can indeed claim that schemaevolution is guided by a feedback based mechanism.Positive feedback brings the need to increase the

information capacity of the database, resulting inexpansion of the number of relations and attributesover time. At the same time, there is negative feed-back too, from the need to do some house-cleaningof the schema for redundant attributes or restruc-turing to enhance schema quality. We have alsoobserved that the inverse square models for theprediction of size expansion holds for all the eightschemata that we have studied. However, we donot come with a good explanation as to why thisholds. The supporting observations in this contextcan be listed as follows:

• As an overall trend, the information capacityof the database schema is enhanced – i.e., thesize grows in the long term (VI).

• The existence of perfective maintenance is evi-dent in almost all datasets with the existence ofrelation and attributes removals, as well as ob-servable drops in growth and size of the schema(sometimes large ones). In fact, growth fre-quently oscillates between positive and nega-tive values (III).

• The schema size of a certain version of thedatabase can be accurately estimated via a re-gressive formula that exploits the amount ofchanges in recent, previous versions (VIII).

As in all feedback-based systems, the negativefeedback prevents the uncontrolled growth and re-tains the quality of the schema at a high level,allowing thus the subsequent releases to operatesmoothly. This is a true sign of stability : the systemis maintained adequately to minimize the effects ofits unavoidable subsequent modifications and con-tinue evolving smoothly.

Overall, we can state: Schema evolution demon-strates the behavior of a stable, feedback-regulatedsystem, as the need for expanding its informationcapacity to address user needs is controlled via per-fective maintenance that retains quality; this antag-onism restrains unordered expansion and brings sta-bility.

5.2.2. Hypothesis of schema size expansion (andproperties of its growth)

The size of the schema expands over time, albeitwith versions of perfective maintenance due to thenegative feedback. As already mentioned, the in-verse square model seems to work.

The growth of the database schema does not fol-low a pattern of smooth growth – even considering

22

Page 23: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

Figure 12: Summary of research questions, findings and validity over the datasets

23

Page 24: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

the amendment where phases of constant growthare assumed. The expansion is mainly character-ized by three kinds of phases, including (i) abruptchange (positive and negative), (ii) smooth growth,and, (iii) calmness (meaning large periods of nochange, or very small changes). We observe that inthe case of schema evolution, the schema’s growth(i.e., its change from one version to the follow-ing) mainly occurs with spikes oscillating betweenzero and non-zero values. The changes are typi-cally small, following a Zipfian distribution of oc-currences, with high frequencies in deltas that in-volved small values of change, close to zero.

At the same time, in contrast to the case of soft-ware systems, we observe a very strong inclinationto avoid changes to the database schema. Changein the database impacts surrounding code, so thechange is constrained by the need to minimize thisimpact. So, we frequently see versions with nochange to the information capacity of the schemaand large time periods where the schema is still (oralmost still). Bear in mind that we monitor onlythe subset of versions that pertain to the databaseschema and ignored any versions where the infor-mation system surrounding the database changedwhile the schema remained the same. This enforcesour argument for the tendency towards stillness.

Although we do not believe conservation of famil-iarity to be the only cause, we see that the feedbackmechanism of the evolution demonstrates a reduc-tion in the density of changes as the schema ages..We also observe unexpected patterns of changeswith sequences of high spikes, sometimes oscillat-ing around zero. Such patterns require further in-vestigation for their verification and explanation.The average growth is close to zero, and with thetendency to drop as time passes, not due to the di-minishing of the (already small) deltas, wheneverthey occur, but mainly due to the diminishing oftheir density.

Concerning the size of the system, our support-ing evidence has been already summarized via lawsVI and VIII (see the previous paragraph). Con-cerning the heartbeat of the system, our support-ing evidence for the above statements can be listedas follows:

• The database is not continuously adapted, butrather, alterations occur from time to time (I).

• Change does not follow patterns of constantbehaviour (IV).

• Age results in a reduction of the density ofchanges to the database schema in most cases(V).

Concerning the growth of the system, our sup-porting evidence for the above statements can belisted as follows:

• Growth is typically small in the evolution ofdatabase schemata, compared to traditionalsoftware systems (III). The distribution of oc-currences of the amount of schema change fol-lows a Zipfian distribution, with a predominantamount of zero growth in all data sets. Plainlyput, there is a very large amount of versionswith zero growth, both in the case of attributesand in the case of tables. The rest of the fre-quently occurring values are close to zero, too.

• The average value of growth is typically closeto zero (although positive) (III) and drops withtime, mainly due to the drop in change density(V).

5.2.3. Hypothesis of perfective maintenance to fightcomplexity and user dissatisfaction

We also believe that there is sufficient evidenceto support the claim that perfective maintenance ispart of the process. This is mainly demonstrated bythe drops in the schema size as well as the drops inactivity rate and growth with age. In fact, growthfrequently oscillates between positive and negativevalues (III). Thus, based on simple reasoning, onecan accept the wording of Lehman’s laws on nega-tive feedback, as they both state that quality (in-ternal and external) declines unless confronted.

However, despite the adoption of the hypothesisfor a feedback-based mechanism, we cannot adoptthe corroborating observations of the related liter-ature for software systems that accompany the twolaws of negative feedback (II and VII). In the sys-tems we have studied we observe that age results ina reduction of the complexity to the database schema(II), although we need to remember that the mea-surement of complexity is an approximation. Theinterpretation of the observation is that perfectivemaintenance seems to do a really good job and com-plexity drops with age (in sharp contrast to what isobserved in the related literature for software sys-tems where more and more effort is devoted to bat-tle complexity). Also, in the case of schema evo-lution, activity is typically less frequent with age.Although one can attribute this to the inefficacy of

24

Page 25: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

the approximating measure, we anticipate that itshould mainly be attributed to the truth lying inthe essence of law II: ‘complexity increases unlesswork is done to reduce it’. We conjecture that dueto the criticality of the database layer in the overallinformation system, this process is done with careand achieves the reduction of complexity over time,coming hand in hand with the strong tendency to-wards minimum or no changes to the schema.

As for law VII, as already mentioned, we areeven more hesitant to adopt it, as we are already indoubt towards internal quality and have no actualevidence as to what happens with external quality.

Overall: although our research seems to keep thenegative feedback laws in place in the case of schemaevolution, this is done with (a) a degree of uncer-tainty and (b) with the strong indication of funda-mental differences with E-type program evolution.We would not be surprised if future research estab-lishes with more certainty that the feedback mecha-nism for schema evolution improves the quality andcomplexity of a database as time passes.

5.3. Opportunities for Future Work

There are several opportunities for follow-upwork. As one would normally expect, verifying thefindings of this study with more datasets can fur-ther solidify our confidence to them. The exten-sion of this work to evolution histories of propri-etary databases in closed environments, over largeperiods of time, would be of extreme value; albeitone can only be pessimistic on the possibility of ob-taining such data and being able to publish them.Novel developments in database technology allowthe extension of this kind of study to non-relationaldata too. This includes all kinds of semi-structureddata (evolution of XML data alone is a vast areaof research, where the nesting of the elements pro-vides transformations of the schema that are notpresent in the relational case), but also, the so-called ”NoSQL” data, where structures like graphsand text evolve over time. In the latter case, theidentification of patterns in the evolution of thedata at the instance level is clearly a challengingtopic of research.

A second large area of research concerns the iden-tification of patterns in the correlation of the evo-lution of the database and the evolution of the sur-rounding applications. This involves both the align-ment of the application code to the new schemaand, as a reviewer of this paper has pointed out,

possible workarounds in the code to avoid modi-fying the database. Even more challenging is therelationship of user requirements to database evo-lution. Remember that in order to be able to comeup with results in long histories with many ver-sions, automated processing of the available datais paramount. The possibility of automating theprocessing of tickets, bug reports and to-do lists ina way that can be correlated to the subsequent evo-lution of the database is a topic with a significantamount of technical challenge.

At the same time, the techniques used in thisstudy provide opportunities for improvement. Afirst area of future research concerns the findingsof ageing and complexity (Law II). We need to es-tablish better measures for complexity of databaseschemata and see how this complexity behaves overtime. Similar considerations hold for estimating ef-fort and work-rate by exploiting the available infor-mation in the software repositories as automaticallyas possibly.

Finally, one should also recognize that thesearch for more patterns than the ones offered byLehman’s laws, via traditional or novel pattern de-tection mechanisms, is another important possi-bility for future work. Already, the observationof patterns of growth (Laws III and V), or pat-terns in the heartbeat of the evolution, are openissues worth investigating. Going further than that,identifying which tables are more liable to changein the future and how, or how the effort aroundschema evolution can be planned in advance bystudying the available data are research questionswith great value both for developers, who can tailorthe code to be as loosely-coupled as possibly to themost unstable parts of the database, and projectmanagers, who can estimate where change will bedirected. We hope that in the context of suchendeavors, the publicly available datasets of thispaper (https://github.com/DAINTINESS-Group)can serve the research community.

Acknowledgment. We would like to thank theanonymous reviewers of both [31] and this paper fortheir useful comments.

This research has been co-financed by the Eu-ropean Union (European Social Fund - ESF) andGreek national funds through the Operational Pro-gram ”Education and Lifelong Learning” of theNational Strategic Reference Framework (NSRF)- Research Funding Program: Thales. Investingin knowledge society through the European SocialFund.

25

Page 26: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

6. References

[1] D. Sjøberg, Quantifying Schema Evolution, Informationand Software Technology 35 (1) (1993) 35–44.

[2] G. Papastefanatos, P. Vassiliadis, A. Simitsis, Y. Vas-siliou, Metrics for the Prediction of Evolution Impactin ETL Ecosystems: A Case Study, Journal on DataSemantics 1 (2) (2012) 75–97.

[3] C. Curino, H. J. Moon, L. Tanca, C. Zaniolo, SchemaEvolution in Wikipedia: Toward a Web InformationSystem Benchmark, in: Proceedings of 10th Interna-tional Conference on Enterprise Information Systems(ICEIS), 2008.

[4] D.-Y. Lin, I. Neamtiu, Collateral Evolution of Applica-tions and Databases, in: Proceedings of the Joint Inter-national and Annual ERCIM Workshops on Principlesof Software Evolution and Software Evolution Work-shops (IWPSE), 2009, pp. 31–40.

[5] S. Wu, I. Neamtiu, Schema evolution analysis for em-bedded databases, in: Proceedings of the 27th IEEEInternational Conference on Data Engineering Work-shops (ICDEW), 2011, pp. 151–156.

[6] D. Qiu, B. Li, Z. Su, An Empirical Analysis of theCo-evolution of Schema and Code in Database Ap-plications, in: Proceedings of the 9th Joint Meetingof the European Software Engineering Conference andthe ACM SIGSOFT Symposium on the Foundations ofSoftware Engineering (ESEC/FSE), 2013, pp. 125–135.

[7] L. A. Belady, M. M. Lehman, A Model of Large Pro-gram Development, IBM Systems Journal 15 (3) (1976)225–252.

[8] M. M. Lehman, J. C. Fernandez-Ramil, P. Wernick,D. E. Perry, W. M. Turski, Metrics and Laws of Soft-ware Evolution - The Nineties View, in: Proceedingsof the 4th IEEE International Software Metrics Sym-posium (METRICS), 1997, pp. 20–34.

[9] M. M. Lehman, J. C. Fernandez-Ramil, Software Evolu-tion and Feedback: Theory and Practice, Wiley, 2006,Ch. Rules and Tools for Software Evolution Planningand Management.

[10] I. Herraiz, D. Rodriguez, G. Robles, J. M. Gonzalez-Barahona, The Evolution of the Laws of Software Evo-lution: A Discussion Based on a Systematic LiteratureReview, ACM Computing Surveys 46 (2) (2013) 1–28.

[11] M. Wermelinger, Y. Yu, A. Lozano, Design Principlesin Architectural Evolution: A Case Study, in: Proceed-ings of the 24th IEEE International Conference on Soft-ware Maintenance (ICSM), 2008, pp. 396–405.

[12] Z. Xing, E. Stroulia, Analyzing the Evolutionary His-tory of the Logical Design of Object-Oriented Software,IEEE Transactions on Software Engineering 31 (10)(2005) 850–868.

[13] M. Lehman, Programs, Life Cycles, and Laws of Soft-ware Evolution, Proceedings of the IEEE 68 (9) (1980)1060–1076.

[14] M. M. Lehman, Laws of Software Evolution Revisited,in: Proceedings of 5th European Workshop on SoftwareProcess Technology, (EWSPT), 1996, pp. 108–124.

[15] M. M. Lehman, J. C. Fernandez-Ramil, D. E. Perry,On Evidence Supporting the FEAST Hypothesis andthe Laws of Software Evolution, in: Proceedings of the5th IEEE International Software Metrics Symposium(METRICS), 1998, pp. 84–88.

[16] S. S. Pirzada, A Statistical Examination of the Evolu-

tion of the Unix System, Ph.D. thesis, Imperial College,University of London (1988).

[17] N. T. Siebel, S. Cook, M. Satpathy, D. Rodrıguez, Lat-itudinal and Longitudinal Process Diversity, Journal ofSoftware Maintenance Research and Practice 15 (1).

[18] M. J. Lawrence, An Examination of Evolution Dynam-ics, in: Proceedings of the 6th International Conferenceon Software Engineering (ICSE), 1982, pp. 188–196.

[19] M. W. Godfrey, Q. Tu, Evolution in Open Source Soft-ware: A Case Study, in: Proceedings of the 16thIEEE International Conference on Software Mainte-nance (ICSM), 2000, pp. 131–142.

[20] M. W. Godfrey, Q. Tu, Growth, Evolution, and Struc-tural Change in Open Source Software, in: Proceedingsof the 4th International Workshop on Principles of Soft-ware Evolution (IWPSE), 2001, pp. 103–106.

[21] G. Robles, J. J. Amor, J. M. Gonzalez-Barahona, I. Her-raiz, Evolution and Growth in Large Libre SoftwareProjects, in: Proceedings of the 8th International Work-shop on Principles of Software Evolution (IWPSE),2005, pp. 165–174.

[22] S. Koch, Software Evolution in Open Source Projects:a Large-scale Investigation, Journal of Software Main-tenance and Evolution 19 (6) (2007) 361–382.

[23] G. Xie, J. Chen, I. Neamtiu, Towards a Better Under-standing of Software Evolution: An Empirical Studyon Open Source Software, in: Proceedings of the 25thIEEE International Conference on Software Mainte-nance (ICSM), 2009, pp. 51–60.

[24] I. Herraiz, G. Robles, J. M. Gonzalez-Barahon, Com-parison Between SLOCs and Number of Files As SizeMetrics for Software Evolution Analysis, in: Proceed-ings of the 10th European Conference on SoftwareMaintenance and Reengineering (CSMR), 2006, pp.206–213.

[25] R. Vasa, Growth and Change Dynamics in Open SourceSoftware Systems, Ph.D. thesis, Swinburn Univ. ofTechnology, Australia (2010).

[26] A. Israeli, D. G. Feitelson, The Linux Kernel as a CaseStudy in Software Evolution, Journal of Systems andSoftware 83 (3) (2010) 485–501.

[27] C. A. Curino, H. J. Moon, C. Zaniolo, GracefulDatabase Schema Evolution: the PRISM Workbench,Proceedings of the VLDB Endowment 1 (2008) 761–772.

[28] C. Curino, H. J. Moon, A. Deutsch, C. Zaniolo,Automating the Database Schema Evolution Process,VLDB Journal 22 (1) (2013) 73–98.

[29] G. Papastefanatos, P. Vassiliadis, A. Simitsis, Y. Vassil-iou, Policy-Regulated Management of ETL Evolution,Journal on Data Semantics 13 (2009) 147–177.

[30] G. Papastefanatos, P. Vassiliadis, A. Simitsis, Y. Vas-siliou, HECATAEUS: Regulating Schema Evolution, in:Proceedings of the 26th IEEE International Conferenceon Data Engineering (ICDE), 2010, pp. 1181–1184.

[31] I. Skoulis, P. Vassiliadis, A. Zarras, Open-SourceDatabases: Within, Outside, or Beyond Lehman’s Lawsof Software Evolution?, in: Proceedings of 26th Inter-national Conference on Advanced Information SystemsEngineering (CAiSE), 2014, pp. 379–393.

[32] IEEE, Guide to the Software Engineer-ing Body of Knowledge (v. 3.0), IEEEComputer Society, 2014, available athttp://www.computer.org/portal/web/swebok, Re-trieved at 08 July 2014.

26

Page 27: Growing up with Stability: how Open-Source Relational ...pvassil/publications/2014_CAiSE/... · Growing up with Stability: how Open-Source Relational Databases Evolve Ioannis Skoulisa,1,

[33] N. E. Fenton, S. L. Pfleeger, Software Metrics - A Prac-tical and Rigorous Approach, International Thomson,1996.

[34] M. M. Lehman, J. F. Ramil, Software Evolution,in: STRL Annual Distinguished Lecture, De Mont-fort Univ., Leicester, 20 Dec. 2001, 2001, available athttp://www.eis.mdx.ac.uk/staffpages/mml/feast2/papers.html,http://www.eis.mdx.ac.uk/staffpages/mml/feast2/papers/pdf/690c.pdf,http://www.eis.mdx.ac.uk/staffpages/mml/feast2/papers/pdf/jfr103c.pdf.

[35] J. Fernandez-Ramil, A. Lozano, M. Wermelinger,A. Capiluppi, Empirical Studies of Open Source Evolu-tion, in: Software Evolution, 2008, pp. 263–288.

27