Object-Oriented Class Maintainability Prediction Using ... · Keywords: internal and external...

1

Object-Oriented Class Maintainability Prediction Using Internal

Quality Attributes

Jehad Al Dallal

Department of Information Science

Kuwait University

P.O. Box 5969, Safat 13060, Kuwait

[email protected]

Abstract

Context: Class maintainability is the likelihood that a class can be easily modified. Before

releasing an object-oriented software system, it is impossible to know with certainty

when, where, how, and how often a class will be modified. At that stage, this likelihood

can be estimated using the internal quality attributes of a class, which include cohesion,

coupling, and size. To reduce the future class maintenance efforts and cost, developers

are encouraged to carefully test and well document low maintainability classes before

releasing the object-oriented system.

Objective: We empirically study the relationship between internal class quality attributes

(size, cohesion, and coupling) and an external quality attribute (class maintainability).

Using statistical techniques, we also construct models based on the selected internal

attributes to predict class maintainability.

Method: We consider classes of three open-source systems. For each class, we account

for two actual maintainability indicators, the number of revised lines of code and the

number of revisions in which the class was involved. Using 19 internal quality measures,

we empirically explore the impact of size, cohesion, and coupling on class

maintainability. We also empirically investigate the abilities of the measures, considered

both individually and combined, to estimate class maintainability. Statistically based

prediction models are constructed and validated.

Results: Our results demonstrate that classes with better qualities (i.e., higher cohesion

values and lower size and coupling values) have better maintainability (i.e., are more

likely to be easily modified) than those of worse qualities. Most of the considered

measures are shown to be predictors of the considered maintainability indicators to some

degree. The abilities of the considered internal quality measures to predict class

maintainability are improved when the measures are combined using optimized

multivariate statistical models.

Conclusion: The prediction models can help software engineers locate classes with low

maintainability. These classes must be carefully tested and well documented.

Keywords: internal and external quality attributes, quality measures, class cohesion,

class coupling, class size, class maintainability, class revisions, object-oriented software.

1. Introduction

Maintenance is a key stage in the software life cycle, and it starts when the software

product is delivered. During the maintenance stage, the software product is modified to

correct faults, improve performance or other attributes, or adapt the product to a modified

2

environment (Mamone 1994). Software maintenance constitutes the largest share of the

total cost of producing software applications. Some studies estimated that maintenance

requires up to 80% of the total cost (Ahn et al., 2003). To reduce the maintenance costs,

earlier development stages must be implemented to enable software code implementation

results to be easily understood and modified (Erdil et al., 2003). Maintainability is

defined as "the ease with which a software system or component can be modified" (IEEE

1990).

One reason for the shift in software development toward the use of object-oriented (OO)

technology is the belief that OO code has high quality and maintainability (Briand et al.

1997b). Because the central construct of OO development is the class, classes are

expected to be high-quality units that can be easily maintained. Maintainability aspects

are measurable after performing maintenance tasks. Once a class is revised, the time and

cost of this specific revision are measurable. Alternatively, measures correlated to

maintenance time and cost can be applied to estimate the maintenance time or cost. In

this paper, we consider two such existing maintenance measures: the number of revised

lines of code (LOC) (Li and Henry 1993) and the number of revisions in which the class

was involved during the maintenance history (Dagpinar and Jahnke 2003). We selected

these two measures for two main reasons. The first reason is that the number of revisions

and revised LOC indicate two different maintenance aspects that are of interest to

software engineers. The former measure quantifies the maintenance rate. Software

engineers and practitioners prefer classes with lower maintenance rates over those with

higher rates because code that undergoes more revisions becomes less organized, less

understandable, and more fault-prone (Erdil et al., 2003). The number of revised LOC is

found to correlate with both maintenance cost (Granja-alvarez and Barranco-garcia 1997)

and maintenance effort measured in units of time (Hayes et al., 2004), where both cost

and effort are key factors for software engineers. The second reason for selecting these

two maintenance measures is that they are measurable in software systems with reported

maintenance histories, which occurs for some systems available on-line. This allows us to

perform the required empirical study.

The two considered maintenance measures might be somehow correlated, but they are

different. A class can be involved in many revisions, but it may have relatively few

revised LOC. In contrast, a class can be involved in few revisions but have many revised

LOC.

Class maintainability, i.e., the likelihood that a class may be easily modified, is a key

class quality attribute, and classes should be designed to be maintainable. Based on the

above discussion, classes with many revisions or revised LOC are less maintainable (i.e.,

expected to be more difficult to modify) than those with few revisions or revised LOC.

The two considered maintenance measures are thus actually class maintainability

indicators. Maintainability, like other important class qualities (e.g., reusability and

reliability) belongs to a set of software attributes known as external software attributes.

These attributes are directly relevant to users and practitioners (Fenton and Pfleeger 1997,

Morasca 2009).

3

Classes with low maintainability must be carefully tested to reduce their fault-proneness

and well documented to improve their understandability when future maintenance tasks

are performed. In addition, software developers may refactor classes with low

maintainability to enhance their maintainability before releasing the system. The

refactoring of classes by their developers at earlier stages is better than having other

maintenance programmers refactor them at later stages because system developers are

generally more knowledgeable about the system than external maintenance programmers;

however, the quality of documentation is an important factor to be considered with regard

to this. Therefore, the refactoring of classes by their developers potentially reduces the

maintenance time and cost. Software engineers are thus interested in identifying classes

with low maintainability when the system is complete and not yet released. As with other

external attributes, however, many factors may affect class maintainability in addition to

the factors that depend on knowledge of the class artifacts (e.g., source code). These

factors are typically unknown at the class development stage and cannot be measured

solely based on knowledge of the class or even the software system to which the class

belongs. For example, it is impossible to foresee the evolutions that the system will

undergo and the future modifications to the environment to which the system will be

adapted. Therefore, class maintainability cannot be measured before the actual

maintenance process is performed, but it can be estimated.

Class size, cohesion, and coupling are internal software attributes that can be measured

after the system is developed and before it is released, and they may be related to various

external software quality attributes, including maintainability, reusability, and reliability

(Lee and Chang 2000). If a relationship exists between these internal quality attributes

and maintainability, developers could rely on measures that quantify the internal quality

attributes of classes to estimate class maintainability.

Without empirical validation or with a little empirical support using a few classes, which

raises questions about the generality of the obtained results, several researchers (e.g.,

Briand et al. 1993, Li and Henry 1993, Lee and Chang 2000, Sheldon et al. 2002,

Chaumun et al. 2002, Dagpinar and Jahnke 2003, Aggarwal et al. 2006, Zhou and Leung

2007, Li-jin et al. 2009, Elish and Elish 2009) suggested using existing or newly

proposed internal quality measures to predict maintainability. Some related empirical

studies (e.g., Li and Henry 1993, Dagpinar and Jahnke 2003, Zhou and Leung 2007, Li-

jin et al. 2009, Elish and Elish 2009) considered a few internal quality attributes or did

not investigate the impact of individual measures on maintainability; therefore, their

results cannot be used to decide whether the cohesion, coupling, and size quality

attributes each has a negative or positive impact on class maintainability. Instead of

examining actual revisions performed on the considered systems during their

maintenance history, some related empirical studies (e.g., Kabaili et al. 2001, Chaumun et

al. 2002) were based on experimentally revising the considered systems and exploring the

relationship between some internal quality measures and artifacts based on the

experimental revisions. The key limitation of such studies is the fact that they depend on

the experimental revisions, which might not be representative of the actual revisions.

Finally, some empirical studies (e.g., Briand et al. 1999b, Briand et al. 2001, Gyimothy et

al. 2005, Olague et al. 2007, and Marcus et al. 2008) investigated the abilities of some

4

quality measures to predict a specific aspect related to maintenance, namely, fault-

proneness, but they did not consider other maintenance types, including those performed

to enhance system performance or adapt the system to a modified environment.

In this paper, we extend these studies by empirically investigating relationships between

19 internal class attribute measures, including some of the most common size, cohesion,

and coupling measures in the literature, and the two indicated maintainability indicators.

In the empirical study, we used classes selected from three open-source Java systems, and

we collected their actual maintenance data, which was available on-line. We discuss how

to apply logistic regression analysis (Hosmer and Lemeshow 2000), a widely applied

statistical technique in experiment-based research, to predict class maintainability. Using

this technique, we propose models to estimate maintainability. Some models are based on

individual measures, and others are based on combinations of measures. These models

can be applied to predict class maintainability in advance, i.e., after the system is

developed but before it is released. Classes determined to be potentially less maintainable

should be carefully tested, documented, and possibly refactored.

Our results show that most considered quality measures are statistically significant

predictors of the two considered class maintainability indicators. Specifically, the results

suggest that there is a negative relationship between the maintainability external quality

attribute and each of the size and coupling internal quality attributes. In other words, the

results indicate that classes with larger sizes and higher coupling values are less

maintainable (i.e., more difficult to modify) than those with smaller sizes and lower

coupling values. The results also suggest that there is a positive relationship between the

maintainability external quality attribute and the cohesion internal quality attribute;

namely, the classes with higher cohesion values are more maintainable (i.e., easier to

modify) than those with lower cohesion values. When we considered the quality

measures in combination, the statistically constructed maintainability prediction models

show that the different quality attributes are complementary in predicting class

maintainability. As expected, when considering the measures in combination, the

constructed models become more statistically stable and provide better maintainability

indicators than most models based on individual measures.

The major contributions of this paper include the following:

1. Explaining how to apply logistic regression analysis to predict class

maintainability.

2. Investigating the relationship between size, cohesion, and coupling quality

attributes and class maintainability.

3. Exploring the abilities of several size, cohesion, and coupling measures,

considered individually, to predict class maintainability.

4. Constructing models based on combinations of measures, that have practical

abilities to predict class maintainability.

This paper is organized as follows. Section 2 reviews the basic concepts regarding

internal and external software attributes, and Section 3 reviews related work. Section 4

provides an overview of the software systems and the descriptive statistics that

characterize them. Section 5 describes the statistical techniques used in the data analyses.

5

Sections 6 and 7 report and discuss the univariate and multivariate regression analyses

and their results. Section 8 discusses validity threats to the empirical study. Finally,

Section 9 concludes the paper and outlines possible future work.

2. Internal vs. external attributes

Researchers have divided class quality attributes into internal and external categories

(Morasca 2009). External quality attributes are those that indicate class quality based on

factors that cannot be measured using only knowledge of the software artifacts (Fenton

and Pfleeger 1997). In addition to the artifacts, the software engineer must consider their

environment and the interactions between the artifacts and environment. For example, the

maintainability of an OO class depends on many elements, such as the class itself, the

experience of the team in charge of system maintenance, system age, and the

environment to which the system is modified to adapt. For example, an older system is

likely to require more maintenance effort than a younger system because the system size

is likely to increase with age and the modified code is likely to be less organized and

understandable. Therefore, class knowledge alone is insufficient to quantify its

maintainability, it is difficult to anticipate the future effort required to maintain the class,

and the class maintainability cannot be measured unless the class is actually maintained.

Conversely, internal class quality attributes, e.g., size, cohesion, and coupling, can be

measured based only on class artifact knowledge. Quantifying internal attributes is much

easier than quantifying external attributes; for example, class size can be measured by

counting the number of LOC. However, software practitioners are not interested in

internal quality attributes unless they are used to indicate external quality attributes such

as maintainability and reusability (Morasca 2009). For example, class cohesion is worth

measuring only if it is believed or has been shown to be related to (1) an external attribute

of the same artifact (e.g., the maintainability of the class) or some other related artifacts,

such as the class test suite, or (2) a software process attribute (e.g., the cost required to

develop the class).

In this empirical study, we consider the measurement of one external attribute, namely,

maintainability, and three internal attributes, namely, size, cohesion, and coupling.

Because of their different natures, the considered attributes are quantified differently as

follows.

2.1. Internal attributes: size, cohesion, and coupling

Software size has been typically considered as a key attribute of several software

products, including OO classes (Briand et al. 1999b, Briand et al. 2001). Software size is

empirically found to influence several qualities of interest, such as a software product’s

fault-proneness (e.g. Briand et al., 2001, Gyimothy et al., 2005, Aggarwal et al., 2007)

and reusability (e.g. Al Dallal and Morasca 2012).

Class cohesion is an intra-class property that refers to the extent to which class members

are related. The literature has proposed several class cohesion measures (Briand et al.

1998). These measures use different formulas applied during either the high- or low-level

design phases. High-level design (HLD) measures, such as those proposed by Briand et al.

6

(1999b), Bansiya et al. (1999), Counsell et al. (2006), and Al Dallal and Briand (2010),

require information available during the HLD phase, e.g., the types of attributes and

method parameters. Low-level design (LLD) measures, such as those proposed by

Chidamber and Kemerer (1991), Bieman and Kang (1995), Chen et al. (2002), Badri and

Badri (2004), Wang (2005), Bonja and Kidanmariam (2006), Fernández and Peña (2006),

Al Dallal (2012b), and Al Dallal and Briand (2012), require information available during

the LLD phase, e.g., the attributes referenced by the methods.

Coupling is an inter-class property that refers to the degree to which a class is related to

other classes. The literature has proposed several measures to determine class coupling

(Briand et al. 1999a), and these measures consider different class aspects such as the

types of attributes (Li and Henry 1993), the types of parameters (Briand et al. 1997), and

invoked methods (Chidamber and Kemerer 1991, Chidamber and Kemerer 1994, Li and

Henry 1993, Lee et al. 1995, Gui and Scott 2009).

The usual notion of measure as defined in measurement theory, i.e., a function that

associates a value with each entity (Krantz et al. 1971, Roberts 1979), can be used for the

above internal attributes (Morasca 2009). Moreover, the corresponding measures must

comply with the representation condition of measurement theory or weaker conditions,

such as those defined in axiomatic approaches (Weyuker 1988, Briand et al. 1996,

Morasca 2008). Section 3.1 presents further details about the specific internal software

attribute measures used in this study.

2.2 An external attribute: maintainability

Morasca (2009) discussed several reasons for the unsuitability of using the same notion

of measure as that defined in measurement theory to quantify external attributes such as

maintainability. The first reason is that the quantification of external attributes depends

on the entity under study and several additional factors. For example, it is not true that

software product maintainability is a function only of the software product itself. The

definition of a measure given in measurement theory (Krantz et al. 1971, Roberts 1979),

which states that a measure is a function that associates a value with an entity, is thus not

suitable for external attributes. The second reason why using measurement theory is

unsuitable for external attributes is that defining attributes with their measures causes a

logical problem. This problem occurs because attributes exist prior to when and

independent of how they are measured, and defining a measure logically follows the

defining of the attribute it purports to measure; otherwise, attribute classification

inconsistencies may occur. The third reason for the inappropriateness of using

measurement theory for external attributes is that measurement theory is applied to define

deterministic measures, which do not exist for external attributes, as many variables

affect them (the “environment”) in addition to the specific entity.

In this paper, we therefore follow the suggestion of Morasca (2009) using probabilities

and probabilistic models to estimate our external attribute of interest, namely, OO class

maintainability. These estimation models use size, cohesion, and coupling measures as

independent variables and estimate the probability that an OO class will be frequently

revised or will require costly revisions (indicated by the number of revised LOC).

7

Building these probability estimation models requires the collection of data regarding

actual revisions performed on OO classes during the maintenance phase. We thus use the

number of revisions in which the class was involved and the number of revised (i.e.,

added, deleted, or changed) LOC during the maintenance phase as measures to indicate

class maintainability.

3. Related work

In this section, we provide an overview of several existing internal quality class measures

for OO systems and other related work on the measurement of software quality. We also

review several existing maintainability indicators and provide an overview of the existing

research that has theoretically or empirically discussed and studied the relationship

between internal class attributes and maintainability.

3.1. Internal class attributes

Researchers have proposed several measures to assess different internal class quality

attributes, such as size, cohesion, and coupling. The proposed size measures measure

different size aspects. For example, Number of Methods (NOM) measures the amount of

functionality that a class provides, Number of Attributes (NOA) measures the amount of

data necessary for the class to function, and the lines of code (LOC) parameter measures

the size in terms of statements. Chidamber and Kemerer (1994) proposed the Weighted

Methods per Class (WMC) parameter as a complexity measure whose value is obtained

by summing the complexities of all methods defined in a class. Most authors who

previously used WMC assumed that each method has a complexity of one and thus

assumed that NOM and WMC are equivalent (Chaumun et al. 2002).

Cohesion refers to the extent to which components in a software module are related

(Bieman and Kang 1998). Several cohesion measures have been proposed for functions

in procedural programs (e.g., Bieman and Kang 1998, Meyers and Binkley 2007, Sarkar

et al. 2007, Al Dallal 2009) and classes in object-oriented programs (e.g., Chidamber and

Kemerer 1991, Li and Henry 1993, Chidamber and Kemerer 1994, Bieman and Kang

1995, Briand et al. 1998, Briand et al. 1999b, Bansiya et al. 1999, Chen et al. 2002, Yang

2002, Chae et al. 2004, Etzkorn et al. 2004, Badri and Badri 2004, Wang 2005,

Fernandez and Pena 2006, Counsell et al. 2006, Badri et al. 2008, Al Dallal and Briand

2010, Al Dallal and Briand 2012, Al Dallal 2012b). Based on a justified criterion, as

discussed in Section 4.3, we consider eight cohesion measures, including Coh, CAMC,

TCC, LCC, LSCC, SCOM, PCCC, and OLn, as defined in Table 1. The selected cohesion

measures are well studied, both theoretically and empirically (Briand et al. 1999b, Briand

et al. 2001, Al Dallal 2010, Al Dallal 2011a, Al Dallal 2011b, Al Dallal 2012a, Al Dallal

2012c, Al Dallal 2013).

8

Table 1: Definitions of the considered class cohesion measures (adapted from Al Dallal

2012a)

Coupling refers to the relatedness among system components. In OO systems, coupling

can be measured at the class or system level. At the class level, developers are concerned

with measuring the extent to which the class is coupled with other classes. At the system

level, developers are concerned with measuring the total coupling of a software system

(Gui and Scott 2009). Researchers have proposed several coupling measures to assess

class coupling (e.g., Briand et al. 1997, Chidamber and Kemerer 1991, Chidamber and

Class Cohesion Measure Definition/Formula

Coh (Briand et al. 1998) Coh = a/kl, where a, k, and l have the same definitions as above.

Cohesion Among Methods in

a Class (CAMC) (Counsell et

al. 2006)

CAMC = a/kl, where l is the number of distinct parameter types, k is the number

of methods, and a is the summation of the number of distinct parameter types of

each method in the class. Note that this formula is applied in the model that does

not include the self-parameter type used in all methods.

Tight Class Cohesion (TCC)

(Bieman and Kang 1995)

TCC = Relative number of directly connected pairs of methods, where two

methods are directly connected if they are both directly connected to the same

attribute. A method m is directly connected to an attribute when the attribute

appears within the method's body or within the body of a method invoked by

method m, either directly or transitively.

Loose Class Cohesion (LCC)

(Bieman and Kang 1995)

LCC = Relative number of directly or transitively connected pairs of methods,

where two methods are transitively connected if they are both directly or indirectly

connected to the same attribute. A method m, directly connected to an attribute j, is

indirectly connected to an attribute i when there is a method directly or transitively

connected to both attributes i and j.

Low-level Design Similarity-

based Class Cohesion

(LSCC) (Al Dallal and

Briand 2012)

LSCC(C)

0 if k 0 or l 0,

1 if k 1,

x i(x i 1)i1

l

lk(k 1) otherwise.

,

Where l is the number of attributes, k is the number of methods, and xi is the

number of methods that reference attribute i.

Class Cohesion Metric

(SCOM) (Fernandez and

Pena 2006)

SCOM = Ratio of the sum of the similarities between all pairs of methods to the

total number of pairs of methods. The similarity between methods i and j is

defined as:

l

II

II

IIjiSimilarity

ji

ji

ji

),min(),(

, where l is the number of attributes

Low-level design Similarity-

based Class Cohesion

(LSCC) (Al Dallal and

Briand 2012)

LSCC(C)

0 if k 0 or l 0,

1 if k 1,

x i(x i 1)i1

l

lk(k 1) otherwise.

,

Where l is the number of attributes, k is the number of methods, and xi is the

number of methods that reference attribute i.

Path Connectivity Class

Cohesion (PCCC) (Al Dallal

2012b)

otherwise. )(

)(

0, and 0 if 1

1, and 0 if 0

)(

c

c

FGNSP

GNSP

kl

kl

CPCCC

,

where NSP is the number of simple paths in graph Gc, FGc is the corresponding

fully connected graph, and a simple path is a path in which each node occurs once

at most.

OLn (Yang 2002) OLn= The average strength of the attributes, wherein the strength of an attribute is

the average strength of the methods that reference that attribute. The strength of a

method is initially set to 1 and is computed, in each iteration, as the average

strength of the attributes that it references, where n is the number of iterations that

are used to compute OL.

9

Kemerer 1994, Li and Henry 1993, Lee et al. 1995, Kabaili et al. 2001). Based on the

types of coupling considered by the measures, as in Section 4.3, we consider eight

coupling measures, including CBO, CBO_IUB, CBO_U, RFC, MPC, DAC1, DAC2, and

OCMEC, defined in Table 2. Briand et al. (1999a) studied the theoretical validation of

most of these coupling measures.

Table 2: Definitions of the considered class coupling measures

3.2. Indicating object-oriented maintainability

Software maintenance is categorized into four types: corrective, adaptive, perfective, and

preventive (Erdil et al. 2003). Corrective maintenance addresses the correction of faults

when the system does not behave according to its specifications. Adaptive maintenance is

applied to a system to adapt it to new environments without affecting its functionality.

Perfective maintenance extends system functionality and improves the provided services.

Preventive maintenance performs activities such as code refactoring to enhance system

maintainability. Corrective maintenance is considered the traditional maintenance type,

whereas the other three types of maintenance are referenced as software evolution.

Several measures have been proposed to measure different maintenance aspects,

including the number of revised LOC (Li and Henry 1993), number of revisions

(Dagpinar and Jahnke 2003), and pieces of code potentially affected by the revised code

(Kabaili et al. 2001, Chaumun et al. 2002, Xia and Srikanth 2004). Some researchers

studied the correlation between some maintenance measures and maintenance effort.

Among several measures empirically considered, Hayes et al. (2004) found that the

number of revised LOC is strongly correlated to maintenance effort measured in units of

time. Granja-alvarez and Barranco-garcia (1997) showed that the cost of maintenance is

correlated with the number of revised LOC.

Several researchers have discussed or empirically investigated the correlations between

maintenance and internal quality measures. Without validation, Briand et al. (1993)

Class Coupling Measure Definition/Formula

Coupling Between Object

Classes (CBO) (Chidamber

and Kemerer 1994)

CBO = Number of classes, excluding the inherited classes, to which the class is

coupled. A class A is coupled to another class B if the methods of class A use

attributes or methods of class B, or vice versa. Thus, CBO = CBO_IUB + CBO_U.

CBO is Used by (CBO_IUB)

(Kabaili et al. 2001)

CBO_IUB of class A = Number of classes, excluding the inherited classes, that

use the attributes or methods of class A.

CBO Using (CBO_U)

(Kabaili et al. 2001)

CBO_U of class A = Number of classes, excluding the inherited classes, that are

used by the methods of class A.

Response for a class (RFC)

(Chidamber and Kemerer

1994)

RFC of class A = Number of methods in class A + Number of distinct methods of

the other classes directly invoked by the methods of class A.

Message Passing Coupling

(MPC) (Li and Henry 1993)

MPC of class A = Number of method invocations in class A.

Data Abstraction Coupling

(DAC1) (Li and Henry 1993)

DAC1 of class A = Number of attributes, in class A, whose types are of other

classes.

DAC2 (Li and Henry 1993) DAC2 of class A = Number of distinct classes used as types of the attributes of

class A.

OCMEC (Briand et al. 1997) OCMEC of class A = Number of distinct classes used as types of the parameters of

the methods in class A.

10

suggested a set of high-level design-based cohesion and coupling measures that can be

used to estimate OO system maintainability.

Li and Henry (1993) used two Classic-Ada systems with 39 and 70 classes to investigate

quality measures that predict maintainability. They used depth of inheritance (DIT),

number of children classes (NOC), MPC, lack-of-cohesion (LCOM), RFC, DAC, WMC,

NOM, number of semicolons, and NOA as quality measures, and the number of revised

LOC per class during its maintenance history was used as a maintenance measure.

Adding or deleting a line is counted as a single line change, and a change in line content

is counted as both a deletion and an addition. The results obtained by applying the linear

regression statistical technique confirmed that there is a strong correlation between the

quality measures, considered in combination, and class maintainability, indicated by the

number of revised LOC. The empirical study has two main limitations. First, it considers

few classes, which raises questions about the generality of the obtained results. The

second limitation is that their work did not investigate the abilities of the individual

measures to predict maintainability; therefore, the results cannot be used to determine

whether each of the cohesion, coupling, and size quality attributes has a negative or

positive impact on class maintainability.

Without validation, Lee and Chang (2000) proposed an equation that uses existing

complexity measures to estimate the maintainability of OO software.

Using three C++ systems, Kabaili et al. (2001) investigated whether cohesion can predict

the changeability of an OO system (i.e., the ability of the system to absorb changes).

They identified the possible changes that can be performed on an OO system, performed

some of these changes on the considered systems, and analyzed the impact of these

changes on the systems. The impact of a change is defined as the number of classes that

the change affects. Finally, they studied the correlation between the impact of the change

and the LCC and LCOM values. They found a weak correlation between the cohesion

values and impact of change. They argued that the unexpected correlation result occurred

because LCC and LCOM are frequently misleading cohesion indicators. Chaumun et al.

(2002) performed a similar analysis to explore the correlation between the method

signature change and the WMC measure, and they concluded that the correlation is weak.

The studies by Kabaili et al. (2001) and Chaumun et al. (2002) have two main limitations.

First, they considered few internal quality measures, which raises questions about their

generality with regard to the correlation between each of the cohesion and size quality

attributes and changeability. Second, they did not account for the actual changes

performed on the considered systems during their maintenance history; therefore, the

performed experimental changes may not be representative of the actual changes.

Sheldon et al. (2002) extended the NOC and number of descendant classes (NOD)

measures to better estimate the maintainability of the class inheritance hierarchy.

However, they did not empirically validate the extended measures.

Dagpinar and Jahnke (2003) investigated the prediction of maintainability using quality

measures of OO systems. In their empirical study, two size measures (corresponding to

11

LOC and NOM), two inheritance measures (DIT and NOC), a cohesion measure (LCC),

and a set of coupling measures were applied to two Java systems with 27 and 180 classes.

They categorized the coupling measures as either import or export coupling, where

import coupling measures the coupling of a class using instances of other classes coupled

to the class of interest, and export coupling measures the coupling of a class using

instances of the class of interest coupled to other classes. They collected logs reporting

three years of maintenance history for each considered system and considered the number

of class revisions during its maintenance history as the maintenance measure. They

applied univariate linear regression analysis to explore the abilities of the individual

measures to predict maintainability and multivariate linear regression analysis on

combinations of measures to construct a maintainability prediction model. Their results

indicated that size and import coupling measures are significant maintainability

predictors, while inheritance, cohesion, and export coupling measures are not. This study

has a generality limitation due to the relatively low number of selected classes and

measures. That is, the considered systems, with low number of classes, do not represent

real projects. To obtain truly conclusive results, researchers must consider data from real

projects (Genero et al. 2005). In addition, it might be inaccurate to generalize the

conclusion regarding the relationship between maintainability and an internal quality

attribute using a single or low number of measures. For example, to get conclusive results

regarding the relationship between cohesion and maintainability, researchers must

consider multiple measures that consider different cohesion aspects and follow different

cohesion measuring approaches.

Aggarwal et al. (2006), Zhou and Leung (2007), Li-jin et al. (2009), and Elish and Elish

(2009) applied different statistical techniques to construct maintainability prediction

models using the same data collected by Li and Henry (1993), and they reached the same

conclusions. Consequently, these studies share the same limitations indicated for the Li

and Henry study.

Rizvi and Khan (2010) performed an empirical study that investigates the relationship

between class diagram maintainability and class diagram understandability and

modifiability. The study involved values, previously collected through controlled

experiments on 28 class diagrams, of understandability, modifiability, maintainability,

and eleven size and structural complexity measures. They applied multivariate linear

regression to construct models to estimate class diagram understandability and

modifiability using the eleven measures and to estimate class diagram maintainability

using understandability and modifiability attributes. Class diagram maintainability found

to be positively and strongly correlated to each of understandability and modifiability and

a corresponding significant maintainability model was constructed. The study indirectly

explored the relationship between maintainability on one side and size and complexity

attributes on the other side, but it did not investigate the relationship between

maintainability and other quality attributes such as cohesion and coupling.

Several measures have been proposed to predict the maintainability of non-object-

oriented systems such as service-oriented (Perepletchikov et al. 2007), Web (Chae et al.

2007), and functional systems (Ahn et al. 2003). Some researchers (Briand et al. 1998,

12

Briand et al. 2001, Gyimothy et al. 2005, and Marcus et al. 2008) were also interested in

investigating the abilities of some size, cohesion, and coupling quality measures to

predict a specific aspect related to maintenance, namely, fault proneness. Benestad et al.

(2006) performed a survey to research the assessment of OO class maintainability.

In this paper, we use the same maintainability indicators proposed by Li and Henry (1993)

and Dagpinar and Jahnke (2003), and we rely on the results of the existing empirical

studies (Granja-alvarez and Barranco-garcia 1997, Hayes et al. 2004) regarding the

relationship between the considered maintainability indicators and the maintenance cost

and effort. The number and sizes of the systems considered in this paper are larger than

those considered in similar studies (e.g., Li and Henry 1993, Dagpinar and Jahnke 2003,

Aggarwal et al. 2006, Zhou and Leung 2007, Li-jin et al. 2009, Elish and Elish 2009). In

addition, in this paper, we considered number of internal quality measures greater than

those considered in similar studies (e.g., Li and Henry 1993, Kabaili et al. 2001,

Chaumun et al. 2002, Dagpinar and Jahnke 2003, Aggarwal et al. 2006, Zhou and Leung

2007, Li-jin et al. 2009, Elish and Elish 2009). Opposite to the studies performed by

Kabaili et al. (2001) and Chaumun et al. (2002), in this paper, we accounted for actual

changes performed on the considered systems during their maintenance history. Finally,

this paper shows the application of logistic regression, a statistical technique which was

not applied by any of the surveyed papers, to predict several maintainability aspects.

4. Descriptive statistics

The empirical study considered three systems to explore the prediction of internal quality

measures for class maintainability. In this section, we describe the considered systems

and the data collection process. We also provide descriptive statistics of the considered

internal and external quality measures.

4.1. The software systems

In this empirical study, we considered three open-source Java software systems from

different domains, including Art of Illusion version 2.4.1 (Illusion 2012), FreeMind

version 0.8.0 (FreeMind 2012), and JabRef version 1.8 (JabRef 2012). The first system,

Art of Illusion, is a 3D modeling, rendering, and animation studio system. The second

system, FreeMind, is a hierarchical editing system. The third system, JabRef, is a

graphical application for managing bibliographical databases. These systems were

selected from http://sourceforge.net. Regarding the selection criteria, these systems had to

(1) be implemented using Java, (2) be relatively large in terms of the number of classes,

(3) be from different domains, (4) have available source code and maintenance

repositories, and (5) be relatively old versions that were actively maintained over a

considerable period of time. The variety of sizes and domains of the systems allows

commenting on the generality of the obtained results.

4.2. Maintenance data collection

In this empirical study, we considered two actual class maintenance measures: the

number of revisions in which the class was involved and the number of LOC revised

during the considered maintenance history. As Li and Henry (1993) suggested, a line

addition or deletion is considered a single line modification, and a change in line content

13

is counted as a deletion and an addition. We considered only concrete classes and ignored

abstract classes and interfaces because they do not have defined values for most of the

considered measures.

We collected maintenance data for classes in the considered software systems from

publicly available revision repositories in which their maintenance histories are

maintained and managed. The developers of the considered systems used two different

on-line Version Control System (VCS) tracking systems to track source code changes.

The changes, called revisions, are due to either detected faults or required evolutions. In

this empirical study, we did not differentiate between the different maintenance types, as

we are concerned with all maintenance tasks.

The VCS system revisions for the Art of Illusion and JabRef systems are organized using

the revision time stamp, whereas the VCS system revisions for the FreeMind system are

organized using the system package hierarchy. For the former organization method, each

revision is associated with the revision date and a report that includes the revision

description and a list of classes involved in this revision. For each class, the system

reports the revised code for that revision and identifies the differences between the

previous and current class versions, including the added, changed, and deleted lines of

code. For the latter organization method, the tracking system provides the package tree

hierarchy in which classes represent leaf nodes and provides the maintenance history of

any selected class. The history reports all class revisions, and, for each revision, the

history reports revision identification, date, description, and revised pieces of code (i.e.,

added, changed, and deleted lines of code).

The selected Art of Illusion version was issued four and a half years ago, and it was the

most recent system among those systems considered in this empirical study. The second

and third systems were maintained for six and a half years and eight years, respectively.

For each system, we collected the maintenance data reported during the entire

maintenance period, starting from the issuing date and ending on the date on which the

data were collected. Because of the different system ages, we performed an empirical

analysis on each system alone and could thus provide a general comment about the

impact of system age on the analysis results.

We created three corresponding empty files (henceforth called maintenance repositories)

for each considered class in each software system to include the added, changed, and

deleted lines of code during the selected maintenance history period. For each software

system, we manually traced each revision reported in the VCS tracking system; copied

the added, changed, and deleted lines of code during that revision to the corresponding

files of the maintenance repository; and headed the pasted lines of code with a comment

that indicates the revision identifier (for reference purposes). While tracing the VCS

tracking systems, we accumulatively collected the added, changed, and deleted lines of

code for each considered class and built our own maintenance repository.

Tracing the time-stamp-based VCS requires browsing the reported modifications for each

class involved in each individual revision. The time required to collect maintenance data

14

for systems that use time-stamp-based VCS thus depends on the numbers of revisions,

classes involved in each revision, and revised LOC in each class revision. Conversely,

tracing the package-hierarchy-based VCS requires browsing the tree-organized package

hierarchy from the root package to each leaf node representing a class in the considered

system. After reaching the class link, one must trace the list of class revisions and obtain

the reported modifications for each revision. The time required to collect maintenance

data for systems that use package-hierarchy-based VCS thus depends on the package

hierarchy complexity (i.e., the lengths of the paths from the root to the leaves of the

hierarchical tree and the number of nodes in the tree), the total number of revisions

performed on each class, and the number of revised LOC in each class revision.

A research assistant with a B.Sc. in computer science and nine years of experience in

software development activities manually traced the VCS tracking systems of the three

considered software systems and collected the data in the maintenance repository. The

author of the current paper randomly selected 10% of the classes, checked the correctness

of the work performed by the research assistant, and found that the maintenance data

collection was performed properly for the selected classes, thus increasing the confidence

that the collected data match what is reported in the VCS tracking system.

We developed our own Java tool to parse the three maintenance repository files that

included the added, changed, and deleted lines of code in each considered class. The tool

counted and reported, in an Excel sheet, the number of revisions and the number of

added, changed, and deleted lines of code for each class. A single class revision can

include some added, changed, and deleted lines of code. The data associated with such a

class revision are thus distributed among the three files in the maintenance repository. As

a result, our maintenance repository includes the same maintenance data that are reported

in the VCS tracking system, but organizes the maintenance data differently in a way that

simplifies the required maintenance data collection process. For each class in the original

version of a considered system (i.e., the version identified in Section 4.1), our

maintenance repository includes every added, deleted, and changed line of code during

the history of the class, from the date at which the original version of the class was issued

until the date at which empirical study was performed. In other word, for each class in an

original version of a considered system, our maintenance repository reports all detailed

changes in every subsequent version of the system up to the most recent one. To avoid

mistakenly counting this revision as three revisions, our tool compared the revision

identifiers added as comments in the maintenance repository files and counted such a

revision as a single revision. We followed the convention that adding or deleting a line is

counted as a single line change, and a change in line content is counted as both a deletion

and an addition (Zhou and Leung 2007, Elish and Elish 2009).

For each considered system, Table 3 reports the number of concrete classes, number of

LOC, and number and percentage of revised classes. Table 4 lists descriptive statistics for

each actual maintenance measure. The mean numbers of revisions and revised LOC,

shown in Table 4, indicate that FreeMind, with a maintenance age of six and a half years,

was the most actively maintained system among the three systems, although JabRef was

maintained longer (i.e., maintenance age of eight years). Figure 1 presents the number of

15

classes that feature each value of the number of revisions maintenance measure, and

Figure 2 shows the number of classes that exhibit each percentage range of the number of

revised LOC maintenance measures. For example, Figure 1 shows that 327 of the Illusion

classes, 253 of the FreeMind classes, and 191 of the JabRef classes were not involved in

any revision. Figure 2 shows that 73 of the Illusion classes had a percentage p of the

maximum number of revised LOC among the Illusion classes (i.e., 259 LOC, as shown in

Table 4), where 0%<p≤10%. As 10% of the maximum number of revised LOC among

the Illusion classes is 25.9, 73 of the Illusion classes had a number n of revised LOC,

where 0<n≤25.9. Figure 2 does not show the number of classes that did not have any

revised LOC during the selected maintenance history; such a number can be obtained

from Figure 1. For example, the number of Illusion classes that did not have any revised

LOC is the same as the number of Illusion classes that were not involved in any revisions

(327 classes).

Table 3: The descriptions of the Java systems in the dataset

Table 4: Descriptive statistics for the actual maintenance measures

Figure 1: Class distribution of the number of revisions measure

Metric System Min Max 25% Med 75% Mean Std. Dev.

No. of

revisions

Illusion 0 9 0 0 0 0.44 1.09

FreeMind 0 55 0 0 1 2.69 7.18

JabRef 0 31 0 0 2 1.30 3.22

No. of revised

LOC

Illusion 0 259 0 0 0 7.89 30.00

FreeMind 0 1852 0 0 3 55.06 184.74

JabRef 0 918 0 0 6 16.01 64.53

System No. of concrete classes LOC No. of revised classes

Illusion 430 72 K 103 (24%)

FreeMind 363 64 K 110 (30%)

JabRef 306 41 K 115 (38%)

16

Figure 2: Class distribution of the percentage of revised LOC measure

Figure 1 shows that a relatively high percentage of classes were not involved in any

revision and that a low percentage of classes were involved in relatively many revisions.

Figure 2 shows that a low percentage of classes had a large percentage of revised LOC

during their selected maintenance history period. There are several practical implications

of these observations. For example, the results indicate that, instead of spending equal

documentation efforts for the classes under development, software developers can

provide more detailed documentations for relatively low percentages of classes that are

expected to be highly revised during the maintenance history period. Such detailed

documentation is expected to reduce the code understanding effort required during the

maintenance stage. To achieve this goal, software developers need models to predict

classes with low maintainability before performing actual maintenance. These models

must be constructed based on the artifacts available during the software development

stage. This paper investigates the abilities of selected software internal quality measures

to predict classes with low maintainability. These measures are the independent variables

used to construct the required models. The measures are summarized in Section 2, and

their descriptive statistics are provided below.

4.3. Independent variables

Researchers have proposed many measures to quantify the internal quality attributes that

are considered in this empirical study. For the size attribute, we considered the three

measures that similar studies most commonly consider: LOC, NOM, and NOA. Existing

cohesion measures consider different cohesion aspects and apply different approaches to

measure class cohesion. We identified four main approaches, including (1) measuring

cohesion based on counting the number of distinct attributes accessed using the methods

of the class of interest, (2) measuring cohesion based on counting the number of cohesive

method pairs, (3) measuring cohesion based on quantifying the similarity degree between

each pair of methods according to the number of commonly accessed attributes, and (4)

measuring cohesion based on the connectivity degree between the methods and attributes

of the class of interest. To more comprehensively address the cohesion measuring

approaches, we selected two existing cohesion measures for each identified measuring

approach. That is, we selected Coh and CAMC to address the first approach, TCC and

LCC to address the second approach, LSCC and SCOM to address the third approach,

17

and PCCC and OL2 to address the fourth approach. Most of the selected measures satisfy

the necessary cohesion measure properties (Al Dallal 2010, 2012b, Al Dallal and Briand

2012).

We selected eight coupling measures that address different coupling aspects. We selected

CBO and its two extensions, CBO_IUB and CBO_U, because they consider the coupling

caused by accessing attributes and methods across different classes. RFC and MPC were

selected because they consider the coupling caused by method invocations. DAC1 and

DAC2 were selected because they account for the coupling caused by the attribute types

of the class of interest. Finally, we selected OCMEC because it considers the coupling

caused by the parameter types. CBO_IUB considers the coupling caused by the use of the

elements of the class of interest by other classes (i.e., import coupling), CBO_U

considers the coupling caused by using the class of interest to elements of other classes

(i.e., export coupling), and CBO accounts for both import and export coupling. We

selected measures to cover different measuring approaches to interpret the results when

some measures are found to be significant maintainability predictors and others are not.

We developed our own Java tool (QMT 2013) to automate the size, cohesion, and

coupling measurement processes using the selected measures. For each class in the

considered systems (the versions that are identified in Section 4.1), the tool analyzed the

Java source code; extracted the required data; calculated the size, cohesion, and coupling

values using the 19 considered measures; and reported the results in an Excel

spreadsheet. Some selected measures had undefined values for some classes; for

example, TCC and LCC were originally undefined when the class of interest had a single

method. For all such cases, the tool set the measure value according to the

recommendations proposed by Al Dallal (2011a), which modified the measures such that

they were always applicable. This modification allowed us to apply the empirical study to

all considered classes, which made the results more general. We applied the considered

measures on the original versions of the classes because our goal was to investigate

whether the collected values of the measures are statistically related to the number of

revisions and number of revised lines of code in the future history of the classes. This

application allowed us to explore the prediction abilities of the measures.

We applied the boxplot statistical technique (Rousseeuw et al., 1999) to the collected

quality data to detect outliers. A few outliers were detected for the size and coupling

measures. However, we did not exclude any collected data because we found that

removing outliers did not lead to significant differences in the final analysis results.

For Illusion classes, Tables 5 lists descriptive statistics for the cohesion, coupling, and

size measures, including the minimum, 25% quartile, mean, median, 75% quartile,

maximum value, and standard deviation. The corresponding results for the FreeMind and

JabRef classes are reported in Appendix A (Tables A.1 and A.2).

18

Table 5: Descriptive statistics of the 19 considered independent measures for Illusion

classes

5. Data analysis techniques used in the empirical study We performed an empirical study to investigate the practical ability of the size, cohesion,

and coupling measures to predict class maintainability. Based on the problems addressed,

we applied univariate and multivariate logistic regression statistical techniques to analyze

the collected data and build the maintainability prediction models. Logistic regression

(Hosmer and Lemeshow 2000) is a standard and mature statistical method based on

maximum likelihood estimation. This method is widely applied to predict other OO class

external quality attributes, such as class fault-proneness (e.g., Briand et al. 1998, Briand

et al. 2001b, Gyimothy et al. 2005, Marcus et al. 2008, Al Dallal and Briand 2012) and

class reusability (Al Dallal and Morasca 2012). Although we could have used other

analysis methods, including those discussed by Briand and Wust (2002), Subramanyam

and Krishnan (2003), and Arisholm et al. (2010), they are outside the scope of this paper.

The logistic regression model is univariate if it features only one independent variable

and multivariate if it includes several independent variables. In this case study, we

explored the abilities of the 19 considered quality measures to predict several

maintenance-dependent variables. Univariate regression is applied to study the

maintenance prediction capability of each measure separately, whereas multivariate

regression is applied to study the combined maintenance prediction of several measures.

An overview of the applied statistical techniques and model factors is provided in this

section.

Quality

attribute

Measure Min Max 25% Med 75% Mean Std. Dev.

Siz

e

NOM 1 91 4.00 8.00 15.00 11.74 11.73

NOA 0 114 2.00 6.00 12.00 8.53 9.97

LOC 6 2767 51.25 106.50 219.75 189.60 270.42

Co

hes

ion

Coh 0 1 0.21 0.38 0.74 0.48 0.32

CAMC 0 1 0.00 0.00 0.27 0.24 0.41

TCC 0 1 0.20 0.49 1.00 0.52 0.37

LCC 0 1 0.27 0.70 1.00 0.61 0.39

LSCC 0 1 0.05 0.15 0.56 0.33 0.37

SCOM 0 1 0.12 0.32 0.88 0.45 0.37

PCCC 0 1 0.00 0.00 1.00 0.34 0.46

OL2 0 1 0.00 0.00 0.24 0.24 0.41

Co

up

lin

g

CBO 0 208 3.00 7.00 12.00 11.86 19.31

CBO_IUB 0 207 0.00 1.00 3.00 6.05 18.60

CBO_U 0 26 2.00 4.00 8.00 5.82 5.18

RFC 0 413 8.00 21.00 46.75 32.90 38.81

MPC 0 1739 12.00 41.50 94.00 81.26 144.58

DAC1 0 55 1.00 2.00 5.00 4.20 6.21

DAC2 0 19 1.00 2.00 4.00 2.79 3.15

OCMEC 0 22 2.00 4.00 7.00 4.94 3.90

19

5.1. Dependent Variables

In logistic regression, explanatory or independent variables are used to explain and

predict dependent variables. A dependent variable can only take discrete values and is

binary when we predict the classes expected to be involved in revisions or exhibit

considerable revised LOC. In this empirical study, we considered three practical

problems: (1) predicting the classes expected to be involved in at least one revision, (2)

predicting the classes expected to be frequently revised, and (3) predicting the classes

expected to exhibit a considerable amount of revised LOC. During the software

development phase, developers spend considerable time and effort testing and

documenting the software. It should be more efficient for software engineers to spend

more testing and documenting time and effort on the classes expected to be frequently

revised than on those expected to be less frequently revised. Software engineers are thus

advised to concentrate less on testing and documenting the classes predicted not to

require revision; they are instead advised to focus more on testing and documenting the

classes predicted to be involved in many revisions and require considerable maintenance

costs. Because the average value is a typical representative statistical value, we

considered the class to be involved in a considerable number of revisions if the actual

number of revisions performed on the class during the considered maintenance history

period is greater than the average number of revisions among all classes involved in

revisions. Similarly, the class is considered to require considerable maintenance costs (in

terms of the number of revised LOC) if the actual number of revised LOC during the

considered maintenance history is greater than the average number of revised LOC

among all classes involved in revisions.

Based on the three considered practical problems, we considered the following three

dependent variables in the logistic regression analysis:

A revised class (RC) is a class that was involved in at least one revision during the

considered maintenance effort. The RC value was set to "1" when the class was

involved in one or more revisions; otherwise, the RC value was set to "0".

A frequently revised class (FRC) is a class that was involved in a number of

revisions that is greater than the average number of revisions among all revised

classes. The FRC value of such a class was set to "1"; otherwise, the FRC value

was set to "0".

A costly revised class (CRC) is a class whose number of revised LOC during the

maintenance history was greater than the average number of revised LOC among

all revised classes. The CRC value of such a class was set to "1"; otherwise, the

CMC value was set to "0". We defined the classes with relatively high numbers

of revised LOC as costly revised classes because these classes are expected to

require costly maintenance (Granja-alvarez and Barranco-garcia 1997).

Table 6 shows the distribution of classes that have different values for the three

dependent variables in the three considered systems. For example, the table shows that

the RC values of 327 (76%) Illusion classes were set to "0" (i.e., they were not involved

in any revision) and the RC values of the rest of the Illusion classes (i.e., 103 classes)

were set to "1" (i.e., they were involved in some revisions).

20

Table 6: Number and percentage of classes for each value of our discretized maintenance

variables

5.2. Classification performance

The logistic regression analysis results in a prediction model that uses the following

equation:

)...(21221101

1),...,,(

nn XCXCXCCne

XXX

In our context, π represent the probability that the class is expected to be revised, be

frequently revised, or require costly revisions when using the RC, FRC, or CRC,

respectively, as the dependent variable. Xis were quality measures, and the Ci coefficients

were estimated by maximizing a likelihood function (i.e., obtained using logistic

regression analysis) (Hosmer and Lemeshow 2000). In univariate regression analysis,

only a quality measure is used as an independent variable, and the prediction equation

becomes as follows:

)( 101

1)(

XCCe

X

For each model in our analyses, we report:

Intercept (denoted by c0 in the tables): intercept value estimated using logistic

regression analysis.

Coefficients (denoted by c1 in the tables): coefficient values of the independent

variables estimated using logistic regression analysis.

In practice, the π of each class of interest must be calculated. The software engineer must

pay more attention to classes with relatively high π values because these classes are

candidates for being revised, being frequently revised, or requiring costly maintenance. A

threshold t must be set for π to classify the classes accordingly and assess the

classification performance of a probability estimation model. All classes whose estimated

probability is less than or equal to t are classified as having an estimated value of the

discretized maintenance variable of Y = 0, and Y = 1 otherwise. This classification allows

us to obtain a 2×2 contingency table, as shown in Table 7. For example, cell [0,0]

contains the number of classes that were estimated to have estimated and actual Y values

of 0. The sums across the rows provide the number of Estimated Negatives and Positives,

and the sums across the columns produce the number of Actual Negatives and Positives.

System Value RC FRC CRC

Illusion 0 327 (76%) 388 (90.2%) 406 (94.4%)

1 103 (24%) 42 (9.8%) 24 (5.6%)

FreeMind 0 253 (69.7%) 324 (89.3%) 334 (92%)

1 110 (30.3%) 39 (10.7%) 29 (8%)

JabRef 0 191 (62.4%) 278 (90.8%) 279 (91.2%)

1 115 (37.6%) 28 (9.2%) 27 (8.8%)

21

Table 7: Contingency table

The determination of a threshold t for classification purposes is a subjective choice, and it

may dramatically change the classification results. Setting t to be "0" results in the

classification of all classes as having a discretized estimated value of 1; therefore, no

class is classified as Estimated Negative. Conversely, setting t to "1" causes all classes to

be classified as having a discretized estimated value of 0; therefore, no class is classified

as Estimated Positive. Table 6 shows that the distributions of the three independent

maintenance variables are concentrated (Morasca 2004). This observation clearly

demonstrates the inadequacy of using a 50% classification threshold because this

threshold would be too far from the true proportion of actual positives. The proportion of

actual positives may be a better choice than the default classification threshold (0.5) for

assessing the actual classification strength of an estimation model because the former

threshold uses information available from the field instead of relying on the arbitrary

threshold of 0.5. In our analyses, we set t to be the proportion of actual positives in the

considered data set, i.e., a class is classified as having a discretized maintenance variable

value of 1 if its estimated probability is greater than this proportion. For example, when

considering the Illusion classes and RC as a dependent variable, according to the values

given in Table 6, we set t to 0.24.

Based on this contingency table, we considered the following classification performance

indicators in our empirical study:

Precision (denoted in the results tables as P) = True Positives/Estimated Positives;

this indicator is not defined if there are no estimated positives;

Recall (denoted in the tables of results as R) = True Positives/Actual Positives;

Inverse precision (denoted IP) = True Negatives/Estimated Negatives; this

indicator is not defined if there are no estimated negatives;

Inverse Recall (denoted IR) = True Negatives/Actual Negatives.

Precision and recall (Olson and Delen 2008) are used to indicate the performance of the

model in correctly predicting the discretized maintenance variable Y = 1, whereas inverse

precision and inverse recall (Powers 2007) are applied to demonstrate the performance of

the model in correctly predicting the discretized maintenance variable Y = 0. These four

classification performance indicators depend on the value of the probability threshold

selected for classification. To evaluate the performance of a prediction model regardless

of any particular threshold, we instead used the receiver operating characteristic (ROC)

Actual Y

0 1

Estimated Y

0 True

Negatives

False

Negatives Estimated Negatives

1 False

Positives

True

Positives Estimated Positives

Actual

Negatives

Actual

Positives

22

curve (Hosmer and Lemeshow 2000). In this study, the ROC curve is a graphical plot of

the ratio of classes correctly classified with a maintenance variable of 1 versus the ratio

of classes incorrectly classified with a maintenance variable of 1 at different thresholds.

The area under the ROC curve (AUC) represents the ability of the model to correctly

rank classes based on the considered maintenance variable. A 100% ROC area represents

a perfect model that correctly classifies all classes, and larger ROC areas indicate that the

model is better at classifying classes. The AUC is often considered a better evaluation

criterion than standard precision and recall, as selecting a threshold is always somewhat

subjective. We applied the following general rules to assess the classification

performance according to the AUC value (Hosmer and Lemeshow 2000): AUC=0.5

means that the classification is not good, 0.5<AUC<0.6 means that the classification is

poor, 0.6≤AUC<0.7 means that the classification is fair, 0.7≤AUC<0.8 means that the

classification is acceptable, 0.8≤AUC<0.9 means that the classification is excellent, and

AUC≥0.9 means that the classification is outstanding. Thresholds based on the ROC

analysis for the selected measures are considered practical if they fall at least within the

acceptable range (Shatnawi et al., 2010). A measure might be found to be a statistically

significant maintainability predictor (p-value<0.05), but it could be determined to be an

impractical predictor according to the AUC.

5.3. Goodness-of-fit

To explore the goodness-of-fit of the constructed univariate and multivariate regression

models, we used the following indicators:

R2 = (L0-LL)/L0 (Hosmer and Lemeshow 2000), where LL is the log-likelihood

of the data in the model and L0 is the log-likelihood of the data in a model with

no independent variables. R2 represents the proportion of the “unexplained” log-

likelihood of a model, including independent variables in the log-likelihood of a

constant model. The value of R2 ranges between 0 and 1. For technical reasons,

high values of R2 are rare, even for accurate models.

Mean squared error (MSE), which is the average of the squared differences

between the probability values estimated by a logistic regression model and the

actual values of the discretized maintenance variable used.

5.4. Model Validation

To more realistically assess the constructed models' predictive capacities, we used V-

cross-validation, a procedure in which a data set is partitioned into k sub-samples. The

regression model is then built and evaluated k times. Each time, a different sub-sample is

used to evaluate the classification performance, and the remaining sub-samples are then

used as training data to build the regression model. We applied the V-cross-validation

technique to build each univariate and multivariate regression model considered in this

empirical study. When building each model, we applied 10-times 10-fold cross-validation.

To provide evidence that the multivariate models have, in practice, reasonable

performance in predicting the considered maintenance factors, in addition to applying V-

cross-validation, we also validated the multivariate regression models by exploring their

performances when they are applied to classes other than those used to build the models.

23

6. Univariate Regression Analysis Results

For each considered system, we constructed three families of univariate regression

models, one for each discretized maintenance variable: RC, FRC, and RCR. Each model

family contains a single univariate model for each independent variable that was found to

be statistically significant. This analysis aims to investigate the relation between the

maintainability external quality attribute, indicated by the three maintenance variables,

and the three considered internal quality attributes, i.e., size, cohesion, and coupling,

indicated by the 19 considered measures. The results of the univariate regression analysis

are presented in tables, and their characteristics are discussed in the following order:

Statistically significant independent variables: We present the independent

variables found to be statistically significant.

Direction of independent variable impact: We discuss whether the statistically

significant independent variables influence the estimated probability positively or

negatively, where independent variables with a positive (negative) coefficient

influence the estimated probability positively (negatively).

Goodness-of-fit: We discuss the obtained MSE and R2.

Classification performance: We present the precision, recall, inverse precision,

inverse recall, and AUC.

6.1. Univariate analysis of the RC

Table 8 reports the results of the univariate RC prediction models based on the Illusion

classes. Appendix B provides the results based on classes in the other two systems

(Tables B.1 and B2). The reported results lead to the following observations.

Table 8: Univariate results for the RC using Illusion classes

Measure c0 c1 MSE R2 p-value P R IP IR AUC

NOM -1.743 0.046 0.173 0.051 < 0.0001 0.345 0.476 0.813 0.716 0.646

NOA -1.636 0.052 0.173 0.042 < 0.0001 0.319 0.447 0.801 0.700 0.591

LOC -1.660 0.002 0.171 0.061 < 0.0001 0.418 0.495 0.831 0.783 0.687

Coh -0.660 -1.106 0.179 0.018 0.004 0.300 0.718 0.842 0.471 0.626

CAMC -0.484 -1.901 0.179 0.026 0.001 0.294 0.660 0.824 0.502 0.610

TCC -0.587 -1.191 0.178 0.031 <0.001 0.290 0.583 0.807 0.550 0.620

LCC -0.782 -0.633 0.182 0.010 0.027 0.285 0.515 0.795 0.593 0.596

LSCC -0.933 -0.724 0.181 0.011 0.030 0.273 0.757 0.826 0.364 0.638

SCOM -0.652 -1.243 0.176 0.030 <0.001 0.292 0.709 0.833 0.459 0.643

PCCC -0.856 -1.084 0.177 0.034 <0.001 0.296 0.816 0.870 0.388 0.668

OL2 -0.968 -0.955 0.179 0.021 0.004 0.276 0.854 0.865 0.294 0.570

CBO -1.393 0.019 0.178 0.024 0.002 0.371 0.379 0.803 0.798 0.665

CBO_IUB -1.236 0.012 0.184 0.010 0.033 0.295 0.175 0.770 0.869 0.510

CBO_U -1.810 0.102 0.173 0.050 < 0.0001 0.356 0.563 0.831 0.679 0.670

RFC -1.891 0.020 0.166 0.086 < 0.0001 0.422 0.602 0.855 0.740 0.724

MPC -1.591 0.005 0.170 0.064 < 0.0001 0.430 0.476 0.829 0.801 0.713

DAC -1.535 0.082 0.172 0.045 < 0.0001 0.370 0.456 0.815 0.755 0.609

DAC2 -1.715 0.179 0.169 0.058 < 0.0001 0.393 0.447 0.818 0.783 0.606

OCMEC -1.777 0.116 0.176 0.037 < 0.0001 0.353 0.524 0.823 0.697 0.614

24

Statistically significant independent variables: Except for the TCC measure

applied to FreeMind classes, the univariate results show that each of the

considered size, cohesion, and coupling measures was found to be a statistically

significant RC predictor.

Direction of independent variable impact: Each of the considered size and

coupling measures appears to have a positive impact on RC, and a class with

higher values of these measures has a higher probability of being revised during

the maintenance phase. The considered cohesion measures seem to negatively

affect RC, which implies that a class with higher cohesion measure values has a

lower probability of being revised during the maintenance phase.

Goodness-of-fit: The MSE values are relatively small. Such an observation is

expected because of the concentration of the RC distribution (Table 6), which

implies that the probability estimates are also concentrated. The R2 values are

often relatively low, which implies that the models have low goodness-of-fits.

Classification performance: Due to the concentration of the distribution on RC =

0, most models exhibit low precision and recall values, i.e., the models have a low

percentage of True Positives among the observations that actually or are estimated

to have RC = 1. Conversely, the percentage of True Negatives among the

Estimated Negatives and Actual Negatives is relatively high for most models.

This observation indicates that most models are good predictors for classes that

have low probabilities of being revised during the maintenance phase. Most AUC

values do not exceed the “fair” category, which indicates that most obtained

models are not practical classifiers for the revised and unrevised classes; a few

models, such as those based on MPC and RFC, were always found to be

acceptable classifiers.

6.2. Univariate analysis of the FRC

Table 9 presents the results of the univariate models based on Illusion classes using the

FRC as the discretized maintenance variable. Appendix B (Tables B.3 and B.4) provides

the corresponding results for the classes of the other two considered systems. The results

lead to the following observations.

Statistically significant independent variables: All considered size measures

appear to be statistically significant predictors of FRC. In addition, most

considered cohesion and coupling measures are always found to be statistically

significant FRC predictors. The Coh, LSCC, and CBO_U measures appear to be

statistically significant FRC predictors for classes in two of the three considered

systems. Finally, TCC and LCC are always found to be nonsignificant FRC

predictors.

Direction of independent variable impact: As in RC, all statistically significant

size and coupling measures positively affect FRC, and all statistically significant

cohesion measures negatively affect FRC.

Goodness-of-fit: The MSE values are smaller than those obtained for RC, and the

R2 values are mostly higher than those obtained for RC. This observation implies

that the FRC models exhibit better goodness-of-fits than RC models.

Classification performance: The precision values of the statistically significant

models are quite low, and they are always lower than those of the RC models. In

25

many cases, the models' recall values are lower than those of the RC models.

Conversely, the inverse precision values of the statistically significant models are

quite high, and they are always higher than those of the RC models. In many

cases, the inverse recall values are higher than those of the RC models. These

observations imply that the statistically significant FRC models can be trusted

more when they estimate that a class will not be frequently revised than when

they predict that a class will be frequently revised. Several models have AUC

values in the “excellent” category, and the remaining statistically significant

models have AUC values in either the "fair" or "acceptable" ranges. In many

cases, the obtained models are practical classifiers for classes that are frequently

revised and those that are not.

Table 9: Univariate results for the FRC using Illusion classes

6.3. Univariate analysis of the CRC

Table 10 contains the results of the univariate models based on Illusion classes using the

CRC as the discretized maintenance variable. Appendix B (Tables B.5 and B.6) provides

the corresponding results based on classes in the other two systems . The reported results

lead to the following observations.


NOM -3.010 0.053 0.082 0.083 < 0.0001 0.222 0.524 0.940 0.802 0.696

NOA -3.029 0.074 0.079 0.100 < 0.0001 0.218 0.571 0.944 0.778 0.657

LOC -2.982 0.003 0.079 0.129 < 0.0001 0.262 0.524 0.942 0.840 0.773

Coh -1.749 -1.091 0.087 0.014 0.057 0.124 0.690 0.934 0.472 0.636

CAMC -1.069 -3.669 0.085 0.061 <0.001 0.162 0.762 0.957 0.575 0.698

TCC -1.873 -0.730 0.088 0.010 0.103 0.101 0.476 0.905 0.539 0.580

LCC -2.203 -0.033 0.088 0.000 0.936 NA 0.000 0.902 1.000 0.554

LSCC -2.042 -0.601 0.088 0.006 0.216 0.109 0.714 0.923 0.369 0.633

SCOM -1.774 -1.156 0.087 0.021 0.023 0.131 0.738 0.943 0.469 0.638

PCCC -1.904 -1.377 0.086 0.038 0.005 0.127 0.857 0.959 0.361 0.734

OL2 -2.022 -1.201 0.087 0.023 0.030 0.116 0.881 0.955 0.276 0.628

CBO -2.584 0.024 0.085 0.054 <0.001 0.184 0.333 0.921 0.840 0.700

CBO_IUB -2.402 0.020 0.085 0.038 0.001 0.173 0.214 0.913 0.889 0.616

CBO_U -2.897 0.097 0.087 0.044 <0.001 0.176 0.571 0.939 0.711 0.696

RFC -3.235 0.023 0.081 0.141 < 0.0001 0.243 0.643 0.953 0.784 0.804

MPC -2.867 0.006 0.081 0.125 < 0.0001 0.280 0.548 0.945 0.848 0.811

DAC -2.904 0.117 0.079 0.120 < 0.0001 0.247 0.429 0.933 0.858 0.697

DAC2 -3.202 0.254 0.075 0.137 < 0.0001 0.197 0.548 0.939 0.758 0.693

OCMEC -3.263 0.171 0.083 0.084 < 0.0001 0.182 0.524 0.935 0.745 0.699

26

Table 10: Univariate results for the CRC using Illusion classes

Statistically significant independent variables: As in FRC, the TCC and LCC

models always appear to be statistically nonsignificant CRC predictors. A size

measure, NOA, and a coupling measure, DAC, were found to be statistically

significant CRC predictors using only classes in Illusion systems. The cohesion

measures, PCCC and OL2, and coupling measures, CBO, CBO_IUB, and CBO_U,

were found to be statistically significant CRC predictors using classes of two of

the three considered systems. The remaining size, cohesion, and coupling

measures (10 measures) always appear to be statistically significant CRC

predictors.

Direction of independent variable impact: As with both the RC and FRC, all

statistically significant size and coupling measures positively affect the CRC, and

all statistically significant cohesion measures negatively affect the CRC.

Goodness-of-fit: With few exceptions, the MSE values are somewhat smaller than

those obtained for the RC and FRC. In many cases, the R2 value of the models

obtained for the CRC appear to be slightly higher than those obtained for the RC

and FRC. These observations indicate that most statistically significant models

obtained for the CRC have better goodness-of-fit values than those obtained for

the RC and FRC.

Classification performance: The precision values of the statistically significant

models are quite low for all considered measures, most recall and inverse recall

values are greater than 50%, and the inverse precision values of the statistically

significant models are always high (i.e., greater than 90%). These observations

imply that the statistically significant CRC models can be trusted more when they

estimate the classes that do not require costly revisions than when they predict the

classes that do. Most obtained models have AUC values in either the "acceptable"


NOM -3.556 0.047 0.051 0.074 0.000 0.115 0.458 0.961 0.791 0.744

NOA -3.838 0.083 0.045 0.147 < 0.0001 0.141 0.458 0.963 0.835 0.725

LOC -3.737 0.003 0.047 0.182 < 0.0001 0.221 0.625 0.975 0.869 0.885

Coh -1.773 -2.838 0.051 0.062 0.004 0.097 0.792 0.979 0.564 0.716

CAMC -1.745 -3.464 0.052 0.049 0.008 0.095 0.708 0.972 0.601 0.680

TCC -2.411 -0.898 0.053 0.013 0.126 0.053 0.375 0.942 0.603 0.586

LCC -2.726 -0.170 0.053 0.001 0.751 0.043 0.167 0.940 0.778 0.563

LSCC -2.305 -2.336 0.052 0.046 0.021 0.086 0.875 0.984 0.448 0.703

SCOM -1.971 -2.716 0.051 0.071 0.003 0.099 0.833 0.982 0.549 0.706

PCCC -2.400 -2.832 0.051 0.077 0.015 0.082 0.958 0.993 0.367 0.764

OL2 -2.467 -762.6 0.051 0.090 <0.001 0.078 1.000 1.000 0.303 0.639

CBO -2.956 0.009 0.053 0.007 0.200 0.091 0.167 0.948 0.901 0.686

CBO_IUB -2.866 0.005 0.053 0.002 0.533 0.091 0.125 0.947 0.926 0.611

CBO_U -3.458 0.090 0.052 0.035 0.008 0.110 0.625 0.969 0.702 0.686

RFC -3.956 0.023 0.049 0.169 < 0.0001 0.167 0.625 0.974 0.815 0.840

MPC -3.624 0.006 0.048 0.178 < 0.0001 0.246 0.583 0.973 0.894 0.877

DAC -3.612 0.117 0.047 0.152 < 0.0001 0.182 0.500 0.967 0.867 0.734

DAC2 -3.803 0.236 0.047 0.131 < 0.0001 0.133 0.458 0.963 0.823 0.697

OCMEC -4.147 0.201 0.050 0.117 < 0.0001 0.140 0.708 0.977 0.744 0.782

27

or "excellent” categories, and the remaining statistically significant models have

AUC values in the "fair" range. This observation implies that most obtained

statistically significant CRC models are practical classifiers for classes that

require or do not require costly revisions.

6.4. Univariate analyses: discussion

The above results show that the considered independent variables can be used to

construct statistically significant univariate predictor models. The statistically significant

size and coupling measures were consistently found to positively affect all three

discretized maintenance variables, whereas the statistically significant cohesion measures

appeared to negatively affect the considered maintenance variables. These observations

indicate that classes with better quality (i.e., smaller sizes, lower coupling, and higher

cohesion) potentially have higher maintainability (i.e., are less frequently revised and

have a fewer number of revised LOC) than those with worse quality.

Beside the importance of the univariate analyses in exploring the direction and strength

of the relationship between the considered internal quality attributes and maintainability

factors, we made use of the univariate analyses results in selecting the measures to be

included in the construction process of the multivariate prediction models. As will be

illustrated in the next section, in the forward model construction approach, we made use

of the AUC values obtained in the univariate analysis to decide the order in which the

measures are included in the constructed models. The measures of higher AUC values

were given the higher priority to be included in the constructed models.

7. Multivariate analysis

This multivariate analysis aims to construct practical and optimized models that include

multiple quality measures to predict the three considered dependent maintenance

variables, namely, RC, FRC, and CRC. The analysis explores how well the models of

measures used in combination can predict the required dependent maintenance variables.

The inclusion of all measures in the model can produce the best result for the AUC value.

However, this strategy results in a model with a relatively high MSE, which means that

the model is highly dependent on the data set. This result contradicts our goal of having a

general model that can perform well with any given data set. In addition, a model that

includes all considered measures is difficult to use in practice because it exhausts the

software engineer’s time by applying all measures. To solve these problems, a set of

measures must be selected to construct an optimized model. These measures must be

selected to reduce the multicollinearity in the model, i.e., the existence of highly

correlated measures (Hosmer and Lemeshow 2000). Therefore, the constructed model

does not necessarily include the measures found to be the best individual predictors;

depending on the correlations between them, the measures existing together may increase

the multicollinearity in the model. Two general approaches are followed to construct an

optimized model: forward and backward selection (Hosmer and Lemeshow 2000). In the

forward selection approach, the model construction process starts with no variables in the

model, and the variables are added individually if they are statistically significant in the

prediction model. In the backward selection approach, the model construction process

starts with a model that includes all independent variables, and the statistically

28

nonsignificant variables are sequentially removed from the model. Different researchers

have used different statistical criteria to select the variables to add or delete.

In this research, we tried both construction approaches. The first experiment was based

on the backward selection approach. We selected the p-value to be the criterion for

removing measures from the model. In each step, we applied multivariate regression

analysis and removed the measure that had the highest p-value from the model. The

process continued until each of the remaining measures had a p-value above α (0.05).

This backward-based experiment does not take the AUC values into account. The second

experiment was based on the forward selection approach. First, we ordered the measures

in a descending fashion according to the AUC reported in Section 6 (last column in

Tables 8, 9, and 10). On the basis of this order, in each step, a measure was added to the

model and the regression analysis was performed again using the measures that existed in

the model at that moment. If the added measure was found to have a p-value greater than

the 0.05, the measure would be removed from the model. In addition, if the added

measure caused a measure already in the model to become insignificant (p-value>0.05),

the measure would be deleted from the model. Because our forward-based model-

construction process considered the AUC values of the individual measures, we noticed

that the resulting models had AUC values that were better than those of the models

constructed using the backward approach. At the same time, the multicollinearities in the

forward models were relatively low due to the consideration of the p-values while

constructing the models. Therefore, because of space limitations, in this section we only

report, discuss, and compare the results of the forward-based models.

We tested the multicollinearity in the models by obtaining the variance inflation factor

(VIF) (O’Brien 2007), a widely used measure of the multicollinearity of a variable with

other variables in the model. We used the rule of thumb that a value of four indicated a

multicollinearity problem. We used the Mahalanobis Distance (Barnett and Lewis 1994)

to detect outliers in the models that include multiple independent variables. However, we

found that the removal of outliers did not lead to significant differences in the final

multivariate regression analysis results.

We constructed three models, one for each considered variable, namely, RC, FRC, and

CRC. The models were constructed using classes in the Illusion system. This system had

the largest number of considered classes; therefore, its models should be more general

than those constructed using the other systems. However, we constructed the prediction

models using the classes in each of the other two systems and reported the results in

Appendix C. As discussed in Section 5.4, we applied two validation approaches. In the

first approach, we applied the V-cross-validation statistical technique. In the second

approach, we explored the classification performance of the three constructed models

when applied to classes in the other two systems. In this case, we applied the equation of

the multivariate model constructed using the Illusion classes to each class in the other two

systems. We used the threshold suggested for the Illusion system on the π values obtained

for the classes in the other two systems to determine whether each class was estimated to

have low maintainability. This process resulted in the 2×2 contingency table, as in Table

7, for the classes of each system. Therefore, we could identify the precision, recall,

29

inverse precision, and inverse recall values for each system. These values were used to

evaluate the classification performance of the model constructed using the Illusion

classes when applied to the classes of the other two systems. To simplify this evaluation,

we calculated the weighted average precision (WAP) and weighted average recall (WAR)

using the following formulas:

.)(*_)(*

and )(*_)(*

TPFNFPTN

FPTNrecallinverseTPFNrecallWAR

TPFNFPTN

FPTNprecisioninverseTPFNprecisionWAP

The WAP and WAR values are still dependent on the selected threshold for the model,

but they are more representative of the confusion matrix values than the traditional

precision and recall values. Finally, we calculated the FMeasure, defined as the harmonic

mean of WAP and WAR, and used it as a single representative value to compare the

classification performance of the model when applied to the original data set (Illusion

classes) to that obtained when applying the model to the validation data sets.

7.1. Multivariate analysis of the RC

Tables 11 and 12 present the results for the statistically significant multivariate model

using RC as the discretized maintenance variable. The model included three measures.

The first row in Table 12 shows the classification performance of the model when applied

to the original data set used to construct the model. The second and third rows provide

the classification performance of the model when applied to the classes of the other two

systems.

Table 11: RC-based forward multivariate regression model.

Table 12: Multivariate analysis of the RC: classification performance and validation

The results in Tables 11 and 12 lead to the following observations.

Statistically significant independent variables: The model has three statistically

significant measures. Two of them are coupling measures (RFC and CBO), and

one is a cohesion measure (TCC). However, the coefficient values show that the

contribution of the TCC measure to the model is much higher than those of the

coupling measures. No statistically significant independent variable is a size

System P R IP IR FMeasure

Illusion 0.412 0.680 0.873 0.694 0.725

FreeMind 0.511 0.209 0.726 0.913 0.680

JabRef 0.679 0.461 0.728 0.869 0.713

Metric Coefficient p-value VIF MSE R2 AUC

Intercept -1.484 < 0.0001 -

0.164 0.117 0.747 RFC 0.018 < 0.0001 1.069

CBO 0.014 0.022 1.047

TCC -1.079 0.003 1.026

30

measure. The model is free of multicollinearity problems (i.e., all values of VIF

are less than the value “4”).

Direction of independent variable impact: The coupling measures still positively

affect RC, and the cohesion measure still negatively affects RC.

Goodness-of-fit: The MSE (0.164) is slightly smaller than the values obtained for

the univariate models, which were already small. The R2 value (0.117) is much

larger than those obtained in the univariate models, which implies a much better

goodness-of-fit.

Classification performance: The model still features a low precision value.

However, the classification performance values of the model are better than most

of those obtained using univariate models. The model has an AUC of 0.747

(acceptable), which is higher than any AUC values obtained using RC univariate

models.

Validation: The model has higher precision and inverse recall values and lower

recall and inverse precision values when applied to other data sets. However, it is

normal to have worse classification performance results when applying the model

to FreeMind and JabRef classes, as these classes are not within the data set used

to construct the model. As discussed in Section 5.2, these results depend on the

selected classification threshold. Overall, the FMeasure values show that the

model, when applied to other classes, has a classification performance similar to

that obtained when the model is applied to its original data set. This observation

provides confidence that the model behaves well when applied in practice to

predict the classes estimated to be involved in revisions.

7.2. Multivariate analysis of the FRC

Tables 13 and 14 present the results of the statistically significant multivariate model

using FRC as the discretized maintenance variable. The model included four measures.

Table 13: FRC-based forward multivariate regression model.

Table 14: Multivariate analysis of the FRC: classification performance and validation



Illusion 0.304 0.667 0.944 0.785 0.855

FreeMind 0.278 0.128 0.901 0.960 0.852

JabRef 0.341 0.536 0.950 0.896 0.878


Intercept -4.690 < 0.0001 -

0.076 0.209 0.836 RFC 0.021 0.001 2.029

CBO 0.022 0.002 1.065

DAC 0.079 0.008 1.925

Coh 1.742 0.022 1.342

31

Statistically significant independent variables: The model has four statistically

significant measures. Three of them are coupling measures (RFC, CBO, and

DAC), and one is a cohesion measure (Coh), which had a much higher

contribution to the model than the coupling measures. As in the RC multivariate

model, no statistically significant independent variable in the FRC model is a size

measure, and the model is free of multicollinearity problems.

Direction of independent variable impact: As in the corresponding univariate

models, the coupling measures positively affect FRC in the multivariate model.

However, the cohesion measure features different impact directions than the

univariate models, which is likely due to the association between the Coh and

coupling measures in the multivariate model.

Goodness-of-fit: The MSE value (0.076) is smaller than most values obtained in

the univariate models. There is a sharp increase in the R2 value (0.209), which

implies that this model has a much better goodness-of-fit than the univariate

models.

Classification performance: The multivariate model has a better precision value

than any univariate model and better recall, inverse precision, and inverse recall

values than most univariate models. The model also has an AUC value of 0.836

(excellent), which is higher than any obtained using univariate FRC models.

Validation: The validation results show that the FRC prediction model

constructed using the Illusion class data set has lower precision, recall, and

inverse precision values, and it has a higher inverse recall value when applied to

FreeMind classes than to its original data set. However, the prediction model was

found to have better precision, inverse precision, and inverse recall values for

JabRef classes than for the original data set. The FMeasure values show that the

constructed model has a relatively high classification performance when applied

to data sets other than the data set from which it was constructed. This

observation suggests that the constructed model is potentially useful in practice.

7.3. Multivariate analysis of the CRC

Tables 15 and 16 report the results of the statistically significant multivariate model using

CRC as the discretized maintenance variable. The model included two measures.

Table 15: CRC-based forward multivariate regression model.

Table 16: Multivariate analysis of the CRC: classification performance and validation


Illusion 0.200 0.667 0.977 0.842 0.880

FreeMind 0.300 0.103 0.926 0.979 0.892

JabRef 0.276 0.296 0.931 0.925 0.871


Intercept -3.991 < 0.0001

0.047 0.212 0.871 LOC 0.002 0.001 1.619

DAC 0.072 0.014 1.619

32


Statistically significant independent variables: The model has two statistically

significant measures. One is a size measure, LOC, and the other is a coupling

measure, DAC, with the LOC measure having a higher contribution to the model

than the DAC measure. No statistically significant independent variable in the

CRC model is a cohesion measure. The model is free of multicollinearity

problems.

Direction of independent variable impact: As in the corresponding univariate

models, LOC and DAC positively affect the CRC in the multivariate model.

Goodness-of-fit: The MSE value (0.047) is smaller than most values obtained in

the univariate models. The R2 value (0.212) is much higher than those obtained in

the univariate models, which implies a much better goodness-of-fit.

Classification performance: The multivariate model has better precision, recall,

inverse precision, and inverse recall values than most univariate models. The

model also has an AUC value of 0.871 (excellent), which is higher than any value

obtained using univariate FRC models, with the exception of the MPC model.

Validation: As usually expected, the precision, recall, inverse precision, and

inverse recall values are sometimes higher when the CRC model is applied to the

original data set than to other systems. However, the FMeasure values indicate

that the classification performance of the model, when applied to systems other

than the system used to construct the model, is potentially high and can be trusted.

7.4. Multivariate analyses: discussion

The multivariate regression analysis shows that it is possible to construct statistically

significant multivariate models with better goodness-of-fit values and classification

performances than those obtained using univariate analysis. The constructed multivariate

models are more general than the univariate models because the former models consider

different quality aspects related to the problem of interest, as well as because a single

measure cannot capture all such aspects. Using Illusion classes, each RC and FRC

prediction model included cohesion and coupling measures, and the CRC model included

size and coupling measures. The RC prediction models that were constructed using

FreeMind and JabRef classes included size, cohesion, and coupling measures, and the

FRC and CRC prediction models included cohesion and coupling measures. This

observation shows that the considered quality attributes are complementary for predicting

the considered maintainability variables. Most excluded measures were found to be

statistically significant predictors when used alone to predict the maintenance variables in

the univariate analysis. However, these measures were excluded because of the

associations between them and the measures left in the models. Although all models have

coupling measures and do not always include size and cohesion measures, the coupling

measures had much lower contributions than the other measures left in the model.

Practically, this observation implies that software developers must pay attention to three

quality attributes, namely, size, coupling, and cohesion, to improve different class

maintainability aspects when developing OO classes.

33

Typically, multivariate models are general in their applicability. This expectation is

confirmed by the validation results, which demonstrated that the constructed models had

reasonable classification performances for the classes of systems other than those used to

construct the models.

8. Threats to validity The empirical study presented has several threats to the internal and external validity for

the selected systems to which the study was applied and the selected measures. These

threats may restrict the generality and limit the interpretation of our results.

8.1. Internal validity

The collected maintenance data greatly depend on the considered ages of the systems.

The chances for a class to be revised are expected to increase with time because systems

typically evolve and more faults are detected over time. However, one of our criteria for

selecting systems is that the systems must have been actively revised over a reasonable

maintenance history. We selected systems of different ages and found that their results

led to the same general conclusions, which gave us confidence that the collected

maintenance data were reliable. The maintenance data, which were available online for

the three selected systems, were collected from the CVS systems under the assumption

that all revisions performed during the maintenance history were reported in the CVS

systems. Although this empirical study does not consider many existing size, cohesion,

and coupling measures, the selected measures consider many measuring approaches and

quality aspects. Although different lines of code can have different maintenance costs and

efforts, we considered the modified lines of code equally because it is difficult to measure

the exact maintenance cost and effort for each modified line of code. It is important to

note that, in this paper, we did not use the number of revised lines of code and number of

revisions to measure the maintenance cost and effort but to estimate them. We followed

the convention that an addition or a deletion of a line of code is considered a single line

change, and a change in line content is considered both a deletion and an addition.

8.2. External validity

The first external threat to validity is that all considered systems are implemented in Java.

Other object-oriented programming languages, e.g., C++, have features that differ from

those in Java, such as allowing multiple inheritance and destructor declarations. No

considered measure accounts for inherited attributes and methods; therefore, the

inheritance issue is not expected to alter the results and conclusions drawn in this paper.

However, the inclusion of destructors can affect the class size, cohesion, and coupling

values, which can thus also affect the ability of measures to predict class maintainability.

However, the effect of including or excluding special methods (i.e., constructors,

destructors, and access methods) on the quality measurement is outside the scope of this

paper. In this paper, we instead investigated the impact of the considered existing

measures, as originally defined, on class maintainability.

The second external threat to validity is that all three considered systems are open-source

systems, which may not be representative of all industrial domains. However, the use of

open-source systems in empirical studies is a common practice in the research

34

community (Mockus et al. 2002, Lavazza et al. 2012). Although differences in design

quality and reliability between open-source and industrial systems have been investigated

(e.g., Samoladas et al. 2003, Samoladas et al. 2008, Spinellis et al. 2009), there is no clear

and general result on which we can rely. To compare maintainability of the selected

systems with those of other systems, including proprietary and open-source systems, one

can use the Software Improvement Group (SIG) system, which includes a repository of

hundreds of systems (Baggen et al. 2012). The SIG system compares a system of interest

with those in the repository in terms of certain aspects of maintainability and ranks the

system accordingly. However, the application of SIG system may require an interaction

with the developers of the system of interest, which might be infeasible for the selected

open-source systems.

The third external threat to validity is that the selected systems may not be representative

in terms of their class numbers and sizes. However, the selected systems are not artificial

examples. The number of considered systems and their sizes in terms of LOC and the

number of classes are also comparable with those considered in similar empirical studies

(Briand et al., 2002, Counsell et al., 2006, Marcus et al., 2008). In our empirical analyses,

we paid considerable attention to factors related to the significance of the collected data

and results, as discussed in Sections 5, 6, and 7. The small p-values obtained for most

considered measures indicate that, using the considered classes, there is actually enough

evidence to draw the obtained results.

The fourth external threat to validity is applying our own developed tool to analyze Java

classes and obtain the quality values instead of using an existing tool. We did not find an

existing tool that automates all of the measures considered in this paper. Therefore, using

an existing tool requires reverse engineering the tool and extending it to consider the

additional measures, which might require more development time than developing a tool

from scratch. However, we applied other existing tools such as CKJM (CKJM 2013) and

compared the results of the common measures with ours. We found most of the

corresponding values identical, which gave us confidence about the results of our tool.

To generalize our results, different systems written in different programming languages,

selected from different domains, and including both real-life and large-scale systems

should be considered in similar large-scale evaluations.

9. Conclusions and future work This paper investigates the relationships between three key class internal quality

attributes, i.e., size, coupling, and cohesion, and class maintainability, which is an

external quality attribute of interest to software practitioners. We empirically explored the

abilities of 19 selected size, coupling, and cohesion quality measures, considered both

individually and in combination, to predict two class maintainability aspects, namely, the

number of revisions performed on the class and the number of revised LOC during the

maintenance phase. The empirical study involved classes from three open-source Java

systems. We applied univariate logistic regression analysis to explore both the abilities of

the individual measures to predict class maintainability and the relationships between the

individual measures and considered maintainability aspects. We also used multivariate

35

logistic regression analysis to study the abilities of measure combinations to predict class

maintainability and construct the corresponding practical models.

With different abilities to predict maintainability, ranging from poor to excellent, the

univariate regression analysis results showed that most considered measures are

significant predictors of both considered maintainability aspects. The results generally

provided empirical evidence that, with regard to both considered class maintainability

aspects, the size and coupling quality attributes have positive impacts, and the cohesion

quality attribute has a negative impact.

From a practical perspective, the results indicate that developers can enhance the

maintainability (i.e., reduce the maintenance efforts and cost) of their developed classes

by reducing the classes' sizes and coupling and increasing their cohesion. When applying

the constructed multivariate models, developers must pay more attention to increasing

cohesion and decreasing size than decreasing coupling because, as discussed in Section 7,

size and cohesion have higher contributions to the constructed models in terms of their

considered maintainability aspects than coupling.

The multivariate regression analysis results showed that the combination of measures that

measure different quality attributes resulted in stable, optimized prediction models. The

constructed multivariate models were better than most univariate models in their abilities

to predict both class maintainability aspects. The reported results indicate that, in practice,

applying the constructed multivariate models is more appealing than applying the

univariate models.

In practice, the constructed prediction models can be automated and integrated using OO

programming editors to estimate class maintainability after the system is developed. That

is, the modules, in our tool, that are responsible for obtaining the values of measures that

are included in the multivariate model can be integrated with a Java editor. For each class

in a newly developed system, the modified editor can obtain the values of the measures

and the corresponding probability that the class will require costly and frequent revisions

by applying the equations of the multivariate models. In this case, developers can revise

the code of the classes with low maintainability. Software engineers can also spend more

time testing classes with low maintainability to reduce the chances of detecting faults in

these classes during the maintenance phase. Finally, software developers are encouraged

to document classes with low maintainability well to reduce the time required during the

maintenance phase to understand the code and perform the required revisions.

This empirical study can be extended by considering other direct maintenance aspects

such as the actual maintenance time and cost. More and larger industrial systems can be

used in a similar empirical study to validate or invalidate the obtained results. In previous

empirical studies, we explored the impact of including or excluding special methods (e.g.,

constructors and access methods) (Al Dallal 2012a) and transitive relationships caused by

method invocations (Al Dallal 2013) in cohesion measurement on the abilities of

cohesion measures to predict faulty classes. In future work, we plan to perform a similar

36

empirical study to investigate the impact of the same factors on the abilities of the

cohesion measures to predict class maintainability.

Acknowledgments

The author would like to acknowledge the support of this work by Kuwait University

Research Grant WI03/11. In addition, the authors would like to thank Anas Abdin for

assisting in collecting the required data.

References

K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, Application of artificial neural

network for predicting maintainability using object-oriented metrics, Proceedings of

World Academy of Science, Engineering and Technology, 2006, Vol. 15, pp. 285-289.

K. Aggarwal, Y. Singh, A. Kaur, and R. Malhotra, Investigation effect of design metrics

on fault proneness in object-oriented systems, Journal of Object Technology, 2007, 6(10),

pp. 127-141.

Y. Ahn, J. Suh, S. Kim, and H. Kim, The software maintenance project effort estimation

model based on function points, Journal of Software Maintenance Evolution: Research

and Practice, 2003, Vol. 15, pp. 71-85.

J. Al Dallal, Software similarity-based functional cohesion metric, IET Software, 2009,

3(1), pp. 46-57.

J. Al Dallal, Mathematical validation of object-oriented class cohesion metrics,

International Journal of Computers, 4(2), 2010, pp. 45-52.

J. Al Dallal, Improving the applicability of object-oriented class cohesion metrics,

Information and Software Technology, 2011a, 53(9), pp. 914-928.

J. Al Dallal, Measuring the discriminative power of object-oriented class cohesion

metrics, IEEE Transactions on Software Engineering, 2011b, 37(6), pp. 788-804.

J. Al Dallal, The Incorporating transitive relations in low-level design-based class

cohesion measurement, Software—Practice & Experience, 2013, 43(6), pp. 685-704.

J. Al Dallal, The impact of accounting for special methods in the measurement of object-

oriented class cohesion on refactoring and fault prediction activities, Journal of Systems

and Software, 2012a, 85(5), pp. 1042-1057.

J. Al Dallal, Fault prediction and the discriminative powers of connectivity-based object-

oriented class cohesion metrics, Information and Software Technology, 2012b, 54(4), pp.

396-416.

J. Al Dallal, Constructing models for predicting extract subclass refactoring opportunities

using object-oriented quality metrics, Information and Software Technology, 2012c,

54(10), pp. 1125-1141.

J. Al Dallal and L. Briand, An object-oriented high-level design-based class cohesion

metric, Information and Software Technology, 2010, 52(12), pp. 1346-1361.

J. Al Dallal and L. Briand, A Precise method-method interaction-based cohesion metric

for object-oriented classes, ACM Transactions on Software Engineering and

Methodology (TOSEM), 2012, 21(2), pp. 8:1-8:34.

J. Al Dallal and S. Morasca, Predicting object-oriented class reuse-proneness using

internal quality attributes, Empirical Software Engineering, in press, 2012.

37

E. Arisholm, L. Briand, and E. Johannessen. A systematic and comprehensive

investigation of methods to build and evaluate fault prediction models, Journal of

Systems and Software, 2010, 83(1), pp. 2-17.

L. Badri and M. Badri, A Proposal of a new class cohesion criterion: an empirical study,

Journal of Object Technology, 3(4), 2004, pp. 145-159.

L. Badri, M. Badri, and A. Gueye, Revisiting class cohesion: an empirical investigation

on several systems, Journal of Object Technology, 2008, 7(6), pp. 55-75.

R. Baggen, J. Correia, K. Schill, and J. Visser, Standardized code quality benchmarking

for improving software maintainability, Software Quality Journal, 2012, 20(2), pp. 287-

307.

J. Bansiya, L. Etzkorn, C. Davis, and W. Li, A class cohesion metric for object-oriented

designs, Journal of Object-Oriented Program, 11(8), 1999, pp. 47-52.

V. Barnett and T. Lewis, Outliers in Statistical Data, John Wiley and Sons, 3rd

e, 1994,

pp. 584.

H. Benestad, B. Anda, and E. Arisholm, Assessing software product maintainability

based on class-level structured measures, 7th International Conference on Product-

Focused Software Process Improvement (PROFES), 2006, pp. 94-111.

J. Bieman and B. Kang, Cohesion and reuse in an object-oriented system, Proceedings of

the 1995 Symposium on Software reusability, Seattle, Washington, United States, 1995,

pp. 259-262.

J. Bieman and B. Kang, Measuring design-level cohesion, IEEE Transactions on

Software Engineering, 24(2), 1998, pp. 111-124.

C. Bonja and E. Kidanmariam, Metrics for class cohesion and similarity between

methods, Proceedings of the 44th Annual ACM Southeast Regional Conference,

Melbourne, Florida, 2006, pp. 91-95.

L. C. Briand, C. Bunse, J. W. Daly, and C. Differding, An experimental comparison of

the maintainability of object-oriented and structured design documents, Empirical

Software Engineering, 1997b, Vol 2, pp. 291-312.

L. C. Briand, J. Daly, and J. Wust, A unified framework for cohesion measurement in

object-oriented systems, Empirical Software Engineering - An International Journal,

3(1), 1998, pp. 65-117.

L. C. Briand, J. Daly, and J. Wust, A unified framework for coupling measurement in

object-oriented systems, IEEE Transactions on Software Engineering, 25(1), 1999a, pp.

91-121.

L. Briand , P. Devanbu, and W. Melo, An investigation into coupling measures for C++,

Proceedings of the 19th International Conference on Software Engineering, Boston,

Massachusetts, United States, 1997, p.412-421.

L. Briand, S. Morasca, V. Basili, Property-based software engineering measurement,

IEEE Transactions on Software Engineering, 1996, 22(1), pp. 68–86.

L. C. Briand , S. Morasca , and V. R. Basili, Defining and validating measures for object-

based high-level design, IEEE Transactions on Software Engineering, 25(5), 1999b, pp.

722-743.

L. C. Briand , S. Morasca , and V. R. Basili, Measuring and assessing maintainability at

the end of high level design, IEEE Conference on Software Maintenance, Montreal,

Canada, 1993, pp. 88-97.

38

L. Briand and J. Wust, Empirical studies of quality models in object-oriented systems,

Advances in Computers, Academic Press, 2002, pp. 97-166.

L. C. Briand, J. Wüst, and H. Lounis, Replicated Case Studies for Investigating Quality

Factors in Object-Oriented Designs, Empirical Software Engineering, 6(1), 2001, pp. 11-

58.

CKJM — Chidamber and Kemerer Java Metrics, http://www.spinellis.gr/sw/ckjm/,

accessed in January 2013.

H. S. Chae, T. Y. Kim, W. Jung, and J. Lee, Using metrics for estimating maintainability

of web applications: an empirical study, 6th

IEEE/ACIS International Conference on

Computer and Information Science, 2007.

H. S. Chae, Y. R. Kwon, and D. Bae, Improving cohesion metrics for classes by

considering dependent instance variables, IEEE Transactions on Software Engineering,

30(11), 2004, pp. 826-832.

M. A. Chaumun, H. Kabaili, R. K. Keller, and F. Lustman, A change impact model for

changeability assessment in object-oriented software systems, Science of Computer

Programming, 2002, 45(2), pp. 155-174.

Z. Chen, Y. Zhou, and B. Xu, A novel approach to measuring class cohesion based on

dependence analysis, Proceedings of the International Conference on Software

Maintenance, 2002, pp. 377-384.

S.R. Chidamber and C.F. Kemerer, Towards a Metrics Suite for Object-Oriented Design,

Object-Oriented Programming Systems, Languages and Applications (OOPSLA), Special

Issue of SIGPLAN Notices, 26(10), 1991, pp. 197-211.

S.R. Chidamber and C.F. Kemerer, A Metrics suite for object Oriented Design, IEEE

Transactions on Software Engineering, 20(6), 1994, pp. 476-493.

S. Counsell, S. Swift, and J. Crampton, The interpretation and utility of three cohesion

metrics for object-oriented design, ACM Transactions on Software Engineering and

Methodology (TOSEM), 15(2), 2006, pp.123-149.

M. Dagpinar and J. H. Jahnke, Predicting maintainability with object-oriented metrics –

an empirical comparison, Proceedings of the 10th

Working Conference on Reverse

Engineering, 2003.

M. Elish and K. Elish, Application of treenet in predicting object-oriented software

maintainability: a comparative study, 13th

European Conference on Software

Maintenance and Reengineering (CSMR '09), 2009, pp. 69-78.

K. Erdil, E. Finn, K. Keating, J. Meattle, S. Park, and D. Yoon, Software maintenance as

part of the software life cycle, Comp180: Software Engineering Project, Department of

Computer Science, Tufits University, 2003.

L. Etzkorn, S. Gholston, J. Fortune, C. Stein, D. Utley, P. Farrington, and G. Cox, A

comparison of cohesion metrics for object-oriented systems, Information and Software

Technology, 46(10), 2004, pp. 677-687.

N. Fenton and S. Pfleeger, Software Metrics A Rigorous & Practical Approach, ITP, 2nd

edition, 1997.

L. Fernández, and R. Peña, A sensitive metric of class cohesion, International Journal of

Information Theories and Applications, 13(1), 2006, pp. 82-91.

FreeMind, http://freemind.sourceforge.net/, accessed March 2012.

M. Genero, M. Piattini, and C. Calero, A survey of metrics for UML class diagrams,

Journal of Object Technology, 2005, 4(9), pp. 59-92.

39

J. Granja-Alvarez and M. J. Barranco-Garcia, A method for estimating maintenance cost

in a software project: a case study, Journal of Software Maintenance: Research and

Practice, 9(3), 1997, pp. 161-175.

G. Gui and P. D. Scott, Measuring software component reusability by coupling and

cohesion metrics, Journal of Computers, 4(9), 2009, pp. 797-805.

T. Gyimothy, R. Ferenc, and I. Siket, Empirical validation of object-oriented metrics on

open source software for fault prediction, IEEE Transactions on Software Engineering,

3(10), 2005, pp. 897-910.

J. Hayes, S. C. Patel, and L. Zhao, A metrics-based software maintenance effort model,

In Proceeding of the 8th

European Conference on Software Maintenance and

Reengineering, Tampere, Finland, 2004, pp. 254-258.

D. Hosmer and S. Lemeshow, Applied Logistic Regression, John Wiley and Sons, 2000.

IEEE, IEEE standard glossary of software engineering terminology, IEEE Std 610.12-

1990, Institute of Electrical and Electronics Engineering, 1990.

Illusion, http://sourceforge.net/projects/aoi/, accessed March 2012.

JabRef, http://sourceforge.net/projects/jabref/, accessed March 2012.

H. Kabaili, R. Keller, and F. Lustman, Class cohesion as predictor of changeability: an

empirical study, L'Objet, Hermes Science Publications, 2001, 7(4), pp. 515-534.

D. Krantz, R. Luce, P. Suppes, A. Tversky, Foundations of Measurement, Vol. 1,

Academic Press, San Diego, 1971.

L. Lavazza, S. Morasca, D. Taibi, and D. Tosi, An empirical investigation of perceived

reliability of open source Java programs. accepted for publication in Proceedings of

the27th Symposium On Applied Computing, SAC ’12, 2012.

Y. Lee and K. Chang, Reusability and maintainability metrics for object-oriented

software, Proceedings of the 38th annual on Southeast regional conference, USA, 2000.

Y. Lee, B. Liang, S. Wu, and F. Wang, Measuring the coupling and cohesion of an

object-oriented program based on information flow, In Proceedings of International

Conference on Software Quality, Maribor, Slovenia, 1995, pp. 81-90.

W. Li and S.M. Henry, Object-oriented metrics that predict maintainability, Journal of

Systems and Software, 1993, 23(2), pp. 111-122.

W. Li-jin, H. Xin-xin, N. Zheng-yuan, K. Wen-hua, Predicting object-oriented software

maintainability using projection pursuit regression, 1st International Conference on

Information Science and Engineering (ICISE), 2009, pp. 3827-3830.

S. Mamone, The IEEE standard for software maintenance, SIGSOFT SE Notes, 1994,

19(1), pp. 75-76.

A. Marcus, D. Poshyvanyk, and R. Ferenc, Using the conceptual cohesion of classes for

fault prediction in object-oriented systems, IEEE Transactions on Software Engineering,

34(2), 2008, pp. 287-300.

T. Meyers and D. Binkley, An empirical study of slice-based cohesion and coupling

metrics, ACM Transactions on Software Engineering Methodology, 17(1), 2007, pp. 2-

27.

A. Mockus, R. Fielding, and J. Herbsleb, Two case studies of open source software

development: Apache and Mozilla, ACM Trans. Softw. Eng. Methodol., 2002, 11(3), pp.

309-346.

40

S. Morasca, On the definition and use of aggregate indices for nominal, ordinal, and other

scales, 10th IEEE International Software Metrics Symposium (METRICS 2004), Chicago,

IL, USA, 2004, pp. 46-57.

S. Morasca, Refining the axiomatic definition of internal software attributes, Proceedings

of the 2nd

ACM-IEEE International Symposium on Empirical Software Engineering and

Measurement, Kaiserslautern, Germany, 2008, pp. 188–197.

S. Morasca, A probability-based approach for measuring external attributes of software

artifacts, Proceedings of the 3rd

International Symposium on Empirical Software

Engineering and Measurement, 2009, USA, pp. 44-55.

R. O'Brien, A caution regarding rules of thumb for variance inflation factors, Quality and

Quantity, Vol. 41, No. 5, 2007, pp. 673-690.

H. Olague, L. Etzkorn, S. Gholston, and S. Quattlebaum, Empirical validation of three

software metrics suites to predict fault-proneness of object-oriented classes developed

using highly iterative or agile software development processes, IEEE Transactions on

Software Engineering, 2007, 33(6), pp. 402-419.

D. Olson and D. Delen, Advanced Data Mining Techniques, Springer, 1st edition, 2008.

M. Perepletchikov, C. Ryan, and K. Frampton, Cohesion metrics for predicting

maintainability of service-oriented software, IEEE Seventh International Conference on

Quality Software, 2007.

D. Powers, Evaluation: form precision, recall and F-factor to ROC, School of Informatics

and Engineering, Flinders University, Technical report SIE-07-001, 2007.

QMT: Quality Measuring Tool, http://www.cfw.kuniv.edu/drjehad/research.htm,

accessed May 2013.

S. Rizvi and R. Khan, Maintainability estimation model for object-oriented software in

design phase (MEMOOD), Journal of Computing, 2010, 2(4), pp. 26-32.

F. Roberts, Measurement theory with applications to decisionmaking, utility, and the

social sciences, Encyclopedia of Mathematics and its Applications, Vol. 7, Addison-

Wesley, 1979.

P. Rousseeuw, I. Ruts, and J. Tukey, The bagplot: a bivariate boxplot, The American

Statistician, Vol. 53, No. 4, 1999, pp. 382–387.

R. Shatnawi, W. Li, J. Swain, and T. Newman, Finding software metrics threshold values

using ROC curves, Journal of Software Maintenance and Evolution: Research and

Practice, 2010, Vol. 22, No. 1, pp. 1-16.

I. Samoladas, S. Bibi, I. Stamelos, and G.L. Bleris. Exploring the quality of free/open-

source software: a case study on an ERP/CRM system, 9th

Panhellenic Conference in

Informatics, Thessaloniki, Greece, 2003.

I. Samoladas, G. Gousios, D. Spinellis, and I. Stamelos, The SQO-OSS quality model:

measurement based open-source software evaluation, Open Source Development,

Communities and Quality, 275, 2008, pp. 237-248.

S. Sarkar, G. Rama, A. Kak, API-based information-theoretic metrics for measuring the

quality of software modularization, IEEE Transactions on Software Engineering, 2007,

33(1), pp. 14-32.

F. Sheldon, K. Jerath, and H. Chung, Metrics for maintainability of class inheritance

hierarchies, Journal of Software Maintenance and Evolution: research and Practice,

2002, Vol. 14, pp. 1-14.

41

D. Spinellis, G. Gousios, V. Karakoidas, P. Louridas, P. J. Adams, I. Samoladas, and I.

Stamelos, Evaluating the quality of open source software, Electronic Notes in Theoretical

Computer Science, 233, 2009, pp. 5-28.

R. Subramanyam and M. S. Krishnan, Empirical analysis of CK metrics for object-

oriented design complexity: implications for software defects, IEEE Transactions on

Software Engineering, 2003, 29(4), pp. 297-310.

J. Wang, Y. Zhou, L. Wen, Y. Chen, H. Lu, and B. Xu, DMC: a more precise cohesion

measure for classes. Information and Software Technology, 47(3), 2005, pp. 167-180.

E. Weyuker, Evaluating software complexity measures, IEEE Transactions on Software

Engineering, 1988, 14(9), pp. 1357–1365.

F. Xia and P. Srikanth, A change impact dependency measure for predicting the

maintainability of source code, IEEE Proceedings of the 28th

Annual International

Computer Software and Applications Conference, 2004.

X. Yang, Research on Class Cohesion Measures, M.S. Thesis, Department of Computer

Science and Engineering, Southeast University, 2002.

Y. Zhou and H. Leung, Predicting object-oriented software maintainability using

multivariate adaptive regression splines, Journal of Systems and Software, 2007, 80(8),

pp. 1349-1361.

42

Appendix A: Descriptive statistics using FreeMind and JabRef classes

Table A.1: Descriptive statistics of the 19 considered independent measures for FreeMind

classes

Table A.2: Descriptive statistics of the 19 considered independent measures for JabRef

classes

Quality

attribute

Metric Min Max 25% Med 75% Mean Std. Dev.

Siz

e NOM 1 126 4.00 8.00 10.00 8.68 9.03

NOA 0 48 1.00 2.00 4.00 3.11 4.05

LOC 7 989 39.00 68.00 87.50 82.15 82.64

Co

hes

ion

Coh 0 1 0.13 0.29 0.75 0.42 0.36

CAMC 0 1 0.37 0.46 0.54 0.47 0.19

TCC 0 1 0.00 0.17 0.98 0.39 0.41

LCC 0 1 0.00 0.30 1.00 0.44 0.43

LSCC 0 1 0.01 0.06 0.50 0.29 0.39

SCOM 0 1 0.02 0.10 0.83 0.33 0.41

PCCC 0 1 0.00 0.00 1.00 0.31 0.44

OL2 0 1 0.00 0.00 0.14 0.23 0.41

Co

up

lin

g

CBO 0 90 2.00 4.00 5.00 4.79 7.88

CBO_IUB 0 58 1.00 2.00 2.00 2.55 5.69

CBO_U 0 58 1.00 2.00 3.00 2.25 3.57

RFC 0 185 7.00 11.00 24.00 17.77 18.56

MPC 0 536 10.00 17.00 37.00 32.10 48.57

DAC1 0 42 1.00 2.00 3.00 2.56 3.29

DAC2 0 19 1.00 2.00 3.00 2.05 1.79

OCMEC 0 13 2.00 2.00 4.00 3.01 1.89

Quality

attribute

Metric Min Max 25% Med 75% Mean Std. Dev.

Siz

e

NOM 1 54 2.00 4.00 7.00 5.46 5.61

NOA 0 135 0.00 2.00 6.00 5.02 10.79

LOC 7 686 23.25 50.50 94.75 79.73 93.20

Co

hes

ion

Coh 0 1 0.28 0.65 1.00 0.62 0.36

CAMC 0 1 0.00 0.00 1.00 0.41 0.48

TCC 0 1 0.00 0.49 1.00 0.50 0.44

LCC 0 1 0.00 0.62 1.00 0.54 0.45

LSCC 0 1 0.07 0.49 1.00 0.52 0.43

SCOM 0 1 0.19 0.60 1.00 0.59 0.40

PCCC 0 1 0.00 0.33 1.00 0.51 0.47

OL2 0 1 0.00 0.00 1.00 0.41 0.48

Co

up

lin

g

CBO 0 298 1.00 4.00 7.00 7.91 22.17

CBO_IUB 0 287 0.00 1.00 2.00 4.45 20.28

CBO_U 0 49 1.00 2.00 5.00 3.45 4.62

RFC 0 117 5.00 10.00 24.00 17.05 18.70

MPC 0 583 6.00 16.00 48.00 40.42 62.56

DAC1 0 131 0.00 1.00 4.00 4.26 10.00

DAC2 0 22 0.00 1.00 3.00 2.41 3.33

OCMEC 0 15 1.00 2.00 3.00 2.42 1.88

43

Appendix B: Univariate regression analysis results for FreeMind and JabRef classes

Table B.1: Univariate results for RC using FreeMind classes

Table B.2: Univariate results for RC using JabRef classes


NOM -1.565 0.083 0.199 0.054 < 0.0001 0.374 0.582 0.760 0.577 0.634

NOA -1.341 0.160 0.199 0.046 < 0.0001 0.441 0.373 0.744 0.794 0.665

LOC -1.606 0.009 0.196 0.064 < 0.0001 0.466 0.436 0.762 0.783 0.654

Coh -0.464 -0.944 0.208 0.018 0.006 0.361 0.709 0.782 0.455 0.567

CAMC 0.491 -2.961 0.202 0.044 < 0.0001 0.376 0.618 0.769 0.553 0.646

TCC -1.001 0.415 0.212 0.005 0.129 0.373 0.509 0.746 0.628 0.579

LCC -1.168 0.725 0.209 0.017 0.007 0.384 0.555 0.760 0.613 0.602

LSCC -0.562 -1.063 0.206 0.025 0.002 0.345 0.782 0.789 0.356 0.550

SCOM -0.606 -0.738 0.209 0.014 0.015 0.346 0.755 0.780 0.379 0.537

PCCC -0.604 -0.828 0.207 0.020 0.004 0.343 0.764 0.780 0.364 0.569

OL2 -0.648 -0.930 0.207 0.021 0.004 0.337 0.836 0.800 0.285 0.586

CBO -1.158 0.068 0.203 0.031 0.004 0.394 0.391 0.736 0.739 0.610

CBO_IUB -1.006 0.066 0.205 0.022 0.007 0.604 0.291 0.748 0.917 0.585

CBO_U -1.211 0.170 0.206 0.026 0.005 0.373 0.373 0.727 0.727 0.594

RFC -1.837 0.054 0.183 0.116 < 0.0001 0.522 0.536 0.796 0.787 0.720

MPC -1.545 0.022 0.189 0.097 < 0.0001 0.523 0.518 0.791 0.794 0.716

DAC -1.356 0.200 0.200 0.046 <0.001 0.458 0.491 0.771 0.747 0.667

DAC2 -1.755 0.426 0.192 0.077 < 0.0001 0.472 0.464 0.769 0.775 0.672

OCMEC -2.097 0.397 0.191 0.090 < 0.0001 0.536 0.473 0.782 0.822 0.698


NOM -0.955 0.081 0.225 0.031 0.001 0.540 0.470 0.704 0.759 0.666

NOA -1.069 0.126 0.204 0.097 < 0.0001 0.678 0.530 0.750 0.848 0.743

LOC -1.223 0.009 0.207 0.089 < 0.0001 0.674 0.539 0.752 0.843 0.746

Coh 0.532 -1.753 0.217 0.065 < 0.0001 0.538 0.670 0.767 0.654 0.671

CAMC 0.286 -1.475 0.229 0.028 0.001 0.503 0.670 0.752 0.602 0.640

TCC -0.082 -0.885 0.228 0.026 0.001 0.450 0.591 0.697 0.565 0.600

LCC -0.142 -0.697 0.231 0.017 0.009 0.437 0.539 0.677 0.581 0.578

LSCC 0.373 -1.838 0.206 0.099 < 0.0001 0.533 0.704 0.779 0.628 0.702

SCOM 0.487 -1.778 0.210 0.085 < 0.0001 0.546 0.670 0.770 0.665 0.692

PCCC 0.311 -1.789 0.203 0.112 < 0.0001 0.538 0.730 0.793 0.623 0.692

OL2 0.160 -1.905 0.201 0.120 < 0.0001 0.532 0.800 0.827 0.576 0.681

CBO -1.494 0.178 0.184 0.158 < 0.0001 0.660 0.591 0.768 0.817 0.805

CBO_IUB -0.741 0.085 0.221 0.052 0.005 0.577 0.357 0.685 0.843 0.657

CBO_U -1.680 0.354 0.180 0.193 < 0.0001 0.648 0.609 0.773 0.801 0.800

RFC -1.485 0.057 0.194 0.142 < 0.0001 0.624 0.548 0.746 0.801 0.765

MPC -1.084 0.015 0.204 0.091 < 0.0001 0.677 0.565 0.762 0.838 0.766

DAC -1.050 0.146 0.203 0.098 < 0.0001 0.659 0.522 0.744 0.838 0.744

DAC2 -1.114 0.251 0.205 0.096 < 0.0001 0.606 0.522 0.734 0.796 0.731

OCMEC -1.826 0.528 0.198 0.130 < 0.0001 0.571 0.626 0.761 0.717 0.734

44

Table B.3: Univariate results for FRC using FreeMind classes

Table B.4: Univariate results for FRC using JabRef classes


NOM -3.292 0.113 0.085 0.135 < 0.0001 0.277 0.462 0.930 0.855 0.747

NOA -2.486 0.101 0.095 0.038 0.006 0.215 0.513 0.930 0.775 0.739

LOC -3.514 0.014 0.081 0.180 < 0.0001 0.296 0.538 0.938 0.846 0.806

Coh -1.347 -2.497 0.094 0.070 0.001 0.143 0.692 0.931 0.500 0.651

CAMC -0.306 -4.364 0.092 0.073 < 0.0001 0.159 0.641 0.932 0.593 0.702

TCC -2.096 -0.054 0.096 0.000 0.896 NA 0.000 0.893 1.000 0.482

LCC -2.400 0.590 0.096 0.009 0.133 0.128 0.487 0.907 0.602 0.578

LSCC -1.607 -3.425 0.093 0.084 0.003 0.157 0.872 0.966 0.438 0.624

SCOM -1.639 -2.209 0.094 0.061 0.002 0.140 0.795 0.944 0.414 0.604

PCCC -1.627 -4.090 0.092 0.110 0.006 0.157 0.923 0.978 0.404 0.659

OL2 -1.826 -2.814 0.094 0.064 0.011 0.133 0.923 0.967 0.275 0.620

CBO -2.416 0.051 0.094 0.042 0.004 0.400 0.410 0.929 0.926 0.737

CBO_IUB -2.368 0.074 0.093 0.047 0.001 0.410 0.410 0.929 0.929 0.680

CBO_U -2.266 0.059 0.097 0.012 0.105 0.155 0.436 0.913 0.713 0.645

RFC -3.488 0.059 0.079 0.190 < 0.0001 0.261 0.615 0.945 0.790 0.802

MPC -3.099 0.023 0.082 0.175 < 0.0001 0.319 0.590 0.945 0.849 0.815

DAC -2.580 0.153 0.095 0.050 0.003 0.271 0.487 0.932 0.843 0.742

DAC2 -3.181 0.427 0.091 0.099 < 0.0001 0.222 0.615 0.941 0.741 0.748

OCMEC -3.603 0.413 0.084 0.113 < 0.0001 0.206 0.513 0.929 0.762 0.685


NOM -2.964 0.099 0.081 0.066 0.001 0.198 0.607 0.950 0.752 0.715

NOA -2.606 0.047 0.082 0.056 0.007 0.186 0.464 0.936 0.795 0.739

LOC -2.954 0.006 0.079 0.082 < 0.0001 0.194 0.500 0.940 0.791 0.758

Coh -1.251 -2.030 0.080 0.066 0.001 0.154 0.750 0.959 0.586 0.706

CAMC -1.489 -1.608 0.082 0.026 0.030 0.131 0.714 0.948 0.522 0.666

TCC -1.943 -0.801 0.083 0.016 0.091 0.126 0.679 0.942 0.525 0.581

LCC -1.977 -0.653 0.083 0.012 0.145 0.118 0.607 0.932 0.543 0.578

LSCC -1.578 -1.824 0.080 0.067 0.001 0.147 0.786 0.962 0.540 0.700

SCOM -1.491 -1.663 0.081 0.057 0.002 0.154 0.750 0.959 0.586 0.688

PCCC -1.801 -1.234 0.082 0.040 0.010 0.128 0.714 0.947 0.511 0.652

OL2 -1.883 -1.395 0.081 0.045 0.009 0.129 0.786 0.956 0.468 0.627

CBO -3.648 0.139 0.062 0.321 < 0.0001 0.400 0.786 0.976 0.881 0.883

CBO_IUB -2.911 0.111 0.067 0.226 < 0.0001 0.433 0.464 0.946 0.939 0.830

CBO_U -3.550 0.251 0.068 0.207 < 0.0001 0.241 0.714 0.964 0.773 0.812

RFC -3.254 0.041 0.077 0.123 < 0.0001 0.207 0.607 0.951 0.766 0.760

MPC -2.742 0.008 0.081 0.063 0.001 0.268 0.679 0.962 0.813 0.753

DAC -2.575 0.050 0.082 0.052 0.014 0.210 0.464 0.939 0.824 0.732

DAC2 -2.852 0.170 0.081 0.070 0.000 0.206 0.464 0.938 0.820 0.728

OCMEC -3.791 0.484 0.073 0.138 < 0.0001 0.224 0.536 0.946 0.813 0.741

45

Table B.5: Univariate results for CRC using FreeMind classes

Table B.6: Univariate results for CRC using JabRef classes


NOM -3.635 0.109 0.065 0.147 < 0.0001 0.231 0.517 0.953 0.850 0.780

NOA -2.637 0.054 0.074 0.012 0.087 0.151 0.483 0.944 0.763 0.750

LOC -3.918 0.014 0.062 0.203 < 0.0001 0.262 0.586 0.960 0.856 0.854

Coh -1.761 -2.166 0.072 0.052 0.005 0.119 0.793 0.964 0.488 0.640

CAMC -0.551 -4.639 0.072 0.079 <0.001 0.127 0.690 0.956 0.590 0.723

TCC -2.405 -0.100 0.074 0.000 0.833 0.075 0.552 0.913 0.410 0.502

LCC -2.781 0.688 0.073 0.012 0.126 0.117 0.621 0.947 0.593 0.599

LSCC -2.023 -2.448 0.072 0.055 0.013 0.109 0.897 0.976 0.362 0.613

SCOM -2.010 -1.944 0.072 0.047 0.011 0.108 0.828 0.964 0.404 0.598

PCCC -1.886 -9.451 0.070 0.128 0.109 0.130 0.966 0.993 0.440 0.700

OL2 -2.131 -5.115 0.072 0.078 0.143 0.104 0.966 0.989 0.275 0.701

CBO -2.706 0.043 0.078 0.036 0.007 0.375 0.517 0.957 0.925 0.777

CBO_IUB -2.693 0.069 0.076 0.047 0.002 0.359 0.483 0.954 0.925 0.744

CBO_U -2.546 0.041 0.078 0.006 0.223 0.109 0.414 0.933 0.707 0.646

RFC -3.810 0.056 0.063 0.195 < 0.0001 0.238 0.690 0.968 0.808 0.834

MPC -3.426 0.022 0.064 0.183 < 0.0001 0.284 0.655 0.966 0.856 0.841

DAC -2.658 0.073 0.074 0.015 0.065 0.161 0.655 0.959 0.704 0.742

DAC2 -3.090 0.267 0.072 0.048 0.003 0.167 0.621 0.957 0.731 0.744

OCMEC -3.933 0.405 0.066 0.111 < 0.0001 0.155 0.517 0.947 0.754 0.706


NOM -2.950 0.092 0.079 0.058 0.002 0.186 0.593 0.950 0.749 0.726

NOA -2.457 0.020 0.083 0.012 0.108 0.214 0.444 0.940 0.842 0.713

LOC -2.950 0.006 0.078 0.073 <0.001 0.227 0.556 0.950 0.817 0.758

Coh -1.394 -1.793 0.079 0.052 0.003 0.141 0.704 0.953 0.584 0.683

CAMC -1.311 -2.107 0.080 0.043 0.006 0.160 0.778 0.966 0.606 0.703

TCC -1.987 -0.790 0.080 0.015 0.101 0.114 0.630 0.936 0.527 0.577

LCC -2.030 -0.623 0.080 0.010 0.171 0.101 0.519 0.923 0.556 0.572

LSCC -1.613 -1.850 0.078 0.068 0.002 0.140 0.778 0.962 0.538 0.687

SCOM -1.585 -1.528 0.079 0.048 0.005 0.147 0.741 0.959 0.584 0.670

PCCC -1.670 -1.921 0.077 0.083 0.001 0.143 0.815 0.967 0.527 0.694

OL2 -1.835 -1.954 0.078 0.074 0.003 0.129 0.815 0.963 0.470 0.624

CBO -3.067 0.072 0.064 0.190 <0.001 0.383 0.667 0.965 0.896 0.885

CBO_IUB -2.669 0.050 0.070 0.126 0.001 0.429 0.333 0.937 0.957 0.779

CBO_U -3.686 0.265 0.066 0.227 < 0.0001 0.279 0.630 0.959 0.842 0.830

RFC -3.152 0.036 0.077 0.095 < 0.0001 0.203 0.593 0.952 0.774 0.745

MPC -2.735 0.007 0.080 0.053 0.002 0.250 0.593 0.955 0.828 0.750

DAC -2.436 0.019 0.083 0.010 0.141 0.189 0.370 0.933 0.846 0.721

DAC2 -2.878 0.166 0.077 0.067 <0.001 0.190 0.444 0.938 0.817 0.714

OCMEC -3.629 0.426 0.076 0.112 < 0.0001 0.224 0.556 0.950 0.814 0.763

46

Appendix C: Multivariate regression analysis results for FreeMind and JabRef classes

Table C.1: RC-based forward multivariate regression model for FreeMind classes.


Intercept -1.237 0.001 -

0.171 0.245 0.824

RFC 0.039 0.000 1.876

OCMEC 0.326 0.003 1.978

OL2 -1.163 0.002 1.536

NOM -0.109 0.008 1.667

CBO_IUB 0.085 0.005 1.023

Table C.2: RC-based forward multivariate regression model for JabRef classes.


Intercept -3.035 < 0.0001 -

0.074 0.253 0.807 RFC 0.064 < 0.0001 1.009

SCOM -2.063 0.003 1.009

Table C.3: FRC-based forward multivariate regression model for FreeMind classes.


Intercept -4.419 < 0.0001 -

0.058 0.383 0.913 CBO 0.123 < 0.0001 1.024

RFC 0.034 0.001 1.024

Table C.4: FRC-based forward multivariate regression model for JabRef classes.


Intercept -3.375 < 0.0001 -

0.059 0.237 0.848 RFC 0.058 < 0.0001 1.009

SCOM -2.251 0.017 1.009

Table C.5: CRC-based forward multivariate regression model for FreeMind classes.


Intercept -2.392 < 0.0001 -

0.173 0.151 0.742 RFC 0.046 < 0.0001 1.125

DAC2 0.330 0.000 1.125

47


Intercept -3.448 < 0.0001 -

0.063 0.310 0.865 CBO 0.063 0.014 1.328

CBO_U 0.201 <0.0001 1.400

OL2 -1.530 0.043 1.081

Table C.6: CRC-based forward multivariate regression model for JabRef classes.

Jehad Al Dallal received his PhD in Computer Science from the University of Alberta in

Canada and was granted the award for best PhD researcher. He is currently working at

Kuwait University in the Department of Information Science as an Associate Professor

and as a department chairman. Dr. Al Dallal has completed several research projects in

the areas of software testing, software metrics, and communication protocols. In addition,

he has published more than 70 papers in reputable journals and conference proceedings.

Dr. Al Dallal was involved in developing more than 20 software systems. He also served

as a technical committee member of several international conferences and as an associate

editor for several refereed journals.

Object-Oriented Class Maintainability Prediction Using ... · Keywords: internal and external...

Documents

Transcript of Object-Oriented Class Maintainability Prediction Using ... · Keywords: internal and external...