Implementation of Conceptual Cohesion of Classes to Predict Faults in Object Oriented Systems

Abstract- One of the key characteristics of software industry is the development of quality software. Existence of high cohesion among the various classes in the software is one of the signifi-cant factors for quality software production. Software with high cohesion characteristics improves the understanding, productivity, maintenance and reuse of the product. Current trend in software development is largely based on utilization of the structural information from the source code, such as attrib-ute reference in methods to measure cohesion. The position of this paper is the implementation of a new measure for the cohe-sion of classes in Object Oriented software systems based on the analysis of the unstructured information embedded in the source code, such as comments and identifiers. Conceptual co-hesion of classes is the mechanism that has been implemented here in order to measure textual coherence in cognitive psychol-ogy and computational linguistics. The Paper thus, presents the principles and the technology that support the Conceptual Co-hesion of Class (C3) measurement. Keywords: Software quality metrics, Software quality, Latent se-mantic indexing (LSI), Textual coherence, Software cohesion.

I. INTRODUCTION Software quality may be defined as conformance to explicitly state functional and performance require-ments, explicitly documented development standard and implicit characteristics that are expected of all pro-fessionally developed software. Some of the issues that affect code quality include readability, low complexity is of maintenance, testing, debugging, modification and portability. Quality software cohesion is a measure of the degree to which elements of a module in a software product belong together. Cohesion is considered to be imperative from a conceptual point of view. Most of the approaches to measure cohesion are automated as it is impractical to manually measure the cohesion of classes in large systems. The measures that we use deal with information that can be automatically extracted from software. The above obtained information is ana-lyzed by automated tools. This would ignore less struc-tured information from the software (for example, tex-tual information). Cohesion is usually measured on structural information extracted solely from the source code (for example, attribute references in methods and method calls) that captures the degree to which the ele-ments of a class belong together from a structural point of view.

The measure for class cohesion, named the Conceptual Cohesion of Classes (C3), which captures the concep-tual aspects of class cohesion which enables to measure

the strength in which the methods of a class relate to each other conceptually. The conceptual relation be-tween methods is based on the principle of textual co-herence. This paper provides information that interpret the implementation of methods as elements of dis-course. There are many aspects of a discourse that con-tribute to coherence, including co reference, causal re-lationships, connectives, and signals. The source code is far from a natural language and many aspects of nat-ural language discourse do not exist in the source code or need to be redefined. The rules of discourse are also different from the natural language. Some of the existing metrics to measure cohesion are LCOM metrics and TCC and LCC metrics: LCOM metrics Lack of Cohesion of Methods: This group of metrics aims to detect problem classes. A high LCOM value means low cohesion. TCC and LCC metrics: Tight and Loose Class Cohe-sion. This group of metrics aims to tell the difference of good and bad cohesion. With these metrics, large values are good and low values are bad. This investiga-tion deems the aforementioned theme because cohesion of classes is measured on the basis of unstructured in-formation in the code.

II.LITERATURE SURVEY Authors, Jehad Al Dallal, et al (2010) suggest the use of the discrimination metric. They express that discrim-ination metric measures the probability that cohesion metric will produce distinct cohesion values for classes with the same number of attributes and methods. They further feel that there metrics produces different Con-nectivity Pattern of Cohesive Interactions (CPCIs). However, a highly discriminating cohesion metric is more desirable because it exhibits a lower chance of inappropriately considering classes to be cohesively equal when they have different CPCIs [1]. Authors, Bela Ujhazi, et al. (2010) present two novel conceptual metrics for measuring coupling and cohe-sion in software systems. Conceptual Coupling metric which is implemented here between Object classes (CCBO), is based on the well-known CBO coupling metric, while the other metric, Conceptual Lack of Co-hesion on Methods (CLCOM5), is based on the

Implementation of Conceptual Cohesion of Classes to Predict Faults in Object Oriented Systems

1Rakshith V, 2Ramesh Sagar V N, 3Sandeep A, 4Varun B, 5Suma V

1,2,3,4,5Department of Information Science and Engineering, Dayananda Sagar College of Engineering, VTU, Bangalore, India e-mail: [email protected], [email protected], [email protected],

[email protected], [email protected]

International Journal of Systems , Algorithms &

Applications IIII JJJJ SSSS AAAA AAAA

Volume 2, Issue ICTM 2011, February 2012, ISSN Online: 2277-2677 25

ICTM 2011|June 8-9,2011|Hyderabad|India

LCOM5 cohesion metric. One of the avantages of the proposed conceptual metrics is that they can be comput-ed in a simpler (and in many cases, programming lan-guage independent) way as compared to some of the structural metrics [2].

Author, Lalji Prasad, (2009) defines a new set of opera-tional measures for the conceptual coupling of classes that have been empirically studied and are theoretically valid. In this paper, he shows that these metrics capture new dimensions in coupling measurement, compared to existing structural metrics [3]. Authors, Sukainah Husein, et al. (2009) introduce their view of coupling and cohesion metrics and its imple-mentation approach. Coupling and cohesion metrics are calculated by considering a number of relationships, which were introduced by several researchers. Based on the relationships, some sets of metrics were chosen and implemented [4].

Authors, Andrian Marcus, et al (2008) propose a new measure for the cohesion of classes in OO software sys-tems based on the analysis of the unstructured infor-mation embedded in the source code, such as comments and identifiers. The measure, named the Conceptual Co-hesion of Classes (C3), the mechanism used to measure textual coherence in cognitive psychology and computa-tional linguistics. This paper, thus presents the principles and the technology that stand behind the C3 measure [5].

Authors, Richard Barker, et al. (2007) present the first large-scale empirical study of object oriented cohesion metrics. Their results show that by and large applica-tions have similar distributions of measurements accord-ing to any given metric, but that the distributions can be quite different across metrics. This provides useful infor-mation for the ongoing empirical validation efforts for cohesion metrics [6]. Authors, Andrian Marcus, Denys Poshyvanyk (2005) propose a new set of measures for the cohesion of individual classes within an OO software system, based on the analysis of the semantic information embedded in the source code, such as comments and identifiers. They present a case study on open source software which compares the new measures with an extensive set of ex-isting metrics. They further discuss and analyze the dif-ferences and similarities among the approaches and re-sults [7].

III. RESEARCH DECISIONS The class of structural metrics is the most investigated category of cohesion metrics and includes lack of cohe-sion in methods LCOM (logic control output module), LCOM1 (logic control output module1), LCOM2 (logic control output module2), LCOM3 (logic control output module3), LCOM4 (logic control output module4),

LCOM5 (logic control output module5), Coh (coherence), TCC (tight class cohesion), LCC (loose class cohesion).

Table 1: The definitions of few class cohesion metrics

The dominating philosophy behind this category of met-rics considers class variable referencing and sharing be-tween methods as contributing to the degree to which the methods of a class belong together. Most structural metrics define and measure relationships among the methods of a class based on this principle. Cohesion is seen dependent on the number of pair of methods that

IMPLEMENTATION OF CONCEPTUAL COHESION OF CLASSES TO PREDICT FAULTS IN OBJECT ORIENTED SYSTEMS





Class cohesion metric Definitions/Formulae

Lack of Cohesion of Methods (LCOM1) (Chidamber and Ke-merer 1991)

LCOM1= Number of pairs of methods that do not share attributes.

LCOM2 (Chidamber and Kemerer 1994)

P= Number of pairs of meth-ods that do not share attrib-utes. Q= Number of pairs of meth-ods that share attributes.

LCOM2={P-Q, if P-Q >= 0 0, Otherwise

LCOM3 (Li and Henry 1993)

LCOM3= Number of connect-ed components in the graph that represents each method

as a node and the sharing of at least one attribute as an edge.

LCOM4 (Hitz and Montazeri 1995)

Similar to LCOM3 and addi-tional edges are used to repre-sent method invocations.

LCOM5 (Henderson- Sellers 1996)

LCOM5=(a-kl)/(l-kl), where l is the number of attributes, k is the number of methods,

and a is the summation of the number of distinct attributes accessed by each method in

a class. Coh (Briand et al. 1998) Coh=a/kl, where a, k, and l

have the same definitions above.

Tight Class Cohesion (TCC) (Bieman and Kang 1995)

TCC= Relative number of directly connected pairs of methods, where two methods

are directly connected if they are directly connected to an attribute. A method m is

directly connected to an attrib-ute when the attribute appears within the method's body

or within the body of a method invoked by method m directly or transitively.

Loose Class Cohesion

(LCC) (Bieman and Kang 1995)

LCC=Relative number of directly or transitively con-nected pairs of methods, where two methods are transi-tively connected if they are directly or indirectly connect-ed to an attribute. A method m, directly connected to an attribute j, is indirectly con-nected to an attribute i when there is a method directly or transitively connected to both

attributes i and j.

share instance or class variables, one way or another. The differences among the structural metrics are based on the definition of the relationships among methods, system representation and counting mechanism. Some-what different in this class of metrics are LCOM5 and Coh, which consider that cohesion is directly proportion-al to the number of instance variables in a class that are referenced by the methods in that class. LCOM4 is the metric which measures the number of "connected components" in a class. A connected compo-nent is a set of related methods (and class-level varia-bles). There should be only one such component in each class. If there are 2 or more components, the class should be split into so many smaller classes. Any two methods namely method a and method b is said to be related if, they both access the same class-level variable, or one of the method calls another. Having determined the related methods, we draw a graph linking the related methods to each other. LCOM4 equals the number of connected groups of methods. If LCOM4=1 then it indicates that it is a cohesive class and if LCOM4>=2 then it indicates a problem and the corresponding class should be split into many smaller classes. Outline of Latent Semantic Indexing LSI is a corpus-based statistical method for inducing and representing aspects of the meanings of words and pas-sages (of the natural language) reflective of their usage in large bodies of text. LSI is based on a vector space model (VSM) as it generates a real-valued vector de-scription for documents of text. Results have shown that LSI captures significant portions of the meaning not on-ly of individual words but also of whole passages, such as sentences, paragraphs, and short essays. The central concept of LSI is that the information about the contexts in which a particular word appears or does not appear provides a set of mutual constraints that determines the similarity of meaning of sets of words to each other. LSI was originally developed in the context of IR as a way of overcoming problems with polysemy and synonymy that occurred with VSM approaches. Some words appear in the same contexts and an important part of word usage patterns is blurred by accidental and ines-sential information. The method used by LSI to capture the essential semantic information is dimension reduc-tion, selecting the most important dimensions from a co-occurrence matrix (words by context) decomposed using singular valued composition (SVD). As a result, LSI offers a way of assessing semantic similarity between any two samples of text in an automatic unsupervised way. LSI relies on an SVD of a matrix (word _ context) de-rived from a corpus of natural text that pertains to knowledge in the particular domain of interest. Accord-

ing to the mathematical formulation of LSI, the term combinations that occur less frequently in the given doc-ument collection tend to be precluded from the LSI sub-space. LSI does “noise reduction,” as less frequently co-occurring terms are less mutually related and, therefore, less sensible. The formalism behind SVD is rather com-plex and too lengthy to be presented here. Once the doc-uments are represented in the LSI subspace, the user can compute similarity measures between documents by the cosine between their corresponding vectors or by their length. These measures can be used for clustering simi-lar documents together to identify “concepts” and “topics” in the corpus. This type of usage is typical for text analysis tasks. Uses of LSI in software engineering are presented and discussed in our previous work. The designers and the programmers of a software system often think about a class as a set of responsibilities that approximate the concept from the problem domain im-plemented by the class as opposed to a set of method attribute interactions. Information that gives clues about domain concepts is encoded in the source code as com-ments and identifiers. Among the existing cohesion met-rics for OO software, the Logical Relatedness of Meth-ods (LORM) and the Lack of Conceptual Cohesion in Methods (LCSM) are the only ones that use this type of information to measure the conceptual similarity of the methods in a class. The philosophy behind this class of metrics, into which our work falls, is that a cohesive class is a crisp implementation of a problem or solution domain concept. Hence, if the methods of a class are conceptually related to each other, the class is cohesive. The difficult problem here is how conceptual relation-ships can be defined and measured. LORM uses natural language processing techniques for the analysis needed to measure the conceptual similarity of methods and rep-resents a class as a semantic network. LCSM uses the same information, indexed with LSI, and represents classes as graphs that have methods as nodes. It uses a counting mechanism similar to LCOM.

IV. DATA FLOW DIAGRAM The java program is given as input to the module. From the given input the module would extract two kinds of data. They are variables and methods and valuable com-ments. The above obtained data is utilized effectively further in implementation. The variables and comments are considered to be structured information. The valua-ble comments obtained from the program are considered to be unstructured information. The structured infor-mation is processed using LCOM5 formulae. The un-structured information is processed using the LSI tech-nique and by the application of the vector calculation metric. The results thus obtained are analyzed and inter-preted to provide the output.






Investigation Analysis Work The conceptual similarity between documents is meas-ured via the cosine or inner product between the corre-sponding vectors (i.e., methods), which increases if more words are shared. This underlying mechanism en-tirely supports the idea of measuring conceptual cou-pling and cohesion in software based on word matching from identifiers and comments in software. The source code of the software system is parsed and transformed into a corpus of textual documents where each document corresponds to the implementation of a method. Afore-mentioned LSI technique takes the corpus as an input and creates a term-by-document matrix, which captures the dispersion and co-occurrence of terms in class meth-ods. SVD is used next to construct a subspace, referred to as the LSI subspace. All methods from this matrix are represented as vectors in the LSI subspace. The cosine similarity between two vectors is used as a measure of conceptual similarity between two methods and is pur-ported to determine shared conceptual information be-tween two methods in the context of the entire software system. MODULES

I. Retrieving the structured information. II. Check the availability of structured information

for your source code. III. Apply the LCOM5 formula for structured infor-

mation. IV. Analyze about the comments i.e. unstructured

information. V. Index Searching VI. Apply the Conceptual similarity formula. VII. Comparison

Assessment of the new cohesion measure In order to evaluate our measure, we conducted two case studies. The goal of the first case study was to determine whether the C3 measure captures additional dimensions of cohesion measurement when compared to existing structural cohesion measures. Our hypothesis is that, given the nature of the information and counting mecha-nism employed by C3, it should capture different aspects of class cohesion than existing structural measures. Ex-isting research showed that cohesion measures can be used as good indicators for the fault proneness of classes in OO systems. In the second case study, C3 is com-pared with existing metrics and combinations of C3 with existing cohesion metrics are also compared with combi-nations of structural metrics (with each other) to assess whether they provide better results in predicting faults in classes or not. Our assumption is that combining C3 with other structural cohesion metrics should be a more complete indicator of cohesion (given that they capture different aspects of it); hence, it is a better indicator of fault proneness than combinations of structural metrics alone

V.RESULT Software with a high cohesion characteristic improves the understanding, productivity, maintenance and reuse of the product. Conceptual cohesion of classes is the mechanism that has been implemented here in order to measure textual coherence. The measure for class cohesion, named the Conceptual Cohesion of Classes (C3), which captures the conceptual aspects of class cohesion which enables to measure the strength in which the methods of a class relate to each other conceptually. There are many aspects of a dis-course that contribute to coherence, including co refer-ence, causal relationships, connectives, and signals. The method used by LSI to capture the essential seman-tic information is dimension reduction, selecting the most important dimensions from a co-occurrence matrix (words by context) decomposed using singular valued composition (SVD). As a result, LSI offers a way of as-sessing semantic similarity between any two samples of text in an automatic unsupervised way.

VI.CONCLUSION Classes in object-oriented systems,-are written in differ-ent programming languages and contain identifiers and comments which reflect the concepts from the domain of the software system. This information can be used to measure the cohesion of software. To extract this infor-mation for cohesion measurement, Latent Semantic In-dexing can be used in a manner similar to measuring the coherence of natural language texts. This paper defines the conceptual cohesion of classes, which captures new and complementary dimensions of cohesion compared to a host of existing structural metrics. Principal compo-






nent analysis of measurement results which were con-ducted on three open source software systems statistical-ly supports the aforementioned fact. Faults in classes can be predicted better using the combination of structural and conceptual cohesion metrics than using combina-tions of structural metrics. Highly cohesive classes need to have a design that ensures a strong coupling among its methods and a coherent internal description. Overall, the results indicate that C3 (Conceptual Cohesion of Clas-ses) is a useful indicator of an external property of clas-ses in OO systems, that is, the fault proneness of classes.

REFERENCES

[1] Jehad Al Dallal “Measuring the Discriminative Power of Object-Oriented Class Cohesion Metrics” ,2010 IEEE. [2] Béla Újházi, Rudolf Ferenc, Denys Poshyvanyk2 and Tibor Gyimóthy1 “New Conceptual Coupling and Cohesion Metrics for

Object-Oriented Systems”, 2010 Working Conference on Source Code Analysis and Manipulation. [3] Lalji Prasad, Aditi Nagar “EXPERIMENTAL ANALYSIS OF DIFFERENT METRICS (OBJECT-OREINTED AND STRUCTUR-AL) OF SOFTWARE”, 2009 First International Conference on Com-putational Intelligence, Communication Systems and Networks. [4] Sukainah Husein, Alan Oxley “A Coupling and Cohesion Metrics Suite for Object-Oriented Software”, 2009 International Conference on Computer Technology and Development. [5] Andrian Marcus, Denys Poshyvanyk, Rudolf Ferenc “Using the Conceptual Cohesion of Classes for <[email protected]>, Fault Prediction in Object-Oriented Systems”, IEEE transactions on software engineering, 2008. [6] Richard Barker, Ewan Tempero “A Large-Scale Empirical Com-parison of Object-Oriented Cohesion Metrics”, 14th Asia-Pacific Software Engineering Conference,2007. [7] Andrian Marcus, Denys Poshyvanyk “The Conceptual Cohesion of Classes”, 21st IEEE International Conference on Software Mainte-nance (ICSM’05),2005






Implementation of Conceptual Cohesion of Classes to Predict Faults in Object Oriented Systems

Documents

Transcript of Implementation of Conceptual Cohesion of Classes to Predict Faults in Object Oriented Systems