storage.googleapis.comstorage.googleapis.com/.../5642510b6abb9DwfvzKi/...J… · Web viewAlready...

Fuzzy Retrieval for Software Reuse

Erin Colvin and Donald H. KraftSchool of Computer Science, Colorado Technical University, 4435 Chestnut Street, Colorado Springs, CO 80907. Email: [email protected], [email protected]

AbstractFinding software for reuse is a problem that programmers face. To reuse code that has been proven to work can increase any programmer’s productivity, benefit corporate productivity, and also increase the stability of software programs. This paper shows that fuzzy retrieval has an improved retrieval performance over typical Boolean retrieval. Various methods of fuzzy information retrieval implementation and their use for software reuse will be examined. A deeper explanation of the fundamentals of designing a fuzzy information retrieval system for software reuse will be examined. Future research options and necessary data storage sys-tems are explored.

Introduction

The main goal of any programmer is to develop and deliver high quality software applications that meet a customer’s needs. For years, programmers have searched the web and other open source li -braries for software components to reuse instead of creating an application from scratch (Thum-malapenta, 2011). The number of open-source software libraries on the web has increased as well. Software reuse has been a proven and effective tool for developers to meet time-to-market deadlines and produce a solid, error-free piece of software (Krueger, 1992). Software libraries are full of soft -ware components that have been created and stored with proven track records. Mockus (2007) found that 50% of the code being created for production was using code from previous programs. The chal-lenge is being able to find an already-written piece of software when needed. When searching for data, most users try to best quantify the terms they are looking for in as little words as possible. Then it is up to the system to yield the results that it determines to be the best match (Bordogna, Carrara, & Pasi, 1992). Searching for data can be done in many ways; searching for software requires an alterna-tive approach since the programming language, the behavior of the software, or the intended purpose may require different search terms.

Many information retrieval (IR) applications today use a Boolean match; a document either con-tains the searched term, or it doesn’t. The IR system returns a list of documents in no particular order, requiring the user to discern which match is the closest fit. Fuzzy logic attaches a weighted value to those matches based on the degree of match, yielding a more accurate chance of meeting the user’s goals (Bordogna, Carrara, & Pasi, 1992).

Searching for software and reusing someone’s code that has already been tested and verified to work, programmers can reduce development and test time. Less development and test time means a faster time to market. Software reuse can include one of four types of reusable artifacts: data reuse, architecture reuse, design reuse, and program reuse (Aziz & North, 2007). If software returned from a search can be listed by the degree of match, a user will have a better choice of which software compo -nent to use. The goal of this study is to implement algorithms using fuzzy logic that will have a higher success rate of returning a better match of software that can be reused.

Current information retrieval algorithms are ineffective in finding software components because they are either done on the Internet utilizing open source software or are constructed using a Boolean logic based matching system or both. It is well documented in the literature that in order for software to be reused, it has to be found. Current information retrieval algorithms are ineffective in finding quality software components in an efficient manner (Prieto-Diaz, 1991; Yao, Etzkorn, & Virani, 2007).

mailto:[email protected]

mailto:[email protected]

Using term weights, we have developed a fuzzy search methodology, implementing the MMM, P-norm and Paice algorithms. This study is an extension of the 1991 study by Maarek using the same data corpus and same means to generate an accurate measure for precision. These should result in a better matched list of returned values over the standard Boolean logic method. Looking at the degree of membership will yield a higher rate of successful match over the Boolean method. A higher suc -cess rate in returned software components should result in a higher reuse rate. This study is also the first to test extended Boolean functions with actual data.

Already described is the need for software reuse in today’s programming industry and some of the reasons why it is not widely used in everyday practice. One reason is the need for an accurate look-up or retrieval system that can quickly and accurately return an appropriate soft-ware component. There are many methods to implement information retrieval, the one at the focus of this study uses fuzzy sets. Because fuzzy set theory offers a wider range of possible match, it should deliver the most accurate result to a user’s query.

Software Reuse

Software reuse was first introduced by McIlroy at a NATO conference in 1968 (McIlroy, 1968). McIlroy (1968) understood the importance of creating a solid software component and the need to create an inventory system that will allow these components to be widely accessible to different ma-chines and users. McIlroy said then that in order for reuse to be widely used, there is a need for a standardized library to store and index software components (McIlroy, 1968). This issue has been widely discussed through the literature as the primary issue that is hindering software reuse today from becoming an industry standard (Mili, Mili, & Mili, 1995).

There are two major benefits to software reuse. ”Those components that have already been tested provide higher guarantees of robustness and reliability in any future implementation and 2. Compo-nent reuse should lead to faster development times and lower costs” (Gibb, McCartan, O'Don-nell, Sweeney, & Leon, 2000, p.212). With the increasing demands for software development and the inability of programmers to keep up, software reuse is a practice that can reduce development time and lead to increased stability of a system (Yao, Etzkorn, & Virani, 2008). Software reuse was first introduced as a way to minimize creation time and help build a more stable system with components that have been previously created and tested (Krueger, 1992). Charles Krueger states that although this practice was introduced in the 1960’s it is still a practice that is not widely used today in software engineering. Krueger goes on to say that software reuse can be defined as the “direct reuse of compo-nents or code, the abstraction of ideas, or adaptation of software to fit the needs of others” (1992, p.131). Software reuse also reduces effort and development time which decreases time to market for certain systems. By decreasing time to market, companies can increase productivity and decrease the development time of new or improved systems (Keswani, Joshi, & Jatain, 2014).

Reusing a software component will also improve the quality of that component. The more times a component gets reused, the more chances bugs can be found and fixed. The initial cost of creating the component can also be made up in just a few reuses (Keswani, Joshi, & Jatain, 2014; Vishal, Chander, & Kundu, 2012).

Mili, Mili and Mili (1995) pointed out that in 1984, 60% of software created could have been standardized and reused. Quantifying the amount of software that is actually reused has posed a problem for researchers. Mockus says that more than 50% of the open source software files that were available for his study have been used in more than one program. This was based on a file being present in more than one program. This is just open source code that is widely available on the internet; Mockus doesn’t look at proprietary software libraries within a company. In total, Mockus looked through 13.2 million open source code files and found that 52% of the files had been shared at least once. This means that any file shared on an open source website has about a 50% chance of being reused (Mockus, 2007). The demand for qual-ity software and the sustainability of code that has already been tested is the number one driv-ing factor they found in their study.

Sandhu, Kaur and Singh (2009) looked at reusing software from currently active systems which they said had not been previously studied. Most software reusability is based on older versions of software that is stored and no longer in use. They observed that in order for programmers to reuse software, they must be able to first find it useful (Sandhu, Kaur, & Singh, 2009). They found that to get a more accurate measure of reuse, the domain must also be taken into consideration. They devised a neural network that would automatically evaluate the reusability of object oriented software compo-nents. Their metric was successful in measuring the reusability of software, but was not optimal as compared to other similar studies (Sandhu, Kaur, & Singh, 2009).

It has been shown that if programmers can find good quality software they are more likely to reuse the code segment. In a study by Haefliger, at el., programmers were asked specifically what drives them to reuse code and what means they are most likely to use to find the code. The biggest drivers of searching for software components included not wanting to rewrite large sections of code like a sort or search, to understand current code implementations, for code repair, and lastly resource con-straints like lack of time and testing resources (Haefliger, von Krogh, & Spaeth, 2008, p. 183; Sim, Clarcke, & Holt, 1998). In agreement with the study by Haefliger et al, a study by Agresti (2011) found that programmers discovered a 26% increase in productivity by reusing old code, and found that overall, programmers are willing to reuse code no matter the time constraint - good code is good code. An interesting discovery from this study found that programmers did not believe in the if you want something done right do it yourself mentality but that one of the biggest deterrents for not-reusing code is lack of documentation to what the code actually does (Agresti, 2011).

Boolean Logic in IR

In 1959, Maron and Kuhns introduced a new novel technique to solve the library indexing issue by defining an index for a document as a unique tag that identifies the information in that document. Using this index, data is easier and faster to search through. Although they did not have electronic versions of documents, Maron and Kuhns’s system was still effective in retrieving information from the library. This system of searching for information using only a small piece of data representative of the entire entry has been the foundation for other information retrieval systems (Maron & Kuhns, 1959).

Boolean logic is simple and clean. The results are returned in an ordered list and all having the same chance of matching the query as the next. The problem with that in today’s world of data stor-age is the lists that are returned can be very large and that ordered list is ordered by first match, first on the list, not best match first (Baeza-Yates & Ribeiro-Neto, 2011). Even though a weighted system would return a more accurate list, the Boolean retrieval model is still the most popular among search based algorithms (Bordogna, & Pasi, 1993). Most searches today incorporate some form of weighted system for data searches to reduce the amount of returned data and to limit the number of relevant returns.

A Boolean system gives a value of 1 to a document that contains the queried term and a value of 0 to a document that does not contain the queried term. This is based on only one occurrence of the term, so documents containing the term differently on multiple occasions get returned the same way (Baeza-Yates, & Ribeiro-Neto, 2011). Therefore, a document that contains the searched term 100 times gets returned with the same priority as a document that contains the searched term 1 time. Al -though this is a fast way to search through documents, when listing the matches for the user, the re-sulting list can be deceiving (Baeza-Yates, & Ribeiro-Neto, 2011).

Boolean logic does have a number of disadvantages including the fact that the size of the output is hard to control or even predict. If the matching is done regardless of number of times a term is lo-cated in a document, the output is not initially ranked based on how a document matches the queried term. Therefore, choosing a document that best meets the user’s query is up to the user (Salton, Fox, & Wu, 1983).

Fuzzy Sets and Extended Boolean Logic in IR

Miyamoto (1990) says that an IR system using fuzzy logic takes a query as input, returns a list of documents as output, and how the documents are scored is a measure of the degree of match of a doc-ument to a query. Bordogna and Pasi (1983) have created a fuzzy linguistic approach with general-ized Boolean IR. They say it incorporates the accuracy of a Boolean match with a fuzzy model. In their study, they replace the fuzzy weights with linguistic values, e.g., replacing .8, and .9 with very important, and not important, respectively. They say by replacing the numeric value for a qualitative descriptor, they can get a better sense of how a term matches a user’s query. They found that this allowed the user to be able to calculate the recall and precision much easier and to not have to quan-tify a specific number for the degree of importance of a term in a document (Bordogna, & Pasi, 1993).

A study performed by Bordogna, Carrara and Pasi extended the Boolean information retrieval methodology to help satisfy a user’s query better using a weighted system (Bordogna, Carrara & Pasi, 1992). The authors used a retrieval status value (RSV) which is obtained by combining the resulting weights from the function F : DxT -> [0,1] (Bordogna, Carrara, & Pasi, 1992; Herrera-Viedma, 2001). D is the set of all documents, T is the set of terms and F is a function of the occurrences of term T in D (Bordogna, Carrara, & Pasi, 1992). This RSV is used to represent the closeness to the ideal document (Herrera-Viedma, 2001). A value of 0 for an index term means there is no match for the indexed term and a value of 1 indicates a perfect match (Kraft, Bordogna, & Pasi, 1998). Based on this value a constraint system is set up that affirms the value as degree of relevance to the desired ideal. The results are then listed in descending order from perfect match.

The other statistical models are the vector and probabilistic models (Srinivasan, Ruiz, Kraft, & Chen, 2000). The vector model assigns a non-binary weight to indexed terms and a degree of similarity is calculated between each stored document. The resulting list of matched documents is sorted and presented in descending order; this allows documents that are only partially a match to get returned to the user. The formal definition of the vector model is such that “the weight w i , j associated with a term-document pair (k i , d j) is non-negative and non-binary” (Baeza-Yates & Ribeiro-Neto, 2011, p.77). The probabilistic model looks at the documents and once it finds its match it assigns a value of probability that the user will find the document relevant. Instead of assigning a degree of match, it looks at the statistics of probability. Baeza-Yates and Ribeiro-Neto say “given a query q, the probabilistic model assigns to each document d j, as a measure of its similarity to the query, the ratio P(d j, relevant-to q)/ P(d j non-relevant-to q), which computes the odds of the document d j, being relevant to the query q” (2011, p. 80). Lofti Zadeh is credited with the creation of fuzzy-set theory back in 1965, but parts of the theory can be traced back to the 1920’s. Fuzzy-set theory is described as a cross between Boolean logic and multi-valued set theory (Miller, 1996). Zadeh says of fuzzy-set theory that “as a system becomes more complex, the need to describe it with precision becomes less important” (Miller, 1996, p. 29). Fuzzy sets allow a degree of match or varying scale of relationship that is otherwise not included in mathematics.

A fuzzy set is defined by Klir and Yuan (1995) as assigning every point in a space a value based on the grade of membership of each point to the set. For example, ”a fuzzy set representing our con -cept of sunny might assign a degree of membership of 1 to a cloud cover of 0%, .8 to a cloud cover of 20%, .4 to a cloud cover of 30% and 0 to a cloud cover of 75%” (Klir & Yuan, 1995, p.491). By as -signing a degree of match, there is more flexibility in data searches for finding a better match.

Fuzzy logic allows a system to be set up in a way that helps group similar terms together, the way the human mind summarizes data. For example, if the linguistic values are young, old, infant, the logistic variable would be age. The other way to describe this way of grouping is by referring to it as a membership function. Using linguistic values or membership functions, a series of rules can be defined. These rules are fundamental to the Fuzzy Dependency and Command Language or FDCL.

The FDCL, unlike Fuzzy Prolog, is not a fuzzified version of a standard programming language. Like all languages, FDCL is defined by its semantics and syntax (Zadeh, 1994). In a typical Boolean retrieval model, a document is queried with term A and term B. Using the AND will result in only the documents with both terms present being given a value of 1, and the re-maining documents will be assigned 0. If the OR operator is used, the document that has neither term will be assigned the value of 0 and the remaining documents given a value of 1.

Salton, Fox, and Wu (1983) look at the vector-processing retrieval model and assigning similarity values Using the Euclidian distance from the point (1,1) for AND queries, because (1,1) is the point where a document contains both terms. That makes it the ideal location for a perfect match, those document that contain one term but not both would be calculated by using term weights d A for term A and d B for term B, √(1−d A)

2+(1−dB)2 and from the point (0,0) for the OR operator the equation,

√(dA−0)2+(dB−0)2 (Salton, Fox, & Wu, 1983). With a maximum distance possible of √2, it is clear to see that using a weighted system will provide a closer match than a Boolean logic model. The results of their vector-processing retrieval model showed a 172% improvement in recall and precision from a Boolean logic retrieval model on the same data.

The vector-processing model is effective in finding a successful match with more precision than a Boolean system, but if there is more than one term it becomes impossible to determine which term gets the highest precedence. For example, if the search is Information AND Retrieval AND Software AND Reuse, any document with the word reuse will get the same degree of match as a document con-taining the word retrieval. This may not be what the user desires. Bookstein introduced a model that added weights to not only the searched result list, but also to the queried words. He suggests reducing the retrieved set value (RSV) by the reduced term weight, for example if the term reuse has the mem-bership values of {(d1, 1) (d2 , 0.8¿¿, 0)} where d is a document and the next number is the member-ship of reuse. If the request becomes reuse0.5, then the retrieved set now becomes {(d1, 0.5) (d2 ,0.4¿¿, 0)} (Bookstein, 1980).

Brookstein’s approach is the foundation for Buell and Kraft’s paper (1981) that looks at a weighted retrieval model. Buell and Kraft look at a model that replaces the standard 0 and 1 values assigned to an index term to a continuous value for the membership value. This is calculated by a set of docu-ments (D*) times the set of indexed keywords (I*) (indexed keywords are keywords from a document that get added to the index file) resulting in a membership value between 0 and 1: F: D* × I* [0,1] (Buell & Kraft, 1981). Then the calculations for membership become how much a document is about a query term, as opposed to whether or not the term exists in the document. The issue of how the sys-tem handles the membership of multiple query terms is handled.

Reasons for the study

So far this study has made a clear case why the Boolean model is not the best for searching for data that can have multiple entries in a document. It provides a great result if the term fre-quency is not of importance, but these days when searching for a term in a document, usually the document that contains the term the most is usually more relevant and should be returned first. The vector-space model does a great job of returning a list of documents who match the query in the order of how many times a term is listed in the document, but what if the terms have an order of preference and you need them to be returned termA before termB? The ex-tended Boolean models do a great job of adding weights to the terms and to the documents but no one has used these formulas with actual data to prove that they will actually do a better job at finding a match. Fuzzy logic has shown that using a membership function and assigning a value between 0 and 1 to a match can result in a better list of relevant items. This study has two goals, to prove that using a membership function is the best way to search and that using a member ship function is the best way to search for software so it can be reused. The first step is finding a search method that works the best for all software. Can one search be used for all software? Is using a membership function the better way to search vs. a Boolean when search -ing actual data?

The Study

Three other models used for information retrieval that are considered a fuzzy approach because they include term weights are the MMM (mixed, min and max), Paice and the P-norm model. The MMM model is based on work by Zadeh and says “an element has a varying degree of membership to a given set instead of the traditional membership choice”, but only looks at the min and max docu-ment weights for the index term (Frakes, & Baeza-Yates, 1992, p. 395). The MMM is based on the fuzzy set theory and says that each indexed term has a fuzzy set associated with it. The weight of a document with respect to an index term is considered to be the degree of membership.

Term frequency and inverse document frequency are calculated to determine the most fre-quent words in a document and which document contains a word the most frequently (Fox & Sharan, 1986). This indexing value helps search algorithms quickly look up a term and find the correlating documents that contain that term. This term weighting function was first intro-duced in 1972 as a way to rank documents for information retrieval systems. The basic formula for idf says that given a set of documents, N, and a term t i occurs in ni of the documents, the idf (t i) = log (N /ni ¿ (Robertson, 2004). The frequency of a term in a given document is then mul-tiplied by the idf to get the tfidf number (Robertson, 2004). The tfidf value will be used as the term weights in this research.

Using the term frequency calculation (tfidf) as the term weight, the similarity function in the MMM is calculated by:

SIM (orQ¿ , D) = C ¿1* max (tfidf of all queried terms) + C ¿2* min (tfidf of all queried terms) and

SIM (orQ¿ , D) = C ¿1* min (tfidf of all queried terms) + C ¿2* max (tfidf of all queried terms)

where Q is the query, D is the document with index-term weights tfidf, and the C values are coeffi -cients for “softness”, Prieto-Diaz says “since we would like to give the maximum of the document weights more importance while considering an or query and the minimum more importance while considering an and query” and usually C ¿2 is just 1 - C ¿1 and C ¿2 is calculated as 1 - C ¿1(1991, p. 396). It was found that C ¿1 performed best at 0.6 and C ¿1 performed best at 0.3. For purposes of this research we will use the values of C ¿1= 0.6, C ¿2 = 0.4, C ¿1=0.3 ,C ¿2=0.7 .

The Paice model was proposed by Paice in 1984 and is also based on the fuzzy set theory. Similar to the MMM model, the Paice model looks at the weighted indexes in the document but doesn’t stop at the min and the max like the MMM model. The Paice similarity considers all of the weights of the document. The Paice similarity is calculated by

where n = number of queries, Q is the query and D is the tfidf for the document. For an OR query, D = ( A1 or A2 or … or An ¿;.for an AND query, D = ( A1 and A2 and … and An ¿ ,where A1 is the tfidf for query term 1, and so forth.

The P-norm model adds another angle to the Paice model by considering the weight of the query as well as the weights of the documents (Frakes & Baeza-Yates, 1992). In the research it has been found that p= 2 gives good results, for this research, the weight used will be 2. The P-norm similarity for an OR’d query is

where Q is the query, D is the document, a is the term weight, d is the document weight, p is set to 2, and A is the term for which the document weight is corresponding. For an AND’d query the similar-ity is:

With a p value greater than 1 the computational toll is high, but to get a better result, computational expense is sometimes overlooked.

Based on the literature, there are no previous studies that looked at how effective a fuzzy approach (MMM, Paice, or P-norm) is for information retrieval of software. For this research, the MMM, Paice, Boolean and P-norm model will be used to search a document for software components using the tfidf value as weights.

Measures of performance

There have been a number of measures created and used in the industry to measure the quality of retrieval in an information retrieval system. The most well-known is recall and precision. Recall is defined by Croft, Metzler, and Strohman (2010) as “the proportion of relevant documents that are retrieved” and precision is “proportion of a retrieved set of documents that are actually relevant” (p. 309). These measures are usually inversely related, as precision goes up, recall goes down and vice versa (Binkley, & Lawrie, 2008). Recall = |A ∩B|/ |A| and precision = |A ∩B|/ |B|, where A is the number of relevant documents and B is the number of retrieved documents.

The more precise the match, or the more specific the query gets, the less recall will be. For exam-ple, if a search is for sum function there may be plenty of matches returned with low precision and high recall. If the query becomes integer sum function the more precise the returned solutions will be, but the number of items returned will decrease. It has been shown that in general, the higher the recall the lower the precision. Precision, in this study, was calculated similarly to the Maarek (1991) study. In the 1991 study by Maarek, they calculated the recall and precision and used a panel of experts to evaluate the precision of the calculations. Similar to the Maarek study, this study used the same data corpus, the UNIX man pages, and also an independent panel of UNIX experts to evaluate the results.

This study also used the mean average precision (or MAP) to calculate the overall effectiveness of the system. This value was also used as a comparison measure (Manning, Raghavan, & Schutze, 2008). The MAP is a widely used measure that results in a single numerical figure that represents the effectiveness of a system (Turpin, & Scholer, 2006). With a single measure of quality, multiple sys-tems can now be compared to each other (Manning, Raghavan, & Schutze, 2008). MAP is defined as:

Mean Average Precision = APQ

Where AP is the Average Precision and NR is the number of relevant documents. Q is the total num-ber of queries, N is the number of retrieved documents, r is the rank in the sequence of retrieved docu-ments, P(r) is the precision at rank r, rel(r) is 1 if the item at rank r is relevant and 0 if it is not. Aver-age precision is calculated by “taking the mean of the precision scores obtained after each relevant document is retrieved, with relevant documents that are not retrieved receiving a precision score of zero. MAP is then the mean of average precision scores over a set of queries” (Turpin & Scholer, 2006, p. 12). MAP is a popular metric used in IR system comparisons and has shown to be “stable across query set size and variations in relevance” (Turpin & Sholer, 2006, p. 12). Because of its sta -bility and ability to be used as a comparison for IR systems, MAP was the calculation used to com-pare the IR systems in this study.

Once the matching documents had been returned, they were in a ranked order based on precision or how well they matched the query. Those results were given to the panel of experts to be evaluated. The experts then decided if the returned files were relevant or not relevant and if they were in the cor-rect order of relevance. Files that were not returned but should have been, were listed to the side. Based on both experts opinion the precision was calculated. For example if a query has 5 files re -turned but the experts only think files 1, 2 and 4 are relevant, to get the AP we first calculate precision which is number of relevant file divided by position, therefore 1/1= 1; 2/2 = 1; and ¾ = .75; then add, 1 + 1 + .75 = 2.75 divided by the total number of relevant files 3 = .92 * 100 = 92% Average Preci-sion for this query.

Procedure

Finding a non-web based software library that was full of libraries or reusable software compo-nents was not easy. Most companies have proprietary licenses on their software libraries and the rest are cluttered with a mix bag of different platform and language possibilities. Most software searches these days are done on the internet searching cross platform for multiple languages. Discerning infor -mation on the internet is difficult and being able to discern code from non-code was out of scope for this project. (Herrera-Viedma, Pasi, López-Herrera, & Porcel, 2006). The idea here was to find a sim-ple software library that contained software code in one language for the purpose of identifying if this search mechanism worked. Then in future work we would expand to cross language platform capa-bilities and if that affected the search criteria. Think about searching for a sum function in Java, you could search for it in NetBeans, JavaScript, etc. so we wanted to reduce the clutter from the search for the time being to ensure our search was resulting in the best outcome possible.

We decided to use the Maarek article (Maarek, et al., 1991) as the foundation to our study. In the article, Maarek used the UNIX man pages as the corpus for the 1991 study, the man pages were perfect, they were software and they were downloadable to the researcher’s com-puter. Using about 1100 components, Maarek selected 30 queries for her study and then stated that this ration corresponds to the “same number-of-queries per number-of-documents ratio as the one which has been used in standard test sets such as MED” (Maarek, 1991, p. 811). Using this ratio we constructed 20 queries for the 681 files that were used, which included 10 ‘OR’ and 10 ‘AND’ queries. Queries were simple, two word queries, and the objective is to test the search results, not the ability of software to discern an elaborate query. Being able to quantify one AND would be the same as two AND’s, plus with a limited vocabulary, we limited the queries to two words. Although not an active software library, this was representative of one that was used in the Maarek (1991) study and if results were similar to those found in the Maarek study, we knew we had data that was verifiable. The man pages of the UNIX library also allowed us to maintain the same linguistic identifiers in all files. Using common terms gathered from the experts and the researcher, a list of queries was put together with common UNIX keywords matched together so that like words that would be found in similar files, like move and file or move and copy. The grammar was simple UNIX terms and since the files were made up of these terms, no further research into linguistics was needed. (Herrera-Viedma, López-Herrera, 2007)

Once the man pages were downloaded to the researcher’s computer, the search was on for a soft -ware tool that could index the files and search them. There are plenty of tools on the internet but none that do custom term weighting like we wanted so a custom programmer built application was created. A program using the Lucene 4.6.1 plug-in was created in Java using the Eclipse development environ-ment. An index was created using the Lucene IndexWriter and the Free BSD Unix category 1 data files. The UNIX Category 1 files are just those files that uses would normally access if accessing a man help file. Category 2-8 are system library files that are not normally accessible to the user, there -fore not used in this study. The user is then prompted for a search query and once the query is en-tered, the program parses it out and searches for each term from the query using the IndexReader (which is part of the IndexConfig class). The IndexReader also allows for stop words to be removed. This is the shortening of words like stopping, stopped, stops, to the root word, stop, so they are all stored in the index under their root word, stop. This helps in reducing the size of the index, and in a

software library, most words won’t need to be shortened, if a word is plural or past case it would be in the comments. Future work would include an indexer that would identify comments from code words.

After the queries were run and the MMM, Paice, P-norm and Boolean similarities calculated (see calculations next), the results were then evaluated by a panel of two UNIX experts who decided if the returned files are relevant and if so in which order, we also compared the results to the results in the (Maarek, et al., 1991) study since the data corpus was the same.

Using the Lucene Indexer, the files are indexed based on term frequency, or how many times they appear in each data file. An array was created that stored the document ID’s of the documents that matched the query. If the query was an “AND” only the files that contain both terms were added to the result array; if the query was an “OR”, all files were added to the array. This became the final list of documents that matched the query. To calculate similarity score, the tfidf was first to be calcu -lated. The tfidfA is the term frequency inverse document frequency for search term A, and tfidfB is the term frequency inverse document frequency of search term B. To calculate each of these, we first add one to the log of the frequency per document of a term, then multiply that number by the log of the total number of files in the corpus (681) divided by the total number of documents the term is lo-cated in. For example, if there are 681 files in the corpus, 30 of them contain TermA, and TermA is found 4 times in document 1, the tfidf for document 1, for TermA would be 1+ log(4) * log(681 / 30). Although there are many other variations to the tfidf calculation, Baeza-Yates and Ribeiro-Neto (2011) say this calculation is the most frequently used and the most effective. Then those scores were used as term weights in the three extended Boolean algorithms.

If the search is an AND then(1)

MMM = .4 * Min(tfidfA, tfidfB) + .6 * Max(tfidfA, tfidfB)

Else the search is an OR then

MMM = .7 * Min(tfidfA, tfidfB) + .3 * Max(tfidfA, tfidfB)

End if

This number was then stored in the MMM array. Using 2 for p and 1 for the document weights since we were only using term weights and a recommend value of p=2 for best results (Frakes, Baeza-Yates, 2012). The pseudo code for the P-Norm calculation is:

If the search is an AND then(2)

PNorm = 1 – √ (1 )2∗(1−tfidfA )2+¿¿¿

Else it must be an OR search then

PNorm = √ (1 )2∗( tfidfA )2+¿¿¿

End if

The Paice was calculated using the recommended values of 1 for an AND query and .7 for an OR query (Frakes, Baeza-Yates, 2012).

If the search is an AND then

Paice = (10 * MIN(tfidfA,tfidfB) + 11 * MAX(tfidfA, tfidfB))/ ¿ + 11) (3)

Else the search must be an OR then

Paice = (.70 * MAX(tfidfA,tfidfB) + .7 * MIN(tfidfA, tfidfB))/ ¿ + .71)

End if

There were many different situations that first needed to be taken into consideration in order to find all matched data. The conditions for AND queries included, 1) if there is one or more file that contains both queried terms, then just calculate similarity as normal 2) There are files that contain only one queried term, in this case, abort similarity calculation. For an OR query, 1) if there are one or more files that contain both queried terms, plus other files that contain only one queried term, run similarity calculations as normal, 2) there are one or more files that contain the first term, and one or more files that contain the second term, run similarity calculations using 0 for the term not found in the files. And lastly 3) there are one or more files that contain only one searched term, the other search term is not found in any files, calculate similarity normally using 0 for the other term similarity.

Other conditions that needed to be accounted for include: if no documents match term A and the list of documents matching term B is greater than 1, the check for empty set cannot stop when a list is 0, both lists need to be checked and both must be 0. In this case, the termA simi -larity needs to be 0 while calculating the similarity for term B, and vice versa.

After the searches were completed, the results were recorded and documented and recall and precision were calculated. In order for precision to be calculated, the files that were returned had to be relevant to a user, this was where the UNIX experts came in. In contacting Maarek, she recommended gathering a panel of UNIX experts that is how she conclude relevancy for her study. The two UNIX experts included two dedicated UNIX operators, one who works for the Department of Defense full time on UNIX machines and one who teaches UNIX courses at a local university. The two experts were given a list of the results of the search for each query and asked to rank the results on what files they would want returned, and if a file they expected to be there was not, they were to list it as well. This helped to calculate the recall and precision.

Data Results

The mixed, min, max (MMM), Paice and P-Norm models are considered fuzzy, and because we use term weights, we can successfully say the fuzzy models returned a better result over the Boolean search. Based on the MAP results, the searches ranked best to least score, is the Paice, followed by a tie between the MMM and P-Norm, and finally the Boolean search. Table 1 shows the search models with their resulting MAP scores and how they are ranked.

Search Method Paice

MMM

P-Norm Boolean

MAP 0.5575 0.541 0.541 0.40435

Ranking 1 2 2 3

Table 1 Ranking of results of search performance

Based on these results, there is a 27.6% increase in the Paice model over the Boolean model and a 25.25% increase in the MMM and P-Norm model over the Boolean model. These results support the hypothesis that the fuzzy models will deliver a better search result over a Boolean model. This means that using this model, and with the right interface, an application can be made to search the UNIX man pages. Although that would be pretty useless as they are already searchable through the com-mand prompt, it does mean this search algorithm can be used to further study other fuzzy methods, and how they would do with actual data. These methods can be compared to other IR models and in future studies hopefully find which method provides the best return on a user’s query. Eventually

getting the best IR algorithm available will be very useful not only for current search engines like Google or Yahoo!, but also helping to quickly search the cloud and all the Big Data corporations are storing.

Future work

There are many other areas for future research when it comes to information retrieval. This study did not consider time to execute or other measures of efficiency, but that would be an area that could be studied. Most of the literature discussed includes the similarity measure of time when discussing the success or failure of a new search algorithm, but that was not researched in this study.

A measure of how many times a returned piece of software gets reused would also be a great addi-tion to future research. Reusing a code component can only be judged by actual programmers who use the components, therefore, a panel of software engineers would be needed to verify the reusability of a piece of software. It was shown in the literature that just being able to easily find software com-ponents helps increase the reusability of a software component. So, based on those results, such fu-ture studies should further increase the reuse of software.

The weight and ranking of an information retrieval system is another area that can be improved with future research. How a system ranks documents that match a query can be based on a similarity score, the documents term frequency, or the document name. The basic concept of a ranking system is to assign a number or score to a document that matches the user’s query. Then the list of docu-ments is displayed to the user in order of these ranked scores. This was done in this research using the fuzzy similarity measures compared to the Boolean score. Usunier, Buffoni, and Gallinari (2009) say that only a few top documents that match the query are really relevant to the user and those docu-ments should have the highest precision and be at the top of the list, which they usually are not, so they devised a new ranking based on the Ordered Weighted Average (OWA). By looking at the num-ber of relevant documents retrieved as a pairwise function with the number of irrelevant number of documents retrieved, they can determine which search’s similarity returns a better fit to the user’s query. This new ranking would be great in order to return the most precise documents first. From this study the most relevant documents were returned but not necessarily in the best order, so testing this new OWA system on this study may prove a better ranked list of returned documents, and would be a great area for future research.

One more area for future research that came up in doing this study is the idea of relevance. Rele -vance in an information retrieval system can come from three different areas – user-centered, systems-centered, or cognitive and is being looked at as the score that is compared in a study by Gupta, Saini, and Saxena. (2015). Deciding what is relevant to a user is difficult and has forced most IR systems to create a network or mapping of information to related terms. This matrix of information can be dif-ferent per user, per topic, which is an area of study that was not focused on in this research. These data maps require extensive background information that is usually gathered across time and through experience. Thinking about how the human brain gathers knowledge about a topic, it usually is over the course of years. To gather that kind of information would require systems to be able to grow at an infinite amount of space all while connecting like terms. Although this is more of a data storage topic, this will change how users retrieve and even query a system. If a user understands how the data is stored they are better equipped to use a more intuitive query. Data stores are another area of future research that are of interest. How a system stores and connects data will greatly define how the data is searched, how fast the data is searched, and the success of finding a match. Crestani and Lalmas look at logic in an IR system and look at the relevance of documents as being true if a document meets a user’s needs, and if not the document is irrelevant (2001). This logical approach to relevance is the basis for their logical IR system, and to not sound like a straight Boolean IR system, they incor -porate other theories to devise a new IR system that they compare to traditional IR systems. These new theories applied to this IR system would be a great area to expand for future research.

Another area that should be explored further is relevance feedback. Returning to the user a list of returned matches based on the searches results and letting the user then decide which of those results

are relevant and then continuing to search based on the new results of the user. This idea of fine-tun-ing the results list is one that is currently being used by Google, and other popular search engines to-day.

Reusing a piece of software can be an area of research in and of itself as well. The idea of software plagiarism is no joking matter. There have been numerous studies done to look at what characteristics a piece of software has and what can help detect if it has in fact been plagiarized. Using a search mechanism that will quickly find a piece of software like the one in this study, plagiarized software detection algorithms can be run on the resulting code. (Luo, Ming, Wu, Liu, & Zhu, 2014). This also looks at what constitutes a grammar when searching for software, does software need its own grammar? Should we be searching for a code-centric grammar rather than an English based grammar? These ideas should be further investigated to help deter-mine which grammar and which terms would better find a match in a software library (Herrera-Viedma, Lopez-Herrera, 2010).

Conclusion

Using fuzzy methods with an information retrieval search, documents can be found to be more accurate. With a strict Boolean method, documents will match or not match with no in-between. Using a fuzzy logic system, a degree of match is allowed and that match can be used to sort the results in an ascending order. The MMM, Paice and P-norm methods have never been executed with actual data so comparing the results to other information retrieval systems is a new approach to information retrieval. Searching for software can improve the reuse of software components and being able to find those components will be greatly increased with this method.

There are many applications where an IR system can be effective. Moreover, the most accurate and most effective IR method is something that can benefit many. As we can see in this study, the fuzzy logic methods were an improvement over the Boolean search results. Because of this, more research can be done to find the most efficient system. There are many future routes where this re-search can be expanded, from query improvement to semantic vector space analysis to relevance feed-back but it’s safe to say the fuzzy retrieval was a success over the Boolean model with actual data.

References

Agresti, W. (2011). Software Reuse: Developers’ Experiences and Perceptions. Journal of SoftwareEngineering and Applications. 48-58.

Aziz, M., North, S. (2007). Retrieving Software Component using Clone Detection and Program Slicing. The University of Sheffield.

Baeza-Yates, R., Ribeiro-Neto, B. (2011). Modern Information Retrieval the concepts and technology behind search. Second edition. Addison Wesley: Harlow, England.

Binkley, D., & Lawrie, D. (2008). Applications of information retrieval to software development. Ency-clopedia of Software Engineering (P. Laplante, ed.),(to appear).

Bordogna, G., Carrara, G., Pasi, G. (1992). Extending Boolean Information Retrieval: A Fuzzy Model Based on Linguistic Variables. In Fuzzy Systems, 1992 IEEE International Conference. 69-76)..

Bordogna, G., Pasi, G. (1993). A Fuzzy Linguistic Approach Generalizing Boolean Information Retrieval: A Model and Its Evaluation. Journal of the American Society for Information Science, 44(2),

70-82.Bookstein, A. (1980). Fuzzy requests: an approach to weighted Boolean searches. Journal of the Ameri

can Society for information Science, 31(4), 240-247.Buell, D. A., & Kraft, D. H. (1981). A model for a weighted retrieval system. Journal of the American

Society for Information Science, 32(3), 211-216.Crestani, F., & Lalmas, M. (2001). Logic and uncertainty in information retrieval. In Lectures on infor

mation retrieval Springer Berlin Heidelberg. 179-206.Croft, W., Metzler, D. & Strohman, T. (2010). Search Engines Information Retrieval in Practice. Pear

son Education; Boston, MA.Fox, E. A., & Sharan, S. (1986). A comparison of two methods for soft Boolean operator interpretation in

information retrieval.Frakes, W.B., Baeza-Yates, R. (1992). Information Retrieval Data Structures & Algorithms. Prentice

Hall: Englewood Cliffs, New Jersey.Gibb, F., McCartan, C., O’Donnell, R., Sweeney, N., & Leon, R. (2000). The integration of information

retrieval techniques within a software reuse environment. Journal of Information Science, 26(4), 211-226.

Gupta, Y., Saini, A., & Saxena, A. K. (2015). A new fuzzy logic based ranking function for efficient Information Retrieval system. Expert Systems with Applications, 42(3), 1223-1234.

Haefliger, S., Von Krogh, G., & Spaeth, S. (2008). Code reuse in open source software. Management Science, 54(1), 180-193.

Herrera-Viedma, E. (2001). Modeling the Retrieval Process of an Information Retrieval System Using an Ordinal Fuzzy Linguistic Approach. Journal of the American Society for Information Science and Technology. 52:6 (460-475).

Herrera-Viedma, E., López-Herrera, A. (2007). A Model of Information Retrieval System with Unbalanced Fuzzy Linguistic Information. International Journal Of Intelligent Systems 22:11 (1197-1214).

Herrera-Viedma, E., A.G. López-Herrera, A. (2010). A Review on Information Accessing Systems Based on Fuzzy Linguistic Modelling. International Journal of Computational Intelligence Systems, 3:4 (420-437).

Herrera-Viedma, E., Pasi, G., López-Herrera, A.G., Porcel, C. (2006). Evaluating the Information Quality of Web Sites: A Methodology Based on Fuzzy Computing with Words. Journal of the American Society for Information Science and Technology. 57:4 (538-549)

Ingwersen, P. (2001). Users in context. In Lectures on information retrieval . Springer Berlin Heidelberg. (157-178)

Keswani, R., Joshi, S., & Jatain, A. (2014). Software Reuse in Practice. In Advanced Computing & Communication Technologies (ACCT), 2014 Fourth International Conference on (159-162)

Klir, G. J., & Yuan, B. (1995). Fuzzy sets and fuzzy logic (487-499). New Jersey: Prentice Hall.Kraft, D., Bordogna, G., Pasi, G. (1998) Information Retrieval Systems: Where is the Fuzz? In Fuzzy

Systems Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Conference on (Vol. 2,1367-1372)

Krueger, C. (1992). Software reuse. ACM Comput. Surv. 24, 2 (June 1992), 131-183. Luo, L.,Ming,J., Wu, D., Liu, P., & Zhu, S., (2014, November). Semantics-based obfuscation-resilient

binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering (pp. 389-400).

Maarek, Y., Berry, D., & Kaiser, G. (1991). An information retrieval approach for automatically constructing software libraries. IEEE Transactions on Software Engineering. Vol 17(8). P. 800-813.

Manning, C., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Maron, M. E., & Kuhns, J. L. (1960). On relevance, probabilistic indexing and information retrieval. Journal of the ACM, 7(3), 216-244.

McIlroy, M. (1968) Mass produced software components. In Software Engineering; Report on a conference by the NATO Science Committee (Garmisch, Germany) October 138-150.

Mili, H., Mili, F., Mili, A. (1995). Reusing Software: Issues and Research Directions. IEEE Transac-tions on Software Engineering. 31(6): 528-562.

Miller, B. (1996). Fuzzy Logic. Electronics Now. May 1996. 29-30, 56-60.Miyamoto, S. (1990). Fuzzy Sets in Information Retrieval and Cluster Analysis. Kluwer Academic Pub

lishers. Dordrecht, Netherlands.Mockus, A., (2007). Large-scale code reuse in open source software. International Workshop on

Emerging Trends in FLOSS Research and Development, 0:7,.Morisio, M., Ezran, M., & Tully, C. (2002). Success and failure factors in software reuse. Software Engi

neering, IEEE Transactions on, 28(4), 340-357.Prieto-Diaz, R. (1991). Implementing faceted classification for software reuse. Communications of

the ACM. 34(5).Robertson, S. (2004). Understanding inverse document frequency: on theoretical arguments for IDF.

Journal of documentation, 60(5), 503-520.

Salton, G., Fox, E., Wu, H. (1983). Extended Boolean information retrieval. Communications of the ACM. 26 (12) 1022 – 1036.

Sandhu, P. S., Kaur, H., & Singh, A. (2009). Modeling of reusability of object oriented software system. World Academy of Science, Engineering and Technology. , 56(32).

Sim, S. E., Clarke, C. L., & Holt, R. C. (1998, June). Archetypal source code searches: A survey of software developers and maintainers. In, 1998. IWPC'98. Proceedings. 6th International Workshop on Program Comprehension (180-187). IEEE.

Srinivasan, P., Ruiz, M., Kraft, D., Chen, J. (2001). Vocabulary mining for information retrieval: rough sets and fuzzy sets. Information Processing and Management Vol. (37) 15-38.

Thummalapenta, S., (2011). Improving Software Productivity and Quality via Mining Source Code. (Doctoral dissertation) UMI Dissertation Publishing: 3442531.

Turpin, A., & Scholer, F. (2006, August). User performance versus precision measures for simple search tasks. In Proceedings of the 29th annual international ACM SIGIR conference on Research and de-velopment in information retrieval (11-18). ACM.

Usunier, N., Buffoni, D., & Gallinari, P. (2009, June). Ranking with ordered weighted pairwise classifycation. In Proceedings of the 26th annual international conference on machine learning (1057-1064). ACM.

Verma, R., Sharma, B. (2013). Fuzzy generalized prioritized weighted average operator and its application to multiple attribute decision making. International Journal of Intelligent Systems,

Vol. 00 (1-24.Vishal, Subhash, C., Kunda, J. (2012). An effective retrieval scheme for software component reuse.

International Journal on Computer Science and Engineering. Vol 4(7). Yao, H., Etzkorn, L., Virani, S. (2008). Automated classification and retrieval of reusable software Components. Journal of the American Society for Information Science and Technology, 59(4): 613-627.

Zadeh, L.A., (1994). Soft Computing and Fuzzy Logic. IEEE Software. November: 48-56.

storage.googleapis.comstorage.googleapis.com/.../5642510b6abb9DwfvzKi/...J… · Web viewAlready...

Documents

Transcript of storage.googleapis.comstorage.googleapis.com/.../5642510b6abb9DwfvzKi/...J… · Web viewAlready...