IPv6 – ADDRESSING GRID-INTEROPERABILITY PERSPECTIVE...
Transcript of IPv6 – ADDRESSING GRID-INTEROPERABILITY PERSPECTIVE...
1 Bioinfo Publications
Bioinfo Publications Volume. XX, Issue XX, 2011, PP-XX-X Online available at
IPv6 – ADDRESSING GRID-INTEROPERABILITY
PERSPECTIVE FOR PATIENTS IN MEDICAL FIELD
HEMANT S. MAHALLE1 *, A. B. MANWAR2, Dr. VINAY CHAVAN3
1Department of Computer Science, P. N. College, Pusad, Dist. Yavatmal MS, India 2 Department of Computer Science, SGB Amravati University, Amravati MS, India
3 Department of Computer Science, S. K. Porwal College, Kamptee, Nagpur MS, India *Corresponding Author: [email protected]
Received: January 30, 2012; Accepted: February15, 2012
Abstract- IPv6 is of considerable importance to businesses, consumers, and network access providers of all sizes. IPv6 is designed to improve upon IPv4's scalability, security, ease-of-configuration, and network management;
these issues are central to the competitiveness and performance of all types of network-dependent businesses.
IPv6 aims to preserve existing investment as much as possible. End users, industry executives, network
administrators, protocol engineers, and many others will benefit from understanding the ways that IPv6 will affect
future internetworking and distributed computing application like cloud computing ,Grid computing etc. Every
individual is to be identified as the requirement of today’s Science and Technology.
In this paper, we present a platform providing a problem solving environment for patients or persons among
multiple application domains of Grid computing especially in medical and scientific study, using the
collaborative relationship and also fostering interoperability in Grid infrastructure. The proposed work is based on IPv6 address scheme providing seamless, secure, and intuitive access to distributed Grid resources.
Keywords - IPv4, IPv6, Grid computing, Patients, scalability, Internet, Address scheme 1. Introduction: In 1980, ARPA started converting systems to TCP/IP. In 1983, ARPANET was split into two different networks: MILNET for military research and ARPANET for continuing the current path of research. In 1983 it was also mandated by ARPA that all systems connected to ARPANET use TCP/IP. The Internet Protocol (IP) has its roots in early research networks of the 1970s, but within the past decade has become the leading network-layer protocol. This means that IP is a primary vehicle for a vast number of client/server and peer-to-peer communications, and the current scale of deployment is straining many aspects of its twenty-year old design. The Internet Engineering Task Force (IETF) has produced specifications that define the next-generation IP protocol known as "IPng," or "IPv6." IPv6 is both a near-term and long-range concern for network owners and service providers. IPv6 products have already come to market; on the other hand, IPv6 development work will likely continue well into the next decade. Though it is based on much-needed enhancements to IPv4 standards, IPv6 should be viewed as a new protocol that will provide a firmer base for the continued growth of today's internetworks.
Because it is intended to replace IPv4, IPv6 is of considerable importance to businesses, consumers, and network access providers of all sizes. IPv6 is designed to improve upon IPv4's scalability, security, ease-of-configuration, and network management; these issues are central to the competitiveness and performance of all types of network-dependent businesses. IPv6 aims to preserve existing investment as much as possible. End users, industry executives, network administrators, protocol engineers, and many others will benefit from understanding the ways that IPv6 will affect future internetworking and distributed computing applications[1,2,3]. E-Science today, particularly scientific experiments and studies, involve distributed and heterogeneous resources. The scientific applications use different approaches with similar characteristics in a wide variety of scientific domains and typically require distributed, high-throughput and data intensive computing[5]. In recent years, a grid environment has been adopted in order to accomplish these necessities, which are motivated for several reasons, including technology reuse, computational performance and also, security system aspects [6].
Title of article
2 xxxxxxxxxxxxxxxxxxxx
Volume. xx, Issue xx, 2011
One of the main current Public Key Infrastructures ( PKI needs is interoperability, which makes possible the secure interconnection and co-operation between different PKI structures, thus enhancing their feasibility and applicability at regional, national, as well as international level. This is even more important in the healthcare sector, due to the increased demand for mobility of both patients and doctors and the criticality of the telemedicine data, which make the secure and interoperable information exchange, a basic element for high-level healthcare service provision [7]. 2. Network Addressing in IPv6 Following are some of the points which glance at familiar attributes of address allocation, associated with IPv4 and compare them to similar attributes in IPv6.
2.1 Address scheme IPv4 Address Scheme 32 bits long (4 bytes). Address is composed of a network and a host portion, which depend on address class. Various address classes are defined: A, B, C, D, or E depending on initial few bits. The total number of IPv4 addresses is 4,294,967,296. The text form of the IPv4 address is nnn.nnn.nnn.nnn, where 0<=nnn<=255, and each n is a decimal digit. Leading zeros may be omitted. Maximum number of print characters is 15, not counting a mask. IPv6 Address Scheme 128 bits long (16 bytes). Basic architecture is 64 bits for the network number and 64 bits for the host number. Often, the host portion of an IPv6 address (or part of it) will be a MAC address or other interface identifier. Depending on the subnet prefix, IPv6 has a more complicated architecture than IPv4. The number of IPv6 addresses is 10 28 (79 228 162 514 264 337 593 543 950 336) times larger than the number of IPv4 addresses. The text form of the IPv6 address is xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx, where each x is a hexadecimal digit, representing 4 bits. Leading zeros may be omitted. The double colon (::) may be used once in the text form of an address, to designate any number of 0 bits. For example, ::ffff:10.120.78.40 is an IPv6 IPv4-mapped address. 2.2 Address Allocation IPv4 - Originally, addresses were allocated by network class. As address space is used up, smaller allocations using Classless Inter-Domain Routing (CIDR) are made. Allocation has not been balanced among institutions and nations.
IPv6 - Allocation is in the earliest stages. The Internet Engineering Task Force (IETF) and Internet Architecture Board (IAB) have recommended that essentially every organization, home, or entity be allocated a /48 subnet prefix length. This would leave 16 bits for the organization
to do subnetting. The address space is large enough to give every person in the world their own /48 subnet prefix length.
2.3 Address Mask In IPv4, network mask is used to designate network from host portion whereas, the mask does not exist in IPv6.
2.4 Address Resolution Protocol (ARP)
Address Resolution Protocol is used by IPv4 to find a physical address, such as the MAC or link address, associated with an IPv4 address. IPv6 embeds these functions within IP itself as part of the algorithms for stateless auto configuration and neighbor discovery using Internet Control Message Protocol version 6 (ICMPv6). Hence, there is no such thing as ARP6. 2.5 Address Types IPv4 supports Unicast, multicast, and broadcast, whereas IPv6 supports unicast, multicast, and anycast. 2.6 Network Address Configuration In IPv4 Configuration must be done on a newly installed system before it can communicate; that is, IP addresses and routes must be assigned. In IPv6, configuration is optional, depending on functions required. An appropriate Ethernet or tunnel interface must be designated as an IPv6 interface, using iSeries Navigator. Once that is done, IPv6 interfaces are self-configuring. So, the system will be able to communicate with other IPv6 systems that are local and remote, depending on the type of network and whether an IPv6 router exists. 2.7 Domain Name System (DNS) Applications accept host names and then use DNS to get an IP address, using socket API gethostbyname(). Applications also accept IP addresses and then use DNS to get host names using gethostbyaddr(). For IPv4, the domain for reverse lookups is in-addr.arpa. Same for IPv6. Support for IPv6 exists using AAAA (quad A) record type and reverse lookup (IP-to-name). An application may elect to accept IPv6 addresses from DNS (or not) and then use IPv6 to communicate (or not). The socket API gethostbyname() is unchanged for IPv6 and the getaddrinfo() API can be used to obtain (at application choice) IPv6 only, or IPv4 and IPv6 addresses. 2.8 DHCP (Dynamic Host Configuration Protocol) In IPv4, Dynamic Host configuration Protocol is used to dynamically obtain an IP address and other configuration information. Currently, DHCP does not support IPv6 [2 to 2.8, from 1 to 4]. 3. Implementation and Transition to IPv6: IPv6 implementation is an ongoing process. Some analysts make comparisons between the potential
First Author, Second Author, Third Author, Fourth Author
3 Bioinfo Publications
shortfall of IPv4 addresses and the potential problems like mismatch of addresses, viruses affecting the change in configuration etc. The task of transitioning to IPv6 is full of challenges. Large-scale transitions require a high level of flexibility in the protocol, software and hardware supporting and using it. A successful migration to IPv6 requires all IPv4 nodes to communicate with IPv6 nodes during the migration period. It is also important that technology is in place to allow IPv6 nodes can communicate over the IPv4 Internet to other networks. To meet the needs of the various network topologies that are spread throughout the Internet, there are multiple methods for making this change[3]. The major transition mechanisms that are integral to the IPv6 design effort, these techniques include dual-stack IPv4 /IPv6 hosts and routers, tunneling of IPv6 via IPv4, and a number of IPv6 services, including IPv6 DNS, DHCP, MIBs, and so on[2]. Worldwide IPv6 testing and pre-production deployment network, called the 6BONE, had already reached approximately 40 countries. There are many IPv6 implementations completed or underway worldwide, and in test or production use on the 6BONE. The 6BONE has been built by an active population of protocol inventors, designers and programmers[2,3]. 4. Problem- Resource provision in a Grid The IPv6 can be effectively used in Grid computing for various problems faced by the users in the Internet. It is the prime requirement of future scenario of the world to identify at every individual and every asset in the Internet so as to provide the service to every individual or every asset. With the rapid progress of various technologies, there is a growing demand for integrating amount of domain information, including medical information, bioinformatics information, enterprise information and etc. Medical information system is well known in the world as one of the most complicated information systems, which main target is through the network to achieve medical resource sharing of each aspect of medical domain, for instance, hospital, medical research organizations, doctors and patients[8]. In this paper we have concentrated on providing every individual the medical facility being enrolled in the Grid of Doctors. As the number of patients is very large or every member of the family has a unique identity in the Internet so as to get the proper and required aid, it is not possible to address the every individual using the IPv4 address scheme. So, to address the millions of persons individually in the system it is required to have a addressing scheme which can addresses these millions of patients or persons in the system the only solution is to use the IPv6 addressing scheme. The key factor for this project was generating and managing metadata. The Metadata has found to be the key in the ability to interoperate and co-relate heterogeneous data systems and available data bases in the distributed computing. The emphasis was given on data engineering to understand how to interpret and
interrelate the local data dictionaries among the distributed systems. 5. The IPv6 as the perspective solution The Internet is changing the face of medical research. The current world of isolated research and proprietary data encodings is evolving into a future of standardized medical databases and integrated medical applications, such as clinical decision support systems[9]. The every individual in the Internet is recognized by the IPv6 addressing scheme. The microchip is to be fitted with every individual for keeping the contact within the Grid of Doctors. The microchip records the every change in the body of the person and sends the information to the server in the Grid. Accordingly the server distributes the information to the related nodes in the Grid. The corresponding Doctor or the related person after seeing the information instructs the patient or if required can take the expert decision in the regard. Some special laboratories will also be involved in the investigation of the diseases or special equipment will also be shared in the Grid. The team of experts will after proper discussion and after proper investigation the expert’s decision or the line of treatment will be sent back to the server and then it will be sent to the patient or needy person. 6. Model Data Architecture The model data architecture focused on the deployment of a middleware component with the objective of interconnecting distributed heterogeneous data systems and also using microchips and Satellite for patients data communication. The Data Architecture (Fig 1) uses the Servers in Grid environment varying from S1,S2,S3- - - - -SN . The main focus of is the distributed patients, referred as P1,P2,P3- - - - - -PN. The computer systems from Grid environment contributing to project, referred as C1,C2,C3- - - - - - CN. The Satellites covering the globe and thereby covering the global patients registered in the system provides ready information about the patients to the servers. These Satellites are referred as ST1,ST2,ST3- - - - - - - - -STN.
Fig (1). The model Data Architecture for Patients Grid-Interoperability developed by researcher (here in after called MMC Graph)
Title of article
4 xxxxxxxxxxxxxxxxxxxx
Volume. xx, Issue xx, 2011
7. The Algorithm The algorithm for “IPv6-Addressing Grid-Interoperability Perspective for Patients In Medical Field” here in after referred as MMC algorithm. The MMC Algorithm:
1. INITIALIZE THE SYSTEM WITH IPv6 ADDRESSING
2. SELECT SERVERS AS S1,S2,S3,- - - - - - -SN 3. SELECT PATIENTS AS P1,P2,P3,- - - - - - -PN 4. SELECT COMPUTERS AS C1,C2,C3,- - - - -CN 5. SELECT SATELLITE AS ST1,ST2,ST3,- - - -STN 6. SELECT EQUIPMENT LABORATORY AS EL-1,EL-
2,EL-3,- - - -EL-N 7. REQUEST FROM P1,P2,P3- - - - - -PN
1. ACTIVATE SATELLITE 2. ACTIVATE SERVER
8. APPLY PROCESS IN SERVER 1. IF Related Data-base found then sent
information to related Doctors Team GOTO LOOP-1
2. ELSE IF Related Data-base not found then sent Information to Experts Team GOTO LOOP-2
3. ELSE IF Required specialized equipment help sent information to Equipment Lab. GOTO LOOP-3
4. OTHERWISE Sent back to SERVER EXECUTE LOOP-4
9. LOOP-1 The Doctors Team verify the Data-bases and their
opinion along with Line of treatment sent back to SERVER.
GOTO LOOP-5 10. LOOP-2
The Experts Team study the new disease and sight a line of treatment or if required sent to Investigation Lab. For sophisticated Equipment help. GOTO LOOP-3
11. LOOP-3 The results of Specialized Equipment Lab. Sent back to Experts Team.
GOTO LOOP-6 12. LOOP-4 For unknown Diseases or for which Data-bases not available the Experts Team sent back Information to SERVER and Research Lab. For Investigation.
GOTO LOOP-7 13. LOOP-5
The SERVER sent the refined Information back to users Computer
END OF LOOP 14. LOOP-6
The Experts Team observes the results of Specialized Lab. Equipments and draws the line of treatment and sent the Information to SERVER.
EXECUTE LOOP-5 15. LOOP-7
The report of research Lab. Sent back to Experts Team, which then after discussion, sight a line of treatment and sent to SERVER.
EXECUTE LOOP-5 16. END 8. Discussion The MMC Algorithm should be very useful for future scenario because of huge population and the explosion of technology it is rather difficult to provide medical facilities. So as a solution to future problem the MMC Algorithm will play an important role. This model aims to settle the semantic relationship between heterogeneous medical information systems. It will support integration and Interoperability by sharing of heterogeneous medical data-bases available in the Grid. It will also support semantic-consistent data process schema to support the information exchange between heterogeneous systems under Grid environment. Our future work includes the actual creation and verification of the MMC model to integrate medical information systems. 9. Conclusion Thus, from the facts of address allocation schemes in IPv6, it is indeed a great achievement that the quantity of IPv6 addresses is so vast that every individual can be recognized in the network. Definitely, this will lead to a future with almost every human being linked in the Internet and much better networking applications & services can be made available along with adequate security. References
[1] http://www.linuxsecurity.com/resource_files/documentation/tcpip-security.html
[2] http://www.SecurityDocs.comIPv6 security/SecurityDocs Comment on IPv6 Security Considerations.htm
[3] CHRIS CHAMBERS, JUSTIN DOLSKE, and JAYARAMAN IYER,” TCP/IP Security “ Department of Computer and Information Science, Ohio State University, Columbus, Ohio 43210.
[4] RFC Editor (http://www.rfc-editor.org/ rfcsearch.html)
[5] Hyejeong Kang, Kyoung-a Yoon, Seoyoung Kim, Yoonhee Kim, and Chongam Kim, “An e-Science Problem Solving Environment for Scientific Numerical Study”, ISBN 978- 89-5519-155-4, Feb. 13~16, 2011 ICACT2011
[6] Benedyczak, K., - Users' UNICORE Security Guide - December, 2008
[7] B. Blobel, Analysis, Design and Implementation of Secure and Interoperable Distributed Health Information Systems. Amsterdam, The Netherlands: IOS Press, 2002.
[8] Yunmei Shi, Xuhong Liu, Yabin Xu, Zhenyan Ji,”Semantic-Based Data Integration Model Applied to Heterogeneous Medical Information
First Author, Second Author, Third Author, Fourth Author
5 Bioinfo Publications
System” 978-1-4244-5586-7/10/$26.00 C 2010 IEEE
[9] Christina Catley’ and Dr. Monique Frize ’’DESIGN OF A HEALTH CARE ARCHITECTURE FOR MEDICAL DATA INTEROPERABILITY AND APPLICATION INTEGRATION” Proceedings of the-Second Joint EMBSBMES Conference Houston, TX, USA October 23-26, 2002
Indian Journal of Computer Science and Engineering Vol X No X 1-5
1
A VECTOR SPACE MODEL FOR INFORMATION RETRIEVAL: A
MATLAB APPROACH
A. B. Manwar*
Department of Computer Science,
S.G.B. Amravati University, Amravati MS, India†
avinash.manwar @gmail.com‡
Hemant S. Mahalle Department of Computer Science,
P. N. College, Pusad, Dist. Yavatmal MS, India
K. D. Chinchkhede Department of Applied Electronics,
S.G.B. Amravati University, Amravati MS, India
Dr. Vinay Chavan Department of Computer Science,
S. K. Porwal College, Kamptee, Nagpur MS, India
Abstract
By and large, three classic framework models have been used in the process of retrieving
information: Boolean, Vector Space and Probabilistic. Boolean model is a light weight model
which matches the query with precise semantics. Because of its boolean nature, results may be
tides, missing partial matching, while on the contrary, vector space model, considering term-frequency, inverse document frequency measures, achieves utmost relevancy in retrieving
documents in information retrieval. This paper implements and discusses the issues of information
retrieval system with vector space model using MATLAB on Cranfield data collection of
aerodynamics domain.
Keywords: tf; idf; vector space model; cosine similarities; term-document; term-query matrices;
dot products.
1. Introduction
Enormous amount of text material is increasing at exponential rate, especially with the increasing
use and applications of Internet. Day by day it is becoming very difficult to retrieve the relevant
information. Various approaches have been used by the researchers to get over the relevancy factor
in information retrieval.
An information retrieval model is a quadruple consisting of document collection, set of queries, framework model and a ranking function associated with query-document. A framework model
may be boolean, vector space or probabilistic. Boolean model matches query with precise
semantics in the document collection by boolean operations with operators AND, OR, NOT. It
predicts either relevancy or non-relevancy of each document, leading to the disadvantage of
retrieving very few or very large documents. The boolean model is the lightest model having
inability of partial matching which leads to poor performance in retrieval of information. Vector
space model is introduced by G. Salton in late 1960s in which partial matching is possible. Non-
binary weights are used to weight the index terms in queries and in documents. These words are
used for calculating degree of similarity between each document and the query. The ranked
document set in the decreasing order of degree of similarity thus obtained is precise than the result
of boolean model. Index term weights can be calculated in many different ways. The work by Salton and McGill [1] reviews various term-weighting techniques. Although there is contention as
to whether probabilistic model outperforms the vector model, Salton and Buckley [2] showed that
the vector model is expected to outperform the probabilistic model with general collections [3].
This paper implements and discusses the issues of information retrieval system with vector space
model using MATLAB on Cranfield data collection of aerodynamics domain.
The next section deals with brief review of related work of vector space model in information
retrieval.
* ASSOCIATE PROFESSOR IN COMPUTER SCIENCE, S.G.B. AMRAVATI UNIVERSITY, AMRAVATI MS INDIA.
† Sant Gadge Baba Amravati University, Amravati, Maharashtra, India.
Indian Journal of Computer Science and Engineering Vol X No X 1-5
2
2. Related work
Maron and Kuhns [4] in early 1960, described probabilistic indexing technique in a mechanized
library system yielding probable relevance. Afterword in 1983, Salton and McGill wrote a book
[1] which discusses thoroughly the three classic models in information retrieval namely, the
boolean, the vector, and the probabilistic models. The book by van Rijsbergen [5] covers the
discussion on three classic models and majority of the associated technology of retrieval system.
Frakes and Baeza-Yates [6] edited the book on information retrieval which mainly deals with the
data structures used in general information retrieval systems. Also, it includes the issue of
relevance feedback as well as some query modification techniques [7] and Boolean operations and their implementations [8]. Verhoeff, Goffman, and Belzer [9] described the shortfall of boolean
queries for information retrieval. The concept of using boolean formalism in other frameworks had
been the great interest area of the researchers. Lee et al proposed a thesaurus-based boolean
retrieval system for ranking [10].
Vector space model has been the most popular model in information retrieval among the research
vicinity because of the research outcome in indexing, term value specification in automatic
indexing carried out by Salton and his associates [11, 12]. Most of this research deals with
experiments in automatic document processing and different term weighting approaches for
automatic retrieval [2, 13]. In 1972, Karen Sparck Jones introduced the concept of inverse
document frequency, a measure of specificity [14, 15] and Salton and Yang uses it for automatic
indexing to improve retrieval [12]. Raghavan and Wong [16] analyses vector space model
critically with the conclusion that the vector space model is useful and which provides a formal framework for the information retrieval systems.
The next section gives a description of the most influential vector space model in modern
information retrieval research.
3. Vector Space Model
The drawback of binary weight assignments in boolean model is remediated in the vector space
model which projects a framework in which partial matching is possible [11, 13]. Non-binary
weights for index terms in queries and documents are used in the calculation of degree of
similarity. Decreasing order of this degree of similarity for the retrieved documents gives the
ranked documents with partial match.
For the vector model, the weight associated with a pair is positive and non-binary.
Further, the index terms in the query are also weighted. Let be the weight associated with the
pair , where . Then, the query vector is defined as where
t is the total number of index terms in the system. The vector for a document dj is represented by
.
The vector model proposes to evaluate the degree of similarity of the document dj with regard to
the query q as the correlation between the vectors and . This correlation can be measured by
the cosine of the angle between these two vectors as,
where and are the norms of the document and query vectors. The factor does not affect
the ranking (i.e., the ordering of the documents) because it is the same for all documents. The
factor provides normalization in the space of the documents.
Since and , varies from 0 to +1, the vector model ranks the
documents according to their degree of similarity to the query. A document might be retrieved
even if it matches the query only partially. [3, page 27-28]
.
In the vector space model, the frequency of a term ki inside a document dj referred to as the tf
factor and provides one measure of how well that term describes the document contents.
Furthermore, the inverse of the frequency of a term ki among the documents in the collection
referred to as the inverse document frequency or the idf factor.
The normalized frequency fi,j of term ki in document dj is given by
Indian Journal of Computer Science and Engineering Vol X No X 1-5
3
(1)
where the maximum is computed over all terms which are mentioned in the text of the document
dj. The idfi, inverse document frequency for ki, be given by
(2)
The best known term-weighting schemes use weights which are given by
(3)
Such term-weighting strategies are called tf-idf schemes [3, page 29-30].
The success of vector space model lies in its term-weighting scheme, its partial matching strategy
and similarity measure. Mutual independence of index terms has said to be disadvantage of vector
space model but practically, consideration of term dependencies is not fruitful. From the research
consequence in the field, it seems that the vector model is either superior or almost as good as the
known alternatives.
4. Experimental Evaluations
4.1. Dataset for information retrieval system We used a Cranfield collection having 1398 abstracts related to aerodynamics domain which is obtained from [17]. The collection contains a compressed version of document text, relation giving
relevance judgements, text of 225 queries, indexed documents and indexed queries. Also, we used
stop-list obtained from the collection source.
4.2. Preprocessing The compressed version of document text has been preprocessed to obtain a set of 1398 individual
abstract files. A PHP script has been written for this purpose. Stemming has not applied for the
reasons of i) losing context of search, ii) may reduces precision and iii) can not be applied to
proper nouns. To have a clear view of relevancy, instead of using given set of query and relevance
judgements, we have created our set of queries from ‘titles’ of 1398 documents, forming a query
set of 1398 entities.
4.3. Implementation A MATLAB is used to implement a vector space model for information retrieval; the complete
process is depicted in fig. 1. Fig. 1. Implementation of vector space model for information retrieval.
As shown in block diagram it consists of three stages:
1. Generation of Term-Frequency matrix
It is term-frequency matrix of all unique terms in document with . The
term document matrix (FM) is matrix with
unique terms in dictionary
Indian Journal of Computer Science and Engineering Vol X No X 1-5
4
and documents. The elements of FM are represented as in which
each element indicates the frequency of term in document.
The Cranfield data collectiont is preprocessed to convert into individual 1398 text files. A
SMART stop word text file which is available with the dataset is used for the removal of
stop words from the data collection of 1398 files. Also, non-embedding special characters
and numerals have been removed from these files. 79,728 words have been collected which are then processed to find the frequency of unique words in each documents. The
dictionary of unique words is of 7805 words. Thus the term-frequency matrix is of size
7805x1398.
2. Generation of Query matrix
The title of each abstract, after removing stop-list words and non-embedding special
characters is used as query, which contributes to the set of 1398 unique queries
represented as . Here, we have taken queries as titles of the document instead of the
dataset queries so as to judge the relevancy more profoundly. The generated matrix for
1398 queries is
3. Term-weight calculations and result
A term-frequency matrix is processed to get the term weights considering tf-idf scheme.
These term weights are calculated by the formula , where
is the term-weight i.e. weight of term in document ,
is the frequency of term in document ,
is the inverse document frequency representing the terms appearing in
many documents and is calculated by the formula , where is
the total number of documents and is the number of documents
containing term ,
is the Euclidean length obtaining by taking square root of sum of
squares of individual terms per document.
Query matrix of size is also divided by their corresponding Euclidean lengths
to obtain the normalized weights. For query , the transpose of query matrix is multiplied by the term-weight matrix. The final result is obtained by ordering the weights in a result matrix
in decreasing manner of their weights.
4.4. Testing phase The vector space model (VSM) implemented above is tested thoroughly in different ways, the
details of experimentation is explained below: Fig. 2 shows the distribution of index terms in dictionary for individual documents. The
dictionary consists of 7805 unique terms. Fig. 2. Index terms in a dictionary
Indian Journal of Computer Science and Engineering Vol X No X 1-5
5
Fig. 3 shows frequency count of each unique term in dictionary distributed in complete
dataset. Some of the unique terms such as (‘flow’, 2059), (‘pressure’, 1245), (‘boundary’,
1076), (‘results’, 897), with high frequency in entire documents is shown.
Fig. 3. Frequency count of each unique term among data collection
Further looking into the dataset, as shown in fig. 4, shows subset of 380 documents out of
1398 abstracts. In document 329 the total terms are 342 but the unique terms are only 162.
Fig. 4. Document-wise total and unique terms
Table I is the subset of 20 queries out of 1398 tested queries against the vector space model
for retrieval of relevant documents. In the present study the most relevant document is one
whose ID matches with query ID.
Indian Journal of Computer Science and Engineering Vol X No X 1-5
6
Table 1. Subset of 20 queries
Q.ID Details of Queries
1 experimental investigation of the aerodynamics of a wing in a slipstream .
2 simple shear flow past a flat plate in an incompressible fluid of small viscosity .
3 the boundary layer in simple shear flow past a flat plate .
4 approximate solutions of the incompressible laminar boundary layer equations for a plate in shear flow .
5 one-dimensional transient heat conduction into a double-layer slab subjected to a linear heat input for a small time internal .
6 one-dimensional transient heat flow in a multilayer slab .
7 the effect of controlled three-dimensional roughness on boundary layer transition at supersonic speeds .
8 measurements of the effect of two-dimensional and three-dimensional roughness elements on boundary layer transition .
9 transition studies and skin friction measurements on an insulated flat plate at a mach number of 5.8 .
10 the theory of the impact tube at low pressure .
11 similar solutions in compressible laminar free mixing problems .
12 some structural and aerelastic considerations of high speed flight .
13 similarity laws for stressing heated wings .
14 piston theory - a new aerodynamic tool for the aeroelastician .
15 on two-dimensional panel flutter .
16 transformation of the compressible turbulent boundary layer .
17 remarks on the eddy viscosity in compressible mixing flows .
18 the flow field in the diffuser of a radial compressor .
19 an investigation of the pressure distribution on conical bodies in hypersonic flows .
20 generalised-newtonian theory .
The result of experimentation is tabulated in Table 2 shown below:
Table 2. Tabulation of results of first ten queries
Further query wise details are depicted in subsequent Table 3 to Table 5 respectively.
Table 3. Frequency of terms in retrieved document set for query 1
Indian Journal of Computer Science and Engineering Vol X No X 1-5
7
Table 4. Frequency of terms in retrieved document set for query 2
Table 5. Frequency of terms in retrieved document set for query 3
Table 3 depicts the result obtained for query 1 with query terms and their occurrences in the
documents. First 18 documents are listed. Similar results are shown in table 4 and table 5 for
queries 2 and 3 respectively.
5.7.4.Results and discussion
Even after taking the ‘titles’ of the abstract documents as a query set, the final result of retrieval is
89.41%. 148 queries have not shown result as first retrieved document; however, within the range
of first two retrieved documents, the result obtained is 94.99%. 2.22% of queries have not retrieved the correct result up to the range of first five documents.
Inter-document characterization and document frequency plays vital role in building ranks of the
documents in vector space model. As such, term frequency of the documents d3, d389 and d2 is 24,
42 and 110 respectively; and the frequency of terms ‘flow’, ‘plate’ and ‘small’ in all documents is
2059, 421 and 306 respectively, which is much more higher than the average frequency- as a
result, table IV shows the outcome of query q2 as document d3; however, expected relevant
document is d2. Variety of the weight calculation formulas as suggested by Salton and Buckley [2]
have been tested on this collection but we found that the standard tf-idf weighting scheme gives the
best results.
References
[1] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York,
1983.
[2] G. Salton and C. Buckley. Term-weighting approaches in automatic retrieval. Information Processing and
Management, 24(5):513-523, 1988.
[3] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Introduction to Information Retrieval,
Cambridge University Press, New York, USA, 2008.
[4] M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing and information retrieval. Association for
Computing Machinery, 7(3):216-244, 1960.
[5] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.
Indian Journal of Computer Science and Engineering Vol X No X 1-5
8
[6] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall,
Englewood Cliffs, NJ, USA, 1992.
[7] D. Harman. Relevance feedback and other query modification techniques. In W. B. Frakes and R. Baeza-Yates,
editors, Information Retrieval: Data Structures and Algorithms, pages 241-263. Prentice Hall, Englewood Cliffs,
NJ, USA, 1992.
[8] S. Wartick. Boolean operations. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data
Structures and Algorithms, pages 264-292. Prentice Hall, Englewood Cliffs, NJ, USA, 1992.
[9] J. Verhoeff, W. Goffmann, and Jack Belzer. Inefficiency of the use of Boolean functions for information
retrieval systems. Communications of the ACM, 4(12):557-558, 594, December 1961.
[10] J. H. Lee, W. Y. Kim, and Y. H. Lee. Ranking documents in thesaurus-based Boolean retrieval systems.
Information Processing and Management, 30(1):79-91, 1993.
[11] G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing. Journal of the ACM, 15(1):8-
36, January 1968.
[12] Gerad Salton and C. S. Yang. On the specification of term values in automatic indexing. Journal of
Documentation, 29:351-372, 1973.
[13] G. Salton. The SMART Retrieval System – Experiments in Automatic Document Processing. Prentice Hall Inc.,
Englewood Cliffs, NJ, 1971.
[14] K. Sparck Jones. A statistical interpretation of term specificity and its application to retrieval. Journal of
Documentation, 28(1):11-20, 1972.
[15] K. Sparck Jones. A statistical interpretation of term specificity and its application to retrieval. Information
Storage and Retrieval, 9(11):619-633, 1973.
[16] V. V. Raghavan and S. K. M. Wong. A critical analysis of vector space model for information retrieval. Journal
of the American Society for Information Sciences, 37(5):279-287, 1986.
[17] ftp server of Cornell University ftp://ftp.cs.cornell.edu/pub/smart/cran/ for Cranfield collection.
@@@@@@@@@@@
1
GRID DATA MINING USING K-MEANS ALGORITHM
Hemant S. Mahalle1,
Asso. Prof. & Head, Department of
Computer Science, P. N. College,
Pusad, Dist: Yavatmal(M.S.).
A. B. Manwar2,
Asso. Prof., Department of
Computer Science, S.G.B.
Amravati University,
Amravati(M.S.).
Dr. Vinay Chavan3
Asso.Prof. and Head, Department
of Computer Science, S.K.Porwal
College, Kamptee, Nagpur
Abstract— The k-means algorithm is one of the famous
algorithms used for data mining. The Cluster Computing and
Grid-computing are the emerging technologies in the today’s
Internet scenario. The resources are geographically distributed
from where the data and knowledge are to be shared among
the users. For this it is required to have efficient algorithms for
mining the web. Several algorithms have been developed for
data mining, but a few tends to exact result. Mostly all
algorithms use the keywords as the base for searching. In Grid
computing also the same algorithms have been used to mine
the data. In this paper we have tried to use the Grid along
with cluster concept for a specific application of data mining
using k-means algorithm by grouping items in to k clusters.
Keywords- data mining, Grid computing, cluster, resource
sharing
I. INTRODUCTION
Grid computing is a platform for coordinated resource
sharing and problem solving among virtual organizations. It
provides direct access to computers, software, data and high
performance equipments. Grid uses grid-services to access
and use grid resources. It is required to invent new grid services for efficient and effective data and knowledge
mining. So as to improve the effectiveness of data the k-
means algorithm can be used in many ways by adding the
modules. The Grid computing uses the internet to enable
resource sharing of geographically distributed resources and
equipments. The web services can be used effectively for
communication in the grid environment.
A web service is a way to expose the functionality of an
information system or application, and to make it available
via web technologies [1].
Through a middleware such as Globus Toolkit, we can
obtain a set of services and software libraries to support the
resources of Grid and grid applications [2]. Grid services
[GS] need to apply resource discovery to be functional.
However, sometimes the conventional search of services
does not provide good results, in these cases, Semantic web
tries to solve the problem adding meaning to GS and thus,
the search is improved and better results are obtained [3].
II. K-MEANS ALGORITHM [4]
The k-means algorithm is a simple iterative method to
partition a given dataset into a user specified number of clusters. This algorithm has been discovered by several
researchers across different disciplines. The algorithm
operates on a set of d-dimensional vectors, D={xi| i=1 …..
N} where xi ⋲ Rd denotes the i
th data point. The algorithm is
initialized by picking k points in Rd as the initial k cluster
representatives or “centroids”. Techniques for selecting
these initial seeds include sampling at random from the
dataset, setting them as the solution of clustering a small
subset of the data or the global mean of the data k times.
Then the algorithm iterates between two steps till
convergence: Step 1: Data Assignment. Each data point is assigned to
its closest centroid, with ties broken arbitrarily. This results
in a partitioning of the data.
Step 2: Relocation of “means”. Each cluster
representative is relocated to the center (mean) of all data
points assigned to it. If the data points come with a
probability measure (weights), then the relocation is to the
expectations (weighted mean) of the data partitions. The
algorithm converges when the assignments no longer
change. In k-mean algorithm each iteration needs N × k
comparisons, which determines the time complexity of one
iteration. The number of iterations required for convergence varies and may depend on N, but as a first cut, this
algorithm can be considered linear in the dataset size [4].
III. CLUSTERING IN GRID
In Grid architecture the resources are located
geographically diverse, in which different resources contain
different attributes for a common set of entities. Every
resource learns the cluster of each entity, but don’t have the
knowledge of their attributes. Here we have tried to found
some solution over by using k-means clustering, which is a
simple technique to group items into k clusters.
The k-means clustering algorithm requires approximations for the values or positions of the k-means.
The choice of initial points is very important to determine
the final solution using k-means algorithm. The cluster of
datasets from the involving computers from the Grid
selected as the initial points for approximations value
positions for k-means algorithm. The topic leads to
mechanism for producing appropriate initial assignment.
The classic k-means clustering approach uses multiple
samples of data sets to get the initial points. Here the k-means are selected arbitrarily because the new operations involve k-means clustering.
2
IV. GRID DATA MINING ALGORITHM USING K-MEANS
The k-means algorithm takes k, the number of desired clusters, and the data points as input and produces K clusters
as output. For the general partitioning strategy the algorithm
follows the following pseudo-code iterative method [5].
Partitioning Cluster (Data D, int k)
1) Select k prototypes t1,t2,. . . . tk, from D
2) for i=1,…..k. initialize cluster Ci={ti}
3) repeat:
4) for each point p in D:
5) let ti be the prototype that minimizes d(t,p)
6) put p in cluster Ci
7) quality = ClusteringQuality(C1,C2,….CK)
8) Re-compute prototypes
9) until quality does not change
The k-means algorithm characterizes the following details
[5].
1) First, it selects the initial k prototypes arbitrarily.
2) The Squared error criterion method is used to determine the
clustering quality.
3) In each iteration, the prototype of each cluster is re-
computed to be the cluster mean.
4) The k-means does not include any sampling techniques to
scale to huge databases
Ex. Consider the task of clustering computers in the Grid
environment into two clusters based on numerical data. These two cluster groups are
1) Computers actually taking part in the sharing of databases
in grid environment.
2) The number of databases shared by these computers for
different jobs of various Organizations.
TABLE I. DATA FOR K-MEANS EXAMPLE
Company Jobs No. of Computers No. of data-bases
C1 J1 60 64
C2 J2 55 55
C3 J3 46 54
C4 J4 50 46
C5 J5 58 65
C6 J6 52 60
For two cluster groups the value K is k=2. The graph plotted for these cluster groups is as follows.
Iteration 1. The initial k-prototypes are selected arbitrarily,
let the points J1, J2 & J3 are cluster means. The algorithm
decides to which cluster the remaining points should fall.
This is based on distance how close they are to the prototypes by calculating the Euclidean distance, we obtain
d(J4,J1) = [(50-60)2+(46-64)
2] = 20.59
d(J4,J2) = [(50-55)2+(46-55)
2] = 10.29
d(J4,J3) = [(50-46)2+(46-54)
2] = 8.94
d(J5,J1) = [(58-60)2+(65-64)
2] = 2.23
d(J5,J2) = [(58-55)2+(65-55)
2] = 10.44
d(J5,J3) = [(58-46)2+(65-54)
2] = 16.27
It is clear from the Euclidean distance that J5 is closer to J1 than J2, Hence it should belong to the cluster for which J1
is a prototype. Similarly, J4 should belong to the cluster for
which J3 is a prototype.
The clustering quality should be calculated by using the
squared error criterion as
E = d(J5,J1)2 + d(J4,J3)
2 = 84
Iteration 2: The algorithm re-computes the prototypes by calculating the cluster means. For the cluster with J1 the mean is calculated as ((58+60)/2, (65+64)/2) which is (59, 64.5) similarly for J2, J3 the cluster means are calculated and Euclidean distance is calculated along with clustering quality.
V. ANALYSIS OF ALGORITHM
In each iteration, the k-means algorithm computes the distance of each point from each of the prototypes. If there are n points and t iterations, then the complexity of the algorithm is O(nkt). Thus, the algorithm is linearly scalable with respect to the data size [5].
3
VI. DISCUSSION AND CONCLUSION
K-means is closely related to fitting a mixture of k isotropic Gaussians to the data or it is related to fitting the data with a
mixture of k components from the exponential family of
distributions. Another broad generalization is to view the
“means” as probabilistic model instead of points in Rd [4].
Here, in the assignment step, each data point is
assigned to the most likely model which has generated it. In
the “relocation” step, the model parameters are updated to
best fit the assigned datasets. Thereby, providing solution to
data mining in grid for more complex and scattered data. The k-means algorithm is widely used because of its
simplicity and also for its partitional clustering. It can be easily modified to deal with streaming data. The approximations to the true means are iteratively refined and at each iteration every point is assigned to the proper cluster.
VII. LIMITATIONS
1. It is observed that k-means is a limiting case of fitting data by a mixture of k Gaussians with identical, isotropic covariance matrices ,when the soft assignments of data points to mixture components are hardened to allocate each data point solely to the most likely component.
2. The cost of the optimal solution decreases with increasing k till it hits zero when the number of clusters equals the number of distinct data-points. This makes it more difficult to (a) directly compare solutions with different numbers of clusters and (b) to find the optimum value of k. If the desired k is not known in advance, one will typically run k-means with different values of k, and then use a suitable criterion to select one of the results.
REFERENCES
[1] G. Alonso, F. Casati, H. Kuno, and V. Machiraju, "Web Services:
Concepts, Architectures, and Applications",Springer Verlag, 2004.
[2] I. Foster, J. Nick, C. Kesselman, and S. Tuecke, "The Physiology of the
Grid: An Open Grid Services Architecture for Distributed Systems
Integration", 2002.
[3] M. Muhammad, L. Yuan, and Z. Jianqiu, "Web 3.0: A real Personal Web",
2009 Third International Conference on Next Generation Mobile
Applications, Services and Technologies,2009, pp. 125 – 128.
[4] IEEE international conference on Data Mining (ICDA) Survey paper
published on 4 December 2007.
[5] Vikram Pudi, P.Radha Krishna. “DATA MINING”, OXFORD University
Press.
1
A Survey of Approaches for Ontology Design
A.B.Manwar1, H.S. Mahalle
2, Dr. Vinay Chavan
3
1 Department of Computer Science, SGB Amravati University, Amravati MS, India
2Department of Computer Science, P. N. College, Pusad, Dist. Yavatmal MS, India 3 Department of Computer Science, S. K. Porwal College, Kamptee, Nagpur MS, India
Abstract- The semantic web is the next generation web
with structured information, that is, with well defined
semantics. Ontologies defined by Grubber as
‘vocabularies of representational terms- classes,
relations, functions, object constants- with agreed-upon
definitions in the form of human readable text and
machine enforceable, declarative constraints on their
well-formed use’. Ontology plays vital role in
augmenting relevancy in the information retrieval. In
this paper, we survey the prominent approaches that
have been proposed for construction of ontologies. We
also discuss some key issues in the design process of
ontologies.
I. INTRODUCTION
“The Semantic Web is not a separate Web but an extension of the current one, in which information is
given a well-defined meaning, better enabling
computers and people to work in cooperation” [1].
The notion of Semantic Web by Tim Berners Lee has created multidimensional directions of research, with ontology at its core as a suitable representation of
knowledge hidden in massive data of web. Several issues needs to be considered related to the
methodologies, tools and languages to be used, while
constructing new ontology. Many approaches have
been proposed for developing ontology which allows users to define new concepts, relations and instances.
The interrelationships between entities, domain
vocabulary and factual knowledge are key factors in designing effective tool. However, hassle is in the use
of a proper tool for modeling, editing and exploiting
various ontologies. Knowledge representation in domain ontologies developed with standard formalism enables interoperability amongst similar
domain of ontologies; moreover, this formal structure
leads to ontology learning [2].
In this paper, we survey some of the known relevant approaches for building ontologies. Next
section introduces the related work of the approaches
for ontology construction. Section III reviews different approaches for building ontologies. Finally,
section IV is a discussion which summarizes the
paper. The survey has been done from the ontology
development community and would be valuable for
ontology developers.
II. RELATED WORK
So far many approaches have been reported for
ontology building. In 1990, Lenat and Guha had suggested the approach for the Cyc development [3].
The method used to build the Cyc KB consists of
three phases wherein in first phase was manual
extraction of implicit knowledge and second, third phase comprises of acquiring and managing
knowledge through natural language or machine
learning. The Uschold and King’s method proposed capturing knowledge, coding it and integrating other
ontologies inside the current one [4]. Gruninger and
Fox propose a methodology that is founded on the
knowledge-based systems using first order logic which takes advantage of the robustness of classic logic [5]. In the method proposed at the KACTUS
project, the ontology is built on the basis of an application KB, by means of a process of abstraction
that is, following a bottom-up strategy [6].
The only method that suggests collaborative
construction is CO4 [7]. This method includes an approach for fitting new pieces of knowledge with
underlying knowledge architecture, which has been
previously agreed. Natalya Fridman Noy considered
major strengths and contributions and outlines weaknesses of different approaches in the design process of ontology [8]. In 1997, a new methodology
was suggested for building ontologies based on the SENSUS ontology [9]. The method based on Sensus
is a top-down approach for deriving domain specific
ontologies from huge ontologies. Consequently, this
approach promotes sharability of knowledge, since the same base ontology is used to develop ontologies
in particular domains. In 1999, Fernandez Lopez, M.
has analysed the methodologies wherein incompatibility with IEEE standard has been
mentioned except for the approach of SENSUS [10]. A comparative and detailed study of these methods
and methodologies can be found at [11]. METHONTOLOGY is a methodology, created in the
Artificial Intelligence Lab from the Technical
2
University of Madrid (UPM), for building ontologies either from scratch, reusing other ontologies as they
are, or by a process of reengineering them. The METHONTOLOGY framework enables the construction of ontologies at the knowledge level.
The main phase in the ontology development process
using the METHONTOLOGY approach is the
conceptualization phase [12]. Also many other methods and methodologies have
been proposed for other tasks, such as ontology
reengineering [13], ontology learning [2, 14], ontology evaluation [15, 16, 17], ontology evolution
[18, 19], ontology merging [20].
To gain the maximum potential from ontologies,
Dean Jones et al [21] suggests consideration of approach that based on variety of available
experience rather than one or two. In 2001, the On-
To-Knowledge methodology came out with name
OnToKnowledge [22]. In most of the cases there is no correspondence between ontology building
methodologies and tools, since there is no
technological support for most of the existing methodologies, they cannot be easily applied in the
ontology construction task; Oscar et al [23],
discussed about missing integrated environments for
ontology development with the exceptions of Protégé, OntoEdit, WebODE. Also WebODE [24]
and ODE [25] tools provides support to
METHONTOLOGY. Man Li et al [26] discussed the challenging issues
in constructing large-scale ontology and proposed a
practical collaborative ontology development
method. In The Hitchhiker’s Guide to Ontology Editors, Loredana Laera and Valentina Tamma [27]
had briefly surveyed many ontology development
tools, remarking as most of them only give support
for designing and implementing the ontologies skipping support for other activities of the ontology life cycle. Li Ding et al [28] provided a pragmatic
view on Semantic Web ontologies via comparison on chronological evolution, survey and case-study. The
technical report of Md. Ahsan-ul Murshed and
Ramanjit Singh [29] proposed ontology evaluation
criteria and a ranking technique to evaluate ontology editors and comments that Protege-2.1.2 can be used
to construct small-sized ontologies and OntoEdit can
be used to develop medium to large-sized ontologies. Metis-3.4 is good for enterprise architecture
modeling. Eva Blomqvist had shown the use of patterns as templates for creating ontology design
patterns for automatic construction of enterprise ontologies [30]. However, this fits to the
requirements of a small-scale application context.
Later, the general outline of a case-based reasoning
inspired approach for pattern-based ontology construction, called OntoCase had been presented
[31]. Sabin C. Buraga, Liliana Cojocaru and Ovidiu C. Nichifor had drawn several remarks and results concerning a survey on Web ontology editing tools
which dealt with practical deployment of test
ontology via a suite of tools and is focused on some
important technical details and features [32]. Mohammad Nazir Ahmad and Robert M. Colomb
reported the technologies of ontology and the
semantic web, particularly on ontology server development in brief [33]. Ling Li et al presented a
visual ontology modeling tool VOEditor for ontology
engineers to create and edit their ontologies in a
graphical mode and a multi-agent ontology-aware system for querying of Semantic Web resources [34].
Gianluca Correndo and Harith Alani presented a brief
survey of tools submitted to the workshop for Social
and Collaborative Construction of Structured Knowledge [35].
III. APPROACHES FOR ONTOLOGY DESIGN
Some of the relevant ontology approaches for
ontology development have been reviewed briefly in
this section. IEEE defines methodology as “a comprehensive
integrated series of techniques or methods creating a
general systems theory of how a class of thought-intensive work ought to be performed” [36].
Although most ontology methodologies conform to
software and knowledge systems engineering
methodologies, are oriented to the more specific task at hand. Thus, a methodology is composed of
methods, techniques, processes and activities and
may follow several approaches to develop ontology:
bottom-up, top-down and middle-out.
A. Approach 1
In this methodology, the approach is for manual
extraction, machine-aided extraction and machine-
managed extraction of common sense knowledge. Lenat and Guha described the process followed in
order to model Cyc, an ontology originally conceived
to capture shared knowledge about the world [3].
B. Approach 2
TOVE project established another approach as (1)
specification of the environment motivating ontology and its application, (2) set of questions in terms of
requirements for ontology, (3) objects, attributes,
relations – terminology specification, (4) constraint
3
specification of terminology, (5) evaluation. Later, Gruninger and Fox established several steps towards
a methodological construction and evaluation of ontologies based on their experience building the TOVE project ontologies. [5].
C. Approach 3
Uschold et al [4, 37, 38] proposed a set of guidelines towards ontology construction and
merging. These guidelines are (1) identifying the
purpose for recognition of the user domain, assessment of the purposes, choice of prompting
scenarios and competency questions (2) process of
ontology building which includes - ontology capture
for identification of the key concepts and relationships in the domain of interest called as
scoping using a middle-out approach; setting of
precise, unambiguous text definitions for such
concepts and relations; ontology coding ; integrating existing ontologies. (3) Evaluation and Revision
Cycle: Uschold indicates the clarity, consistency and
coherence criteria, the checking against the purpose or the competency questions.
D. Approach 4
The KAKTUS methodology is mainly centered at the development of ontologies using bottom-up
approach for specific applications. In this process of
design, ontologies developed for other purpose in the same domain may be reused. The specification of
application with relevant terms and tasks gathered in
the first step. In second step, preliminary design is
made considering previous lists and specification which includes search for already developed
ontologies. In final step, refinement of ontology
towards the final design of application is made [39].
E. Approach 5
METHONTOLOGY is a methodology described in
Fernandez-Lopez et al. [40], Fernandez-Lopez [41], Fernandez-Lopez and Gomez-Perez [42] and Gomez-
Perez et al. [43]. The methodology describes the
different steps to be taken in the conceptualization
process of ontology. The methodology considers development activities as well as management and
support activities, integration, ontology verification,
validation, and assessment, documentation and configuration management. Regarding development,
the activities are: Specification, purpose and scope of the ontology; Conceptualization, formation of
glossary of terms with definitions, synonyms and acronyms adopting a middle-out strategy; Classify
terms into one or more taxonomies of concepts;
concept relation and dictionary building, setting
axioms and rules, formalization, implementation and maintenance. Finally, several tools such as ODE and
WebODE were built to give technological support to this methodology, although several other existing tools may also be used – Protege, OntoEdit and
KAON, within others [43, 44].
F. Approach 6
Noy and McGuinness suggested a simple
knowledge-engineering methodology in 7 steps [45].
It is presented just as a set of ideas to aid ontology construction: “there is no single correct ontology-
design methodology and we did not attempt to define
one. The ideas that we present here are the ones that
we found useful in our own ontology-development experience” [45]. They acknowledged Protege-2000
as the ontology development environment used to
present their examples. The seven steps are (1)
Determine the domain and the scope of the ontology, (2) Consider reusing existing ontologies, (3)
Enumerate important terms in the ontology, (4)
Define the classes and the class hierarchy with several approaches such as top-down, bottom-up,
combination, (5) Define the properties of classes-
slots, such as intrinsic, extrinsic, parts - if structured,
relationships to other individuals, (6) Define the facets of the slots, such as Slot value-type: string,
number, boolean slots, enumerated slots, instance-
type slots; Domain and range, (7) Create individual instances of classes in the hierarchy. This guide is
referenced at the Protege ontology editor webpage.
G. Approach 7
The ON-TO-KNOWLEDGE (OTK) methodology
is extensively described in Sure and Studer [47], Sure
[47, 48], and Sure and Studer [47], and was
influenced by the methodological proposes from Uschold and colleagues, CommonKADS and METHONTOLOGY. The methodological process is
divided into five steps (1) Feasibility study - identify stakeholders (users and supporters of the system),
identify uses cases describing usage scenarios and
their supporting uses cases; (2) Ontology Kickoff -
initiates the development of the ontology, (3) Refinement (4) Evaluation - mainly consistency and
language conformity checking (5) Application.
H. Approach 8
The DILIGENT (DIstributed Loosely-controlled evolvInG Engineering of oNTologies) is a
methodology focused on a specific modelling scenario. The DILIGENT process includes five
activities: (1) build, (2) local adaptation, (3) analysis,
(4) revision, (5) local update. In each of these
4
activities of the process, domain experts, users, knowledge engineers and ontology engineers are
assigned different roles [49].
I. Approach 9
The TERMINAE method for the development of
ontologies is divided into four steps, and it is assumed that: “(1) the ontology builder should have a
comprehensive knowledge of the domain (2) the
ontology builder knows well how the ontology will be used. The alignment process takes place during the
construction” [52]. TERMINAE modelling steps are
(1) Constitution corpus - corpus selection and
organization; (2) Linguistic study: linguistic analysis using several natural language processing tools; (3)
Normalization according to some structuring
principles and criteria; (4) formalization:
formalization and validation with a variant language of description logics [50, 51, 53].
J. Approach 10
UPON (Unified Process for ONtology) builds on
the experience of the methodological development of
OntoLearn. The Ontolearn tool and method, offered
methodological support towards ontology learning and automatic population of domain ontologies [54,
55]. The OntoLearn methodology had four steps:
discovery of new terms, extraction of definitions, extraction of taxonomic information, and ontology
update, and included experts as per-concept
evaluators. The UPON methodology is based on the
Unified Process for software development and takes a use-case driven, iterative and incremental nature [56].
The evaluation of the quality of the ontology
includes four activities: (a) syntactic quality (formal),
(b) semantic quality (consistency), (c) pragmatic quality (fidelity, relevance and completeness), and (d) social quality [57, 58].
IV. DISCUSSION
Ontologies may be analysed according to its
rationale, subject-matter, level of simplification, and
richness of internal structure. However, the methodologies adopted for the construction and the
tools used also significant items for analysis.
Hoekstra stated that “ontology development remains an art, and no a priori guidelines exist that can ensure
the quality of the result” [59]. From the investigation of the methodological descriptions, especially
METHONTOLOGY, On-To-Knowledge and UPON ontology building methodologies follow a cyclic
modelling approach, conformed by several stages
which include various management (a preparatory
phase), development (a development phase) and support (an application phase) activities. The
preparatory phase includes the activities such as specification of requirements with initial feasibility study. The development stage covers knowledge
acquisition activities, together with ontology
conceptualization, formalization, evaluation and
refinement. Development activities depend on the criteria established in the preparatory stages and these
activities may need to be performed in iterative and
incremental cycles, thus the need for ontology refinement. Finally, the application phase includes
implementation and maintenance activities.
Also these methodologies may be distinguished by
the knowledge acquisition factor for ontology development. For example, TERMINAE is a bottom-
up approach which focus on the acquisition of low
level domain classes or instances through a list of
terms extracted by linguistic analysis techniques such as natural language processing.
None of the described methodologies focuses
entirely in the development of ontologies from a top-down approach, although the use of competency
questions and the decision to reuse available top or
upper ontologies can be seen as a top-down in
ontology design. However, most methodologies that include the
formation of competency questions (e.g.
METHONTOLOGY, ON-TO-KNOWLEDGE), the analysis of available ontologies towards reuse, and
ontology refinement steps, also suggest the extraction
of lists of domain terms from texts. Sure and Studer
(2003) comments during the description of ON-TO-KNOWLEDGE that, “the usage scenario/competency
question method usually follows a top-down
approach in modelling the domain. One starts by
modelling concepts in a very generic level. Subsequently they are refined. This approach is typically done manually and leads to a high-quality
engineered ontology. Available top-level ontologies may be reused here and serve as a starting point to
develop new ontologies. In practice this seems to be
more a middle-out approach, that is, to identify the
most important concepts which will then be used to obtain the remainder of the hierarchy by
generalization and specialization. However, with the
support of automatic document analysis, a typical bottom-up approach may be applied. There, relevant
lexical entries are extracted semi-automatically from available documents. Based on the assumption
thatmost concepts and conceptual structures of the domain as well the company terminology are
described in documents, applying knowledge
5
acquisition from text for ontology design seems to be promising”
Hence, most ontology methodologies propose an open approach with various concurrent possibilities and opening such as top-down, middle-out, bottom-
up.
REFERENCES
[1] Berners-Lee T., J. Hendler, and O. Lassila. “The Semantic Web” in
Scientific American 284(5):34–43, 2001, May.
[2] Aussenac-Gilles, N., Biebow, B., S. Szulman, “Revisiting ontology
design: a methodology based on corpus analysis,” in: 12th
International Conference in Knowledge Engineering and
Knowledge Management (EKAW_00
[3] D.B. Lenat, R.V. Guha, “Building Large Knowledge-Based
Systems: Representation and Inference in the Cyc Project,”
Addison-Wesley, Boston, 1990.
[4] M. Uschold, M. King, “Towards a Methodology for Building
Ontologies,” in: IJCAI95 Workshop on Basic Ontological Issues in
Knowledge Sharing, Montreal, 1995.
[5] M. Gruuninger, M.S. Fox, “Methodology for the design and
evaluation of ontologies,” in: Workshop on Basic Ontological
Issues in Knowledge Sharing, Montreal, 1995.
[6] A. Bernaras, I. Laresgoiti, J. Corera, “Building and reusing
ontologies for electrical network applications,” in: Proc. European
Conference on Artificial Intelligence (ECAI_96), Budapest,
Hungary, 1996, pp. 298–302.
[7] J. Euzenat, “Corporative memory through cooperative creation of
knowledge bases and hyper-documents,” in: Proc. 10th Knowledge
Acquisition for Knowledge-Based Systems Workshop (KAW96),
Banff, 1996.
[8] Natalya F. Noy and Carole D. Hafner, "The State of the Art in
Ontology Design", American Association for Artificial
Intelligence, 1997.
[9] B. Swartout, P. Ramesh, K. Knight, T. Russ, “Toward Distributed
Use of Large-Scale Ontologies,” in AAAI Symposium on
Ontological Engineering, Stanford, 1997.
[10] M. Fernandez-Lopez, “Overview of Methodologies for Building
Ontologies,” in: IJCAI99 Workshop on Ontologies and Problem-
Solving Methods: Lessons Learned and Future Trends, Stockholm,
1999.
[11] M. Fernandez-Lopez, “Overview of Methodologies for Building
Ontologies,” in: IJCAI99 Workshop on Ontologies and Problem-
Solving Methods: Lessons Learned and Future Trends, Stockholm,
1999.
[12] M. Fernandez-Lopez, A. Gomez-Perez, A. Pazos-Sierra, J. Pazos-
Sierra, “Building a chemical ontology using METHONTOLOGY
and the ontology design environment,” in IEEE Intelligent Systems
& their applications 4 (1) (1999) 37–46.
[13] A. Gomez-Perez, M.D. Rojas, “Ontological reengineering and
reuse,” in: D. Fensel, R. Studer (Eds.), 11th European Workshop
on Knowledge Acquisition, Modeling and Management
(EKAW_99) [14] JU. Kietz, A. Maedche, R. Volz, “A method for semi-automatic
ontology acquisition from a corporate intranet,” in: EKAW_00
Workshop on Ontologies and Texts, CEUR Workshop
Proceedings, Juan-Les-Pins, vol. 51 (2000) 4.14.14.
[15] A. Gomez-Perez, “Evaluation of ontologies,” in International
Journal of Intelligent Systems 16 (3) (2001)10:1–10: 36.
[16] N. Guarino, C. Welty, “Ontological analysis of taxonomic
relationships,” in: 19th International Conference on Conceptual
Modeling.
[17] Y. Kalfoglou, D. Robertson, “Managing Ontological Constraints,”
in: Proc. IJCAI99 Workshop on Ontologies and Problem-Solving
Methods: Lessons Learned and Future Trends, Stockholm, 1999.
[18] M. Klein, D. Fensel, “Ontology versioning on the Semantic Web,”
in: First International Semantic Web Workshop (SWWS01),
Stanford, 2001.
[19] L. Stojanovic, B. Motik, “Ontology evolution with ontology,” in:
EKAW02 Workshop on Evaluation of Ontologybased Tools
(EON2002), CEUR Workshop Proceedings, Siguenza, vol. 62,
2002, pp. 53–62.
[20] N.F. Noy, M.A. Musen, “PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment,” in: 17th National
Conference on Artificial Intelligence (AAAI_00), Austin, 2000.
[21] JONES, Dean, BENCH-CAPON, Trevor a VISSER, Pepijn.
“Methodologies for ontology development,” in proceedings of the
15th IFIP World computation congress, Budapest, 1998. 1998, s.
62-75.
[22] S. Staab, H.P. Schnurr, R. Studer, Y. Sure, “Knowledge processes
and ontologies,” in IEEE Intelligent Systems 16 (1) (2001) 26–34.
[23] Oscar Corcho, Mariano Fernandez Lopez, Asuncion Gomez-Perez.
“Methodologies, tools and languages for building ontologies.
Where is their meeting point?” in Data & Knowledge Engineering
46 (2003) 41–64.
[24] O. Corcho, M. Fernandez-Lopez, A. Gomez-Perez, O. Vicente,
“WebODE: an integrated workbench for ontology representation,
reasoning and exchange,” in 13th International Conference on
Knowledge Engineering and Knowledge Management
(EKAW_02)
[25] M. Blazquez, M. Fernandez-Lopez, J.M. G-Pinar, A. Gomez-
Perez, “Building ontologies at the knowledge level using the
ontology design environment,” in 11th International Workshop on
Knowledge Acquisition, Modeling and Management (KAW_98)
Banff, 1998.
[26] Man Li, Dazhi Wang, Xiaoyong Du, Shan Wang. “Ontology
Construction for Semantic Web: A Role-Based Collaborative
Development Method”, APWeb 2005: 609-619
[27] Loredana Laera, Valentina Tamma, “The Hitchhiker’s Guide to
Ontology Editors,” in AgentLinks’ Making Use of the Semantic
Web. ISSUE 18, August 2005
[28] Li Ding, Pranam Kolari, Zhongli Ding, and Sasikanth Avancha.
“Using Ontologies in the Semantic Web: A Survey,” a book-
chapter in Ontologies: A Handbook of Principles, Concepts and
Applications in Information Systems, 2006.
[29] Md. Ahsan-ul Murshed and Ramanjit Singh. “Evaluation and
ranking of ontology construction tools”. Technical Report # DIT-
05-013, at eprints.biblio.unitn.it/archive/00000747/01/013.pdf,
March 2005,
[30] Eva Blomqvist. “Fully automatic construction of enterprise
ontologies using design patterns: Initial method and first
experiences,” in Proceedings of ODBASE’05
[31] Eva Blomqvist.. “Pattern ranking for semi-automatic
ontology construction,” in proceedings of the 2008 ACM
symposium on Applied computing, ACM New York, NY, USA
[32] Buraga, S. C., Cojocaru, L., and Nichifor, O. C. “Survey on Web
Ontology Editing Tools” available at: http://www.aifb.uni-
karlsruhe.de/WBS/ysu/publications/OntoWeb_Del_1-3.pdf, 2006 [33] Ahmad, M. N., & Colomb, R. M. (2007). “Managing Ontologies :
A Comparative Study of Ontology Servers,” Australian Computer
Society, Inc. Retrieved from
http://portal.acm.org/citation.cfm?id=1273733
[34] Ling Li, Shengqun Tang, Lina Fang, Ruliang Xiao, Xinguo Deng,
Youwei Xu, Yang Xu, "VOEditor: a Visual Environment for
Ontology Construction and Collaborative Querying of Semantic
Web Resources," in 31st Annual International Computer Software
and Applications Conference, 2007
6
[35] Gianluca Correndo, Harith Alani. “Survey of Tools for
Collaborative Knowledge Construction and Sharing,” in Web
Intelligence/IAT Workshops 2007: 7-10
[36] IEEE. 1990, December. “IEEE standard glossary of software
engineering terminology” IEEE Standard 610.12-1990. New York:
Institute of Electrical and Electronics Engineers.
[37] Uschold, M. “Building ontologies: Towards a unified
methodology,” in proceedings of Expert Systems 1996, the 16th
Annual Conference of the British Computer Society Specialist
Group on Expert Systems. Cambridge: The University of
Edinburgh, 1996 [38] Uschold, M., and M. Gr¨uninger. “Ontologies: Principles, methods
and applications”. Technical Report AIAI-TR-191, University of
Edimburgh, 1996
[39] Bernaras, A., I. Laresgoiti, and J. Corera.. “Building and reusing
ontologies for electrical network applications,” in Proceedings of
the 12th European Conference on Artificial Intelligence
(ECAI’96), Budapest, Hungary: Wiley, 1996
[40] Fernandez-Lopez, M., A. Gomez-Perez, and N. Juristo. 1997.
“METHONTOLOGY: From ontological art towards ontological
engineering.” In Ontological engineering: American Association
for Artificial Intelligence, Menlo Park, California, 33–40, 1997
[41] Fernandez-Lopez, M. August. “Overview of methodologies for
building ontologies”. In Proceedings of the Workshop on
Ontologies and Problem-Solving Methods in Conjunction with the
Sixteenth International Joint Conference on Artificial Intelligence
(IJCAI’99).
[42] Fernandez-Lopez, M., and A. Gomez-Perez. “Overview and
analysis of methodologies for building ontologies.” Knowledge
Engineering Review 17(2):129–156, 2002.
[43] Gomez-Perez, A., M. Fernandez-Lopez, and O. Corcho.
“Ontological engineering. With examples from the areas of
knowledge management, e-commerce and the Semantic Web.”
Advanced information and knowledge processing. London:
Springer, 2003.
[44] Corcho, O., M. Fernandez, A. Gomez-Perez, and A. Lopez-Cima.
2005. “Building legal ontologies with METHONTOLOGY and
WebODE,” in Law and the Semantic Web. Legal ontologies,
methodologies, legal information retrieval, and applications, 2005
[45] Noy, N. F., and D. L. McGuinness. “Ontology development 101: A
guide to creating your first ontology.” Technical Report SMI-
2001-0880, Stanford University School of Medicine, 2001.
[46] Sure, Y. “A tool-supported methodology for ontology-based
knowledge management.” In Ontology and Modeling of Real
Estate Transactions in European Juristictions,1993.
[47] Sure, Y., S. Staab, and R. Studer. “Methodology for development
and employment of ontology based knowledge management
applications. “ACM SIGMOD Record 31(Special Issue)(4):18–23,
2002.
[48] Sure, Y. “Methodology, tools and case studies for ontology based
knowledge management.” Ph.D. thesis, 2003.
[49] Pinto, H. S., C. Tempich, and S. Staab. DILIGENT: “Towards a
fine-grained methodology for distributed, loosely-controlled and
evolving engineering of ontologies.” in Proceedings of the 16th
European Conference on Artificial Intelligence (ECAI 2004),
2004. [50] Biebow, B., and S. Szulman. “TERMINAE: A linguistics-based
tool for building of a domain ontology.” In Proceedings of the 11th
European Workshop (EKAW’99), 1999.
[51] Aussenac-Gilles, N., B. Biebow, and S. Szulman. “Revisiting
ontology design: A method based on corpus analysis.” In
Proceedings of EKAW 2000.
[52] Despres, S., and S. Szulman. “Construction of a legal ontology
from a European community legislative text. In Legal knowledge
and information systems.” The Seventeenth Annual Conference,
Amsterdam, ed. T. Gordon, 79–88. Amsterdam: IOS Press, 2004
[53] Despres, S., and S. Szulman. “Merging of legal micro-ontologies
from European directives.” Artificial Intelligence and Law, 2007.
[54] Navigli, R., P. Velardi, A. Cucchiarelli, and F. Neri. “Automatic
ontology learning: Supporting a per-concept evaluation by domain
experts.” In Proceedings of the ECAI-2004 Workshop on Ontology
Learning and Population, Sevilla.
[55] Velardi, P., R. Navigli, A. Cucchiarelli, and F. Neri. “Evaluation of
OntoLearn, a methodology for automatic population of domain
ontologies.” In Ontology learning from text: Methods, evaluation
and applications, ed. P. Buitelaar, P. Cimiano, and B. Magnini.
Frontiers in artificial intelligence and applications series, Vol. 123. Amsterdam: IOS Press, 2005.
[56] de Nicola, A., M. Missikoff, and R. Navigli. “A proposal for a
unified process for ontology building: UPON.” In Database and
expert systems applications (DEXA), 2005.
[57] de Nicola, A., M. Missikoff, and R. Navigli. “A software
engineering approach to ontology building.” Information Systems
34:258–275, 2009.
[58] Barbagallo, A., A. D. Nicola, and M. Missikoff. “eGovernment
ontologies: Social participation in building and evolution.” Hawaii
International Conference on System Sciences 0:1–10, 2010.
[59] Hoekstra, R. “Ontology representation. Design patterns and
ontologies that make sense.” In Frontiers in artificial intelligence
and applications, Vol. 197. Amsterdam: IOS Press, 2009.