IPv6 – ADDRESSING GRID-INTEROPERABILITY PERSPECTIVE...

1 Bioinfo Publications

Bioinfo Publications Volume. XX, Issue XX, 2011, PP-XX-X Online available at

IPv6 – ADDRESSING GRID-INTEROPERABILITY

PERSPECTIVE FOR PATIENTS IN MEDICAL FIELD

HEMANT S. MAHALLE1 *, A. B. MANWAR2, Dr. VINAY CHAVAN3

1Department of Computer Science, P. N. College, Pusad, Dist. Yavatmal MS, India 2 Department of Computer Science, SGB Amravati University, Amravati MS, India

3 Department of Computer Science, S. K. Porwal College, Kamptee, Nagpur MS, India *Corresponding Author: [email protected]

Received: January 30, 2012; Accepted: February15, 2012

Abstract- IPv6 is of considerable importance to businesses, consumers, and network access providers of all sizes. IPv6 is designed to improve upon IPv4's scalability, security, ease-of-configuration, and network management;

these issues are central to the competitiveness and performance of all types of network-dependent businesses.

IPv6 aims to preserve existing investment as much as possible. End users, industry executives, network

administrators, protocol engineers, and many others will benefit from understanding the ways that IPv6 will affect

future internetworking and distributed computing application like cloud computing ,Grid computing etc. Every

individual is to be identified as the requirement of today’s Science and Technology.

In this paper, we present a platform providing a problem solving environment for patients or persons among

multiple application domains of Grid computing especially in medical and scientific study, using the

collaborative relationship and also fostering interoperability in Grid infrastructure. The proposed work is based on IPv6 address scheme providing seamless, secure, and intuitive access to distributed Grid resources.

Keywords - IPv4, IPv6, Grid computing, Patients, scalability, Internet, Address scheme 1. Introduction: In 1980, ARPA started converting systems to TCP/IP. In 1983, ARPANET was split into two different networks: MILNET for military research and ARPANET for continuing the current path of research. In 1983 it was also mandated by ARPA that all systems connected to ARPANET use TCP/IP. The Internet Protocol (IP) has its roots in early research networks of the 1970s, but within the past decade has become the leading network-layer protocol. This means that IP is a primary vehicle for a vast number of client/server and peer-to-peer communications, and the current scale of deployment is straining many aspects of its twenty-year old design. The Internet Engineering Task Force (IETF) has produced specifications that define the next-generation IP protocol known as "IPng," or "IPv6." IPv6 is both a near-term and long-range concern for network owners and service providers. IPv6 products have already come to market; on the other hand, IPv6 development work will likely continue well into the next decade. Though it is based on much-needed enhancements to IPv4 standards, IPv6 should be viewed as a new protocol that will provide a firmer base for the continued growth of today's internetworks.

Because it is intended to replace IPv4, IPv6 is of considerable importance to businesses, consumers, and network access providers of all sizes. IPv6 is designed to improve upon IPv4's scalability, security, ease-of-configuration, and network management; these issues are central to the competitiveness and performance of all types of network-dependent businesses. IPv6 aims to preserve existing investment as much as possible. End users, industry executives, network administrators, protocol engineers, and many others will benefit from understanding the ways that IPv6 will affect future internetworking and distributed computing applications[1,2,3]. E-Science today, particularly scientific experiments and studies, involve distributed and heterogeneous resources. The scientific applications use different approaches with similar characteristics in a wide variety of scientific domains and typically require distributed, high-throughput and data intensive computing[5]. In recent years, a grid environment has been adopted in order to accomplish these necessities, which are motivated for several reasons, including technology reuse, computational performance and also, security system aspects [6].

Title of article

2 xxxxxxxxxxxxxxxxxxxx

Volume. xx, Issue xx, 2011

One of the main current Public Key Infrastructures ( PKI needs is interoperability, which makes possible the secure interconnection and co-operation between different PKI structures, thus enhancing their feasibility and applicability at regional, national, as well as international level. This is even more important in the healthcare sector, due to the increased demand for mobility of both patients and doctors and the criticality of the telemedicine data, which make the secure and interoperable information exchange, a basic element for high-level healthcare service provision [7]. 2. Network Addressing in IPv6 Following are some of the points which glance at familiar attributes of address allocation, associated with IPv4 and compare them to similar attributes in IPv6.

2.1 Address scheme IPv4 Address Scheme 32 bits long (4 bytes). Address is composed of a network and a host portion, which depend on address class. Various address classes are defined: A, B, C, D, or E depending on initial few bits. The total number of IPv4 addresses is 4,294,967,296. The text form of the IPv4 address is nnn.nnn.nnn.nnn, where 0<=nnn<=255, and each n is a decimal digit. Leading zeros may be omitted. Maximum number of print characters is 15, not counting a mask. IPv6 Address Scheme 128 bits long (16 bytes). Basic architecture is 64 bits for the network number and 64 bits for the host number. Often, the host portion of an IPv6 address (or part of it) will be a MAC address or other interface identifier. Depending on the subnet prefix, IPv6 has a more complicated architecture than IPv4. The number of IPv6 addresses is 10 28 (79 228 162 514 264 337 593 543 950 336) times larger than the number of IPv4 addresses. The text form of the IPv6 address is xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx:xxxx, where each x is a hexadecimal digit, representing 4 bits. Leading zeros may be omitted. The double colon (::) may be used once in the text form of an address, to designate any number of 0 bits. For example, ::ffff:10.120.78.40 is an IPv6 IPv4-mapped address. 2.2 Address Allocation IPv4 - Originally, addresses were allocated by network class. As address space is used up, smaller allocations using Classless Inter-Domain Routing (CIDR) are made. Allocation has not been balanced among institutions and nations.

IPv6 - Allocation is in the earliest stages. The Internet Engineering Task Force (IETF) and Internet Architecture Board (IAB) have recommended that essentially every organization, home, or entity be allocated a /48 subnet prefix length. This would leave 16 bits for the organization

to do subnetting. The address space is large enough to give every person in the world their own /48 subnet prefix length.

2.3 Address Mask In IPv4, network mask is used to designate network from host portion whereas, the mask does not exist in IPv6.

2.4 Address Resolution Protocol (ARP)

Address Resolution Protocol is used by IPv4 to find a physical address, such as the MAC or link address, associated with an IPv4 address. IPv6 embeds these functions within IP itself as part of the algorithms for stateless auto configuration and neighbor discovery using Internet Control Message Protocol version 6 (ICMPv6). Hence, there is no such thing as ARP6. 2.5 Address Types IPv4 supports Unicast, multicast, and broadcast, whereas IPv6 supports unicast, multicast, and anycast. 2.6 Network Address Configuration In IPv4 Configuration must be done on a newly installed system before it can communicate; that is, IP addresses and routes must be assigned. In IPv6, configuration is optional, depending on functions required. An appropriate Ethernet or tunnel interface must be designated as an IPv6 interface, using iSeries Navigator. Once that is done, IPv6 interfaces are self-configuring. So, the system will be able to communicate with other IPv6 systems that are local and remote, depending on the type of network and whether an IPv6 router exists. 2.7 Domain Name System (DNS) Applications accept host names and then use DNS to get an IP address, using socket API gethostbyname(). Applications also accept IP addresses and then use DNS to get host names using gethostbyaddr(). For IPv4, the domain for reverse lookups is in-addr.arpa. Same for IPv6. Support for IPv6 exists using AAAA (quad A) record type and reverse lookup (IP-to-name). An application may elect to accept IPv6 addresses from DNS (or not) and then use IPv6 to communicate (or not). The socket API gethostbyname() is unchanged for IPv6 and the getaddrinfo() API can be used to obtain (at application choice) IPv6 only, or IPv4 and IPv6 addresses. 2.8 DHCP (Dynamic Host Configuration Protocol) In IPv4, Dynamic Host configuration Protocol is used to dynamically obtain an IP address and other configuration information. Currently, DHCP does not support IPv6 [2 to 2.8, from 1 to 4]. 3. Implementation and Transition to IPv6: IPv6 implementation is an ongoing process. Some analysts make comparisons between the potential

First Author, Second Author, Third Author, Fourth Author


shortfall of IPv4 addresses and the potential problems like mismatch of addresses, viruses affecting the change in configuration etc. The task of transitioning to IPv6 is full of challenges. Large-scale transitions require a high level of flexibility in the protocol, software and hardware supporting and using it. A successful migration to IPv6 requires all IPv4 nodes to communicate with IPv6 nodes during the migration period. It is also important that technology is in place to allow IPv6 nodes can communicate over the IPv4 Internet to other networks. To meet the needs of the various network topologies that are spread throughout the Internet, there are multiple methods for making this change[3]. The major transition mechanisms that are integral to the IPv6 design effort, these techniques include dual-stack IPv4 /IPv6 hosts and routers, tunneling of IPv6 via IPv4, and a number of IPv6 services, including IPv6 DNS, DHCP, MIBs, and so on[2]. Worldwide IPv6 testing and pre-production deployment network, called the 6BONE, had already reached approximately 40 countries. There are many IPv6 implementations completed or underway worldwide, and in test or production use on the 6BONE. The 6BONE has been built by an active population of protocol inventors, designers and programmers[2,3]. 4. Problem- Resource provision in a Grid The IPv6 can be effectively used in Grid computing for various problems faced by the users in the Internet. It is the prime requirement of future scenario of the world to identify at every individual and every asset in the Internet so as to provide the service to every individual or every asset. With the rapid progress of various technologies, there is a growing demand for integrating amount of domain information, including medical information, bioinformatics information, enterprise information and etc. Medical information system is well known in the world as one of the most complicated information systems, which main target is through the network to achieve medical resource sharing of each aspect of medical domain, for instance, hospital, medical research organizations, doctors and patients[8]. In this paper we have concentrated on providing every individual the medical facility being enrolled in the Grid of Doctors. As the number of patients is very large or every member of the family has a unique identity in the Internet so as to get the proper and required aid, it is not possible to address the every individual using the IPv4 address scheme. So, to address the millions of persons individually in the system it is required to have a addressing scheme which can addresses these millions of patients or persons in the system the only solution is to use the IPv6 addressing scheme. The key factor for this project was generating and managing metadata. The Metadata has found to be the key in the ability to interoperate and co-relate heterogeneous data systems and available data bases in the distributed computing. The emphasis was given on data engineering to understand how to interpret and

interrelate the local data dictionaries among the distributed systems. 5. The IPv6 as the perspective solution The Internet is changing the face of medical research. The current world of isolated research and proprietary data encodings is evolving into a future of standardized medical databases and integrated medical applications, such as clinical decision support systems[9]. The every individual in the Internet is recognized by the IPv6 addressing scheme. The microchip is to be fitted with every individual for keeping the contact within the Grid of Doctors. The microchip records the every change in the body of the person and sends the information to the server in the Grid. Accordingly the server distributes the information to the related nodes in the Grid. The corresponding Doctor or the related person after seeing the information instructs the patient or if required can take the expert decision in the regard. Some special laboratories will also be involved in the investigation of the diseases or special equipment will also be shared in the Grid. The team of experts will after proper discussion and after proper investigation the expert’s decision or the line of treatment will be sent back to the server and then it will be sent to the patient or needy person. 6. Model Data Architecture The model data architecture focused on the deployment of a middleware component with the objective of interconnecting distributed heterogeneous data systems and also using microchips and Satellite for patients data communication. The Data Architecture (Fig 1) uses the Servers in Grid environment varying from S1,S2,S3- - - - -SN . The main focus of is the distributed patients, referred as P1,P2,P3- - - - - -PN. The computer systems from Grid environment contributing to project, referred as C1,C2,C3- - - - - - CN. The Satellites covering the globe and thereby covering the global patients registered in the system provides ready information about the patients to the servers. These Satellites are referred as ST1,ST2,ST3- - - - - - - - -STN.

Fig (1). The model Data Architecture for Patients Grid-Interoperability developed by researcher (here in after called MMC Graph)

Title of article

4 xxxxxxxxxxxxxxxxxxxx

Volume. xx, Issue xx, 2011

7. The Algorithm The algorithm for “IPv6-Addressing Grid-Interoperability Perspective for Patients In Medical Field” here in after referred as MMC algorithm. The MMC Algorithm:

1. INITIALIZE THE SYSTEM WITH IPv6 ADDRESSING

2. SELECT SERVERS AS S1,S2,S3,- - - - - - -SN 3. SELECT PATIENTS AS P1,P2,P3,- - - - - - -PN 4. SELECT COMPUTERS AS C1,C2,C3,- - - - -CN 5. SELECT SATELLITE AS ST1,ST2,ST3,- - - -STN 6. SELECT EQUIPMENT LABORATORY AS EL-1,EL-

2,EL-3,- - - -EL-N 7. REQUEST FROM P1,P2,P3- - - - - -PN

1. ACTIVATE SATELLITE 2. ACTIVATE SERVER

8. APPLY PROCESS IN SERVER 1. IF Related Data-base found then sent

information to related Doctors Team GOTO LOOP-1

2. ELSE IF Related Data-base not found then sent Information to Experts Team GOTO LOOP-2

3. ELSE IF Required specialized equipment help sent information to Equipment Lab. GOTO LOOP-3

4. OTHERWISE Sent back to SERVER EXECUTE LOOP-4

9. LOOP-1 The Doctors Team verify the Data-bases and their

opinion along with Line of treatment sent back to SERVER.

GOTO LOOP-5 10. LOOP-2

The Experts Team study the new disease and sight a line of treatment or if required sent to Investigation Lab. For sophisticated Equipment help. GOTO LOOP-3

11. LOOP-3 The results of Specialized Equipment Lab. Sent back to Experts Team.

GOTO LOOP-6 12. LOOP-4 For unknown Diseases or for which Data-bases not available the Experts Team sent back Information to SERVER and Research Lab. For Investigation.

GOTO LOOP-7 13. LOOP-5

The SERVER sent the refined Information back to users Computer

END OF LOOP 14. LOOP-6

The Experts Team observes the results of Specialized Lab. Equipments and draws the line of treatment and sent the Information to SERVER.

EXECUTE LOOP-5 15. LOOP-7

The report of research Lab. Sent back to Experts Team, which then after discussion, sight a line of treatment and sent to SERVER.

EXECUTE LOOP-5 16. END 8. Discussion The MMC Algorithm should be very useful for future scenario because of huge population and the explosion of technology it is rather difficult to provide medical facilities. So as a solution to future problem the MMC Algorithm will play an important role. This model aims to settle the semantic relationship between heterogeneous medical information systems. It will support integration and Interoperability by sharing of heterogeneous medical data-bases available in the Grid. It will also support semantic-consistent data process schema to support the information exchange between heterogeneous systems under Grid environment. Our future work includes the actual creation and verification of the MMC model to integrate medical information systems. 9. Conclusion Thus, from the facts of address allocation schemes in IPv6, it is indeed a great achievement that the quantity of IPv6 addresses is so vast that every individual can be recognized in the network. Definitely, this will lead to a future with almost every human being linked in the Internet and much better networking applications & services can be made available along with adequate security. References

[1] http://www.linuxsecurity.com/resource_files/documentation/tcpip-security.html

[2] http://www.SecurityDocs.comIPv6 security/SecurityDocs Comment on IPv6 Security Considerations.htm

[3] CHRIS CHAMBERS, JUSTIN DOLSKE, and JAYARAMAN IYER,” TCP/IP Security “ Department of Computer and Information Science, Ohio State University, Columbus, Ohio 43210.

[4] RFC Editor (http://www.rfc-editor.org/ rfcsearch.html)

[5] Hyejeong Kang, Kyoung-a Yoon, Seoyoung Kim, Yoonhee Kim, and Chongam Kim, “An e-Science Problem Solving Environment for Scientific Numerical Study”, ISBN 978- 89-5519-155-4, Feb. 13~16, 2011 ICACT2011

[6] Benedyczak, K., - Users' UNICORE Security Guide - December, 2008

[7] B. Blobel, Analysis, Design and Implementation of Secure and Interoperable Distributed Health Information Systems. Amsterdam, The Netherlands: IOS Press, 2002.

[8] Yunmei Shi, Xuhong Liu, Yabin Xu, Zhenyan Ji,”Semantic-Based Data Integration Model Applied to Heterogeneous Medical Information

First Author, Second Author, Third Author, Fourth Author


System” 978-1-4244-5586-7/10/$26.00 C 2010 IEEE

[9] Christina Catley’ and Dr. Monique Frize ’’DESIGN OF A HEALTH CARE ARCHITECTURE FOR MEDICAL DATA INTEROPERABILITY AND APPLICATION INTEGRATION” Proceedings of the-Second Joint EMBSBMES Conference Houston, TX, USA October 23-26, 2002

Indian Journal of Computer Science and Engineering Vol X No X 1-5

1

A VECTOR SPACE MODEL FOR INFORMATION RETRIEVAL: A

MATLAB APPROACH

A. B. Manwar*

Department of Computer Science,

S.G.B. Amravati University, Amravati MS, India†

avinash.manwar @gmail.com‡

Hemant S. Mahalle Department of Computer Science,

P. N. College, Pusad, Dist. Yavatmal MS, India

[email protected]

K. D. Chinchkhede Department of Applied Electronics,

S.G.B. Amravati University, Amravati MS, India

[email protected]

Dr. Vinay Chavan Department of Computer Science,

S. K. Porwal College, Kamptee, Nagpur MS, India

[email protected]

Abstract

By and large, three classic framework models have been used in the process of retrieving

information: Boolean, Vector Space and Probabilistic. Boolean model is a light weight model

which matches the query with precise semantics. Because of its boolean nature, results may be

tides, missing partial matching, while on the contrary, vector space model, considering term-frequency, inverse document frequency measures, achieves utmost relevancy in retrieving

documents in information retrieval. This paper implements and discusses the issues of information

retrieval system with vector space model using MATLAB on Cranfield data collection of

aerodynamics domain.

Keywords: tf; idf; vector space model; cosine similarities; term-document; term-query matrices;

dot products.

1. Introduction

Enormous amount of text material is increasing at exponential rate, especially with the increasing

use and applications of Internet. Day by day it is becoming very difficult to retrieve the relevant

information. Various approaches have been used by the researchers to get over the relevancy factor

in information retrieval.

An information retrieval model is a quadruple consisting of document collection, set of queries, framework model and a ranking function associated with query-document. A framework model

may be boolean, vector space or probabilistic. Boolean model matches query with precise

semantics in the document collection by boolean operations with operators AND, OR, NOT. It

predicts either relevancy or non-relevancy of each document, leading to the disadvantage of

retrieving very few or very large documents. The boolean model is the lightest model having

inability of partial matching which leads to poor performance in retrieval of information. Vector

space model is introduced by G. Salton in late 1960s in which partial matching is possible. Non-

binary weights are used to weight the index terms in queries and in documents. These words are

used for calculating degree of similarity between each document and the query. The ranked

document set in the decreasing order of degree of similarity thus obtained is precise than the result

of boolean model. Index term weights can be calculated in many different ways. The work by Salton and McGill [1] reviews various term-weighting techniques. Although there is contention as

to whether probabilistic model outperforms the vector model, Salton and Buckley [2] showed that

the vector model is expected to outperform the probabilistic model with general collections [3].

This paper implements and discusses the issues of information retrieval system with vector space

model using MATLAB on Cranfield data collection of aerodynamics domain.

The next section deals with brief review of related work of vector space model in information

retrieval.

* ASSOCIATE PROFESSOR IN COMPUTER SCIENCE, S.G.B. AMRAVATI UNIVERSITY, AMRAVATI MS INDIA.

† Sant Gadge Baba Amravati University, Amravati, Maharashtra, India.

‡ [email protected]


2

2. Related work

Maron and Kuhns [4] in early 1960, described probabilistic indexing technique in a mechanized

library system yielding probable relevance. Afterword in 1983, Salton and McGill wrote a book

[1] which discusses thoroughly the three classic models in information retrieval namely, the

boolean, the vector, and the probabilistic models. The book by van Rijsbergen [5] covers the

discussion on three classic models and majority of the associated technology of retrieval system.

Frakes and Baeza-Yates [6] edited the book on information retrieval which mainly deals with the

data structures used in general information retrieval systems. Also, it includes the issue of

relevance feedback as well as some query modification techniques [7] and Boolean operations and their implementations [8]. Verhoeff, Goffman, and Belzer [9] described the shortfall of boolean

queries for information retrieval. The concept of using boolean formalism in other frameworks had

been the great interest area of the researchers. Lee et al proposed a thesaurus-based boolean

retrieval system for ranking [10].

Vector space model has been the most popular model in information retrieval among the research

vicinity because of the research outcome in indexing, term value specification in automatic

indexing carried out by Salton and his associates [11, 12]. Most of this research deals with

experiments in automatic document processing and different term weighting approaches for

automatic retrieval [2, 13]. In 1972, Karen Sparck Jones introduced the concept of inverse

document frequency, a measure of specificity [14, 15] and Salton and Yang uses it for automatic

indexing to improve retrieval [12]. Raghavan and Wong [16] analyses vector space model

critically with the conclusion that the vector space model is useful and which provides a formal framework for the information retrieval systems.

The next section gives a description of the most influential vector space model in modern

information retrieval research.

3. Vector Space Model

The drawback of binary weight assignments in boolean model is remediated in the vector space

model which projects a framework in which partial matching is possible [11, 13]. Non-binary

weights for index terms in queries and documents are used in the calculation of degree of

similarity. Decreasing order of this degree of similarity for the retrieved documents gives the

ranked documents with partial match.

For the vector model, the weight associated with a pair is positive and non-binary.

Further, the index terms in the query are also weighted. Let be the weight associated with the

pair , where . Then, the query vector is defined as where

t is the total number of index terms in the system. The vector for a document dj is represented by

.

The vector model proposes to evaluate the degree of similarity of the document dj with regard to

the query q as the correlation between the vectors and . This correlation can be measured by

the cosine of the angle between these two vectors as,

where and are the norms of the document and query vectors. The factor does not affect

the ranking (i.e., the ordering of the documents) because it is the same for all documents. The

factor provides normalization in the space of the documents.

Since and , varies from 0 to +1, the vector model ranks the

documents according to their degree of similarity to the query. A document might be retrieved

even if it matches the query only partially. [3, page 27-28]

.

In the vector space model, the frequency of a term ki inside a document dj referred to as the tf

factor and provides one measure of how well that term describes the document contents.

Furthermore, the inverse of the frequency of a term ki among the documents in the collection

referred to as the inverse document frequency or the idf factor.

The normalized frequency fi,j of term ki in document dj is given by


3

(1)

where the maximum is computed over all terms which are mentioned in the text of the document

dj. The idfi, inverse document frequency for ki, be given by

(2)

The best known term-weighting schemes use weights which are given by

(3)

Such term-weighting strategies are called tf-idf schemes [3, page 29-30].

The success of vector space model lies in its term-weighting scheme, its partial matching strategy

and similarity measure. Mutual independence of index terms has said to be disadvantage of vector

space model but practically, consideration of term dependencies is not fruitful. From the research

consequence in the field, it seems that the vector model is either superior or almost as good as the

known alternatives.

4. Experimental Evaluations

4.1. Dataset for information retrieval system We used a Cranfield collection having 1398 abstracts related to aerodynamics domain which is obtained from [17]. The collection contains a compressed version of document text, relation giving

relevance judgements, text of 225 queries, indexed documents and indexed queries. Also, we used

stop-list obtained from the collection source.

4.2. Preprocessing The compressed version of document text has been preprocessed to obtain a set of 1398 individual

abstract files. A PHP script has been written for this purpose. Stemming has not applied for the

reasons of i) losing context of search, ii) may reduces precision and iii) can not be applied to

proper nouns. To have a clear view of relevancy, instead of using given set of query and relevance

judgements, we have created our set of queries from ‘titles’ of 1398 documents, forming a query

set of 1398 entities.

4.3. Implementation A MATLAB is used to implement a vector space model for information retrieval; the complete

process is depicted in fig. 1. Fig. 1. Implementation of vector space model for information retrieval.

As shown in block diagram it consists of three stages:

1. Generation of Term-Frequency matrix

It is term-frequency matrix of all unique terms in document with . The

term document matrix (FM) is matrix with

unique terms in dictionary


4

and documents. The elements of FM are represented as in which

each element indicates the frequency of term in document.

The Cranfield data collectiont is preprocessed to convert into individual 1398 text files. A

SMART stop word text file which is available with the dataset is used for the removal of

stop words from the data collection of 1398 files. Also, non-embedding special characters

and numerals have been removed from these files. 79,728 words have been collected which are then processed to find the frequency of unique words in each documents. The

dictionary of unique words is of 7805 words. Thus the term-frequency matrix is of size

7805x1398.

2. Generation of Query matrix

The title of each abstract, after removing stop-list words and non-embedding special

characters is used as query, which contributes to the set of 1398 unique queries

represented as . Here, we have taken queries as titles of the document instead of the

dataset queries so as to judge the relevancy more profoundly. The generated matrix for

1398 queries is

3. Term-weight calculations and result

A term-frequency matrix is processed to get the term weights considering tf-idf scheme.

These term weights are calculated by the formula , where

is the term-weight i.e. weight of term in document ,

is the frequency of term in document ,

is the inverse document frequency representing the terms appearing in

many documents and is calculated by the formula , where is

the total number of documents and is the number of documents

containing term ,

is the Euclidean length obtaining by taking square root of sum of

squares of individual terms per document.

Query matrix of size is also divided by their corresponding Euclidean lengths

to obtain the normalized weights. For query , the transpose of query matrix is multiplied by the term-weight matrix. The final result is obtained by ordering the weights in a result matrix

in decreasing manner of their weights.

4.4. Testing phase The vector space model (VSM) implemented above is tested thoroughly in different ways, the

details of experimentation is explained below: Fig. 2 shows the distribution of index terms in dictionary for individual documents. The

dictionary consists of 7805 unique terms. Fig. 2. Index terms in a dictionary


5

Fig. 3 shows frequency count of each unique term in dictionary distributed in complete

dataset. Some of the unique terms such as (‘flow’, 2059), (‘pressure’, 1245), (‘boundary’,

1076), (‘results’, 897), with high frequency in entire documents is shown.

Fig. 3. Frequency count of each unique term among data collection

Further looking into the dataset, as shown in fig. 4, shows subset of 380 documents out of

1398 abstracts. In document 329 the total terms are 342 but the unique terms are only 162.

Fig. 4. Document-wise total and unique terms

Table I is the subset of 20 queries out of 1398 tested queries against the vector space model

for retrieval of relevant documents. In the present study the most relevant document is one

whose ID matches with query ID.


6

Table 1. Subset of 20 queries

Q.ID Details of Queries

1 experimental investigation of the aerodynamics of a wing in a slipstream .

2 simple shear flow past a flat plate in an incompressible fluid of small viscosity .

3 the boundary layer in simple shear flow past a flat plate .

4 approximate solutions of the incompressible laminar boundary layer equations for a plate in shear flow .

5 one-dimensional transient heat conduction into a double-layer slab subjected to a linear heat input for a small time internal .

6 one-dimensional transient heat flow in a multilayer slab .

7 the effect of controlled three-dimensional roughness on boundary layer transition at supersonic speeds .

8 measurements of the effect of two-dimensional and three-dimensional roughness elements on boundary layer transition .

9 transition studies and skin friction measurements on an insulated flat plate at a mach number of 5.8 .

10 the theory of the impact tube at low pressure .

11 similar solutions in compressible laminar free mixing problems .

12 some structural and aerelastic considerations of high speed flight .

13 similarity laws for stressing heated wings .

14 piston theory - a new aerodynamic tool for the aeroelastician .

15 on two-dimensional panel flutter .

16 transformation of the compressible turbulent boundary layer .

17 remarks on the eddy viscosity in compressible mixing flows .

18 the flow field in the diffuser of a radial compressor .

19 an investigation of the pressure distribution on conical bodies in hypersonic flows .

20 generalised-newtonian theory .

The result of experimentation is tabulated in Table 2 shown below:

Table 2. Tabulation of results of first ten queries

Further query wise details are depicted in subsequent Table 3 to Table 5 respectively.

Table 3. Frequency of terms in retrieved document set for query 1


7



Table 3 depicts the result obtained for query 1 with query terms and their occurrences in the

documents. First 18 documents are listed. Similar results are shown in table 4 and table 5 for

queries 2 and 3 respectively.

5.7.4.Results and discussion

Even after taking the ‘titles’ of the abstract documents as a query set, the final result of retrieval is

89.41%. 148 queries have not shown result as first retrieved document; however, within the range

of first two retrieved documents, the result obtained is 94.99%. 2.22% of queries have not retrieved the correct result up to the range of first five documents.

Inter-document characterization and document frequency plays vital role in building ranks of the

documents in vector space model. As such, term frequency of the documents d3, d389 and d2 is 24,

42 and 110 respectively; and the frequency of terms ‘flow’, ‘plate’ and ‘small’ in all documents is

2059, 421 and 306 respectively, which is much more higher than the average frequency- as a

result, table IV shows the outcome of query q2 as document d3; however, expected relevant

document is d2. Variety of the weight calculation formulas as suggested by Salton and Buckley [2]

have been tested on this collection but we found that the standard tf-idf weighting scheme gives the

best results.

References

[1] G. Salton and M. J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Co., New York,

1983.

[2] G. Salton and C. Buckley. Term-weighting approaches in automatic retrieval. Information Processing and

Management, 24(5):513-523, 1988.

[3] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Introduction to Information Retrieval,

Cambridge University Press, New York, USA, 2008.

[4] M. E. Maron and J. L. Kuhns. On relevance, probabilistic indexing and information retrieval. Association for

Computing Machinery, 7(3):216-244, 1960.

[5] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979.


8

[6] W. B. Frakes and R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms. Prentice Hall,

Englewood Cliffs, NJ, USA, 1992.

[7] D. Harman. Relevance feedback and other query modification techniques. In W. B. Frakes and R. Baeza-Yates,

editors, Information Retrieval: Data Structures and Algorithms, pages 241-263. Prentice Hall, Englewood Cliffs,

NJ, USA, 1992.

[8] S. Wartick. Boolean operations. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval: Data

Structures and Algorithms, pages 264-292. Prentice Hall, Englewood Cliffs, NJ, USA, 1992.

[9] J. Verhoeff, W. Goffmann, and Jack Belzer. Inefficiency of the use of Boolean functions for information

retrieval systems. Communications of the ACM, 4(12):557-558, 594, December 1961.

[10] J. H. Lee, W. Y. Kim, and Y. H. Lee. Ranking documents in thesaurus-based Boolean retrieval systems.

Information Processing and Management, 30(1):79-91, 1993.

[11] G. Salton and M. E. Lesk. Computer evaluation of indexing and text processing. Journal of the ACM, 15(1):8-

36, January 1968.

[12] Gerad Salton and C. S. Yang. On the specification of term values in automatic indexing. Journal of

Documentation, 29:351-372, 1973.

[13] G. Salton. The SMART Retrieval System – Experiments in Automatic Document Processing. Prentice Hall Inc.,

Englewood Cliffs, NJ, 1971.

[14] K. Sparck Jones. A statistical interpretation of term specificity and its application to retrieval. Journal of

Documentation, 28(1):11-20, 1972.

[15] K. Sparck Jones. A statistical interpretation of term specificity and its application to retrieval. Information

Storage and Retrieval, 9(11):619-633, 1973.

[16] V. V. Raghavan and S. K. M. Wong. A critical analysis of vector space model for information retrieval. Journal

of the American Society for Information Sciences, 37(5):279-287, 1986.

[17] ftp server of Cornell University ftp://ftp.cs.cornell.edu/pub/smart/cran/ for Cranfield collection.

@@@@@@@@@@@

1

GRID DATA MINING USING K-MEANS ALGORITHM

Hemant S. Mahalle1,

Asso. Prof. & Head, Department of

Computer Science, P. N. College,

Pusad, Dist: Yavatmal(M.S.).

A. B. Manwar2,

Asso. Prof., Department of

Computer Science, S.G.B.

Amravati University,

Amravati(M.S.).

Dr. Vinay Chavan3

Asso.Prof. and Head, Department

of Computer Science, S.K.Porwal

College, Kamptee, Nagpur

Abstract— The k-means algorithm is one of the famous

algorithms used for data mining. The Cluster Computing and

Grid-computing are the emerging technologies in the today’s

Internet scenario. The resources are geographically distributed

from where the data and knowledge are to be shared among

the users. For this it is required to have efficient algorithms for

mining the web. Several algorithms have been developed for

data mining, but a few tends to exact result. Mostly all

algorithms use the keywords as the base for searching. In Grid

computing also the same algorithms have been used to mine

the data. In this paper we have tried to use the Grid along

with cluster concept for a specific application of data mining

using k-means algorithm by grouping items in to k clusters.

Keywords- data mining, Grid computing, cluster, resource

sharing

I. INTRODUCTION

Grid computing is a platform for coordinated resource

sharing and problem solving among virtual organizations. It

provides direct access to computers, software, data and high

performance equipments. Grid uses grid-services to access

and use grid resources. It is required to invent new grid services for efficient and effective data and knowledge

mining. So as to improve the effectiveness of data the k-

means algorithm can be used in many ways by adding the

modules. The Grid computing uses the internet to enable

resource sharing of geographically distributed resources and

equipments. The web services can be used effectively for

communication in the grid environment.

A web service is a way to expose the functionality of an

information system or application, and to make it available

via web technologies [1].

Through a middleware such as Globus Toolkit, we can

obtain a set of services and software libraries to support the

resources of Grid and grid applications [2]. Grid services

[GS] need to apply resource discovery to be functional.

However, sometimes the conventional search of services

does not provide good results, in these cases, Semantic web

tries to solve the problem adding meaning to GS and thus,

the search is improved and better results are obtained [3].

II. K-MEANS ALGORITHM [4]

The k-means algorithm is a simple iterative method to

partition a given dataset into a user specified number of clusters. This algorithm has been discovered by several

researchers across different disciplines. The algorithm

operates on a set of d-dimensional vectors, D={xi| i=1 …..

N} where xi ⋲ Rd denotes the i

th data point. The algorithm is

initialized by picking k points in Rd as the initial k cluster

representatives or “centroids”. Techniques for selecting

these initial seeds include sampling at random from the

dataset, setting them as the solution of clustering a small

subset of the data or the global mean of the data k times.

Then the algorithm iterates between two steps till

convergence: Step 1: Data Assignment. Each data point is assigned to

its closest centroid, with ties broken arbitrarily. This results

in a partitioning of the data.

Step 2: Relocation of “means”. Each cluster

representative is relocated to the center (mean) of all data

points assigned to it. If the data points come with a

probability measure (weights), then the relocation is to the

expectations (weighted mean) of the data partitions. The

algorithm converges when the assignments no longer

change. In k-mean algorithm each iteration needs N × k

comparisons, which determines the time complexity of one

iteration. The number of iterations required for convergence varies and may depend on N, but as a first cut, this

algorithm can be considered linear in the dataset size [4].

III. CLUSTERING IN GRID

In Grid architecture the resources are located

geographically diverse, in which different resources contain

different attributes for a common set of entities. Every

resource learns the cluster of each entity, but don’t have the

knowledge of their attributes. Here we have tried to found

some solution over by using k-means clustering, which is a

simple technique to group items into k clusters.

The k-means clustering algorithm requires approximations for the values or positions of the k-means.

The choice of initial points is very important to determine

the final solution using k-means algorithm. The cluster of

datasets from the involving computers from the Grid

selected as the initial points for approximations value

positions for k-means algorithm. The topic leads to

mechanism for producing appropriate initial assignment.

The classic k-means clustering approach uses multiple

samples of data sets to get the initial points. Here the k-means are selected arbitrarily because the new operations involve k-means clustering.

2

IV. GRID DATA MINING ALGORITHM USING K-MEANS

The k-means algorithm takes k, the number of desired clusters, and the data points as input and produces K clusters

as output. For the general partitioning strategy the algorithm

follows the following pseudo-code iterative method [5].

Partitioning Cluster (Data D, int k)

1) Select k prototypes t1,t2,. . . . tk, from D

2) for i=1,…..k. initialize cluster Ci={ti}

3) repeat:

4) for each point p in D:

5) let ti be the prototype that minimizes d(t,p)

6) put p in cluster Ci

7) quality = ClusteringQuality(C1,C2,….CK)

8) Re-compute prototypes

9) until quality does not change

The k-means algorithm characterizes the following details

[5].

1) First, it selects the initial k prototypes arbitrarily.

2) The Squared error criterion method is used to determine the

clustering quality.

3) In each iteration, the prototype of each cluster is re-

computed to be the cluster mean.

4) The k-means does not include any sampling techniques to

scale to huge databases

Ex. Consider the task of clustering computers in the Grid

environment into two clusters based on numerical data. These two cluster groups are

1) Computers actually taking part in the sharing of databases

in grid environment.

2) The number of databases shared by these computers for

different jobs of various Organizations.

TABLE I. DATA FOR K-MEANS EXAMPLE

Company Jobs No. of Computers No. of data-bases

C1 J1 60 64

C2 J2 55 55

C3 J3 46 54

C4 J4 50 46

C5 J5 58 65

C6 J6 52 60

For two cluster groups the value K is k=2. The graph plotted for these cluster groups is as follows.

Iteration 1. The initial k-prototypes are selected arbitrarily,

let the points J1, J2 & J3 are cluster means. The algorithm

decides to which cluster the remaining points should fall.

This is based on distance how close they are to the prototypes by calculating the Euclidean distance, we obtain

d(J4,J1) = [(50-60)2+(46-64)

2] = 20.59

d(J4,J2) = [(50-55)2+(46-55)

2] = 10.29

d(J4,J3) = [(50-46)2+(46-54)

2] = 8.94

d(J5,J1) = [(58-60)2+(65-64)

2] = 2.23

d(J5,J2) = [(58-55)2+(65-55)

2] = 10.44

d(J5,J3) = [(58-46)2+(65-54)

2] = 16.27

It is clear from the Euclidean distance that J5 is closer to J1 than J2, Hence it should belong to the cluster for which J1

is a prototype. Similarly, J4 should belong to the cluster for

which J3 is a prototype.

The clustering quality should be calculated by using the

squared error criterion as

E = d(J5,J1)2 + d(J4,J3)

2 = 84

Iteration 2: The algorithm re-computes the prototypes by calculating the cluster means. For the cluster with J1 the mean is calculated as ((58+60)/2, (65+64)/2) which is (59, 64.5) similarly for J2, J3 the cluster means are calculated and Euclidean distance is calculated along with clustering quality.

V. ANALYSIS OF ALGORITHM

In each iteration, the k-means algorithm computes the distance of each point from each of the prototypes. If there are n points and t iterations, then the complexity of the algorithm is O(nkt). Thus, the algorithm is linearly scalable with respect to the data size [5].

3

VI. DISCUSSION AND CONCLUSION

K-means is closely related to fitting a mixture of k isotropic Gaussians to the data or it is related to fitting the data with a

mixture of k components from the exponential family of

distributions. Another broad generalization is to view the

“means” as probabilistic model instead of points in Rd [4].

Here, in the assignment step, each data point is

assigned to the most likely model which has generated it. In

the “relocation” step, the model parameters are updated to

best fit the assigned datasets. Thereby, providing solution to

data mining in grid for more complex and scattered data. The k-means algorithm is widely used because of its

simplicity and also for its partitional clustering. It can be easily modified to deal with streaming data. The approximations to the true means are iteratively refined and at each iteration every point is assigned to the proper cluster.

VII. LIMITATIONS

1. It is observed that k-means is a limiting case of fitting data by a mixture of k Gaussians with identical, isotropic covariance matrices ,when the soft assignments of data points to mixture components are hardened to allocate each data point solely to the most likely component.

2. The cost of the optimal solution decreases with increasing k till it hits zero when the number of clusters equals the number of distinct data-points. This makes it more difficult to (a) directly compare solutions with different numbers of clusters and (b) to find the optimum value of k. If the desired k is not known in advance, one will typically run k-means with different values of k, and then use a suitable criterion to select one of the results.

REFERENCES

[1] G. Alonso, F. Casati, H. Kuno, and V. Machiraju, "Web Services:

Concepts, Architectures, and Applications",Springer Verlag, 2004.

[2] I. Foster, J. Nick, C. Kesselman, and S. Tuecke, "The Physiology of the

Grid: An Open Grid Services Architecture for Distributed Systems

Integration", 2002.

[3] M. Muhammad, L. Yuan, and Z. Jianqiu, "Web 3.0: A real Personal Web",

2009 Third International Conference on Next Generation Mobile

Applications, Services and Technologies,2009, pp. 125 – 128.

[4] IEEE international conference on Data Mining (ICDA) Survey paper

published on 4 December 2007.

[5] Vikram Pudi, P.Radha Krishna. “DATA MINING”, OXFORD University

Press.

1

A Survey of Approaches for Ontology Design

A.B.Manwar1, H.S. Mahalle

2, Dr. Vinay Chavan

3

1 Department of Computer Science, SGB Amravati University, Amravati MS, India

2Department of Computer Science, P. N. College, Pusad, Dist. Yavatmal MS, India 3 Department of Computer Science, S. K. Porwal College, Kamptee, Nagpur MS, India

Abstract- The semantic web is the next generation web

with structured information, that is, with well defined

semantics. Ontologies defined by Grubber as

‘vocabularies of representational terms- classes,

relations, functions, object constants- with agreed-upon

definitions in the form of human readable text and

machine enforceable, declarative constraints on their

well-formed use’. Ontology plays vital role in

augmenting relevancy in the information retrieval. In

this paper, we survey the prominent approaches that

have been proposed for construction of ontologies. We

also discuss some key issues in the design process of

ontologies.

I. INTRODUCTION

“The Semantic Web is not a separate Web but an extension of the current one, in which information is

given a well-defined meaning, better enabling

computers and people to work in cooperation” [1].

The notion of Semantic Web by Tim Berners Lee has created multidimensional directions of research, with ontology at its core as a suitable representation of

knowledge hidden in massive data of web. Several issues needs to be considered related to the

methodologies, tools and languages to be used, while

constructing new ontology. Many approaches have

been proposed for developing ontology which allows users to define new concepts, relations and instances.

The interrelationships between entities, domain

vocabulary and factual knowledge are key factors in designing effective tool. However, hassle is in the use

of a proper tool for modeling, editing and exploiting

various ontologies. Knowledge representation in domain ontologies developed with standard formalism enables interoperability amongst similar

domain of ontologies; moreover, this formal structure

leads to ontology learning [2].

In this paper, we survey some of the known relevant approaches for building ontologies. Next

section introduces the related work of the approaches

for ontology construction. Section III reviews different approaches for building ontologies. Finally,

section IV is a discussion which summarizes the

paper. The survey has been done from the ontology

development community and would be valuable for

ontology developers.

II. RELATED WORK

So far many approaches have been reported for

ontology building. In 1990, Lenat and Guha had suggested the approach for the Cyc development [3].

The method used to build the Cyc KB consists of

three phases wherein in first phase was manual

extraction of implicit knowledge and second, third phase comprises of acquiring and managing

knowledge through natural language or machine

learning. The Uschold and King’s method proposed capturing knowledge, coding it and integrating other

ontologies inside the current one [4]. Gruninger and

Fox propose a methodology that is founded on the

knowledge-based systems using first order logic which takes advantage of the robustness of classic logic [5]. In the method proposed at the KACTUS

project, the ontology is built on the basis of an application KB, by means of a process of abstraction

that is, following a bottom-up strategy [6].

The only method that suggests collaborative

construction is CO4 [7]. This method includes an approach for fitting new pieces of knowledge with

underlying knowledge architecture, which has been

previously agreed. Natalya Fridman Noy considered

major strengths and contributions and outlines weaknesses of different approaches in the design process of ontology [8]. In 1997, a new methodology

was suggested for building ontologies based on the SENSUS ontology [9]. The method based on Sensus

is a top-down approach for deriving domain specific

ontologies from huge ontologies. Consequently, this

approach promotes sharability of knowledge, since the same base ontology is used to develop ontologies

in particular domains. In 1999, Fernandez Lopez, M.

has analysed the methodologies wherein incompatibility with IEEE standard has been

mentioned except for the approach of SENSUS [10]. A comparative and detailed study of these methods

and methodologies can be found at [11]. METHONTOLOGY is a methodology, created in the

Artificial Intelligence Lab from the Technical

2

University of Madrid (UPM), for building ontologies either from scratch, reusing other ontologies as they

are, or by a process of reengineering them. The METHONTOLOGY framework enables the construction of ontologies at the knowledge level.

The main phase in the ontology development process

using the METHONTOLOGY approach is the

conceptualization phase [12]. Also many other methods and methodologies have

been proposed for other tasks, such as ontology

reengineering [13], ontology learning [2, 14], ontology evaluation [15, 16, 17], ontology evolution

[18, 19], ontology merging [20].

To gain the maximum potential from ontologies,

Dean Jones et al [21] suggests consideration of approach that based on variety of available

experience rather than one or two. In 2001, the On-

To-Knowledge methodology came out with name

OnToKnowledge [22]. In most of the cases there is no correspondence between ontology building

methodologies and tools, since there is no

technological support for most of the existing methodologies, they cannot be easily applied in the

ontology construction task; Oscar et al [23],

discussed about missing integrated environments for

ontology development with the exceptions of Protégé, OntoEdit, WebODE. Also WebODE [24]

and ODE [25] tools provides support to

METHONTOLOGY. Man Li et al [26] discussed the challenging issues

in constructing large-scale ontology and proposed a

practical collaborative ontology development

method. In The Hitchhiker’s Guide to Ontology Editors, Loredana Laera and Valentina Tamma [27]

had briefly surveyed many ontology development

tools, remarking as most of them only give support

for designing and implementing the ontologies skipping support for other activities of the ontology life cycle. Li Ding et al [28] provided a pragmatic

view on Semantic Web ontologies via comparison on chronological evolution, survey and case-study. The

technical report of Md. Ahsan-ul Murshed and

Ramanjit Singh [29] proposed ontology evaluation

criteria and a ranking technique to evaluate ontology editors and comments that Protege-2.1.2 can be used

to construct small-sized ontologies and OntoEdit can

be used to develop medium to large-sized ontologies. Metis-3.4 is good for enterprise architecture

modeling. Eva Blomqvist had shown the use of patterns as templates for creating ontology design

patterns for automatic construction of enterprise ontologies [30]. However, this fits to the

requirements of a small-scale application context.

Later, the general outline of a case-based reasoning

inspired approach for pattern-based ontology construction, called OntoCase had been presented

[31]. Sabin C. Buraga, Liliana Cojocaru and Ovidiu C. Nichifor had drawn several remarks and results concerning a survey on Web ontology editing tools

which dealt with practical deployment of test

ontology via a suite of tools and is focused on some

important technical details and features [32]. Mohammad Nazir Ahmad and Robert M. Colomb

reported the technologies of ontology and the

semantic web, particularly on ontology server development in brief [33]. Ling Li et al presented a

visual ontology modeling tool VOEditor for ontology

engineers to create and edit their ontologies in a

graphical mode and a multi-agent ontology-aware system for querying of Semantic Web resources [34].

Gianluca Correndo and Harith Alani presented a brief

survey of tools submitted to the workshop for Social

and Collaborative Construction of Structured Knowledge [35].

III. APPROACHES FOR ONTOLOGY DESIGN

Some of the relevant ontology approaches for

ontology development have been reviewed briefly in

this section. IEEE defines methodology as “a comprehensive

integrated series of techniques or methods creating a

general systems theory of how a class of thought-intensive work ought to be performed” [36].

Although most ontology methodologies conform to

software and knowledge systems engineering

methodologies, are oriented to the more specific task at hand. Thus, a methodology is composed of

methods, techniques, processes and activities and

may follow several approaches to develop ontology:

bottom-up, top-down and middle-out.

A. Approach 1

In this methodology, the approach is for manual

extraction, machine-aided extraction and machine-

managed extraction of common sense knowledge. Lenat and Guha described the process followed in

order to model Cyc, an ontology originally conceived

to capture shared knowledge about the world [3].

B. Approach 2

TOVE project established another approach as (1)

specification of the environment motivating ontology and its application, (2) set of questions in terms of

requirements for ontology, (3) objects, attributes,

relations – terminology specification, (4) constraint

3

specification of terminology, (5) evaluation. Later, Gruninger and Fox established several steps towards

a methodological construction and evaluation of ontologies based on their experience building the TOVE project ontologies. [5].

C. Approach 3

Uschold et al [4, 37, 38] proposed a set of guidelines towards ontology construction and

merging. These guidelines are (1) identifying the

purpose for recognition of the user domain, assessment of the purposes, choice of prompting

scenarios and competency questions (2) process of

ontology building which includes - ontology capture

for identification of the key concepts and relationships in the domain of interest called as

scoping using a middle-out approach; setting of

precise, unambiguous text definitions for such

concepts and relations; ontology coding ; integrating existing ontologies. (3) Evaluation and Revision

Cycle: Uschold indicates the clarity, consistency and

coherence criteria, the checking against the purpose or the competency questions.

D. Approach 4

The KAKTUS methodology is mainly centered at the development of ontologies using bottom-up

approach for specific applications. In this process of

design, ontologies developed for other purpose in the same domain may be reused. The specification of

application with relevant terms and tasks gathered in

the first step. In second step, preliminary design is

made considering previous lists and specification which includes search for already developed

ontologies. In final step, refinement of ontology

towards the final design of application is made [39].

E. Approach 5

METHONTOLOGY is a methodology described in

Fernandez-Lopez et al. [40], Fernandez-Lopez [41], Fernandez-Lopez and Gomez-Perez [42] and Gomez-

Perez et al. [43]. The methodology describes the

different steps to be taken in the conceptualization

process of ontology. The methodology considers development activities as well as management and

support activities, integration, ontology verification,

validation, and assessment, documentation and configuration management. Regarding development,

the activities are: Specification, purpose and scope of the ontology; Conceptualization, formation of

glossary of terms with definitions, synonyms and acronyms adopting a middle-out strategy; Classify

terms into one or more taxonomies of concepts;

concept relation and dictionary building, setting

axioms and rules, formalization, implementation and maintenance. Finally, several tools such as ODE and

WebODE were built to give technological support to this methodology, although several other existing tools may also be used – Protege, OntoEdit and

KAON, within others [43, 44].

F. Approach 6

Noy and McGuinness suggested a simple

knowledge-engineering methodology in 7 steps [45].

It is presented just as a set of ideas to aid ontology construction: “there is no single correct ontology-

design methodology and we did not attempt to define

one. The ideas that we present here are the ones that

we found useful in our own ontology-development experience” [45]. They acknowledged Protege-2000

as the ontology development environment used to

present their examples. The seven steps are (1)

Determine the domain and the scope of the ontology, (2) Consider reusing existing ontologies, (3)

Enumerate important terms in the ontology, (4)

Define the classes and the class hierarchy with several approaches such as top-down, bottom-up,

combination, (5) Define the properties of classes-

slots, such as intrinsic, extrinsic, parts - if structured,

relationships to other individuals, (6) Define the facets of the slots, such as Slot value-type: string,

number, boolean slots, enumerated slots, instance-

type slots; Domain and range, (7) Create individual instances of classes in the hierarchy. This guide is

referenced at the Protege ontology editor webpage.

G. Approach 7

The ON-TO-KNOWLEDGE (OTK) methodology

is extensively described in Sure and Studer [47], Sure

[47, 48], and Sure and Studer [47], and was

influenced by the methodological proposes from Uschold and colleagues, CommonKADS and METHONTOLOGY. The methodological process is

divided into five steps (1) Feasibility study - identify stakeholders (users and supporters of the system),

identify uses cases describing usage scenarios and

their supporting uses cases; (2) Ontology Kickoff -

initiates the development of the ontology, (3) Refinement (4) Evaluation - mainly consistency and

language conformity checking (5) Application.

H. Approach 8

The DILIGENT (DIstributed Loosely-controlled evolvInG Engineering of oNTologies) is a

methodology focused on a specific modelling scenario. The DILIGENT process includes five

activities: (1) build, (2) local adaptation, (3) analysis,

(4) revision, (5) local update. In each of these

4

activities of the process, domain experts, users, knowledge engineers and ontology engineers are

assigned different roles [49].

I. Approach 9

The TERMINAE method for the development of

ontologies is divided into four steps, and it is assumed that: “(1) the ontology builder should have a

comprehensive knowledge of the domain (2) the

ontology builder knows well how the ontology will be used. The alignment process takes place during the

construction” [52]. TERMINAE modelling steps are

(1) Constitution corpus - corpus selection and

organization; (2) Linguistic study: linguistic analysis using several natural language processing tools; (3)

Normalization according to some structuring

principles and criteria; (4) formalization:

formalization and validation with a variant language of description logics [50, 51, 53].

J. Approach 10

UPON (Unified Process for ONtology) builds on

the experience of the methodological development of

OntoLearn. The Ontolearn tool and method, offered

methodological support towards ontology learning and automatic population of domain ontologies [54,

55]. The OntoLearn methodology had four steps:

discovery of new terms, extraction of definitions, extraction of taxonomic information, and ontology

update, and included experts as per-concept

evaluators. The UPON methodology is based on the

Unified Process for software development and takes a use-case driven, iterative and incremental nature [56].

The evaluation of the quality of the ontology

includes four activities: (a) syntactic quality (formal),

(b) semantic quality (consistency), (c) pragmatic quality (fidelity, relevance and completeness), and (d) social quality [57, 58].

IV. DISCUSSION

Ontologies may be analysed according to its

rationale, subject-matter, level of simplification, and

richness of internal structure. However, the methodologies adopted for the construction and the

tools used also significant items for analysis.

Hoekstra stated that “ontology development remains an art, and no a priori guidelines exist that can ensure

the quality of the result” [59]. From the investigation of the methodological descriptions, especially

METHONTOLOGY, On-To-Knowledge and UPON ontology building methodologies follow a cyclic

modelling approach, conformed by several stages

which include various management (a preparatory

phase), development (a development phase) and support (an application phase) activities. The

preparatory phase includes the activities such as specification of requirements with initial feasibility study. The development stage covers knowledge

acquisition activities, together with ontology

conceptualization, formalization, evaluation and

refinement. Development activities depend on the criteria established in the preparatory stages and these

activities may need to be performed in iterative and

incremental cycles, thus the need for ontology refinement. Finally, the application phase includes

implementation and maintenance activities.

Also these methodologies may be distinguished by

the knowledge acquisition factor for ontology development. For example, TERMINAE is a bottom-

up approach which focus on the acquisition of low

level domain classes or instances through a list of

terms extracted by linguistic analysis techniques such as natural language processing.

None of the described methodologies focuses

entirely in the development of ontologies from a top-down approach, although the use of competency

questions and the decision to reuse available top or

upper ontologies can be seen as a top-down in

ontology design. However, most methodologies that include the

formation of competency questions (e.g.

METHONTOLOGY, ON-TO-KNOWLEDGE), the analysis of available ontologies towards reuse, and

ontology refinement steps, also suggest the extraction

of lists of domain terms from texts. Sure and Studer

(2003) comments during the description of ON-TO-KNOWLEDGE that, “the usage scenario/competency

question method usually follows a top-down

approach in modelling the domain. One starts by

modelling concepts in a very generic level. Subsequently they are refined. This approach is typically done manually and leads to a high-quality

engineered ontology. Available top-level ontologies may be reused here and serve as a starting point to

develop new ontologies. In practice this seems to be

more a middle-out approach, that is, to identify the

most important concepts which will then be used to obtain the remainder of the hierarchy by

generalization and specialization. However, with the

support of automatic document analysis, a typical bottom-up approach may be applied. There, relevant

lexical entries are extracted semi-automatically from available documents. Based on the assumption

thatmost concepts and conceptual structures of the domain as well the company terminology are

described in documents, applying knowledge

5

acquisition from text for ontology design seems to be promising”

Hence, most ontology methodologies propose an open approach with various concurrent possibilities and opening such as top-down, middle-out, bottom-

up.

REFERENCES

[1] Berners-Lee T., J. Hendler, and O. Lassila. “The Semantic Web” in

Scientific American 284(5):34–43, 2001, May.

[2] Aussenac-Gilles, N., Biebow, B., S. Szulman, “Revisiting ontology

design: a methodology based on corpus analysis,” in: 12th

International Conference in Knowledge Engineering and

Knowledge Management (EKAW_00

[3] D.B. Lenat, R.V. Guha, “Building Large Knowledge-Based

Systems: Representation and Inference in the Cyc Project,”

Addison-Wesley, Boston, 1990.

[4] M. Uschold, M. King, “Towards a Methodology for Building

Ontologies,” in: IJCAI95 Workshop on Basic Ontological Issues in

Knowledge Sharing, Montreal, 1995.

[5] M. Gruuninger, M.S. Fox, “Methodology for the design and

evaluation of ontologies,” in: Workshop on Basic Ontological

Issues in Knowledge Sharing, Montreal, 1995.

[6] A. Bernaras, I. Laresgoiti, J. Corera, “Building and reusing

ontologies for electrical network applications,” in: Proc. European

Conference on Artificial Intelligence (ECAI_96), Budapest,

Hungary, 1996, pp. 298–302.

[7] J. Euzenat, “Corporative memory through cooperative creation of

knowledge bases and hyper-documents,” in: Proc. 10th Knowledge

Acquisition for Knowledge-Based Systems Workshop (KAW96),

Banff, 1996.

[8] Natalya F. Noy and Carole D. Hafner, "The State of the Art in

Ontology Design", American Association for Artificial

Intelligence, 1997.

[9] B. Swartout, P. Ramesh, K. Knight, T. Russ, “Toward Distributed

Use of Large-Scale Ontologies,” in AAAI Symposium on

Ontological Engineering, Stanford, 1997.

[10] M. Fernandez-Lopez, “Overview of Methodologies for Building

Ontologies,” in: IJCAI99 Workshop on Ontologies and Problem-

Solving Methods: Lessons Learned and Future Trends, Stockholm,

1999.

[11] M. Fernandez-Lopez, “Overview of Methodologies for Building

Ontologies,” in: IJCAI99 Workshop on Ontologies and Problem-

Solving Methods: Lessons Learned and Future Trends, Stockholm,

1999.

[12] M. Fernandez-Lopez, A. Gomez-Perez, A. Pazos-Sierra, J. Pazos-

Sierra, “Building a chemical ontology using METHONTOLOGY

and the ontology design environment,” in IEEE Intelligent Systems

& their applications 4 (1) (1999) 37–46.

[13] A. Gomez-Perez, M.D. Rojas, “Ontological reengineering and

reuse,” in: D. Fensel, R. Studer (Eds.), 11th European Workshop

on Knowledge Acquisition, Modeling and Management

(EKAW_99) [14] JU. Kietz, A. Maedche, R. Volz, “A method for semi-automatic

ontology acquisition from a corporate intranet,” in: EKAW_00

Workshop on Ontologies and Texts, CEUR Workshop

Proceedings, Juan-Les-Pins, vol. 51 (2000) 4.14.14.

[15] A. Gomez-Perez, “Evaluation of ontologies,” in International

Journal of Intelligent Systems 16 (3) (2001)10:1–10: 36.

[16] N. Guarino, C. Welty, “Ontological analysis of taxonomic

relationships,” in: 19th International Conference on Conceptual

Modeling.

[17] Y. Kalfoglou, D. Robertson, “Managing Ontological Constraints,”

in: Proc. IJCAI99 Workshop on Ontologies and Problem-Solving

Methods: Lessons Learned and Future Trends, Stockholm, 1999.

[18] M. Klein, D. Fensel, “Ontology versioning on the Semantic Web,”

in: First International Semantic Web Workshop (SWWS01),

Stanford, 2001.

[19] L. Stojanovic, B. Motik, “Ontology evolution with ontology,” in:

EKAW02 Workshop on Evaluation of Ontologybased Tools

(EON2002), CEUR Workshop Proceedings, Siguenza, vol. 62,

2002, pp. 53–62.

[20] N.F. Noy, M.A. Musen, “PROMPT: Algorithm and Tool for Automated Ontology Merging and Alignment,” in: 17th National

Conference on Artificial Intelligence (AAAI_00), Austin, 2000.

[21] JONES, Dean, BENCH-CAPON, Trevor a VISSER, Pepijn.

“Methodologies for ontology development,” in proceedings of the

15th IFIP World computation congress, Budapest, 1998. 1998, s.

62-75.

[22] S. Staab, H.P. Schnurr, R. Studer, Y. Sure, “Knowledge processes

and ontologies,” in IEEE Intelligent Systems 16 (1) (2001) 26–34.

[23] Oscar Corcho, Mariano Fernandez Lopez, Asuncion Gomez-Perez.

“Methodologies, tools and languages for building ontologies.

Where is their meeting point?” in Data & Knowledge Engineering

46 (2003) 41–64.

[24] O. Corcho, M. Fernandez-Lopez, A. Gomez-Perez, O. Vicente,

“WebODE: an integrated workbench for ontology representation,

reasoning and exchange,” in 13th International Conference on

Knowledge Engineering and Knowledge Management

(EKAW_02)

[25] M. Blazquez, M. Fernandez-Lopez, J.M. G-Pinar, A. Gomez-

Perez, “Building ontologies at the knowledge level using the

ontology design environment,” in 11th International Workshop on

Knowledge Acquisition, Modeling and Management (KAW_98)

Banff, 1998.

[26] Man Li, Dazhi Wang, Xiaoyong Du, Shan Wang. “Ontology

Construction for Semantic Web: A Role-Based Collaborative

Development Method”, APWeb 2005: 609-619

[27] Loredana Laera, Valentina Tamma, “The Hitchhiker’s Guide to

Ontology Editors,” in AgentLinks’ Making Use of the Semantic

Web. ISSUE 18, August 2005

[28] Li Ding, Pranam Kolari, Zhongli Ding, and Sasikanth Avancha.

“Using Ontologies in the Semantic Web: A Survey,” a book-

chapter in Ontologies: A Handbook of Principles, Concepts and

Applications in Information Systems, 2006.

[29] Md. Ahsan-ul Murshed and Ramanjit Singh. “Evaluation and

ranking of ontology construction tools”. Technical Report # DIT-

05-013, at eprints.biblio.unitn.it/archive/00000747/01/013.pdf,

March 2005,

[30] Eva Blomqvist. “Fully automatic construction of enterprise

ontologies using design patterns: Initial method and first

experiences,” in Proceedings of ODBASE’05

[31] Eva Blomqvist.. “Pattern ranking for semi-automatic

ontology construction,” in proceedings of the 2008 ACM

symposium on Applied computing, ACM New York, NY, USA

[32] Buraga, S. C., Cojocaru, L., and Nichifor, O. C. “Survey on Web

Ontology Editing Tools” available at: http://www.aifb.uni-

karlsruhe.de/WBS/ysu/publications/OntoWeb_Del_1-3.pdf, 2006 [33] Ahmad, M. N., & Colomb, R. M. (2007). “Managing Ontologies :

A Comparative Study of Ontology Servers,” Australian Computer

Society, Inc. Retrieved from

http://portal.acm.org/citation.cfm?id=1273733

[34] Ling Li, Shengqun Tang, Lina Fang, Ruliang Xiao, Xinguo Deng,

Youwei Xu, Yang Xu, "VOEditor: a Visual Environment for

Ontology Construction and Collaborative Querying of Semantic

Web Resources," in 31st Annual International Computer Software

and Applications Conference, 2007

6

[35] Gianluca Correndo, Harith Alani. “Survey of Tools for

Collaborative Knowledge Construction and Sharing,” in Web

Intelligence/IAT Workshops 2007: 7-10

[36] IEEE. 1990, December. “IEEE standard glossary of software

engineering terminology” IEEE Standard 610.12-1990. New York:

Institute of Electrical and Electronics Engineers.

[37] Uschold, M. “Building ontologies: Towards a unified

methodology,” in proceedings of Expert Systems 1996, the 16th

Annual Conference of the British Computer Society Specialist

Group on Expert Systems. Cambridge: The University of

Edinburgh, 1996 [38] Uschold, M., and M. Gr¨uninger. “Ontologies: Principles, methods

and applications”. Technical Report AIAI-TR-191, University of

Edimburgh, 1996

[39] Bernaras, A., I. Laresgoiti, and J. Corera.. “Building and reusing

ontologies for electrical network applications,” in Proceedings of

the 12th European Conference on Artificial Intelligence

(ECAI’96), Budapest, Hungary: Wiley, 1996

[40] Fernandez-Lopez, M., A. Gomez-Perez, and N. Juristo. 1997.

“METHONTOLOGY: From ontological art towards ontological

engineering.” In Ontological engineering: American Association

for Artificial Intelligence, Menlo Park, California, 33–40, 1997

[41] Fernandez-Lopez, M. August. “Overview of methodologies for

building ontologies”. In Proceedings of the Workshop on

Ontologies and Problem-Solving Methods in Conjunction with the

Sixteenth International Joint Conference on Artificial Intelligence

(IJCAI’99).

[42] Fernandez-Lopez, M., and A. Gomez-Perez. “Overview and

analysis of methodologies for building ontologies.” Knowledge

Engineering Review 17(2):129–156, 2002.

[43] Gomez-Perez, A., M. Fernandez-Lopez, and O. Corcho.

“Ontological engineering. With examples from the areas of

knowledge management, e-commerce and the Semantic Web.”

Advanced information and knowledge processing. London:

Springer, 2003.

[44] Corcho, O., M. Fernandez, A. Gomez-Perez, and A. Lopez-Cima.

2005. “Building legal ontologies with METHONTOLOGY and

WebODE,” in Law and the Semantic Web. Legal ontologies,

methodologies, legal information retrieval, and applications, 2005

[45] Noy, N. F., and D. L. McGuinness. “Ontology development 101: A

guide to creating your first ontology.” Technical Report SMI-

2001-0880, Stanford University School of Medicine, 2001.

[46] Sure, Y. “A tool-supported methodology for ontology-based

knowledge management.” In Ontology and Modeling of Real

Estate Transactions in European Juristictions,1993.

[47] Sure, Y., S. Staab, and R. Studer. “Methodology for development

and employment of ontology based knowledge management

applications. “ACM SIGMOD Record 31(Special Issue)(4):18–23,

2002.

[48] Sure, Y. “Methodology, tools and case studies for ontology based

knowledge management.” Ph.D. thesis, 2003.

[49] Pinto, H. S., C. Tempich, and S. Staab. DILIGENT: “Towards a

fine-grained methodology for distributed, loosely-controlled and

evolving engineering of ontologies.” in Proceedings of the 16th

European Conference on Artificial Intelligence (ECAI 2004),

2004. [50] Biebow, B., and S. Szulman. “TERMINAE: A linguistics-based

tool for building of a domain ontology.” In Proceedings of the 11th

European Workshop (EKAW’99), 1999.

[51] Aussenac-Gilles, N., B. Biebow, and S. Szulman. “Revisiting

ontology design: A method based on corpus analysis.” In

Proceedings of EKAW 2000.

[52] Despres, S., and S. Szulman. “Construction of a legal ontology

from a European community legislative text. In Legal knowledge

and information systems.” The Seventeenth Annual Conference,

Amsterdam, ed. T. Gordon, 79–88. Amsterdam: IOS Press, 2004

[53] Despres, S., and S. Szulman. “Merging of legal micro-ontologies

from European directives.” Artificial Intelligence and Law, 2007.

[54] Navigli, R., P. Velardi, A. Cucchiarelli, and F. Neri. “Automatic

ontology learning: Supporting a per-concept evaluation by domain

experts.” In Proceedings of the ECAI-2004 Workshop on Ontology

Learning and Population, Sevilla.

[55] Velardi, P., R. Navigli, A. Cucchiarelli, and F. Neri. “Evaluation of

OntoLearn, a methodology for automatic population of domain

ontologies.” In Ontology learning from text: Methods, evaluation

and applications, ed. P. Buitelaar, P. Cimiano, and B. Magnini.

Frontiers in artificial intelligence and applications series, Vol. 123. Amsterdam: IOS Press, 2005.

[56] de Nicola, A., M. Missikoff, and R. Navigli. “A proposal for a

unified process for ontology building: UPON.” In Database and

expert systems applications (DEXA), 2005.

[57] de Nicola, A., M. Missikoff, and R. Navigli. “A software

engineering approach to ontology building.” Information Systems

34:258–275, 2009.

[58] Barbagallo, A., A. D. Nicola, and M. Missikoff. “eGovernment

ontologies: Social participation in building and evolution.” Hawaii

International Conference on System Sciences 0:1–10, 2010.

[59] Hoekstra, R. “Ontology representation. Design patterns and

ontologies that make sense.” In Frontiers in artificial intelligence

and applications, Vol. 197. Amsterdam: IOS Press, 2009.

IPv6 – ADDRESSING GRID-INTEROPERABILITY PERSPECTIVE...

Documents

Transcript of IPv6 – ADDRESSING GRID-INTEROPERABILITY PERSPECTIVE...