Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and...

23
CLUSTERING AND TEXT MINING Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler

Transcript of Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and...

Page 1: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

CLUSTERING AND TEXT MINING

Name : Emad ZargounId number :135042

EASTERN MEDITERRANEAN UNIVERSITY

DEPARTMENT OF Computing and technology

“ITEC547- text mining“

Prof.Dr. Nazife Dimiriler

Page 2: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

My outline

1. Overview about clustering 2. Web service clustering 3. Ontologies improve text

document clustering 4. Heterarchy and Core Ontology5. Compiling Background Knowledge into

the Text Document Representation6. Conclusion

Page 3: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Overview about clustering

Definition of clustering We have some definitions that we can define clustering in text mining

procedure of dividing texts into several clusters, where each cluster contains relevant text and each cluster differs from others

A grouping of data objects such that the objects within a group are similar , to one another and different from the objects in other groups.

Page 4: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

web services are distributed autonomous software components that are self-describing and designed by different vendors to provide business functions to other applications through an internet connection.

some major providers have even decided to advertise their services through their human-readable websites,. For example, Google’s and Amazon’s web services

Page 5: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

The mechanism for clustering web services to bootstrap is a service search engine .

That in web services we use web service description files (WSDL)files

Page 6: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

The clustering of web service files is different from the traditional web service discovery problem because there are no queries to match against.

the idea of representing a web service using document vectors is still relevant .

gathering the features for a WSDL file is not as simple as collecting description documents when assuming no central UDDI registries.

Page 7: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

system that can automatically cluster a group of WSDL files obtained by querying a search engine (e.g., Google) .

process of mining four types of features of a WSDL file.

1. the content of the web service is characterized by the application-specific terms located in the WSDL file

Page 8: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

2. the context of the web service is represented by the application-specific terms appearing in all index web pages of publicly accessible parent directories of the current directory containing the WSDL file .

3. the service host is the second- and top-level portion of the domain name (i.e., a segment of the authority part of the URI) of the host containing the WSDL file .

4. the service name is the name of the WSDL file.

Page 9: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

Page 10: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

From the previous figure word analyzer begins by tokenizing the WSDL or HTML

files to construct the initial sets of C and X, remove non-words from these sets words in the two sets are conflated, and analyzed for

their content-bearing property to remove function words The remaining content words in the two sets are then

clustered to identify application-specific terms and general computing terms.

we utilize regular expressions to extract the service name, s name and the service host address, short. These steps are implemented as modules identifyServiceHost and identifyServiceName as an example, the service name of this WSDL file http://weather.terrapin.com/

Page 11: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

Page 12: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Web service clustering

The web service clusters produced based on the four types of features .

Page 13: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Ontologies improve text document clustering

The beneficial effects can be achieved for text document clustering by integrating an explicit conceptual account of terms found in ontologies like WordNet.

The clustering is then performed with Bi-Section-KMeans, which has been shown to perform as good as other text clustering algorithms.

Page 14: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Heterarchy and Core Ontology

Definition 1 (Core Ontology) A core ontology is a sign system which consists of :

A lexicon: The lexicon L contains a set of natural language terms.

• A set of concepts C*. • The reference function

A heterarchy H: Concepts are taxonomically related by the directed, acyclic, transitive, reflexive relation.

Page 15: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Example • lexicon L = {Hotel, Grand Hotel, Hotel

Schwarzer Adler, Accommodation, ...) • concepts C* = {ROOT, HOTEL,

ACCOMMODATION, …} • reference function F = {(Hotel, HOTEL),

(Grand Hotel, HOTEL), (Hotel Schwarzer Adler, HOTEL), …}, i.e. "Hotel", "Grand Hotel" and "Hotel Schwarzer Adler" refer to the concept HOTEL .

• heterarchy H = {(HOTEL, ACCOMMODATION), (ACCOMMODATION, ROOT), …}

Page 16: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Compiling Background Knowledge into theText Document Representation we have three strategies that we can

compile the text document Term vs. Concepts Vector Strategies Enriching the term vectors with concepts

from the core ontology has two benefits. First it resolves synonyms; and second it introduces more general concepts

which help identifying related topics. For instance, a document about beef may

not be related to a document.

Page 17: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Compiling Background Knowledge into theText Document Representation

Strategies for Disambiguation The assignment of terms to concepts

in Word net is ambiguous. adding or replacing terms by

concepts may add noise to the representation and may induce a loss of information.

We have 3 strategies in the disambiguation

Page 18: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Compiling Background Knowledge into theText Document Representation

All Concepts (“all”). The baseline strategy is not to do anything about disambiguation and consider all concepts for augmenting the text document representation.

First Concept (“first”). Wordnet returns an ordered list of

concepts when applying Ref C to a set of terms. Thereby,

the ordering is supposed to reflect how common it is that

term reflects a concept in “standard” English language. m ore common term meanings are listed before less

common ones.

Page 19: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Compiling Background Knowledge into theText Document Representation

Disambiguation by Context (“context”). The sense of a term t that refers to several different concepts Ref C(t) := {b, c, . . .} may be disambiguated by a simplified version of first strategy

Page 20: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Compiling Background Knowledge into theText Document Representation

Strategies for considering the concept hierarchy

The third set of strategies varies the amount of background knowledge.

principal idea is that if a term like ‘beef’ appears, one does not only represent the document by the concept corresponding to ‘beef’, but also by the concepts corresponding to ‘meat’ and ‘food’ etc. up to a certain level of generality.

Page 21: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Conclusion

Clustering web services into functional similar groups can greatly reduce the search space of a service discovery task. Therefore, it can be seen as a predecessor of web service discovery or an important functionality provided by future service search engines.

Clustering based on all three document vectors (word vector, concept vector, category vector) also gets significantly better results than the baseline, but does not outperform clustering based only on word vector and category vector.

Page 22: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

References

Web service clustering using text mining techniques(Int. J. Agent-Oriented Software Engineering, Vol. X, No. Y, , Wei Liu* and Wilson Wong )

Ontologies Improve Text Document Clustering(Andreas Hotho, Steffen Staab, Gerd Stumme,Institute AIFB, University of Karlsruhe,germany)

E. Agirre and G. Rigau. Word sense disambiguation using conceptual density. In Proc. of COLING’96, 1996.

Page 23: Name : Emad Zargoun Id number :135042 EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.

Thanks for your attention