Semantic Web mining using nature inspired optimization methods

Semantic Web mining using nature inspiredoptimization methods

Diana Andreea Gorea, Lucian Bentea

Faculty of Computer Science, “A.I. Cuza” University, Iasi, Romania

Abstract. In this paper, nature inspired methods are proposed for solv-ing problems in the field of Semantic Web mining, namely the clusteringof Web resources based on their metadata, as well as the automatic clas-sification of Web pages.

1 Introduction

This paper proposes the use of nature inspired methods when solving the problemof RDF clustering, as well as that of the automatic classification of Web pages.The most promising methods that the authors found are those belonging to theAnt Colony Optimization (ACO) framework. While this paper does not aimto give an introduction ACO, the interested reader can refer to [3] for furtherinformation.

The paper is organized as follows. Section 2 describes efficient heuristics intwo different cases - when the number of clusters is predetermined, or when it isunknown and is part of the solution. By clustering Semantic Web resources, it ispossible to find representatives for a set of similar resources and thus be able toreduce the size of large ontologies. This would also bring insight into the mainconcepts that an ontology contains. Section 3 summarizes the paper [6] and alsobrings further insight into how ACO heuristics can be used to find classificationrules for Web pages. Section 4 draws the conclusions and suggests subjects forfurther research.

2 Clustering of Semantic Web data

The data clustering problem refers to grouping a set of data into several nonemptysubsets whose members are considered similar, with respect to some similaritymeasure. In the context of Semantic Web data, which can be represented throughRDF graphs, the clustering problem becomes that of grouping individuals in thegraph. An individual, also called an instance in [5], is a single resource nodetogether with some of its neighbouring nodes, forming a subgraph that is rel-evant to that resource node. Several instance extraction methods are proposedin [5]: Immediate Properties, Concise Bounded Description (CBD)1, or Depth

1 Concise Bounded Description: http://www.w3.org/Submission/CBD/

Limited Crawling. The optimal method to use depends on the type of data to beprocessed, e.g. RDF data coming converted from a relational database, FOAFdocuments, etc., and the structure of its associated RDF graph. The same crite-rion holds when choosing the optimal similarity measure; the authors of [5] alsopropose three distance measures, one based on feature vectors (denoted simFV),one based on conceptual graphs, inspired by the similarity measure of concep-tual graphs introduced in [10], and another being an ontology based measure(denoted simOnt).

2.1 Predetermined number of clusters (the ACOC algorithm)

Assuming a set Ω := X1, X2, . . . , Xm of individuals is extracted from an RDFgraphG and without giving an explicit formula for the above similarity measures,the RDF data clustering problem can be formally described as the followingdiscrete optimization problem. Let sim be a similarity measure, e.g. simFV orsimOnt above. Also let n ≥ 1 be the predetermined number of clusters intowhich the data is to be grouped and denote by C1, C2, . . . , Cn ∈ Ω the variablesto be determined as the centers of each cluster. By defining the variable wij

through

wij :=

1, the individual Xi belongs to cluster j,

0, otherwise,(1)

for i = 1, . . . ,m and j = 1, . . . , n, the aim is to

Maximize

m∑i=1

n∑j=1

wij sim(Xi, Cj), (2)

such that each individual belongs to only one cluster,

n∑j=1

wij = 1, i = 1, . . . ,m, (3)

and there are no empty clusters,

m∑i=1

wij ≥ 1, j = 1, . . . , n. (4)

To the best of the authors’ knowledge, there is no proof related to the NP-hardcomplexity of this general clustering problem. The most recent results on thissubject is the article [8], which proves that the clustering problem, also known asthe k-means problem, is NP-hard, in the restricted case of planar graphs. How-ever, as is the case with most discrete optimization problems, clustering of RDFdata is also computationally expensive and solution approximation methods arepreferred.

One of the most promising algorithms for solving the previous optimizationproblem is Ant Colony Optimization for Clustering (ACOC), introduced in [7],

which is an alternative to the classic k-means algorithm, known to have sev-eral drawbacks. The numerical results in [7] show that ACOC obtains the bestresults, on several test cases, among various approximation methods, includingthe k-means algorithm. It also achieves this with the highest convergence rate,therefore only requiring a few iteration steps to detect the optimum. Since ACOCis part of the Ant Colony Optimization framework, the idea is to have severalants “foraging” for the optimum, thus avoiding premature convergence due tolocal optima. Apart from using the idea of pheromone trails, each node to beexplored also contains a heuristic value, representing the estimated global gainfrom picking that node; this is used to accelerate the convergence of the algo-rithm. Eventually, ants are grouped into clusters and a solution to the originalRDF clustering problem can be obtained through a decoding algorithm.

2.2 Variable number of clusters (from SSCFL to RDF clustering)

In the case when the number of clusters is not predetermined, but only a fixednumber of individuals are allowed to live in each cluster, the previous problemcan be formulated as a Single Source Capacitated Facility Location (SSCFL)problem, which can be described as follows. Consider several facilities (e.g. med-ical or telecommunications facilities) that are installed at different locations ina city. These facilities provide goods to a number of customers, whose demandsare known beforehand. Each facility comes with the necessary logistics to createa physical network that would allow customers to connect to the facility. How-ever, each facility only provides a fixed amount of resources to the customerswho connect to it. The available amount of resources corresponding to a facilityis also called its capacity ; hence the adjective capacitated in the name of thisoptimization problem. The question is which of the facilities to open and whichcustomers should be assigned to each open facility, so that the total costs ofopening the facilities and of creating the physical networks are minimized, whilemaking sure that each customer’s demand is satisfied by exactly one facility.

In Figure 1, a solution to a particular SSCFL problem is represented. Thecustomers are the light green round rectangles, while the facilities are the lightred circles. The arrows denote assignment relations - the tip of the arrow pointsto the facility to which the customer is assigned. The number on each facilitynode designates its capacity, while the number on each customer node representsits demand. Notice that the given solution is feasible, i.e. the total demand ofthe customers assigned to a facility does not exceed its maximum capacity andno customers are left unassigned. Also, in this case, it was decided that threefacilities (having capacities 1, 6, 10) remain closed.

In order to adapt the SSCFL problem to RDF clustering, customers are thesame with the individuals that need to be grouped and the facilities representthe center of the clusters, which can be activated or not. Thus, consider thevariable wij defined as in the previous subsection and let yi ∈ 0, 1 be theBoolean variable specifying whether the i-th facility is to be opened or not, forall i. Also, denote by αi the cost of opening the i-th facility, which is the samewith the cost of taking the individual Xi to be a cluster center, and by αij the

1.5 2.2 1.3 2

2.5

1.2

2.5

1.7

3

810

5 2 1

6

1.5 2.2 1.3 2

2.5

1.2

2.5

1.7

3

810

5 2 1

6

Fig. 1. Solution to a particular SSCFL problem

cost of assigning the j-th customer to the i-th facility, for all i, j with 1 ≤ i ≤ mand 1 ≤ j ≤ n. In the case of RDF data clustering, the costs αij represent theopposite of the similarity measure between the individual Xi and the clustercenter Cj and they are given by:

αij = −sim(Xi, Cj), i = 1, . . . ,m, j = 1, . . . , n. (5)

Provided that the facilities (the potential cluster centers) have correspondingcapacities u1, u2, . . . , um ∈ R+, the aim of this adapted SSCFL problem is thento

Minimize

m∑i=1

αiyi +

n∑i=1

m∑j=1

αijwij , (6)

subject to the following constraints:

- each customer is assigned to exactly one facility (each individual Xi is as-signed to exactly one cluster)

n∑j=1

wij = 1, i = 1, . . . ,m, (7)

- provided that a facility is open (a cluster center is activated), the totaldemand of the customers assigned to it (the demand of a group of individualsto belong to the corresponding cluster) cannot exceed its capacity; also, acustomer cannot be assigned to a facility that is closed (an individual cannotbe represented by a cluster center that is not activated),

m∑i=1

diwij ≤ ujyj , j = 1, . . . , n, (8)

- a customer can either be assigned or not to a facility (an individual caneither be included or not in a group),

wij ∈ 0, 1, i = 1, . . . ,m, j = 1, . . . , n. (9)

- facilities can either be open or close (cluster centers can either be activatedor not),

yi ∈ 0, 1, i = 1, . . . ,m. (10)

Note: Before carrying on, notice that in a solution to this problem, there maybe individuals that remain ungrouped, which is not necessarily a drawback. Onthe contrary, this may provide more realistic solutions to the clustering problem.

The previous integer programming problem is proven in [9] to be NP-hardand therefore, heuristic solution techniques need to be created to handle its com-plexity. A survey of the more recent heuristics is given in [1], where the methodsof Tabu Search, Simulated Annealing and Genetic Algorithms are comparedon account of their efficiency with respect to different parameters. An alterna-tive solution based on Genetic Algorithms is also the subject of [2], in whichtwo special crossover operators are defined, guaranteeing the feasibility of theapproximations. Also, the Particle Swarm Optimization algorithm described in[11] and the Ant Colony Optimization algorithm in [13] have the potential to beadapted to the RDF clustering problem.

3 Web page classification using Ant Colony Optimization

Semantic Web is a combination of data from different sources integrated in acommon format as opposed to the original Web, concentrated mainly on theexchange of documents. It also has a format that connects data to objects fromthe real world. By doing so, the information seeker may jump from one databaseto another, just because they are linked because they share knowledge on thesame thing [12].

However, these are all made by human knowledge and so we can also take intoaccount the factor of subjectivism and the errors that may occur in placement,content or classification of knowledge. If in the case of user-less web pages (likeportfolio sites or advertising pages) the desire to provide quality content lays onlyin the hands of the site owner who may or may not be aware of the mistakes,once other users appear (that have rights to upload, tag, write content) the taskof keeping the information provided as accurate as possible becomes harder thanever.

A study we found, shows the way and the results of how general web contentcan be sorted by using an Ant Colony Algorithm. We will present the study andtry to connect its findings with what we know that may apply for semantic webas well.

3.1 Preprocessing

The challenge when dealing with web pages is that the developers do not followevery time a standardized way of creating web pages. This has many reasons:design implementation issues that may require certain tricks (fully flash basedsites have no <h1> tag), lack of interest or knowledge in applying them, noor badly chosen <meta> tags (too much or not related to page content), generic<title> tags (all pages have the same title). At least regarding meta tags thingsstarted to improve once everyone realised the advantages of being well ranked onsearch engines. This generated a higher rate of attention to the content of thosetags and a very high interest in SEO (search engine optimisation). In general,this would not be an issue for Semantic Web just because they are standardizedand not yet very popular so that, in theory at least, exceptions from the rulesare few.

The contents of web pages can be filtered using texts preprocessing methodsto obtain fewer relevant word to search for and a more human like understandingof the given text. The most difficult aspect that the methods described abovemust provide is the ability to handle well homographs (is one of a group of wordsthat share the same spelling but have different meanings [14]; ex: stalk - partof a plant) and stalk (follow/harass a person); left (opposite of right) and left(past tense of leave) [15]) .

For the study they used WordNet (a lexical program that offers some rela-tionships between words [4]) to filter the information. From it, they selected:

- the morphological preprocessor (to combine words like: make, made, makinginto one word make) to reduce the number of words to search in

- to identify all nouns from the text, as they may offer some relevant searchinformation. But there is an interesting fact that nouns may have

- the same spelling as verbs (a large number of examples describing this maybe found in [16])

- the words lexical family. If the text has words like: roof, window and door,they may all apply to house. This is a questionable technique, as for someassociated words the result may not be a real link between them (this isespecially the case for homograph words), or, for other cases (as the onedescribed above), a significant increase in efficiency.

As far as Semantic Web is concerned all three methods may offer interestingalternatives to the end results:

- the morphological processor is an interesting option as a word written innatural language may be linked to another, and only the latter is relevant.However, a word like left, if processed by this process may not remain in thesame way, but become leave. Having this in mind, it’s probably a good ideato keep both when dealing with Semantic Web.

- The distinction between nouns/verbs is also not so relevant in terms ofsearching a word in semantic web but it becomes significant in terms ofSPARQL queries. This has, however, the advantage that it knows by theway the syntax is formed which one is the noun and which is the verb.

- For the connections between different types of words, has relevance onlyif multiple words are searched for at the same time, and some commondenominators may then be used to provide results that better match asmany items provided as possible

For both search types, the end result should be a list of search words, with thenote that, for web mining it should only contain the most relevant words, and forSemantic Web it should have first the words obtained by joining the semantics,then the morphologically obtained values (if any) and the words themselves.This may seem an unnecessary overload but it may help the end user to betterunderstand the results given, and the first would be the most relevant.

3.2 Algorithm

The Ant-miner algorithm is a variation of the Ant Colony paradigm, used in datamining. In the beginning it initialises the training set of all available trainingcases (web pages) and adds an empty rule list. In an Repeat-Until loop, oneclassification rule at a time is discovered: first, all trails are initialised with thesame quantity of pheromone (giving them the same chance to be selected) andan inner rule lets the ants to select the best option. Each ant selects the pathto follow based on the path followed by the previous ants due to the presenceof pheromone traces. The higher the amount the better the path. In the secondstep, the irrelevant terms are removed so that in step three the pheromone valuesare updated . The inner loop continues until a condition is fulfilled (maximumnumber of paths is generated).

After the processing of the inner loop, the highest-quality rule is chosenand added to the discovered rule list. All training sets that satisfy the rule areremoved. This ensures that the next inner loop will run with fewer rules thanthe previous. The outer loop continues it’s execution until a criteria is satisfied(ex: some max number of uncovered cases is covered). The algorithm returns therule list found.

3.3 Experiment

The study took into account the <meta> and <title> contents of the BBC site.They chose this because of their high code writing standard, and due to thevery well structured information that improved the chance of making very goodconnections between <meta> and content.

4 Conclusions and further research

This paper shows how nature inspired optimization methods can be more effi-cient than classical, exact methods, when implementing Semantic Web miningalgorithms. Among all, the Ant Colony Optimization metaheuristic proves tobe one of the best solution techniques. As future work, the ideas described in

the previous sections need to be implemented and thoroughly tested, as natureinspired methods have rarely been used in the context of mining the SemanticWeb. Such an implementation would then allow the clustering of resources basedon their associated metadata, e.g. their FOAF description, the microformat in-formation they contain, etc.

References

1. Arostegui, Jr., M.A., Jr., Kadipasaoglu, S.N., Khumawala, B.M., An empirical com-parison of Tabu Search, Simulated Annealing, and Genetic Algorithms for facilitieslocation problems, International Journal of Production Economics, Vol. 103, No. 2,742-754, 2006.

2. Cortinhal, M.J., Captivo, M.E., Genetic Algorithms for the Single Source Capac-itated Location Problem: a Computational Study, in the Proceedings of the 4thMetaheuristics International Conference, 355-359, Porto, Portugal 2001.

3. Dorigo, M., Stutzle, T., Ant Colony Optimization, MIT Press, 2004.4. Fellbaum, C. (Ed.), WordNet - an electronic lexical database, MIT, 1998.5. Grimnes, G.A., Edwards, P., Preece, A., Instance based Clustering of Semantic Web

Resources, in the Proceedings of the 5th European Semantic Web Conference, LNCS5021, Springer-Verlag, pp. 303-317, 2008.

6. Holden, N., Freitas, A.A., Web Page Classification with an Ant Colony algorithm,in the Proceedings of the 8th International Conference on Parallel Problem Solvingfrom Nature, LNCS 3242, Springer-Verlag, pp. 1092-1102, 2004.

7. Kao, Y., Cheng, K., An ACO-Based Clustering Algorithm, in the Proceedings ofthe Ant Colony Optimization and Swarm Intelligence Conference, LNCS 4150, pp.340-347, 2006.

8. Mahajan, M., Nimbhorkar, P., Varadarajan, K., The Planar k-means Problem isNP-hard, in the Proceedings of the 3rd International Workshop on Algorithms andComputation, LNCS 5431, pp. 274-285, 2009.

9. Mirchandani, P.B., Francis, R.L., Discrete location theory, New York: Wiley, 1990.10. Montes-y-Gomez, M., Gelbukh, A., Lopez-Lopez, A., Comparison of Conceptual

Graphs, in Lecture Notes in Artificial Intelligence, Volume 1793, Springer-Verlag,pp. 548-556, 2000.

11. Sevkli, M., Guner, A.R., A Continuous Particle Swarm Optimization Algorithmfor the Uncapacitated Facility Location Problem, in the Proceedings of the 5th In-ternational Workshop on Ant Colony Optimization and Swarm Intelligence, ANTS2006, 316-323, Brussels, Belgium 2006.

12. The official W3C Semantic Web Activity page at http://www.w3.org/2001/sw/.13. Venables, H., Moscardini, A., An Adaptive Search Heuristic for the Capacitated

Fixed Charge Location Problem, in the Proceedings of the 5th International Work-shop on Ant Colony Optimization and Swarm Intelligence, ANTS 2006, 348-355,Brussels, Belgium 2006.

14. The wapedia page on homographs at http://wapedia.mobi/en/Homograph.15. The wapedia page on homonyms at http://wapedia.mobi/en/Homonyms.16. Words that can be used both as nouns and verbs, http://www.dailywritingtips.

com/careful-with-words-used-as-noun-and-verb/

Semantic Web mining using nature inspired optimization methods

Education

Transcript of Semantic Web mining using nature inspired optimization methods