[IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)]...

7
Graph Local Clustering for Topic Detection in Web Collections Sara E. Garza and Ram´ on Brena Center for Intelligent Computing and Robotics Tec de Monterrey, Campus Monterrey Monterrey, Mexico {A00592719,ramon.brena}@itesm.mx Abstract—In the midst of a developing Web that increases its size with a constant rhythm, automatic document organization becomes important. One way to arrange documents is by categorizing them into topics. Even when there are different forms to consider topics and their extraction, a practical option is to view them as document groups and apply clustering algorithms. An attractive alternative that naturally copes with the Web size and complexity is the one proposed by graph local clustering (GLC) methods. In this paper, we define a formal framework for working with topics in hyperlinked environments and analyze the feasibility of GLC for this task. We performed tests over an important Web collection, namely Wikipedia, and our results, which were validated using various kinds of methods (some of them specific for the information domain), indicate that this approach is suitable for topic discovery. Keywords-graph clustering; topic detection; Wikipedia; Web hyperlink structure mining; I. I NTRODUCTION In the midst of a developing Web that already surpasses 30 billion indexed pages 1 and keeps increasing its size at an active rhythm, automatic document organization becomes important. In that sense, one way to arrange documents is by categorizing them into topics, that is, according to a common thematic. Even when literature offers various means for treating topics and their respective documents, a practical and straightforward approach consists of viewing them as document clusters. This perspective not only results suitable if working with a pure link-based approach (which is our intention), but also enables the utilization of clustering alternatives. One of these is graph local clustering; unlike other techniques that group by taking into account all data (a global view) or require external operations to overcome space and time complexity, this kind of clustering inherently creates groups “on-demand” by searching in neighborhoods (partial views). Thus, it becomes an attractive choice for coping with the size of the Web or its collections in a more “natural” way. By accounting the former, we aim to test this approach for discovering topics in Web collections; in order to ac- complish this, it is necessary to first provide a description of documents, corpora, and topics in a hyperlinked environment 1 http://www.worldwidewebsize.com/ (Retrieved June 2009.) and establish the core elements required for achieving the topic detection task under these circumstances. Furthermore, it is important as well to search for topic validation forms. Consequently, our main contributions are the following: (1) A formal framework for topic detection using graph clustering (2) A small group of topic clusters obtained from a Web collection (Wikipedia) that serve as a proof of concept (3) A set of (established) evaluation metrics that can be utilized for assessing topic quality This document is organized as follows: Section II relates topic detection to graph clustering and provides necessary background; Section III describes our topic detection frame- work; Section IV covers performed experiments, validation, and result discussion; Section V enumerates related works; finally, Section VI states conclusions and future work. II. TOPIC DETECTION AS GRAPH CLUSTERING A topic, in an abstract sense, can regard a subject matter, a category, or—simply—a theme. There are different ways to represent topics, which can either involve the corpus text and/or hyperlinks; for instance, a traditional semantic embodiment for topics is that of probability distributions over vocabulary words (a.k.a. topic models [1]). However, if we are working with a purely link-based approach, we might—instead of viewing the corpus as a bag of words— consider the document collection as a directed graph (as has been done in [2] and [3] for the whole Web corpus) and regard topics to be document clusters [4]. That way, it is possible to recast the core problem of topic detection as a graph clustering problem. There is a comprehensive list of clustering methods avail- able, which range from conventional techniques (such as k-means) adapted for the hyperlinked-document domain to more specialized alternatives, such as spectral clustering. Even when all these methods exhibit considerable advan- tages, they are inherently global—this meaning that they produce a clustering for all data at once. So, unless an additional strategy is used for dealing with computational demands, these methods can become expensive for the scale of the Web. On the other hand, local clustering methods [5], tend to naturally group data based on partial views, and 2009 Latin American Web Congress 978-0-7695-3856-3/09 $26.00 © 2009 IEEE DOI 10.1109/LA-WEB.2009.21 205 2009 Latin American Web Congress 978-0-7695-3856-3/09 $26.00 © 2009 IEEE DOI 10.1109/LA-WEB.2009.21 207

Transcript of [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)]...

Page 1: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

Graph Local Clustering for Topic Detection in Web Collections

Sara E. Garza and Ramon BrenaCenter for Intelligent Computing and Robotics

Tec de Monterrey, Campus MonterreyMonterrey, Mexico

{A00592719,ramon.brena}@itesm.mx

Abstract—In the midst of a developing Web that increases itssize with a constant rhythm, automatic document organizationbecomes important. One way to arrange documents is bycategorizing them into topics. Even when there are differentforms to consider topics and their extraction, a practical optionis to view them as document groups and apply clusteringalgorithms. An attractive alternative that naturally copes withthe Web size and complexity is the one proposed by graphlocal clustering (GLC) methods. In this paper, we define aformal framework for working with topics in hyperlinkedenvironments and analyze the feasibility of GLC for this task.We performed tests over an important Web collection, namelyWikipedia, and our results, which were validated using variouskinds of methods (some of them specific for the informationdomain), indicate that this approach is suitable for topicdiscovery.

Keywords-graph clustering; topic detection; Wikipedia; Webhyperlink structure mining;

I. INTRODUCTION

In the midst of a developing Web that already surpasses30 billion indexed pages1 and keeps increasing its size atan active rhythm, automatic document organization becomesimportant. In that sense, one way to arrange documentsis by categorizing them into topics, that is, according toa common thematic. Even when literature offers variousmeans for treating topics and their respective documents,a practical and straightforward approach consists of viewingthem as document clusters. This perspective not only resultssuitable if working with a pure link-based approach (whichis our intention), but also enables the utilization of clusteringalternatives. One of these is graph local clustering; unlikeother techniques that group by taking into account all data(a global view) or require external operations to overcomespace and time complexity, this kind of clustering inherentlycreates groups “on-demand” by searching in neighborhoods(partial views). Thus, it becomes an attractive choice forcoping with the size of the Web or its collections in a more“natural” way.

By accounting the former, we aim to test this approachfor discovering topics in Web collections; in order to ac-complish this, it is necessary to first provide a description ofdocuments, corpora, and topics in a hyperlinked environment

1http://www.worldwidewebsize.com/ (Retrieved June 2009.)

and establish the core elements required for achieving thetopic detection task under these circumstances. Furthermore,it is important as well to search for topic validation forms.Consequently, our main contributions are the following:

(1) A formal framework for topic detection using graphclustering

(2) A small group of topic clusters obtained from a Webcollection (Wikipedia) that serve as a proof of concept

(3) A set of (established) evaluation metrics that can beutilized for assessing topic quality

This document is organized as follows: Section II relatestopic detection to graph clustering and provides necessarybackground; Section III describes our topic detection frame-work; Section IV covers performed experiments, validation,and result discussion; Section V enumerates related works;finally, Section VI states conclusions and future work.

II. TOPIC DETECTION AS GRAPH CLUSTERING

A topic, in an abstract sense, can regard a subject matter,a category, or—simply—a theme. There are different waysto represent topics, which can either involve the corpustext and/or hyperlinks; for instance, a traditional semanticembodiment for topics is that of probability distributionsover vocabulary words (a.k.a. topic models [1]). However,if we are working with a purely link-based approach, wemight—instead of viewing the corpus as a bag of words—consider the document collection as a directed graph (as hasbeen done in [2] and [3] for the whole Web corpus) andregard topics to be document clusters [4]. That way, it ispossible to recast the core problem of topic detection as agraph clustering problem.

There is a comprehensive list of clustering methods avail-able, which range from conventional techniques (such ask-means) adapted for the hyperlinked-document domain tomore specialized alternatives, such as spectral clustering.Even when all these methods exhibit considerable advan-tages, they are inherently global—this meaning that theyproduce a clustering for all data at once. So, unless anadditional strategy is used for dealing with computationaldemands, these methods can become expensive for the scaleof the Web. On the other hand, local clustering methods[5], tend to naturally group data based on partial views, and

2009 Latin American Web Congress

978-0-7695-3856-3/09 $26.00 © 2009 IEEE

DOI 10.1109/LA-WEB.2009.21

205

2009 Latin American Web Congress

978-0-7695-3856-3/09 $26.00 © 2009 IEEE

DOI 10.1109/LA-WEB.2009.21

207

Page 2: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

clusters are computed one at a time by iteratively addingelements in the vicinity of a given starting point.

A graph local clustering approach (which we will callGLC, for short) generically consists of using a local searchmethod that maximizes a cohesion-related fitness function;this algorithm is applied over an initial subset of vertices(known as “seed”) and neighbors of this subset are added(or removed, in some instances) with each iteration into theresulting cluster. As expected, the algorithm finishes whenit does not find a vertex capable of increasing the cohesionvalue—although thresholds in the number of iterations,elements, edges, and maximum obtained cohesion can alsobe imposed.

This GLC approach has been applied in works like [6] and[7], and was initially proposed in [8], where it was used forclustering one of the connected components of the ChileanWeb (in fact, following this method for topic discoverywas left as a possible future direction). The conventionalapproach has used simulated annealing along with a fitnessfunction consisting of a combination of cluster local andrelative density; with respect to the latter (relative density,denoted by ρ(C)), which is the metric of our interest, itis considered as the ratio of the cluster’s internal degree(number of edges that connect two cluster members) to itstotal number of (out) edges (see Eq. 1). On the next section,we will describe the specific GLC algorithm that was appliedfor the current work.

degint(C) = |{(u,v):(u,v)∈E,u,v∈C}|

degext(C) = |{(u,v):(u,v)∈E,u∈C,v /∈C}|

degout(C) = degint(C) + degext(C)

ρ(C) = degint(C)degout(C)

(1)

III. TOPIC DETECTION FORMAL FRAMEWORK

Let us consider a corpus (document collection) as a di-rected graph C = (V,E), where the vertex set V representsdocuments and the edge set E represents hyperlinks amongthem. Now, let us view an individual document dj (dj ∈ V )as a duple dj = (tj , Lj), where:• tj represents the document title• Lj is the set of pages the document links to:

– Lj = {n|n ∈ V, (dj , n) ∈ E}, or also,– Lj = Γ(dj), where Γ(v) represents the neighbor-

hood of a vertex vAs a consequence, a topic Ti can be defined as a tuple

Ti = (t, Si, Di, χi), where:• t is the topic tag• Si is the seed from which the topic is conformed

– Si ⊂ V• Di is the set of documents that the topic contains

– Di ⊂ V– Di ⊇ Si– Di = F (Si) and F (Si) is a clustering function

• χi is a cohesion value for the topic (obtained byapplying a certain cohesion function χi = χ(Di))

Regarding F (Si), the “clustering function”, it is a functionfor which a seed of documents is entered and a cluster isproduced. Therefore, it constitutes a central element for ourtopic detection framework, since it allows—when embeddedinto a clustering process—to discover the document groupsin a corpus. Note that if we are to use local clustering, thenF (Si) = GLC(Si) and χ(Di) is equal to the fitness functionto be maximized.

A. Our GLC approach

The GLC method can be implemented in a variety ways,which include several options for the local search strategyand the corresponding fitness measure. In our case, wedecided to combine hill-climbing with relative density; themotivation behind this selection lies on various aspects. First,since the basic attempt concerns qualifying the feasibility ofusing GLC in general as our clustering function for topicdetection, we need an elementary implementation capableof producing “rough” clusters from nearly trivial inputseeds (since looking for good seeds can regard an elaboratetask, as shown in [9]). In that sense, hill-climbing seemspreferable because it needs less parameters (ultimately,none) than simulated annealing and because of its overallsimplicity when compared to methods such as taboo searchor genetic algorithms. Furthermore, relative density waschosen because it seems, based on some preliminary tests,to yield larger clusters than the conventional metric usedin previous works; this not only allows us to analyze datamore thoroughly but is also convenient because there is ahigher probability of obtaining “fair size” results for triviallychosen seeds, independently if they are good or not.

The described clustering function algorithm is shownin Table I. As it was stated earlier, there was a strongattempt for establishing the least number of parameters;however, in order to achieve rough clusters in less time, acohesion threshold was also placed. That way, the algorithmcould stop either when not finding a better result or afterreaching a certain relative density (note that if the thresholdis equal to 1.0, it is equivalent to not using this parameterat all). Another important consideration is that, for keepingsimplicity, the provided pseudocode depicts selection of thebest element to be done always; nevertheless, the algorithmactually does exhaustive evaluation of possibilities only fora limited number of iterations (or until a certain externaldegree is reached). For the rest of the iterations, neighborsare sorted in descending order according to the number ofcluster elements that link them and the first to incrementrelative density is considered to be the “best”.

206208

Page 3: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

Table I: GLC algorithm.

function FIND-GLC-TOPIC (S, ξ) returnsa cluster C containing the topic’s documents

inputs:S, an initial vertex set (seed)ξ, cohesion threshold

local variables:ρc, current relative densityρp, previous relative densityN , set of neighborsn, a neighbor that increases cohesion

C ← S //let us consider seed as initial cluster

doρc ← ρ(C)N ←

{L0 ∪ . . . L|C|

}− C

nbest ← arg maxn∈N

ρ(C ∪ n)

C ← C ∪ nbest

while ρc < ξ AND ρp < ρc

return C

B. Topic Discovery by Density: an Example

To exemplify the framework and some of its elements,consider the toy graph illustrated in Figure 1. For instance,document a would be represented as shown in Eq. 2, and therest of the documents would have a similar representation.

Figure 1: A document graph.

a = (ta, La)= (“a”, {b, c, 1})

(2)

Now, suppose the GLC algorithm has been given node aas seed, and vertices b, c, and d have been added so far tothe topic cluster, C = {a, b, c, d}; by following Eq. 3, we cansee that the current density is of 0.857, and we also knowthat the only neighbor left to explore (up to this point) isnode 1. When examining this node, it becomes clear thatdensity would decrease with its addition (see Eq. 4) andthe algorithm stops. Then, the resulting topic (which will betagged as “Topic α”) would be represented as in Equation5 (note here that χα = ρ(Dα) and Dα is the algorithm’sresulting cluster C).

ρ(C) = degint(Dα)degint(Dα) + degext(Dα)

= |(a,b),(a,c),(b,d),(c,a),(b,c),(d,c)||(a,b),(a,c),(b,d),(c,a),(b,c),(d,c)|+ |(a,1)|

= 66 + 1

= 0.857

(3)

ρ(C ∪ 1) = degint(Dα) + |(a, 1)|degout(Dα) + |(1, 2), (2, 3)|

= 6 + 17 + 2

= 0.778

(4)

Tα = (t, Sα, Dα, χα)= (“Topic α”, {a} , {a, b, c, d} , 0.857)

(5)

The two topics from the illustrated document graph areshown in Figure 2 (“Topic β” being detected by running thealgorithm with a seed chosen from the 1-5 node subset).

Figure 2: Topic discovery by density.

IV. EXPERIMENTS

The current section describes the experiments that werecarried out for proving the concept of Web topic detectionwith the GLC method; we offer an overview of the mainresults, and then show how these were validated. A veryrelevant aspect of our experiments is that we focused ona specific collection, namely the English Wikipedia, whichis one of the top ten most visited websites on a globalscale2. This corpus constitutes an interesting testbed becauseit remains representative for the Web’s complexity, and evenwhen it may seem homogeneous in the surface, it is actuallyvery diverse. What is more, it is in encyclopedic knowledgecollections where topic detection seems more of a regulartask.

A. Setup

Corpus.- We used the 2005 XML version of Wikipedia,which contains about 900,000 articles and approximately19 million links. The XML file was pre-processed usingWikiprep3 (a PERL script that generates some convenient

2http://www.alexa.com/topsites/global (Retrieved June 2009.)3http://www.cs.technion.ac.il/ gabr/resources/code/wikiprep/

207209

Page 4: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

structures, such as link tables); link information was ex-tracted from this file and placed into a relational database.Moreover, since manual lists and categories were consideredas “noise”, these pages were suppressed from the collection.

GLC parameters.- With the intent of quickly gatheringpreliminary clusters, the algorithm was set a cohesion thresh-old of 0.5.

Seeds.- Five seeds were manually selected for performingexperiments: (1) Artificial Intelligence (AI), (2) Architec-ture, (3) Linguistics, (4) Film, and (5) Science. Each seeddocument set consisted of at least one element, which wasthe corresponding article for each topic (the article for AI,the article for architecture, etc.). However, for the themesconcerning AI, Architecture, and Film, more elements wereintroduced to the seed; these other articles were chosenby looking at the hierarchy provided by ODP (the OpenDirectory Project4).

B. Results

Results are summarized in Table II, where values fort, Si, Di, and χi for each topic Ti are displayed. Withrespect to topic tags, they were manually assigned accordingto each cluster’s final contents. Moreover, for space reasons,not all of the cluster members have been listed; also, withthe purpose of providing readable information, article titles(instead of the corresponding numerical indices) are usedfor referring to documents.

C. Validation

In order to evaluate results, we propose to use two kinds ofmetrics: (a) intra-cluster cohesion vs. inter-cluster separationmetrics and (b) group quality metrics, the first showingthat the formed groups have indeed a cluster structure andthe second providing a quality measurement on a globalscale. While these metrics can be used in general with anytype of clusters, for the second kind (b), we propose twomeasurements that can be considered as specific for topics.These are (1) cosine similarity and (2) Wikipedia articlesemantic relatedness distance; for group quality, we chosethe (3) “relative strength” ratio. We will now describe eachmetric and provide the obtained validation results. As forcluster cohesion and separation, not only clusters wherecompared against each other but also against a randomdocument set, in order to give a wider perspective.

Cosine similarity is a text-based measure that calculatesthe correlation of a pair of documents (represented as termvectors) by obtaining the “cosine of the angle” betweenthem [10]. A similarity of 1 indicates that the documentsare identical, while a value of 0 indicates no relation. Thismetric was chosen because, in despite of its shortcomings,it constitutes a pure text approach, which can be seen asorthogonal to our method. Furthermore, since it has been

4http://dmoz.org

proved that a document is similar (referring to content) tothe documents that link to it [11], this type of measurementis valid. However, probably the best reason for selectingcosine similarity is that it yields a semantic metric, and thisis specially adequate for topics.

With regard to semantic relatedness [12], it is actually adistance metric for Wikipedia articles that was inspired bythe Normalized Google Distance [13]; it takes into accountthe number of pages that link to both articles separatelyand to their intersection. If the distance is 0, it means thatthe pair of documents is linked by the same sources; if thedocuments do not share linking pages, the distance becomesinfinite. This measure was selected because it representsa different link analysis approach and is specific for theWikipedia corpus.

The “relative strength” ratio is a Social Network Analysismetric that displays the relative cohesion of internal edgeswith respect to external ones; a value greater than 1 indicatesthat links within the group are tighter than links to outerelements [14]. It is useful for validating our results becauseit constitutes an alternative way of showing that they complywith what we are looking for.

It is important to note that only internal validity mea-sures are being applied because Wikipedia is a “natural”—and, consequently—, untagged corpus. For this reason, thecalculation of precision, recall, and other similar metrics isnot trivial. Furthermore, internal validation should suffice atthis point, considering that it provides a general idea of themethod’s effectiveness.

1) Validation Results: Cosine similarity. An average co-sine similarity within clusters of 0.026 was obtained, whilethe average similarity among different clusters resulted in0.004, and the average similarity against the random doc-ument set was of 0.003. These results illustrate that, evenwhen the average intra-cluster cosine similarity is low (thisis logical at some extent, since the groups do not containnearly identical documents), it surpasses inter-cluster andrandom set similarity by a factor that ranges from 5 to 10.This is shown in Figure 3; a detailed analysis for each groupis given by Table III.

Table III: Cosine Similarity

AI Arch. Film Ling. Sc. R. Rand.AI 0.025 0.005 0.003 0.003 0.004 0.002Arch. - 0.034 0.004 0.004 0.008 0.005Film - - 0.016 0.002 0.002 0.003Ling. - - - 0.023 0.005 0.002Sc. R. - - - - 0.031 0.003

Semantic relatedness. The average intra-cluster distancewas of 0.36, while 0.65 was the mean inter-cluster distanceand 0.56 was the result of comparing against the randomarticle set (Figure 4); therefore, the first distance is smalleralmost by a factor of 2. Once again, we can confirm thatmembers of the same cluster are closer or more related

208210

Page 5: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

Table II: Results summary.

AI (117 members)Seed: Artificial Intelligence, Data mining,

Neural Network, Natural Language Processing,Fuzzy system, Support Vector Machine

Members: Blocks World, LISP, Machine Learning, Softcomputing, Computational Intelligence,Inductive Logic Programming,Hybrid intelligent system, . . .

R. Density: 0.365Architecture (142 members)

Seed: Architecture, Landscape architecture,Urban planning, Interior design, Building,Construction, Civil engineering

Members: Structural engineering, Sustainable architectureUrban planner, Civil engineer, Building designIntelligent building, Construction Management, . . .

R. Density: 0.269Linguistics (389 members)

Seed: LinguisticsMembers: Semantics, Morphology, Word, Sentence,

Part of speech, Verb, William Croft, Parse tree,Phonology, Stylistics, Noun, Phonetics, . . .

R. Density: 0.529Filmmaking (664 members)

Seed: Film, Academy AwardMembers: Pre-production, Film production,

Post-production, Scenic design, Casting, ActorSound mixer, Costume design, Playwright, . . .

R. Density: 0.41ScientificResearch (658 members)

Seed: ScienceMembers: Field work, Experiment, Hypothesis,

Statistical hypothesis testing, Methodology,Scientific method, Scientific community, Research,Empirical method, Qualitative research, . . .

R. Density: 0.502

Figure 3: Average cosine similarity.

to each other than to members of other groups. Anotherinteresting finding is that, considering that relatedness uti-lizes linking pages, the articles inside clusters are relativelyclose. On the other hand, it is significant to mention thatthe resulting averages were calculated by taking only intoaccount non-infinite values; however, it is relevant as well to

note that the amount of infinite distances was larger amongmembers of different groups than among members withinthe same one. Table IV displays semantic relatedness foreach cluster.

Figure 4: Average semantic relatedness.

Relative strength ratio. The calculated relative strengthratios were the following: AI–2.78, Architecture–2.2, Lin-

209211

Page 6: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

Table IV: Semantic Relatedness (distance)

AI Arch. Film Ling. Sc. R. Rand.AI 0.24 0.65 0.66 0.52 0.54 0.33Arch. - 0.34 0.73 0.7 0.69 0.67Film - - 0.38 0.71 0.73 0.57Ling. - - - 0.35 0.54 0.58Sc. R. - - - - 0.44 0.64

guistics–4.1, Filmmaking–3.99, Scientific Research–12.97.As we can see, all the obtained groups have a ratio higherthan 1.

D. Result discussion

In a rough manner, the obtained clusters do representtopics. An interesting issue is that, in a pair of cases, thealgorithm tended to add more documents from a certainsub-topic than from the topic in general (acted more likedepth-first than breadth-first search); for instance, on the AItopic, the last inserted articles were more specific to “Prob-lem solving and optimization with AI” than to “ArtificialIntelligence” in a broad sense.

V. RELATED WORK

Related work has been gathered into several major areas(placed in descending order according to similarity): (1)topic detection with hyperlinks, (2) community identificationand link-based web document clustering in general, and (3)text-based topic detection in Wikipedia. The main differencewith respect to works in the first area is that topics are,by usual, intuitively or loosely defined (a formal definitionis not explicit); moreover, the applied methods differ. Con-cerning community identification and link-based clustering,the former is distinct from our approach in the sense thatcommunities are generally more related to people, and thetopic detection task is considered more as a derived resultfrom the application of the techniques; another dissimilarityis that some of the discussed works have not been developedfor the Web, but for other types of networks. Now, regardingthe latter (link clustering), we will list some approaches thatcluster webpages with a purpose other than producing topics.Finally, as it can be inferred, text-based approaches use adifferent resource, which is content.

Regarding topic detection and clustering, [15] identifiestopics by means of clustering similar webpages; by re-cursively using a spectral graph-partitioning method and ahybrid-source (text, hyperlinks, co-citation) similarity met-ric, topic clusters are found. Later, these are post-processedwith Kleinberg’s HITS algorithm [16] to distill authoritiesand hubs. [17] also relies on spectral methods for discov-ering topics and retrieving their most representative pageswhen supplied with a user query.

Hyperlink-based web document clustering in general isalso very related to community identification, which isdevoted to finding groups of persons who share common

interests and their respective pages—this last point beingconsidered as a form of topic detection (in fact, in [18],the community identification task is given the name oftopic enumeration). One of the most relevant works in thisarea is the one done by Flake et al. in [19] and [20],where communities are detected by means of the maximum-flow/minimum-cut theorem and with the aid of focusedcrawling; this approach has been stated as purely link-basedand is capable of identifying topically related pages. Anotherimportant work that claims also to find topics in the midstof communities is the one proposed by Gibson, which usesHITS for finding authorities and hubs; this approach intendsto find community “cores” (which consists, not surprisingly,of authoritative pages), given a user query.

Continuing with the area of community identification,there are several works whose aim does not include topicdetection (neither in an explicit nor implicit way), but thatconcern important developments that could be extrapolatedfor finding topics. One of these, by Clauset et al. [21], isdevoted to community discovery in large networks; for thismatter, a greedy hierarchical clustering algorithm that opti-mizes modularity (a community quality measure proposedby Newman and Girvan [22]) and makes use of advanceddata structures is proposed and tested over a purchasingnetwork. On the other hand, [23] identifies communities byusing random walks and is primarily focused on preservingthe directed nature of the Web graph. On the other hand, [24]describes a method for incrementally constructing clustersby adding members that show to have a high affinity withthe developing groups; the approach was tested on severalgraphs with a symbolic amount of vertices and is primarilyfocused on applications for graph drawing and visualization.

Regarding text-based approaches, there are plenty ofworks that cluster related documents, and it is difficult tosupply a comprehensive list. However, two methods thatfocus on the Wikipedia corpus are [25] and [26]. Concern-ing the former, it uses Wikipedia’s category network anddocument titles for clustering and classification in severalcorpora. On the other hand, Wartena and Brussee experimentwith various similarity metrics utilizing bisecting k-means asthe clustering approach and encourage their proposed metric,that is based on term co-occurrence, for obtaining higherquality results.

It is relevant to restate that probably the major particular-ity of our work in comparison to these consists of a completefocus on the topic detection task (and in a pure-link fashion);for this reason, special emphasis has been given to its formaldefinition and to the consideration of clusters as documentgroups with a common thematic (thus tailoring methods andvalidation techniques for this kind of clusters).

VI. CONCLUSIONS AND FUTURE WORK

Along the present work, we have provided a definitionfor topic detection in terms of hyperlinked corpora (such as

210212

Page 7: [IEEE 2009 Latin American Web Congress (LA-WEB) - Merida, Yucatan, Mexico (2009.11.9-2009.11.11)] 2009 Latin American Web Congress - Graph Local Clustering for Topic Detection in Web

Wikipedia and the Web, in general) and briefly discussedsome properties of graph local clustering (GLC) that makeit a suitable and attractive alternative. Furthermore, a GLCapproach was used to find several document groups withthe aim of stating its feasibility for accomplishing the topicdiscovery task—at least in a rough manner. Results wereevaluated according to several established criteria (some ofthese, specific for validating document clusters) and showedto be favorable—hence enabling us to conclude that GLCcan be, as a matter of fact, utilized for finding topics insideWeb collections.

Future work includes embedding the GLC function intoa process capable of producing a high-coverage clustering(which also implies seed selection and expansion) andautomatic topic tagging.

An important validation for our method is to additionallycompare it against other approaches (e.g. text-based).

REFERENCES

[1] T. L. Griffiths, M. Steyvers, and J. B. Tenenbaum, “Topics insemantic representation,” Psychological Review, vol. 114, pp.211–244, 2007.

[2] J. Kleinberg, S. Kumar, P. Raghavan, S. Rajagopalan, andA. Tomkins, “The Web as a Graph: Measurements, Modelsand Methods,” LECTURE NOTES IN COMPUTER SCI-ENCE, pp. 1–17, 1999.

[3] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Ra-jagopalan, R. Stata, A. Tomkins, and J. Wiener, “Graphstructure in the Web,” Computer Networks, vol. 33, no. 1-6, pp. 309–320, 2000.

[4] N. Andrews and E. Fox, “Recent de-velopments in document clustering,”http://eprints.cs.vt.edu/archive/00001000/01/docclust.pdf,Virginia Tech, Tech. Rep., 2007.

[5] S. Schaeffer, “Graph clustering,” Computer Science Review,vol. 1, no. 1, pp. 27–64, 2007.

[6] ——, “Stochastic local clustering for massive graphs,” com-plexity, vol. 100, p. 5.

[7] E. Meneses, “Vectors and graphs: Two representations tocluster web sites using hyperstructure,” in Proceedings ofthe Fourth Latin American Web Congress. IEEE ComputerSociety Washington, DC, USA, 2006, pp. 172–178.

[8] S. Virtanen, “Clustering the chilean web,” in Web Congress,2003. Proceedings. First Latin American, 2003, pp. 229–231.

[9] R. Andersen and K. Lang, “Communities from seed sets,” inProceedings of the 15th international conference on WorldWide Web. ACM New York, NY, USA, 2006, pp. 223–232.

[10] R. Baeza-Yates, Modern Information Retrieval. New York:Addison-Wesley, 1999.

[11] F. Menczer, “Links tell us about lexical and semantic Webcontent,” University of Iowa, Tech. Rep., 2001.

[12] D. Witten, “An Effective, Low-Cost Measure of Semantic Re-latedness Obtained from Wikipedia Links,” in Proc. AAAI08Workshop on Wikipedia and Artificial Intelligence, 2008.

[13] R. Cilibrasi and P. Vitanyi, “The Google Similarity Distance,”IEEE TRANSACTIONS ON KNOWLEDGE AND DATA EN-GINEERING, pp. 370–383, 2007.

[14] S. Wasserman and K. Faust, Social Network Analysis: Meth-ods and Applications. Cambridge University Press, 1994.

[15] X. He, C. Ding, H. Zha, and H. Simon, “Automatic topicidentification using webpage clustering,” in Proc. IEEE IntlConf. Data Mining. San Jose, CA, 2001, pp. 195–202.

[16] J. Kleinberg, “Authoritative sources in a hyperlinked envi-ronment,” Journal of the ACM, vol. 46, no. 5, pp. 604–632,1999.

[17] K.-J. Wu, M.-C. Chen, and Y. Sun, “Automatic topics discov-ery from hyperlinked documents,” Information Processing &Management, vol. 40, no. 2, pp. 239–255, March 2004.

[18] R. Kumar, P. Raghavan, S. Rajagopalan, D. Sivakumar,A. Tompkins, and E. Upfal, “The Web as a graph,” in Pro-ceedings of the nineteenth ACM SIGMOD-SIGACT-SIGARTsymposium on Principles of database systems. ACM NewYork, NY, USA, 2000, pp. 1–10.

[19] G. Flake, S. Lawrence, and C. Giles, “Efficient identificationof Web communities,” in Proceedings of the sixth ACMSIGKDD international conference on Knowledge discoveryand data mining. ACM New York, NY, USA, 2000, pp.150–160.

[20] G. Flake, S. Lawrence, C. Giles, and F. Coetzee, “Self-organization and identification of Web communities,” Com-puter, vol. 35, no. 3, pp. 66–70, 2002.

[21] A. Clauset, M. Newman, and C. Moore, “Finding communitystructure in very large networks,” Physical Review E, vol. 70,no. 6, p. 66111, 2004.

[22] M. Newman and M. Girvan, “Finding and evaluating com-munity structure in networks,” PHYSICAL REVIEW E PhysRev E, vol. 69, p. 026113, 2004.

[23] J. Huang, T. Zhu, and D. Schuurmans, “Web communitiesidentification from random walks,” LECTURE NOTES INCOMPUTER SCIENCE, vol. 4213, p. 187, 2006.

[24] X. Huang and W. Lai, “Identification of clusters in the Webgraph based on link topology,” in Database Engineeringand Applications Symposium, 2003. Proceedings. SeventhInternational, 2003, pp. 123–128.

[25] P. Schonhofen, “Identifying document topics using the Wi-kipedia category network,” in Proceedings of the 2006IEEE/WIC/ACM International Conference on Web Intelli-gence. IEEE Computer Society Washington, DC, USA, 2006,pp. 456–462.

[26] C. Wartena and R. Brussee, “Topic detection by cluster-ing keywords,” in DEXA 2008: 19TH INTERNATIONALCONFERENCE ON DATABASE AND EXPERT SYSTEMSAPPLICATIONS, PROCEEDINGS, 2008, Proceedings Paper.

211213