Personalization in Folksonomies Based on Tag …...Personalization in Folksonomies Based on Tag...

Personalization in Folksonomies Based on Tag Clustering

Jonathan Gemmell, Andriy Shepitsen, Bamshad Mobasher, Robin Burke

Center for Web IntelligenceSchool of Computing, DePaul University

Chicago, Illinois, USA

{jgemmell,ashepits,mobasher,rburke}@cti.depaul.edu

Abstract

Collaborative tagging systems, sometimes referred to as“folksonomies,” enable Internet users to annotate or searchfor resources using custom labels instead of being restrictedby pre-defined navigational or conceptual hierarchies. How-ever, the flexibility of tagging brings with it certain costs. Be-cause users are free to apply any tag to any resource, tag-ging systems contain large numbers of redundant, ambigu-ous, and idiosyncratic tags which can render resource discov-ery difficult. Data mining techniques such as clustering canbe used to ameliorate this problem by reducing noise in thedata and identifying trends. In particular, discovered patternscan be used to tailor the system’s output to a user based on theuser’s tagging behavior. In this paper, we propose a methodto personalize a user’s experience within a folksonomy us-ing clustering. A personalized view can overcome ambigu-ity and idiosyncratic tag assignment, presenting users withtags and resources that correspond more closely to their in-tent. Specifically, we examine unsupervised clustering meth-ods for extracting commonalities between tags, and use thediscovered clusters as intermediaries between a user’s pro-file and resources in order to tailor the results of search tothe user’s interests. We validate this approach through ex-tensive evaluation of proposed personalization algorithm andthe underlying clustering techniques using data from a realcollaborative tagging Web site.

IntroductionCollaborative tagging is an emerging trend allowing Internetusers to manage and share online resources through user-defined annotations. There has been a recent proliferationof collaborative tagging systems. Two of the most popularexamples are del.icio.us1 and Flickr2. In del.icio.us usersbookmark URLs. Flickr, on the other hand, allows usersto upload, share and manage pictures. Other applicationsspecialize in music, blogs, or journal publications.

At the foundation of collaborative tagging is the anno-tation; a user describes a resource with a tag. A collec-tion of annotations results in a complex network of inter-related users, resources and tags, commonly referred to asa folksonomy (Mathes 2004). Users are free to navigate

Copyright c© 2008, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

1del.icio.us2www.flickr.com

through the folksonomy without being tied to a pre-definednavigational or conceptual hierarchy. The freedom to ex-plore this large information space of resources, tags, or evenother users is central to the utility and popularity of collab-orative tagging. Tags make it easy and intuitive to retrievepreviously viewed resources (Hammond et al. 2005). Fur-ther, tagging allows users to categorized resources by sev-eral terms, rather than one directory or a single branch ofan ontology (Millen, Feinberg, and Kerr 2006). Collabora-tive tagging systems have a low entry cost when comparedto systems that require users to conform to a rigid hierarchy.Furthermore, users may enjoy the social aspects of collab-orative tagging (Choy and Lui 2006). Collaborative tag-ging offers a sense of community, not provided by eitherontologies or search engines. Users may share or discoverresources through the collaborative network and connect topeople with similar interests.

Because collaborative tagging applications reap the in-sights of many users rather than a few “experts”, they aremore dynamic and able to incorporate a changing vocab-ulary or absorb new trends quickly (Wu, Zhang, and Yu2006). These applications can identify groups of like-minded users, catering not only to mainstream but also tonon-conventional users that are often under-served by tradi-tional Web tools. Search engines, the most widely used toolfor searching large information spaces, attempts to index re-sources. In effect, this is a pull-model, where the applica-tion pulls resources from the information space (Yan, Nat-sev, and Campbell 2007). In contrast, collaborative taggingapplications support a push-model: users identify which re-sources are relevant and through the annotation process pro-mote the resource. Consequently, the collaborative taggingsystem may be populated with resources a pull-model maynot be able to locate.

A collaborative tagging application may be applied to avariety of resource: Web pages, news stories, pictures, videoclips, etc. The content of some of these resources can bedetermined by computers, but some resources are particu-larly difficult to automatically categorize except by relyingon meta-data. For example, it can be particularly difficultfor standard search engines to describe or organize videoand music resources. Collaborative Tagging applications,however, rely on the user’s tags to determine the content ofa resource. As such, it may be easier for users to locate

37

resources with collaborative tagging applications than withstandard search engines.

Another advantage of collaborative tagging applicationsis the richness of the user profiles. As users annotate re-sources, the system is able to track their interests. Data min-ing tools such as clustering can identify important trendsand characteristics of the users. These profiles are a pow-erful tool for personalization algorithms (Yan, Natsev, andCampbell 2007). The relatively low cost of generating a userprofile (typically based on the resources and tags associatedwith a user), can be dramatically overcome by the improveduser experience personalization allows. The benefit of per-sonalization in search to the user is described in (Teevan,Dumais, and Horvitz 2007).

Even though collaborative tagging applications havemany benefits, they also present unique challenges forsearch and navigation. Most collaborative tagging appli-cations permit unsupervised tagging; users are free to useany tag they wish to describe a resource. This is often doneto reduce the entry cost of using the application and makecollaborative tagging more user friendly. As a result, folk-sonomies contain a wide variety of tags: from the factual(e.g., “Mt Rushmore”) to the subjective (e.g., “boring”), andfrom the semantically-obvious (e.g., “Chicago”) to the ut-terly opaque (e.g., “jfgwh”). Moreover, tag redundancy inwhich several tags have the same meaning or tag ambiguityin which a single tag has many different meanings can con-found users searching for resources. The task of combatingnoise is made even more difficult by capitalization, punctu-ation, misspelling, and other discrepancies.

Data mining techniques such as clustering provide ameans to overcome these problems. Through clustering, re-dundant tags can be aggregated; the combined trend of acluster can be more easily detected than the effect of a sin-gle tag. The effect of ambiguity can also be diminished,since the uncertainty of a single tag in a cluster can be over-whelmed by the additive effects of the rest of the tags. Tagclusters may represent coherent topic areas. By associatinga user’s interest to a particular cluster, we may surmise theuser’s interest in the topic.

Personalization can also be used to overcome noise infolksonomies. Given a particular user profile, the user’s in-terests can be clarified and navigation within the folksonomycan be tailored to suit the user’s preferences. By using bothclustering and personalization together we seek to combatnoise in folksonomies and improve the user experience.

In this paper, we propose an algorithm to personalizesearch and navigation based on tags in folksonomies. Thecore of our algorithm is a set of tag clusters, discoveredbased on their associations with resources by various users.The personalization algorithm models users as vectors overthe set of tags. By measuring the importance of a tag clusterto a user, the user’s interests can be better understood and thecontext of user’s interaction can be better delineated. Like-wise, each resource is also modeled as a vector over the setof tags. By associating resources with tag clusters, resourcesrelevant to the topics captured by those clusters can be iden-tified. By using the tag clusters as intermediaries between auser and a resource, we infer the relevance of the resource to

the user. We then use the inferred relevance of the resourceto re-rank the results of a basic search, thereby personalizingthe user experience.

Furthermore, we evaluate the effectiveness of severalclustering techniques that can be used as part of the per-sonalized search and navigation framework: hierarchical ag-glomerative clustering, maximal complete link clustering,andk-means clustering. Hierarchical clusters, in particular,offers more flexibility than other clustering methods. By se-lecting clusters in the hierarchy directly related to the user’saction, the algorithm may focus more clearly on the user’sintent. Alternatively, by including more clusters, the resultcan be generalized promoting serendipitous discovery.

The rest of this paper is organized as follows. We beginwith presenting some related work involving the use of clus-tering and personalization in folksonomies. We then out-line basic approaches used for search in folksonomies andmotivate the need for personalization. The clustering meth-ods and our personalization algorithm are then presented. Inthe experimental evaluation section we evaluate our person-alization algorithm and compare the effectiveness of clus-tering algorithms in that context. Finally, we conclude thepaper and offer some directions for future work.

Related WorkA fundamental assumption in this paper is the ability of clus-tering algorithms to form coherent clusters of related tags.Support for that assumption is given in (Begelman, Keller,and Smadja 2006) where tag clustering is suggested to im-prove search in folksonomies and (Heymann and Garcia-Molina April 2006) where hierarchical agglomerative clus-tering is proposed to generate a taxonomy from a folkson-omy.

Integral to our algorithm for personalization using clus-ters is the measurement of relevance between a user anda resource. A similar notion was previously described in(Niwa, Doi, and Honiden 2006) in which an affinity levelwas calculated between a user and a set of tag clusters. Acollection of resources was then identified for each clusterbased on tag usage. Resources were recommended to theuser based on the user’s affinity to the clusters and the asso-ciated resources.

Our algorithm relies heavily on tag clusters and the utilitythey offer, but clusters have many other potential functionsworth noting. Tag clusters could serve as intermediaries be-tween two users in order to identify like-minded individualsallowing the construction of a social network. Tag cluster-ing can support tag recommendation, reducing annotation toa mouse click rather than a text entry. Well chosen tags makethe recovery process simple and offer some control over thetag-space. By exerting some control over the tag space, theeffect of tag redundancy or ambiguity can be mitigated tosome degree. In (Xu et al. 2006) a group of tags are offeredto the user based on several criteria (coverage, popularity,effort, uniformity) resulting in a cluster of a relevant tags.

Clustering is an important step in many attempts to im-prove search and navigation. In (Wu, Zhang, and Yu 2006),tag clusters are presumed to be representative of the resourcecontent. Thus, a folksonomy of Web resources is used to

38

move the Internet closer to the Semantic Web. In (Choyand Lui 2006) a two-dimensional tag map is constructed. Inthis manner, tag clusters can be used as waypoints in thetag space and facilitate navigation through the folksonomy.In (Hayes and Avesani 2007), topic relevant partitions aregenerated by clustering resources rather than tags. Then themost characteristic resources of the clusters are identified.Users interested in the topic represented by a cluster may beparticularly interested in the characteristic resources.

By using clusters of resources, Flickr, a popular collabo-rative tagging application for pictures, improves search andnavigation by discriminating between different meanings ofa user query. For example, a user selecting the tag “apple”will receive several groups of pictures. One group represents“fruit”; while another contains iPods, iMacs, and iPhones. Athird cluster contains pictures of New York City. In (Chenand Dumais 2000) clusters of resources are shown to benefitnavigation by categorizing the resources into topic areas. Anadvantage of this approach is that the user may interactivelydisambiguate his query.

Search and Navigation in FolksonomiesIn traditional Internet applications the search and navigationprocess serves two vital functions: retrieval and discovery.Retrieval incorporates the notion of navigating to a particu-lar resource or a resource containing particular content. Dis-covery, on the other hand, incorporates the notion of findingresources or content interesting but theretofore unknown tothe user. The success of collaborative tagging is due in partto its ability to facilitate both these functions within a singleuser-centric environment.

Reclaiming previously annotated resources is both simpleand intuitive, as most collaborative tagging applications of-ten present the user’s tag in the interface. Selecting a tagdisplays all resources annotated by the user with that tag.Users searching for particular resources they have yet to an-notate may select a relevant tag and browse resources anno-tated by other users. However, the discovery process can bemuch more complex. A user may browse the folksonomy,navigating through tags, resources, or even other users. Fur-thermore, the user may select one of the results of a query(i.e. tag, resource, or user) as the next query itself. This abil-ity to navigate through the folksonomy is one reason for thepopularity of collaborative tagging. While the user can nav-igating through many different dimensions of a folksonomy,this work focuses on searching for resources using a tag asquery.

A folksonomy can be described as a four-tupleD:

D = 〈U, R, T, A〉 , (1)

where,U is a set of users;R is a set of resources;T is a setof tags; andA is a set of annotations, represented as user-tag-resource triples:

A ⊆ {〈u, r, t〉 : u ∈ U, r ∈ R, t ∈ T } (2)

A folksonomy can, therefore, be viewed as a tripartitehyper-graph (Mika 2007) with users, tags, and resources

represented as nodes and the annotations represented ashyper-edges connecting a user, a tag and a resource.

Standard Search in FolksonomiesContrary to traditional Internet applications, a search in acollaborative tagging application is performed with a tagrather than a keyword. Most often the tag is selected throughthe user interface.

Applications vary in the way they handle navigation. Pos-sible methods include recency, authority, linkage, or vectorspace models (Salton, Wong, and Yang 1975). In this workwe focus on the vector space model adapted from the infor-mation retrieval discipline to work with folksonomies. Eachuser,u, is modeled as a vector over the set of tags, whereeach weight,w(ti), in each dimension corresponds to theimportance of a particular tag,ti.

~u = 〈w(t1), w(t2)...w(t|T |)〉 (3)Resources can also be modeled as a vector over the set

of tags. In calculating the vector weights, a variety of mea-sures can be used. Thetag frequency, tf, for a tag,t, anda resource,r is the number of times the resource has beenannotated with the query tag. We definetf as:

tf(t,r) = |{a = 〈u, r, t〉 ∈ A : u ∈ U}| (4)Likewise, the well knownterm frequency * inverse docu-

ment frequency(Salton and Buckley 1988) can be modifiedfor folksonomies. Thetf*idf multiplies the aforementionedfrequency by the relative distinctiveness of the tag. The dis-tinctiveness is measured by the log of the total number ofresources,N, divided by the number of resources to whichthe query tag was applied,nt. We definetf*idf as:

tf*idf(t,r) = tf(t,r) ∗ log(N/nt) (5)With either term weighting approach, a similarity measure

between a query,q, represented as a vector over the set tags,and a resource,r, also modeled as a vector over the set tags,can be calculated. However, in this work, since search ornavigation is often initiated by selecting a single tag fromthe user interface, we assume the query is a vector with onlyone tag.

Several techniques exist to calculate the similarity be-tween vectors such as Jaccard similarity coefficient or Co-sine similarity (Van Rijsbergen 1979). Cosine similarity isa popular measure defined as:

cos(q,r)=

∑

t∈T tf(t,q)∗ tf(t,r)√

∑

t∈T tf(t,q)2 ∗√

∑

t∈T tf(t,r)2(6)

A basic search may begin by calculating the cosine sim-ilarity of the query to each resource. Once the similarity iscalculated, an ordered list can be returned to the user. Sincethe query is modeled as a vector containing only one tag, thisequation may be further simplified. However, we have pro-vided the full equation so that it can be applicable in a moregeneral setting. Other characteristics of the resource such asrecency or popularity can be used to augment the query. Inthis work we focus on cosine similarity.

39

Figure 1: Clusters represent coherent topic areas and serve as intermediaries between a user and the resources.

Need for Personalization

A standard search does not take into account the user profileand returns identical results regardless of the user. Whilepersonalization has been shown to increase the utility of Webapplications, the need for personalization in folksonomies iseven more critical. Noise in the folksonomy, such as tagredundancy and tag ambiguity, obfuscate patterns and re-duce the effectiveness of data mining techniques. Redun-dancy occurs when two users apply different tags with iden-tical meaning (e.g. “java” and “Java”). Redundant tagscan hinder algorithms that depend on calculating similar-ity between resources. A user searching with a redundanttag may not find resources annotated with another similartag. Ambiguity occurs when two users apply an identicaltag, but mean something different (e.g. “java” applied to awww.starbucks.com and “java” applied to www.sun.com).Ambiguous tags can result in the overestimation of the simi-larity of resources that are in fact unrelated. A user searchingwith an ambiguous tag may receive results unrelated to hisintended meaning.

Tag clustering provides a means to combat noise in thedata and facilitate personalization. By aggregating tags intoclusters with similar meaning, tag redundancy can be as-suaged since the trend for a cluster can be more easily iden-tified than the effect of a single tag. Ambiguity will alsobe remedied to some degree, since a cluster of tags will as-sume the aggregate meaning and overshadow any ambigu-ous meaning a single tag may have. Furthermore, usingclusters to represent a topic area, the user’s interest in thattopic can be more easily quantified. The connection of a re-source to a tag cluster can also be quantified. If tag clustersare used as a nexus between users and resources, the usersinterest in resources can be calculated. Consequently resultsfrom a basic search can be re-ranked to reflect the user pro-file.

Personalization, therefore, is also critical in our attemptsto combat noise in the data. The user’s intended meaningfor ambiguous tags can be inferred through the analysis ofother tags and resources in the user profile. Therefore, eventhough a user may annotate resources with redundant tags,

personalization techniques may reduce the noise generatedby these tags.

Personalized Search Based on Tag ClusteringIn this section, our algorithm for search personalizationbased on tag clustering is described in detail. We first of-fer a brief overview of the proposed approach. Then we de-scribe several methods for clustering and their parameters.We next describe the personalization algorithm and how thediscovered clusters are used to connect a user to a resource,providing a means to measure the relevance of a resource toa user. Finally, we show how the results of a basic searchstrategy can be personalized by incorporating the relevanceof the resources to the user.

Overview of the Proposed ApproachIn our approach, tag clusters serve as intermediaries betweena user and the resources. Once a set of tag clusters are gener-ated, the user’s interest in each cluster is calculated. A stronginterest indicates the user has frequently used the tags in thecluster. Likewise, a measure is calculated from each clusterto all resources. A strong relationship between a tag clusterand a resource means many of the tags were used to describethe resource.

By using clusters to connect the user to the resources, therelevance of the resource to the user can be inferred. Oncea relevance measure is calculated for all resources, a list ofresources provided by a traditional search can be reorderedand presented to the user. Each user, therefore, receives apersonalized view of the information space.

For example, in Figure 1, a hypothetical user searchingbased on the tag “Java” has a strong connection to the clus-ter of coffee related tags, and weaker connections to theclusters dealing with traveling or computer programming.The strength of the connection is based on the similarity ofthe user profile to the tag clusters. Likewise, the resourcesdealing with coffee has a strong relation to the coffee clus-ter. The user’s interest in that resource can, therefore, beinferred. However, the coffee cluster has a weak relation toresources dealing with travel to Java and Sumatra. The rel-

40

Figure 2: An example of hierarchical tag clustering.

evance of the traveling resources to the user is consequentlyminimal. Because of the crucial role tag clusters play inthe personalization algorithm, we evaluate several cluster-ing techniques and their parameters in order to achieve themaximum results.

Tag ClusteringA critical element of our algorithm is a set of tag clustersthat connects a user with the resources. For many clusteringtechniques, the similarity between tags must first be calcu-lated. The cosine similarity between two tags,t ands, maybe calculated by treating each tag as a vector over the set ofresources and usingtag frequencyor tag frequency * inversedocument frequencyas the weights in the vectors.

cos(t,s)=

∑

r∈R tf(t,r) ∗ tf(s,r)√

∑

r∈R tf(t,r)2 ∗√

∑

r∈R tf(s,r)2(7)

Once the cosine similarities are calculated, it is possible toconstruct clusters. Our personalization method is indepen-dent of the clustering technique. In this paper we evaluatethree clustering techniques: a specific version of hierarchicalagglomerative clustering, maximal complete link clusteringandk-means clustering.

Hierarchical Agglomerative Clustering One of the algo-rithms we consider is a version of hierarchical agglomerativeclustering (Gower and Ross 1969) modified to suit out needsfor tag clustering. As the hierarchical clustering algorithmbegins each tag forms a singleton cluster. Then, during eachstage of the procedure, clusters of tags are joined togetherdepending on the level of similarity between the clusters.This is done for many iterations until all tags have been ag-gregated into one cluster. The result is a hierarchical cluster-ing of the tags such as the one depicted in Figure 2.

Several techniques exist to calculate the similarity be-tween tag clusters and to merge smaller clusters. The min-imum distance between any tag from one cluster to any tagfrom another cluster could be used (often called single link).

Likewise, the maximum distance could be used (often calledcomplete link). It is also possible to calculate the similar-ity between every tag in one cluster and every tag in theother cluster and then take the average of these similarities.For this work, we focus on using the latter centroid-basedapproach which is not as computationally expensive as theother techniques.

To compute the similarity between clusters, a centroid foreach cluster is calculated. Each tag is treated as a vector overthe set resources. Vector weights are calculated using eithertf or tf*idf . The centroid for a cluster is calculated as theaverage vector of the tag’s vectors. The similarity betweentwo clusters is then calculated using the centroids as thoughthey were single tags.

Hierarchical agglomerative clustering has several parame-ters that require tuning in order to achieve optimum results inthe personalization routine. The parameterstepis the decre-ment by which the similarity threshold is lowered. At eachiteration, clusters of tags are aggregated if the similarity be-tween them meets a minimum threshold. This threshold islowered bystepat each iteration until it reaches0. By mod-ifying this parameter the granularity of the hierarchy can becontrolled.

In order to break the hierarchy into distinct clusters, adi-vision coefficientis chosen as a cutoff point. Any clusterbelow this similarity threshold is considered an independentcluster in the personalization routine. Selecting a value nearone will result in many small clusters with high internal sim-ilarity or possibly even singletons. Alternatively, selecting asmall value will result in fewer larger clusters, with lowerinternal similarity.

An important modification to the traditional hierarchicalclustering method is thegeneralization level. Normally, allclusters below thedivision coefficientwould be used, but inthis modification only those clusters descendent of the se-lected tag are used. Thegeneralization levelallows the al-gorithm to return more general tag clusters for the hierarchy.Instead of using only the descendants of the selected tag, alarger branch of the hierarchy is used by first traveling up thetree the specified number of levels. Notice that if thegener-alization levelis set very high it will include all clusters inthe hierarchy and behave as a traditional agglomerative clus-tering algorithm.

For example, in the hypothetical hierarchy of clusters de-picted in Figure 2, if the user selects the tag “Design,” thealgorithm will first identify the level at which this tag wasadded to the hierarchy. In this case, it was added when thesimilarity threshold was lowered to.7. With a generaliza-tion coefficientof 2, the algorithm proceeds up two levelsin the hierarchy. Finally, thedivision coefficientis used tobreak the branch of the hierarchy into distinct clusters.

In order to ascertain the relative value of the modified hi-erarchical clustering technique, two other clustering meth-ods were evaluated: maximal complete link clustering andk-means clustering.

Maximal Complete Link Clustering Maximal completelink clustering identifies every maximal clique in a graph(Augutson and Minker 1970). A maximal clique is a clique

41

that it not contained in a larger clique. Maximal completelink clustering permits clusters to overlap. This may be par-ticularly advantageous when dealing with ambiguous tags.The tag “java” for example could be a member of a coffeecluster as well as programming cluster.

In this work maximal complete link clusters were con-structed using a branch and bound method. Maximal com-plete link clustering is a well known NP-hard problem. For-tunately, the extreme sparsity of the data permits the appli-cation of this method, since the number of potential solutionis dramatically reduced. Nevertheless, this method was themost time intensive of the clustering methods we evaluated.Fortunately, clustering in the proposed personalization algo-rithm clustering is done off-line. But, it will not scale wellto larger datasets. Approximation techniques could be usedto save computational time at the expense of missing someclusters (Johnson 1973).

Maximal complete link clustering has one parameter totune, theminimum similarity threshold. If the similarity be-tween two tags meets this threshold, they are considered tobe connected. Otherwise, they are considered to be discon-nected. Treating each tag as a node, a sparse graph is gen-erated. From this graph the maximal complete clusters arediscovered.

k-means Clustering The last clustering approach used toevaluate the usefulness of clustering to the personalizationalgorithms wask-means clustering (MacQueen 1967). Apredetermined number of clusters,k, are randomly popu-lated with tags. Centroids are calculated for each cluster.Then, each tag is reassigned to a cluster based on a similar-ity measure between itself and the cluster centroid. Severaliterations are completed until tags are no longer reassigned.This clustering method has only one parameter to tune,k.The relatively efficient computational time of the algorithmmakesk-means an attractive alternative.

In contrast to hierarchical or maximal complete link clus-tering,k-means clustering cannot effectively isolate irrele-vant tags. In maximal complete link clustering, tags with avery weak connection (or no connection) can be isolated ina singleton cluster. Such a cluster has little effect on the per-sonalization algorithm. Similarly, in hierarchical agglomer-ative clustering, tags with a strong connection are identifiedfirst. Weakly connected tags are not aggregated until thelast stages of the algorithm. Further, by modifying the di-vision coefficient, these weak connections and the irrelevanttags can be entirely ignored. Thek-means algorithm, how-ever, includes all tags in one of thek clusters and each tagin a cluster is given equal weight. If a cluster contains tagscovering several topic areas the aggregate meaning of thecluster can become muddled. Or, if a cluster contains tagsirrelevant to the consensus meaning, the aggregate meaningin the cluster can become diminished. This drawback mayovershadow the benefit of faster computational time.

The clustering method is independent from the personal-ization algorithm; any clustering approach could be used.Clusters are, however, integral to the algorithm. We nextshow how the personalization algorithm uses clusters tobridge the gap between users and resources.

Personalization Algorithm Based on TagClusteringThere are three inputs to a personalized search: the selectedtag, the user profile and the discovered clusters. The outputof the algorithm is an ordered set of resources.

For each cluster,c, the user’s interest is calculated as theratio of times the user,u, annotated a resource with a tagfrom that cluster over the total number of annotations bythat user. We denote this weight asuc w(u,c)and defined itas:

uc w(u,c)=|{a = 〈u, r, t〉 ∈ A : r ∈ R, t ∈ c}|

|{a = 〈u, r, t〉 ∈ A : r ∈ R, t ∈ T }|(8)

Also, the relation of a resource,r, to a cluster is calculatedas the ratio of times the resource was annotated with a tagfrom the cluster over the total number of times the resourcewas annotated. We call this weightrc w(r,c) and defined itas:

rc w(r,c) =|{a = 〈u, r, t〉 ∈ A : u ∈ U, t ∈ c}|

|{a = 〈u, r, t〉 ∈ A : u ∈ U, t ∈ T }|(9)

Bothuc w(u,c)and rc w(r,c) will always be a number be-tween zero and one. A higher value will represent a strongrelation to the cluster.

The relevance of the resource to the user,relevance(u,r),is calculated from the sum of the product of these weightsover the set of all clusters,C. This measure is defined as:

relevance(u,r)=∑

c∈C

uc w(u,c)∗ rc w(r,c) (10)

Intuitively, each cluster can be viewed as the representa-tion of a topic area. If a user’s interests parallels closelythe subject matter of a resource, the value forrelevance(u,r)will be correspondingly high as can be seen in Figure 1.

To improve performance at the expense of memory therelevance of each resource to every user may be pre-calculated. In fact, at this point the relevance matrix may beuseful in its own right, perhaps for personalizing recommen-dations or comparing the interests of two users. In order topersonalize search and navigation, the selected tag is takeninto account.

A basic search is performed on the query,q, usingthe vector space model andtag frequency. A similarity,rankscore(q,r), is calculated for every resource in the datasetusing the cosine similarity measure. A personalized similar-ity is calculated for each resource by multiplying the cosinesimilarity by the relevance of the resource to the user. Wedenote this similarity asp rankscore(u,q,r)and define it as:

p rankscore(u,q,r)= rankscore(q,r)∗ relevance(u,r) (11)

Once thep rankscore(u,q,r), has been calculated for eachresource, the resources are returned to the user in descend-ing order of the score. While the weights from clusters toresources will be constant regardless of the user, the weightsconnecting the users to the clusters will differ based on theuser profile. Consequently, the resultingp rankscore(u,q,r)will depend on the user and the results will be personalized.

42

Experimental EvaluationWe validate our approach through extensive evaluation ofthe proposed algorithm using data from a real collaborativetagging Web site. A Web crawler was used to extract datafrom del.icio.us from 5/26/2007 to 06/15/2007. In this col-laborative tagging application, the resources are Web pages.The dataset contains 29,918 users, 6,403,442 resources and1,035,177 tags. There are 47,184,492 annotations with oneuser, resource and tag. A subset of the users in the datasetwas used as test cases. The annotations of the remainingusers, were used to generate the tag clusters.

Each test case consisted of a user, a tag and a resource.After performing a basic search using only the tag, the rankof the resource in the returned results was recorded. Next, apersonalized search using the same tag was performed, tak-ing into account the user profile and the discovered clusters.The rank of the resource in the personalized search resultswas also recorded.

Since the resource was annotated by the user, we assumeit is indeed of interest to the user. By comparing the rank ofthe resource in the basic search to the rank in the personal-ized search, we measured the effectiveness of the personal-ization algorithm. This section describes, in more detail, theprocess for judging the proposed personalization algorithm,and supplies an evaluation of the results.

Examples of Tag ClustersThe clustering algorithm is independent of the personaliza-tion algorithm. Still, the quality of the clusters is crucialto the success of the personalization algorithm. We assumethat tags can be clustered into coherent clusters representingdistinct topic areas. Support for that assumption is given inTable 1 where representative tags (those closest to the clus-ter centroids) were selected from six of our discovered tagclusters.

Cluster 1 represents the notion of literature and citations,while cluster 4 represents the notion of testing an Inter-net connection. Other clusters show clearly recognizablecategories. Clusters can capture misspellings, alternativespellings or multilingualism such as in cluster 2: “paleta”versus “palette.” They can also capture other redundant tagsthat are not variations of a particular word such as in cluster5: “concerts” and “band.”

Cluster 4 shows how users may have annotated resourceswith possibly either “speed testing” or “speedtest.” Of-ten collaborative tagging applications will treat the formeras two individual tags. Still, the clustering algorithm willcapture the similarity. Users interested in “speedtest” arelikely to be also interested in resources annotated with both“speed” and “test.” Most ontologies prohibit such ambiguityfrom entering the system by enforcing a set of rules. Folk-sonomies, however, because of their open nature are proneto this type of noise. Clustering appears to be a viable meansto alleviate some of this redundancy.

Cluster 3, on the other hand, shows how clustering cangenerate odd relationships. The tags “peace” and “war” havebeen clustered together since the two words are often usedtogether in the same context. Yet, they have entirely dif-

Cluster 1 Cluster 2 Cluster 3linguistics graphical peaceliteracy linea warpapers paleta 1984anthropology palette terrorismbibliographies rgb fascismthesis hexadecimal orwellCluster 4 Cluster 5 Cluster 6speed live financialtesting concerts retirementtest band savingsbroadband group financesbandwidth event wealthspeedtest party plan

Table 1: Examples of tag clusters

ferent meanings. Further, it is interesting to note the inclu-sion of “orwell” and “1984” with “fascism.” The formerhas a strong literature context, while the latter has a social-political context. Clustering has nevertheless identified thesimilarities across these two contexts. Serendipity can bepromoted by identifying such relations the user might other-wise be unaware of.

Another assumption of the proposed approach is the abil-ity to correctly identify resources with tag clusters. Sup-port for that assumption is given in Table 2. Six stronglyrelated Web pages were selected for each clusters from Ta-ble 1 (based on their relative weights in the centroid vectorsfor each cluster).

Web pages for Cluster 2 all share the notion of color anddesign and relate well to the tag cluster. For example, usershaving annotated “colorblender.com” are also likely to beinterested in “kuler.adobe.com.” Both retrieval and discov-ery are well served by these clusters. It is obvious that auser selecting “terrorism” from the user interface might wishto view “globalincidentmap.com.” Less obvious is the rele-vance of “pickthebrain.com/blog.” However, the strong re-lation of this resource to cluster 3 can support serendipitousdiscovery and improve the user’s navigational experience.

Experimental MethodologyWith our assumptions for the utility of tag clustering veri-fied we turn our attention to the personalization algorithm.Figure 3 shows the experimental procedure. Two randomsamples of 5,000 users were taken from the dataset. Five-fold cross validation was performed on each sample. Foreach fold, 20% of the users were partitioned from the rest astest users. Clustering was completed using the data from theremaining 80% of the users. Clusters were generated usinghierarchical, maximal complete link andk-means clustering.Optimum values for the relevant parameters of each methodwere derived empirically.

From each user in the test set, 10% of the user’s annota-tions were randomly selected as test cases. Each case con-sisted of a user, tag and resource.

The basic search requires only a tag as an input. Re-

43

Figure 3: The steps used to test the personalization algorithm.

Cluster 1 Cluster 2citeulike.org colorblender.comaccent.gmu.edu kuler.adobe.comswoogle.umbc.edu picnik.comciteseer.ist.psu.edu visuwords.comwww.eee.bham.ac.uk pingmag.jpoedb.org touchgraph.comCluster 3 Cluster 4studentsfororwell.org speakeasy.net/speedtestkirjasto.sci.fi/gorwell.htm speedtest.netadbusters.org/media/flash speedtest.net/index.phppickthebrain.com/blog bandwidthplace.comglobalincidentmap.com gigaom.comsecularhumanism.org bt.comCluster 5 Cluster 6eventful.com thesimpledollar.compageflakes.com money.aol.com/savingsgethuman.com/us aaronsw.comalamiracom.multiply.com annualcreditreport.comwamimusic.com/events globalrichlist.comtorrentreactor.net finance.yahoo.com

Table 2: Examples of resources strongly related to the tagclusters

sources were modeled as vectors over the set of tags. Simi-larly, the test tag was treated as a vector containing only onetag. The cosine similarity was calculated for all resources tothe test tag, and the resources were then ordered. The rankof the resource in the basic search,rb, was recorded.

The personalized search requires the test tag, the test userprofile, and a set of discovered clusters. The relevance ofthe resources to the test user was calculated and was usedto re-rank the resources. The new rank of the test resourcein the personalized search,rp, was recorded. Since the userhas annotated this resource, we assume that it is relevant tothe user in the context of the test tag. A personalized searchshould improve the ranking of the resource.

In order to judge the improvement provided by the person-

alized search, the difference in the inverse of the two rankscan be used,imp (Voorhees 1999). It is defined as:

imp =1

rp

−1

rb

(12)

If the personalized approach re-ranks the resource nearerthe top of the list, the improvement will be positive. Simi-larly, if the personalized approach re-ranks the resource fur-ther down the list, the improvement will be negative. In oneextreme, if the basic search ranks the resource very low andthe personalization algorithm improves its rank to the firstposition, thenimp will approach one. The value forimpcan never be greater than one. In practice, however, it is dif-ficult for a personalization algorithm to achieve this level ofsuccess, since the potential improvement is bounded by therank of the basic search. If a basic search ranks the resourcein the forth position, the best improvement a personalizedsearch can achieve is.75.

For each clustering technique and parameter for the tech-nique, the improvement across all folds and samples wasaveraged and reported in the results below.

Experimental ResultsIn general, the proposed personalization technique resultsin improved performance, ranking resources known to berelevant to the user nearer the top of the search results. Whileall clustering techniques showed improvements, they did sowith varying degrees. In addition, the input parameters wereshown to have a marked effect on the final results.

The choice oftf or tf ∗ idf also played an important role.In all casestf ∗ idf is superior. This is likely the result ofthe normalization that occurs. For example, the effect of atag with little descriptive power such as “cool” or “toBuy”is diminished when compared to other tags with strong de-scriptive power such as “concerts” or “retirement.” The twoweighting techniques appear to have nearly identical trendsin the tuning of the parameters.

Hierarchical agglomerative clustering produced superiorperformance when compared to the other clustering meth-ods, perhaps due to its inherent flexibility. The input param-eters,step, division coefficientandgeneralization level, offer

44

Figure 4: The effect ofstep in hierarchical clustering on thepersonalization algorithm.

a level of tuning that the other two clustering methods couldnot provide.

The parameter,step, controls the granularity of the de-rived agglomerative clusters. The similarity threshold, ini-tially set to one, is reduced by the valuestepat each iterationuntil it reaches zero. Clusters of tags are aggregated togetherif their similarity measure meets the current threshold. Anideal value forstepwould aggregate tags slowly enough tocapture the conceptual hierarchy between individual clus-ters. For example, “Java” and “J2EE” should be aggregatedtogether before they are in turn aggregated with “program-ming.”

In Figure 4, increasing the value ofstepresults in dimin-ished performance, as tags are aggregated too quickly. Yet,if the value for step is too low, the granularity of the derivedclusters can become too fine grained and overspecializationcan occur. In these experiments, best results were achievedwith a value of0.004 as is shown in Figure 4. The samevalue forstepwas used when testing the other parameters.

Thedivision coefficientplays a crucial role in the agglom-erative clustering routine. It defines the level where the hi-erarchy is dissected into individual clusters. In the exampleprovided in figure 2, the division coefficient is0.7. Clustersbelow this level in the hierarchy are considered independent.

If the division coefficientis set too low, the result is a fewlarge clusters. These clusters may include many relativelyunrelated tags and span several topic areas. Likewise, if thedivision coefficientis set too high, the result will be manysmall clusters. While the tags in these clusters may be verysimilar, they might not aggregate together all the tags neces-sary to describe a coherent topic in the way a larger clustercould.

Since, the personalization algorithm relies on these clus-ters to serve as the intermediary between users and resourcesand presupposes the clusters represent distinct well definedtopics, the selection of thedivision coefficientis integral tothe success of the personalization algorithm. Intuitively, thegoal of tuning thedivision coefficientis to discover the opti-

Figure 5: The effect ofdivision levelin hierarchical cluster-ing on the personalization algorithm.

Figure 6: The effect ofgeneralization levelin hierarchicalclustering on the personalization algorithm.

mum level of specificity. As shown in Figure 5, the optimumvalue for this dataset is approximately 0.1. This is also thevalue used when testing other parameters.

The generalization levelis key to our modification ofthe hierarchical agglomerative clustering algorithm. Nor-mally, every cluster below thedivision coefficientwould bereturned by the clustering algorithm. In our modification,however, we select clusters directly related to the user’s ac-tion. First, the position in which the test tag was aggregatedinto the hierarchy is noted. Then the algorithm returns onlythose cluster descendent of the test tag. However, we mayinclude a broader swath of clusters by first traveling up thehierarchy by thegeneralization leveland cutting off a largerbranch as is shown in Figure 2.

The importance of thegeneralization levelis demon-strated in Figure 6. If thegeneralization levelis set too low,it is possible to overlook a cluster representing relevant re-sources. However, if the algorithm includes too many clus-ters not related to the user’s selected tag, irrelevant factors

45

Figure 7: The effect ofsimilarity threshold in maximalComplete Link clustering on the personalization algorithm.

can be introduced and the personalization routine may notbe able to improve performance. For these experiments theoptimum value for thegeneralization levelis 8 as shown inFigure 6. This is also the default value when testing otherparameters.

Thegeneralization levelalso has important ramificationsrelated to the user’s motivation. If the user is searching forsomething specific, a lowgeneralization levelmay be cho-sen, focusing on clusters more directly related to the selectedtag. However, if the user is browsing through the folkson-omy, a highergeneralization levelmay be appropriate. Itmay increase serendipity and introduce topics unknown butnevertheless interesting to the user. For example, users se-lecting the tag “chess” are likely to be interested in chessrelated resource. However, by increasing thegeneralizationlevel, serendipitous discovery of other topics such as brainteasers or backgammon may be included.

If the generalization levelis set very high, the algorithmwill behave like a standard hierarchical clustering algorithm,returning all clusters below thedivision coefficient. In thiscase the performance of the personalization drops precipi-tously, underscoring the importance of the modifications tothe algorithm.

The proposed personalization algorithm is independentfrom the method used to generate the tag clusters. In or-der to judge the relative benefit of the modified hierarchicalclustering approach, clusters generated with maximal Com-plete Link andk-means clustering was also tested.

The similarity thresholdfor complete link has a strongimpact on the effectiveness of the personalization routine asshown in Figure 7. If the similarity between two tags meetsthesimilarity thresholdthey are considered connected; oth-erwise they are considered unconnected. If the threshold isset too low, links are generated between tags based upon avery weak relationship resulting in large clusters. On theother hand, setting the threshold too high, can remove linksbetween tags that are in fact quite similar, resulting in a lossof valuable information, perhaps even resulting in numerous

Figure 8: The effect ofk in k-means Clustering on the per-sonalization algorithm.

Figure 9: Comparison of the three techniques.

singleton clusters that have little added utility to the person-alization method. Through empirical evaluation we foundthe ideal value for the similarity threshold to be 0.2, result-ing in an improvement of .112.

We also investigatedk-means clustering. It is an effi-cient clustering algorithm; the computational time it requiresis much less than either hierarchical clustering or completelink clustering. It has only one parameter to tune,k, the pre-determined number of clusters to be generated. If too manyclusters are generated, a topic may be separated into manyclusters. Alternatively, too few clusters can result in clusterscovering multiple topics. In these experiments the optimumvalue fork was found to be approximately 350 as shown inFigure 8.

In sum, the evidence demonstrates that tag clusters canserve as effective intermediaries between users and re-sources thereby facilitating personalization. The person-alization technique using maximal complete link clustersdemonstrated promising results. It is particularly usefulwhen dealing with ambiguous tags. By focusing on tags

46

specifically related to the user’s selected tag, hierarchicalclustering improved personalization even further. The worstof the three methods wask-means clustering, with a max-imum improvement of only about .042 when compared to.112 and .137 as seen in Figure 9. Moreover, the standarddeviation of the K-means clustering was .37. For maximalcomplete link the standard deviation was .30. It was .26for hierarchical agglomerative clustering. Not only did hi-erarchical clustering prove to offer the most benefit to thepersonalization algorithm it was also the most reliable.

The poor results ofk-means clustering can be attributed toits inability to identify innocuous tags. Both maximal Com-plete Link clustering and hierarchical agglomerative cluster-ing use a similarity threshold to determine when to combinea tag in a cluster. As a result, many tags are clustered indi-vidually since they have little or no similarity to other tags.These singleton clusters of innocuous tags have little effecton the personalization algorithm. However,k-means clus-tering requires an input for the parameterk, and will put ev-ery tag into one of the clusters. These innocuous tags muddythe clusters and hinder the identification of the topic repre-sented by the cluster. Without the ability to prune innocuoustags, clusters generated byk-means are ill suited for the pro-posed personalization algorithm.

Another drawback fromk-means clustering is that am-biguous tags can pull unrelated tags together. For example,the tag “eve” can pull religious tags into the cluster as wellas tags concerning online role-playing games such as EVEOnline. Consequently, such a cluster can do little to alleviatethe problem of tag ambiguity.

The strength of maximal Complete Link clustering liesin its ability to generate overlapping clusters, a trait wellsuited to the personalization algorithm. Ambiguous tags canbe members of multiple clusters representing different in-terests. The tag “apple” for example can be a member of acluster concerning the company, the fruit or New York City.A user that annotated a resource with an ambiguous tag willhave some measure of similarity to all clusters containingthe tag. The user profile can be used to disambiguate theintended meaning of the user, by measuring the relative in-terest of the user to each cluster.

The modified hierarchical clustering offers a level of cus-tomization not offered by the other two techniques. The pa-rameterstepeffects the granularity of the hierarchy, whilethe division coefficientoffers a means to select the clusterspecificity. Moreover, the modifications to the algorithmcoupled with thegeneralization levelallow the algorithm toincorporate clusters strongly related to the user’s selectedtag or take a broader view of the hierarchy thereby promot-ing serendipitous discovery.

Conclusions and Future WorkIn this work, we have proposed a method for personalizingsearch and navigation in folksonomies based on three clus-tering techniques. Tag clusters are used to bridge the gapbetween users and resources, offering a means to infer theuser’s interest in the resource. Standard search results basedon cosine similarity are reordered using the relevance of the

resource to the user. The clustering techniques are indepen-dent of the personalization algorithm, but the quality of thetag clusters greatly affects the performance of the personal-ization technique.

Several clustering approaches were investigated alongwith their corresponding parameters. By using clusters di-rectly related to the selected tag, the modified hierarchicalagglomerative clustering proved superior when compared tomaximal complete link andk-means clustering. It offersnot only better improvement in our experimental results, butprovides more flexibility. The success of maximal completeLink clustering can be attributed in part to its ability gen-erate overlapping clusters. Hence, ambiguous tags can bemember of several clusters, and a user’s intended meaningcan be derived through the user profile. However, it is a wellknown NP-complete problem, and though the folksonomyis extremely sparse, this method may not scale well to largersamples. Thek-means clustering algorithm, on the otherhand, while very efficient, was not as successful in improv-ing the personalized search results.

In the future, we plan to conduct work in several direc-tions. We will continue to investigate different clusteringapproaches, including fuzzy versions ofk-means algorithm.We plan to investigate potential improvements to the person-alization algorithm itself. Alternative measures can be usedto judge the relation of the users to the tag clusters, as wellas the relation of the resources to the tag clusters. Also, thealgorithm can be modified to support multi-tag queries.

Tag clusters can be used for other purposes, such as rec-ommending tags or even users. Clusters of resource or canbe used improving navigation in collaborative tagging sys-tems, or possibly clusters of users. Other data mining andmachine learning techniques can be used to overcome im-prove the user experience in folksonomies.

AcknowledgmentsThis work was supported in part by the National ScienceFoundation Cyber Trust program under Grant IIS-0430303and a grant from the Department of Education, Graduate As-sistance in the Area of National Need, P200A070536.

ReferencesAugutson, J., and Minker, J. 1970. An Analysis of SomeGraph Theoretical Cluster Techniques.Journal of the As-sociation for Computing Machinery17(4):571–588.

Begelman, G.; Keller, P.; and Smadja, F. 2006. AutomatedTag Clustering: Improving search and exploration in thetag space.Proceedings of the Collaborative Web TaggingWorkshop at WWW6.

Chen, H., and Dumais, S. 2000. Bringing order to theWeb: automatically categorizing search results.Proceed-ings of the SIGCHI conference on Human factors in com-puting systems145–152.

Choy, S., and Lui, A. 2006. Web Information Retrieval inCollaborative Tagging Systems.Proceedings of the 2006IEEE/WIC/ACM International Conference on Web Intelli-gence352–355.

47

Gower, J., and Ross, G. 1969. Minimum Spanning Treesand Single Linkage Cluster Analysis.Applied Statistics18(1):54–64.Hammond, T.; Hannay, T.; Lund, B.; and Scott, J.2005. Social Bookmarking Tools (I).D-Lib Magazine11(4):1082–9873.Hayes, C., and Avesani, P. 2007. Using tags and clusteringto identify topic-relevant blogs.International Conferenceon Weblogs and Social Media.Heymann, P., and Garcia-Molina, H. April 2006. Collab-orative Creation of Communal Hierarchical Taxonomies inSocial Tagging Systems. Technical report, Technical Re-port 2006-10, Computer Science Department.Johnson, D. 1973. Approximation algorithms for com-binatorial problems.Proceedings of the fifth annual ACMsymposium on Theory of computing38–49.MacQueen, J. 1967. Some methods for classification andanalysis of multivariate observations.Proceedings of theFifth Berkeley Symposium on Mathematical Statistics andProbability1(281-297):14.Mathes, A. 2004. Folksonomies-Cooperative Classifica-tion and Communication Through Shared Metadata.Com-puter Mediated Communication, LIS590CMC (DoctoralSeminar), Graduate School of Library and Information Sci-ence, University of Illinois Urbana-Champaign, Decem-ber.Mika, P. 2007. Ontologies are us: A unified model of socialnetworks and semantics.Web Semantics: Science, Servicesand Agents on the World Wide Web5(1):5–15.Millen, D.; Feinberg, J.; and Kerr, B. 2006. Dogear: So-cial bookmarking in the enterprise.Proceedings of the Spe-cial Interest Group on Computer-Human Interaction con-ference on Human Factors in computing systems111–120.Niwa, S.; Doi, T.; and Honiden, S. 2006. Web PageRecommender System based on Folksonomy Mining forITNG06 Submissions.Proceedings of the Third Interna-tional Conference on Information Technology: New Gen-erations (ITNG’06)-Volume 00388–393.Salton, G., and Buckley, C. 1988. Term-weighting ap-proaches in automatic text retrieval.Information Process-ing and Management: an International Journal24(5):513–523.Salton, G.; Wong, A.; and Yang, C. 1975. A vectorspace model for automatic indexing.Communications ofthe ACM18(11):613–620.Teevan, J.; Dumais, S.; and Horvitz, E. 2007. Character-izing the value of personalizing search.Proceedings of the30th annual international ACM SIGIR conference on Re-search and development in information retrieval757–758.Van Rijsbergen, C. 1979. Information Retrieval.Butterworth-Heinemann Newton, MA, USA.Voorhees, E. 1999. The TREC-8 Question AnsweringTrack Report.Proceedings of TREC8:77–82.Wu, X.; Zhang, L.; and Yu, Y. 2006. Exploring socialannotations for the semantic web.Proceedings of the 15thinternational conference on World Wide Web417–426.

Xu, Z.; Fu, Y.; Mao, J.; and Su, D. 2006. Towards thesemantic web: Collaborative tag suggestions.Collabo-rative Web Tagging Workshop at WWW2006, Edinburgh,Scotland, May.Yan, R.; Natsev, A.; and Campbell, M. 2007. An efficientmanual image annotation approach based on tagging andbrowsing. Workshop on multimedia information retrievalon The many faces of multimedia semantics13–20.

48

Personalization in Folksonomies Based on Tag …...Personalization in Folksonomies Based on Tag...

Documents

Transcript of Personalization in Folksonomies Based on Tag …...Personalization in Folksonomies Based on Tag...