This file has been cleaned of potential threats. If you confirm that...

Deep(Hierarchical) Classification ofTweets

A report submitted in partial fulfilment of therequirements for the degree of

Master of Technology

in

Computer Science and Engineering

by

Srihari Pratapa (10CS30032)Sachin Kumar (11CS30043)

advised by

Dr. Pabitra Mitra

Department of Computer Science and EngineeringIndian Institute of Technology, Kharagpur

November 2014

Contents

1 Introduction 3

2 Related Works 4

3 Methodolgy 5

4 Data Setup 54.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54.2 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Clustering 65.1 Results of Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

6 Expanding the Tweet 8

7 Preprocessing Wikipedia Category Graph 97.1 Level Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97.2 Non-heirarchical links removal . . . . . . . . . . . . . . . . . . . . . . 9

8 Mapping the tweets to categories in the heirarchy 108.1 Selecting the categories . . . . . . . . . . . . . . . . . . . . . . . . . . 118.2 Assigning score to the matched nodes . . . . . . . . . . . . . . . . . . 11

9 Generating the category subgraph 129.1 Spreading activation . . . . . . . . . . . . . . . . . . . . . . . . . . . 129.2 Use of spreading activation to generate category subgraph . . . . . . 13

10 Work In Progress 1610.1 Evaluating the category subgraphs . . . . . . . . . . . . . . . . . . . 16

Abstract

In microblogging services such as Twitter, there is a lot of raw data and itcan be overwhelming to analyze and make sense of raw data as such.Twitter,due to its massive growth as a social networking platform, has been in focusfor the analysis of its user generated content for personalization and recom-mendation tasks. One possible solution to the problem is classifying the rawshort-texts(tweets) into categories. The problem of tweet categorization is ex-tensively studied and many solutions have been proposed. Although classifi-cation into single topics like Technology, Entertainment etc., helps it confinesand put tweets in a very broad categories which have no relation between .We propose that instead of putting tweets into one category, classifying theminto multiple categories which are related through some hierarchy is a goodapproach in making more sense of the short texts, a deep categorization ofshort-texts. This give a more broader and more particular categories of tweets,which helps in better understanding of the micro-blogs and in personalizationfor users at a better level. We propose a good approach for deep categoriza-tion of tweets. Categories and the relation between them is established fromWikipedia’s Category Graph.

2

1 Introduction

With the increasing popularity of microblogging sites, we are in the era of informa-tion explosion. As of June 2014, about 210 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet aboutknown as Trending Topics in real time, it is often hard to understand what thesetrending topics are about. Therefore, it is important and necessary to classify thesetopics into general categories with high accuracy for better understanding of the rawdata [1].

Twitter is an extremely popular microblogging site, where users search for timelyand social information such as breaking news, posts about celebrities, and trendingtopics. Users post short text messages called tweets, which are limited by 140 char-acters in length and can be viewed by users followers. As of June 2011, about 200million tweets are being generated every day When a new topic becomes popular onTwitter, it is listed as a trending topic, which may take the form of short phraseor hashtags (e.g., #election). What the Trend2 provides a regularly updated list oftrending topics from Twitter. It is very interesting to know what topics are trend-ing and what people in other parts of the world are interested in. However, a veryhigh percentage of trending topics are hashtags, a name of an individual, or words inother languages and it is often difficult to understand what the trending topics areabout. It is therefore important to classify these topics into general categories foreasier understanding of topics and better understanding [2].

Many solutions have been proposed for the problem of tweet topic classificationusing different approaches. But all of them puts the tweets into broad category forunderstanding which is not that helpful when it comes to personalization or theyclassify into multiple unrelated categories. All of these approaches gives meaning tothe raw data at a superficial level, but to understand the raw data of tweets we haveto do a more deep categorization of tweets into multiple categories which are giveboth broader sense and more particularity of the tweet and the categories have tobe related through a hierarchy which helps in making more sense. So in this workwe propose a computationally effective method for deep(hierarchical) categorizationof tweets. We have used Wikipedia categories as our classification categories and wehave use wikipedia’s categories graph for hierarchical relation between categories.

Advantages of deep categorization are numerous. Deep categorization helps in un-derstanding and profiling users in a much better and more information about theinterest and activities of the user can be understood. This helps in personalizingthe content for the user according to his interests. Another advantage is sometimesa tweet may not contain any words that could catch a trigger in the normal cate-gorization of tweets, for example The new MotoX from motorola is very good. #

3

MOTOX in this tweet there is a chance that it may not be identified with any ofbroad categories as it doesn’t have any generic words like phone , but if properlyconsidered it belongs to TECHNOLOGY → to SMARTPHONES → MOTOROLA→ MOTOX . This problem won’t arise in the method we propose as we’ll identifythe MOTOROLA and MOTOX in our wikipedia categories and spread it around thegraph to identify SMARTPHONE and TECHNOLOGY.One more important advan-tage is click through rate of advertisements be improved by providing better ads aboutusers proper interests rather than giving advertisements over his broad interests. Asmentioned in the pervious example if we identify that the user is speaking aboutsmartphones then we can place advertisements relating to smartphones or smart-phone accessories which he might buy as he was talking about a newly phone he’sbought, thus improving our click thourgh rate.

Wikipedia is an online encyclopedia and has information about almost anything,including latest and new things that anyone can talk about. So the ideal choice for predefined set of categories is wikipedia, also wikipedia categories have a sub category→ category relation between thus making it more advantageous.

2 Related Works

Topic categorization of tweets has been well studied and different works classifytweets in variety style of topics. In [1] they categorize tweets into categories such asNews (N), Events (E), Opinions (O), Deals (D), and Private Messages (PM) basedon the author information and features within the tweets. In other works they tryclassify tweets for finding trending topics, the work explained in [2] tries to achievethe same. In [3] topic categorization is explored for event detection. Latent DirichletAllocation is also used in many works for classification of tweets, we also plan to useit.

Hierarchical categorization of tweets has also been studied, but not as extensively.In some works hierarchy is self-generated, i.e they classify the tweets into self gener-ated hierarchy from the tweets itself and also the hierarchy is not very deep. In [?]hierarchical categorization is explored but their categories a self generated and veryshallow.

In this work we consider wikipedia categories and wikipedia hierarchy, with 800,000nodes and depth of nearly 20. Wikipedia categories are universally accepted and thehierarchy is very deep. This is one of the challenge we tried to address in this work.

4

3 Methodolgy

We identified specific problems with topic categorization in tweets and tried todevelop methods that can overcome those problems. There is a pre processing step,which is sort of like training step in Machine Learning Algorithms which is clusteringof variety of tweets from collected from various categories and topics. And also thereis step where the wikipedia categories and the relationship between them, a hierar-chy is made. Then there are three steps in matching a given tweet to a hierarchy ofwikipedia.

(1) Expand the tweet with help of clustering done in pre-processing step

(2) Tag the tweet to the relevant primary nodes in the category

(3) Expand from the primary nodes in the hierarchy using spreading Activation

The core problem with short-text is it contains very small amount of informationand difficult to comprehend, especially for tweets with only 140 characters and havingslang and useless words. We tried to solve this problem using clusters of tweets, weadapted two very good and accurate methods. How do we expand a tweet based on acluster?, we argue that a single tweet alone doesn’t have that much information butas a cluster they represent as a whole some meaningful information. That was thekey idea. So once clustering is done we try to summarize some set of data from allthe tweets present in that cluster. So in this clustering step if large number of tweetsfrom different categories are collected then it would help in getting better results.We also assume all the tweets belonging to one category in some sense or the otherwill end up in one cluster thus representing some information. Once the clusteringis done, when a new tweet comes up and it’ll be put into one of the cluster and theboth twitter own words and clusters representative words are used together and thusgiving information to the tweet even if it doesn’t have much information.

4 Data Setup

4.1 Data Collection

A toy dataset of hundred thousand tweets for experimental purposes is collectedusing Twitter4J and Twitter’s publicly available streaming API regarding differentsports. Twitter4J is a Twitter API binding library for the Java language licensedunder Apache License 2.0 [4]

5

The collected data is in JSON format, meaning each line of the output streamis a tweet encoded as a JSON object. We further processed each JSON object toextract, for each tweet, only the date, tweet id, text, user mentions, hashtags, urlsand media urls, to a text file for faster processing (120MByte). For re-tweets, wereplace the text of the re-tweet with the original text of the tweet that was re-tweeted(although we only do this for the tweets in JSON format, since the original tweet textis included in the JSON object). We use this text file, with one tweet per line, for allour experiments

4.2 Data Pre-processing

An important part of our method is data pre-processing and filtering. For eachtweet, we pre-process the text as follows. We normalize the text to remove urls, usermentions and hashtags, as well as digits and other punctuation. Next, we tokenize theremaining clean text by white space, and remove stop words. In order to prepare thetweet corpus, in each time window, for each tweet, we first append the user mentions,the hashtags and the resulting clean text tokens..

The next step is concerned with vocabulary filtering. For each time window, fromthe window tweet corpus, we create a (binary) tweet-term matrix, where we removeuser mentions (but keep hashtags), and the vo cabulary terms are only bi-grams andtri-grams, that occur in at least a number of tweets, where the minimum is set to10 tweets. The idea behind this filtering step, is that clusters should gather enoughtweets to be considered a topic at all.In the next filtering step, we reduce this matrixto only the subset of rows containing at least 5 terms (tweets with at least 5 tokensfrom the vocabulary). This step is meant to remove out-of-vocabulary tweets, as wellas tweets that are too short to be meaningfully clustered [3].

5 Clustering

Cluster analysis or clustering is the task of grouping a set of objects in such a waythat objects in the same group (called a cluster) are more similar (in some sense oranother) to each other than to those in other groups (clusters). Cluster analysis itselfis not one specific algorithm, but the general task to be solved. It can be achieved byvarious algorithms that differ significantly in their notion of what constitutes a clus-ter and how to efficiently find them. Popular notions of clusters include groups withsmall distances among the cluster members, dense areas of the data space, intervals orparticular statistical distributions [?]. We used Hierarchical Agglomerative Cluster-ing(HAC), one of the clustering models for clustering the tweets with the feature setdescribed in previous section. Hierarchical clustering algorithms are either top-downor bottom-up. Bottom-up algorithms treat each document as a singleton cluster at

6

the outset and then successively merge (or agglomerate) pairs of clusters until allclusters have been merged into a single cluster that contains all documents. Bottom-up hierarchical clustering is therefore called hierarchical agglomerative clustering orHAC .

In bottom-up HAC, at the initial step all the data points needed to be clusteredare consider as a individual clusters, and similarity between all the pairs of clustersin current step is calculated and the two clusters which are most similar to eachother a joined together into one cluster. So at each step there will be one less clusterthis goes on till all the data points mergers into one cluster. So if we visualize thisconstruction process it looks like a sort of building structure called dendogram. Itcan be thresholded at different levels to get different set of clusters. Pseudocode[9]for a naive way of solving the problem is given at. This procedure takes O(N3) anda navie to do it, there many efficent other ways to solve it like fast cluster or usingHeaps and KD-Trees. We used fast cluster method, the pseudocode for the naiveprocedure is given to get a grip of the algorithm.

Algorithm 1 Bottom-UP HAC Pseudo Code

1: procedure SIMPLEHAC(d1, . . . , dN)2: for n ← 1 to N3: do for i ← 1 to N4: do C[n][i]← SIM(dn, di)5: I[n]← 1(keepstrackofactiveclusters)6: A ← [] (assembles clustering as a sequence of merges)7: for k ← 1 to N-18: do < i,m > ← argmax{<i,m>:i 6=m∧I[i]=1∧I[m]=1}9: A.Append(< i,m >) (store merge)

10: for j ← 1 to N11: do C[i][j]← SIM(i,m, j)12: C[j][i]← SIM(i,m, j)13: I[m]← 0(deactivatecluster)14: return A

We experimented another clustering approach, Latent Dirichilet Allocation. Innatural language processing, latent Dirichlet allocation (LDA) is a generative modelthat allows sets of observations to be explained by unobserved groups that explainwhy some parts of the data are similar[8]. MALLET(MAchine Learning for LanguagEToolkit) API was used to do LDA clustering. In LDA each topic is a distribution overa vocabulary and each tweet is a distribution over a the set of topics. As explainedearlier each tweet can be expressed as a mixture of multiple topics currently underconsideration [5].

7

5.1 Results of Clustering

For our experimental purposes we have collected 100,000 sports related tweetsand after pre-processing only 17,000 tweets were left. After applying the clusteringalgorithm described above 66 clusters were formed, on an average of 250 tweets ineach cluster. A higher threshold will yield a less number of clusters but will mergedifferent topics. We played around with threshold and found 0.5 yields proper results,like putting all the tweets for a certain sport into one cluster.

Once clustering is done, each cluster is represented by top ten most frequent wordsfrom the all the tweets in that particular cluster. This is the main purpose of theclustering part in solving the problem of short text in tweets. As a cluster tweetshave some proper information which can be represented as a whole in a much betterway. The tweets are expressed as a weighted vector of the words from the clusterwhich it belongs to and the nouns in the tweet itself. How the weights are expressedor calculated is explained in the next section.

6 Expanding the Tweet

To expand the tweet we used distributional semantic techniques. The Distribu-tional Hypothesis in linguistics is derived from the semantic theory of language usage,i.e. words that are used and occur in the same contexts tend to purport similar mean-ings. The underlying idea that ”a word is characterized by the company it keeps”was popularized by Firth. The Distributional Hypothesis is the basis for StatisticalSemantics. Although the Distributional Hypothesis originated in Linguistics, it isnow receiving attention in Cognitive Science especially regarding the context of worduse[10].

Co-occurrence of words is a strong measure of similarity between unrelated words.To express a tweet in terms of the representative words of cluster, we made a terms-representative words matrix. The terms are all the set of words of all tweets in thecluster and representative words are the most frequent words chosen to represent thecluster. A term in this matrix represents the frequency of co-occurrence of a termwith the most frequent cluster word.

Once this matrix is constructed a tweet is picked up and for each of it’s termco-occurrence is measured from the matrix, giving a weighted vector in terms ofthe representative words. This vector is normalized and appended with a weightedvector of it’s own words with weight 1. The weights are important because morerepresentative have to be given some more importance while mapping them to thehierarchy in the next stage. This matrix is smoothed to remove zeros. This is doneso if a tweet has less information, it can be atleast in terms of it’s cluster words as it

8

belongs to the same topic. Nevertheless the tweets words are given a higher score of1 always giving priority to their words.

7 Preprocessing Wikipedia Category Graph

We utilize Wikipedia as the knowledge-base for inferring Hierarchical Interests.Although, there are other free ontologies such as OpenCyc 1, and the ODP taxonomy2, we opted for Wikipedia because of its vast domain coverage. However, a major chal-lenge faced in utilizing Wikipedia as a hierarchy is that, its not actually a heirarchy,it is a graph. The wikipedea category graph (WCG) comprises of cycles and hence itis neither a taxonomy nor a hierarchy. These cycles make it non trivial to determinethe hierarchical relationships between categories. For example, determining that Cat-egory:Baseball is conceptually more abstract than Category:Major League Baseballis difficult if there exists cycles in the graph. Therefore the wikipedia category graphis transformed to a hierarchy by assigning levels of abstraction for each category.Firstly, we remove categories that are irrelevant to topics which are of practical usagein case of twitter. Specifically, we remove the Wikipedia admin categories 3 that areused only to manage Wikipedia. A sub-string match is employed for the categorieswith the set of labels used in [11]. Consequently, around 64K categories with 150Klinks are removed from Category Graph. Wikipedia administration have includedthese categories to maintain Wikipedia. The articles are filtered based on the follow-ing strings in their labels: wikipedia, wikiprojects, lists, mediawiki, template, user,portal, categories, articles, pages.

7.1 Level Identification

. The root category (node) of Wikipedia Category Graph is Category: Main TopicClassifications, which subsumes 98 percent of the categories. Selecting this root nodeas the most abstract category, we determine the relative hierarchical levels of othercategories. We assign the shortest distance to the category from the root (By doinga breadth first like search on the Category Graph) as its hierarchical level (level ofabstractness)

7.2 Non-heirarchical links removal

Once the hierarchical levels are assigned we remove the edges that do not con-form to a hierarchical structure, i.e. all the directed edges from a category of largerhierarchical level (specific) to a smaller hierarchical level (conceptually abstract) areremoved. Performing this task reduced WCG from 1.9M links to 1.2M links, also

1http://www.opencyc.org2http://www.dmoz.org3http://en.wikipedia.org/wiki/Category:Wikipedia administration

9

(a) Assigning levels to the nodesin the heirarchy

(b) Removal of edges pointing backto more abstract categories

Figure 1: Preprocessing Wikipedia Category Graph

Figure 2: A screenshot of the edge list of categories

leading to the removal of cycles in the Category Graph converting it into a directedacyclic graph (DAG)

The output of this process is a hierarchy with height = 15, rooted at the nodeCategory: Main Topic Classifications. The nodes in the hierarchy have many to manyrelationships and hence it is still not a taxonomy. This refined graph with directededges that conform to a hierarchy is referred to as Wikipedia Hierarchy (WH)

We have used the scripts and instructions provided in 4 [12] [13] to accomplishthis task.The wikipedia dump (snapshot of 2014) was downloaded by following theinstructions in 5. The final output is a file containing edge list with each edge denotedby tab separated nodes in each line as follows:< sub−category >< tab >< category >< tab >< priority >, where priority denotesthe importance of the link, it desribed in section 8.

8 Mapping the tweets to categories in the heirar-

chy

This module maps the content bearing words in the expanded tweet to variouscategories in the heirarchy. The approach desribed in the [10] uses Entity Recognition

4http://www.cs.technion.ac.il/ gabr/resources/code/wikiprep5http://en.wikipedia.org/wiki/Wikipedia:Database download

10

to identify Primitive Interests from a user’s tweets, and scores them based on theirfrequency. Entity recognition in tweets is non trivial due to the informal natureand ungrammatical language of tweets. They have used an existing frameword ofentity resolution called Zemanta, for their work because of the following reasons:(1) Zemanta links the entities spotted in tweets to their corresponding Wikipediaarticles (Primitive Interests ); (2) Zemanta has relatively superior performance toother available services.

In this work, we are focussing on increasing the content of the tweets (as theymight contain little information) by expanding the tweet words using a techniquesimilar to distributional semantics by creating a co-occurence matrix, thus EntityResolution will not work in our case, since we finally have a bag of words. Thuswe use a different approach to map each word of the tweet to the categories in theheirarchy as described in the following paragraph.

8.1 Selecting the categories

For each word of the tweet, every category which contains this word as its substringis selected and is assigned a score. This way, many categories may get selected sowe limit the categories in which the word is present as a space separated string (as acomplete word). For example Fanta is a substring of Fantasy but the category Fantasydoes not in anyway relate to Fanta, hence that is omitted. Moreover we consider onlythose categories for which the score value is greater than some threshold that we haveempirically decided.

8.2 Assigning score to the matched nodes

Once the matching categories are identified, the next step is to score the categoriesaccording to how much relevance they may actually bear with the tweet. This scoringis essential as this will be used to score other categories in the graph which are notdirectly matched by the spreading activation technique described in the next sectionfor generating subgraph of all related topics. We use a count based method of scoringthe categories as follows. If a category C is matched with n words of a tweet T itsscore is given as:

Score(C, T ) =∑

w∈T∧Match(w,C)=1

1

|C|Dw (1)

where Match(w,C) is a boolean function which gives true if the word w existsin the category name C according to the method described above. Dw denotes thescore assigned to word w in the tweet T which is 1 if word exists in the original tweet,and is equal to the normalised count of the word with another word from the tweetcalculated using the co-occurence matrix otherwise. |C| denotes the number of wordsin the category. This division is done to penalise the categories which contain somewords from the tweet but are longer in length (Thus the importance of the word

11

in question is less in that category). For example a word ’football’ matches withCategory:List of state club football players in the United States of America as wellas Category:Football. But has less relevance in the first case.

To summarise, this module gives weighted categories for each tweet, where thescore denotes the relevance of the category for the tweet.

Algorithm 2 Match Tweet with Wikipedia Categories

1: procedure MatchCategory(tweet bag)2: //N is the number of categories in the wikipedia heirarchy3: Act[1..N ]← 0.04: PrimCat← {}5: for all category in category heirarchy do6: for all word in tweet bag do7: if word is in category then Act[category ] = Act[category ] +

Dword/|category| PrimCat← {category} ∪ PrimCat

9 Generating the category subgraph

9.1 Spreading activation

[14] Spreading activation is a method for searching associative networks, neuralnetworks, or semantic networks. The search process is initiated by labeling a setof source nodes (e.g. concepts in a semantic network) with weights or ”activation”and then iteratively propagating or ”spreading” that activation out to other nodeslinked to the source nodes. Most often these ”weights” are real values that decayas activation propagates through the network. When the weights are discrete thisprocess is often referred to as marker passing. Activation may originate from alternatepaths, identified by distinct markers, and terminate when two alternate paths reachthe same node. However brain studies show that several different brain areas playan important role in semantic processing. In a generalised framework, a spreadingactivation algorithm works as follows.

A directed graph is populated by Nodes[1...N] each having an associated activa-tion value A [i] which is a real number in the range [0.0, 1.0]. A Link[i, j] connectssource node[i] with target node[j]. Each link has an associated weight W [i, j] usuallya real number in the range [0.0, 1.0].

Parameters:

• Firing threshold F, a real number in the range [0.0 ... 1.0]

• Decay factor D, a real number in the range [0.0 ... 1.0]

12

1. Initialize the graph setting all activation values A [ i ] to zero. Set one or moreorigin nodes to an initial activation value greater than the firing threshold F. Atypical initial value is 1.0.

2. For each unfired node [ i ] in the graph having an activation value A [ i ] greaterthan the node firing threshold F:

3. For each Link [ i, j ] connecting the source node [ i ] with target node [ j ], adjustA [ j ] = A [ j ] + (A [ i ] * W [ i, j ] * D) where D is the decay factor.

4. If a target node receives an adjustment to its activation value so that it wouldexceed 1.0, then set its new activation value to 1.0. Likewise maintain 0.0as a lower bound on the target node’s activation value should it receive anadjustment to below 0.0.

5. Once a node has fired it may not fire again, although variations of the basicalgorithm permit repeated firings and loops through the graph.

6. Nodes receiving a new activation value that exceeds the firing threshold F aremarked for firing on the next spreading activation cycle.

7. If activation originates from more than one node, a variation of the algorithmpermits marker passing to distinguish the paths by which activation is spreadover the graph

8. The procedure terminates when either there are no more nodes to fire or inthe case of marker passing from multiple origins, when a node is reached frommore than one path. Variations of the algorithm that permit repeated nodefirings and activation loops in the graph, terminate after a steady activationstate, with respect to some delta, is reached, or when a maximum number ofiterations is exceeded.

9.2 Use of spreading activation to generate category sub-graph

For each tweet, various categories are selected and weighted according to themethod in the previous module. We call these primary categories. These weightsare the activation values of these nodes (these values lie between 0 and 1). All othercategories have their activation value initialised to 0. These activation values areappropriately propagated up and down the heirarchy using the spreading activationtheory as described to determine related relevant categories and their wieghts (activa-tion values). This procedure is performed for each initially activated node. We stopspreading when the activation value of node (Category) falls below certain thresholdF. The activated nodes obtained by this procedure are called secondary categories.

13

The propagation of the weights is performed using the following activation func-tion;

Ai = Ai + Aj ×Wij ×D (2)

where Ak denotes the activation value of node k, i is the node to be activated, jis the activated node. Wij denotes the weight of the edge between categories i and j(which defines the priority with which category is the child of j (or vice-versa). Thisfunction is most primitive form of the activation. [] uses various modification on thisactivation functions as described below:

1. no weight no decay i.e. setting Wij = 1, D = 1 for all i, j in Equation 2. This wayeach node will be activated in the graph with the nodes high up the heirarchy(conceptually abstact) getting very high values of activation. This is intuitive,as many nodes lower in the heirarchy spread up the heirarchy without anyconstraints. Other values of decay values (0.4, 0.6, 0.8) are also experimentedwith. This setting does not take into account the distribution of various nodesin the heirarchy.

2. It was observed that nodes with more child nodes get higher scores than thosehaving lower children. This is because the distribution of categories in theheirarchy follow a bell curve. To account for this behaviour, another term wasproposed which would penalise the nodes to with more children by normalisingthe activation value of each of the categories based on the number of subcat-egories at its child level. Two normalising functions were used. One was justraw count of the children, and other was logarithm of the raw count of childrenof a node.

Ni =1

nodeshi+1

(3)

NLi =1

log(nodeshi+1)(4)

where, hi denotes the heirarchical level of node i, and nodesh denotes the numberof nodes in the heirarchical level h.

3. A subcategory in the heirarchy has many categories (parents) associated withit. For example, Categoty:Soccer cup competitions in the United States is a sub-category of Category:Association football cup competitions by country as well asCategory:Soccer competitions in the United States. These categories are givenequal priorities hence equal weights are being propagated during activation.Therefore, a preferential path constraint was introduced and priorities were as-signed to each parent of a subcategory. This priority is motivated by Wikipediacategory graph itself where categories are mentioned from left to right in the in-creasing order of their priorities as being the parent of the current subcategory.

14

(a) Category Distribution

(b) Boost at intersection

Figure 3: Generating the category subgraph

This heuristic is utilized in the activation function as follows:

Pij =1

priorityij(5)

where priorityij denotes the priority of the edge between categories i and j.Priorities are assigned to the parent categories in a linear fashion (1, 2, .. ) fora particular category

4. Intersection Booster Another variation of the activation function involves as-signing larger values of activation to the nodes where multiple primary nodesintersect. This is intuitive, since if many primary nodes are spreading to thea node, the category correspoding to the node is expected to have high rel-evance with the tweet. To formalise this aspect and account the boosting ofinteresection nodes, following variation was introduced:

Bi =Nei

Necmax

(6)

where Nei is the total number of primary categories activating node i; cmaxis the subcategory of i that has been activated with max number of primarycategories.

For example : (add a figure and explain)

Using the above four variations, we have experimented on the data set describedin the previous section with the following activation functions. Since the dataset issmall and covers only a small distribution of topics, evaluation of these methods stillremains.

Using only empirical counts as normalising function with no edge weights

Ai = Ai + Aj ×Ni (7)

15

Using log normalisationAi = Ai + Aj ×NLi (8)

Using edge priorities and intersection booster

Ai = Ai + Aj ×NLi × Pij ×Bi (9)

10 Work In Progress

The input to our system is a set of tweets and Wikipedia Category Graph, andthe output is a subgraph of categories for each tweet. Currently, we have tested thewritten code with a small dataset of tweets (around 100,000 originally, which reduce to17,000 after cleaning). We plan to collect a larger collection of tweets of varied topicsand test the codes on them. Moreover, we plan to evaluate two aspects of the system(1) The category subgraphs generated using spreading activation methods describedin section 8. (2) The authenticity of the Wikipedia Category Graph created usingthe algorithm described, by compare it with some manually constructed taxonomy.

10.1 Evaluating the category subgraphs

We plan to get manual annotation of tweets to various topics in the wikipediaheirarchy done by different users. Given this annonation, considering only the cat-egories marked (not considering the heirarchy anymore) we will measure the accu-racy/precision/recall of the results (only the mapped categories) produced by oursystem for different activation functions on the same tweets as marked by the users.Using these results we will furthur fine tune our parameters such as Firing Thresholdand initial scoring function.

References

[1] Short Text Classification in Twitter to Improve Information Filtering, BharathSriram, David Fuhry, Engin Demir, Hakan Ferhatosmanoglu, Murat Demirbas.

[2] Twitter Trending Topic Classification, Kathy Lee, Diana Palsetia, RamanathanNarayanan, Md. Mostofa Ali Patwary, Ankit Agrawal, and Alok Choudhary.201111th IEEE International Conference on Data Mining Workshops

[3] Event Detection in Twitter using Aggressive Filtering and Hierarchical TweetClustering, Georgiana Ifrim, Bichen Shi, Igor Brigadir.

[4] Twitter API Used for Data (http://twitter4j.org/en/)

[5] McCallum, Andrew Kachites. ”MALLET: A Machine Learning for LanguageToolkit.” http://mallet.cs.umass.edu. 2002.

16

[6] Clustering analysis and Techniques http://en.wikipedia.org/wiki/Cluster_analysis?oldformat=true

[7] Hierarchical Clustering http://en.wikipedia.org/wiki/Hierarchical_

clustering?oldformat=true

[8] Latent Dirichlet Allocation , David M. Blei, Andrew Y. Ng, Michael I. Jordan.

[9] Clustering PseudoCode and Example Dendogram ,http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.

html

[10] Distributional Semantic Techniques (co-occurrence represents correlation),http://en.wikipedia.org/wiki/Distributional_semantics?oldformat=

true

[10] User Interests Identification on Twitter Using a Hierarchical Knowledge Base.Pavan Kapanipathi, Prateek Jain, Chitra Venkataramani and Amit Sheth.

[11] Deriving a Large Scale Taxonomy from Wikipedia. Simone Paolo Ponzetto andMichael Strube. AAAI 07

[12] Evgeniy Gabrilovich and Shaul Markovitch. Computing Semantic Relatednessusing Wikipedia-based Explicit Semantic Analysis. Proceedings of The 20th In-ternational Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India,January 2007

[13] Evgeniy Gabrilovich and Shaul Markovitch. Overcoming the Brittleness Bot-tleneck using Wikipedia: Enhancing Text Categorization with EncyclopedicKnowledge. Proceedings of The 21st National Conference on Artificial Intelli-gence (AAAI), pp. 1301-1306, Boston, July 2006

[14] Adam M. Collins, Elizabeth F. Loftus. Spreading Activation Theory of SemanticProcessing. Psychological Review, 1975.

17

This file has been cleaned of potential threats. If you confirm that...

Documents

Transcript of This file has been cleaned of potential threats. If you confirm that...