Tutorial on Topic Modelling from messages on the Internet

24
Tutorial on Topic modelling from messages on the Internet by Xiang Kong Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Transcript of Tutorial on Topic Modelling from messages on the Internet

Page 1: Tutorial on Topic Modelling from messages on the Internet

Образец заголовка

Tutorial on Topic modelling

from messages on the

Internet

by Xiang Kong

Prepared as an assignment for CS410: Text Information Systems in Spring 2016

Page 2: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаPlan

• Motivation

• Methods

• Conclusion

• Relevant work

Page 3: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаWhy we are interested

• Millions of messages will be generated on

the Internet (twitter, Facebook, LinkedIn,

etc.)

• Understand interests of a large amount of

people

• Recognizing topics in the real world

Page 4: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаMore about Why

According to Cowen & Co Predictions &

Report:

• Twitter had 241 million monthly active

users at the end of 2013

• Twitter reaches 270 million monthly active

users by the end of 2014

Page 5: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаCan tradition method work?

• Tradition method (PLSA, LDA) will not work

very well on Internet massages.

Results could not represent what exactly

happens on the Internet

(from Hong L, 2010)

Page 6: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаDifference

• Usually, they are very short (tweets)

• Different languages in one short message

• Unstructured

• Abbreviations

Page 7: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаWe need some changes

• Combining with extra sources (hashtags in

twitter)

• Clustering-based topic extraction

• Vision information

Page 8: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаHashtag

• Providing context or metadata for online

messages (tweets for example)

• Organizing the information in the online

messages for retrieval

• Helping to find latest trends

• Helping to get more audience

Fig 1 Hashtag example

Page 9: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаPerformance Improvance(Xinfan Meng, Furu We, Xiaohua Liu et al

Topic extraction from tweets using the graph-based

methods with the help of hashtags (Labeled LDA).

• Co-occurrence and Cosine method produce more

clusters and smaller clusters than the Labeled LDA

method.

• The distributional similarity approach (Labeled LDA)

based on hashtags can greatly improve the performance

Fig 2. Accuracy of topic extraction for different methods

Page 10: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаMissing hashtags

• Users tweet history

• Social graph

• Influential friends

• Temporal Information

Page 11: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаDifferent clustering(Ming Xie and Yunlu Zhang)

Optimizing density-based clustering OPTICS

algorithm• It uses WordNet for word sense disambiguation of words

in the learning resources documents

• It maps the data space of the original method to a vector

space of sentence, improving the original OPTICS

algorithm.

Page 12: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаVision-based (1) (Qingshui Li and KaiWu 2010)

• The vision information (navigation bars, banner

bar, etc.) of Web could avoid using the

sophisticate natural language processing

technology

• Analyzing the vision character of page block and

finally accurate determine the topic data region.

Page 13: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаVision-based (2) (Qingshui Li and KaiWu 2010)

• First detect whether the Web page contains a specific tag, if there is a specific tag, then we analyze it according to a specific tag.

• From information in these vision tags, topic will be extracted more efficiently.

• Topics usually displays by large font, significant position or different color.

• Otherwise, some topics do not display by special format, even it has not topic. In this case, they use the frequency algorithm to carry out the topic extraction.

NB: VB in the sample means vision block.

Page 14: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаTopic emerging time(Adrien Bougouin and Florian Boudin)

(Mario Cataldi, Luigi Di Caro and Claudio Schifanella )

We want to how the topics on the Internet changes

• Extracting the contents according to a novel

aging theory

• Analyzing the social relationships in the network

with the well-known Page Rank algorithm in

order to determine the authority of the users.

• Finally, we leverage a navigable topic graph,

allowing the detection of the emerging topics,

under user-specified time constraints.

Page 15: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаNovel aging theory(1)

• Many conventional clustering and classification strategies can not be applied to this problem due to the fact that they tend to ignore the temporal relationships among documents (tweets in our case) related to a news event.

• we can evaluate the usage of a keyword by its energy, which indicates the vitality status of the keyword and can qualify the keyword’s usage. In fact, a high energy value implies that the term is becoming important in the considered community, while a low energy value implies that it is currently becoming out of favor.

Page 16: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаNovel aging theory(2)

Fig .4 Statistical usage of the term “earthquake” in Twitter from October 2009 to January 2010; the pick represents earthquake occurred in Haiti on 12 January 2010.

Fig 5 A Topic graph with two Strongly Connected Components (in red and blue) representing two different emerging topics: labels in bold represent emerging keywords while the thickness of an edge represents the semantical relationship between the considered keywords.

Page 17: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаAfter extracting topics

• Recommendation system (Brendan O'

Connor, Michel Krieger and David Ahn)

• Topic evolution (Yookyung Jo, John

E.Hopcroft, Carl Lagoze)

Page 18: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаAn example(1)Mathieu Bastian, Matthew Hayes, William Vaughan et al

“Skills and Expertise” is a data-driven feature on LinkedIn, the world’s largest professional online social network, which allows members to tag themselves with topics representing their areas of expertise.

Page 19: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаAn example (2)

• Folksonomy creation

Entity extraction

Clustering to provide context

• Skills Inference and Recommendation

Naive Bayes Classifier to detect

likelihood of having a skill for a user

Page 20: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаPractical design

• Large data scale

Hadoop (Mapreduce) framework

Dimensionality reduction (PCA, ICA)

• Topic extraction

unsupervised training (EM, DNN, etc)

Page 21: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаConclusion

• Necessary to extract summarization of

messages on the Internet

• Some difficulties for online messages but

also give us some extra clues (vison)

• Some applications (LinkedIn skills

inference system)

Page 22: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаFurtherwork (1)

• It is a popular field right now due to

Internet popularity

• How to handle online messages is also a

open field

Highly unstructured data

Short but meaningful messages

• How to make implementations faster to

cluster online messages

Page 23: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаFurtherwork(2)

• Mining the relationships between topics,

topic evolution thread discovery and

textual mining on evolution threads.

• Building a navigational application from

the graphs concrete information, i.e.

through “edges: between topics in the

graph model.

Page 24: Tutorial on Topic Modelling from messages on the Internet

Образец заголовкаReferences

• Papers used in this tutorial are in this file

https://subversion.ews.illinois.edu/svn/sp16-

cs410/xkong12/progress.pdf