On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio...

22
On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop Click icon to add picture 6/15/12 1st HPC Workshp - Claudio Lucchese

description

ABCDEFGHIJKLM Frequent Patterns Mining 6/15/12 1st HPC Workshp - Claudio Lucchese

Transcript of On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio...

Page 1: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

On Frequent Chatters Mining

Claudio Lucchese

1st HPC Lab Workshop

Click icon to add picture

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 2: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Frequent Patterns Mining

• How may patterns do you see in the following dataset ?A B C D E F G H I J K L M

1                          

2                          

3                          

4                          

5                          

6                          

7                          

8                          

9                          

10                          

11                          

12                          

13                          

                         6/15/12 1st HPC Workshp - Claudio Lucchese

Claudio Lucchese, Salvatore Orlando, Raffaele Perego: Mining Top-K Patterns from Binary Datasets in Presence of Noise. SDM 2010

Page 3: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

A B C D E F G H I J K L M1                          

2                          

3                          

4                          

5                          

6                          

7                          

8                          

9                          

10                          

11                          

12                          

13                          

                         

Frequent Patterns Mining

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 4: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Frequent Patterns Mining

                                                                                                                                                                                                                                                                                                                                     

usually rows and cols are not in “good-looking” order6/15/12 1st HPC Workshp - Claudio

Lucchese

Page 5: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

State of the art• Most recent approaches try to discover the top-k

patterns that optimize different cost functions:• Minimize Noise (“holes”) or• Minimize MDL

• encoding(Patterns) + encoding(Data|Patterns)• Maximize Information Ratio:

• Number of bits of information w.r.t. to the Maximum Entropy Model built on the basis of rows and cols marginal distribution

• Minimize length of patterns and the amount of noise (our approach =)

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 6: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Evaluation• Unsupervised:

• Measure how well the proposed algorithm optimizes the proposed cost function

• What is the best cost function ?

• We are investigating supervised measures:• Unsupervised extraction: extract patterns from

classification/clustering dataset without class/cluster labels information

• Supervised evaluation: measure how well the patterns can predict/match classes/clusters

• Preliminary result:• Fancy cost functions might not be the best ones6/15/12 1st HPC Workshp - Claudio

Lucchese

Page 7: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Information Overload in News

6/15/12 1st HPC Workshp - Claudio Lucchese

Gianmarco De Francisci Morales, Aristides Gionis, Claudio Lucchese: From chatter to headlines: harnessing the real-time web for personalized news

recommendation. WSDM 2012.

Page 8: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

✓Timeliness

✓Personalization

Can we exploit Twitter?

Number of mentions of “Osama Bin

Laden”

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 9: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

• 90% of the clicks

happen within 2

days from

publication

• Only a few occur

early!

News Get Old Soon

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 10: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

T.Rex (Twitter-based news recommendation system)

• Builds a user model from Twitter

• Signals from user generated content, social neighbors and popularity across Twitter and news

• Entity-based representation (overcomes vocabulary mismatch)

• Learn a personalized news ranking function:

• Pick up candidates from a pool of related or popular fresh news, rank them and present top-k to the user

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 11: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

• Ranking function is user and time dependent

• Social model + Content model + Popularity model

• Popularity model tracks entity popularity by the number of

mentions in Twitter and news (with exponential forgetting)

• Content model measures relatedness of a bag-of-entities

representation of a users’ tweet stream and of a news article

• Social model weights the content model of every social

neighbor by a truncated PageRank on the Twitter network

Recommendation Model

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 12: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

✓Designed to be streaming and lightweight (just counting)

✓User model is updated continuously

System Overview

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 13: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

• Learning to rank approach with SVM

• Each time the user clicks on a news, we learn a set of

preferences (clicked_news > non_clicked_news):

• Prune the number of constraints for scalability:

• only news published in the last 2 days

• only take the top-k news for each ranking component

• Can optionally include additional features for news articles:

• click count, age, etc... (T.Rex+)

Learning the Weights

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 14: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

✓User generated content is a very good predictor albeit very sparse

✓Click Count is a strong baseline but does not help T.Rex+

Predicting Clicked News

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 15: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Predicting Clicked Entities

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 16: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Future works (?)• Explain a set of news showing how the main

topics interacted with each other over time.

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 17: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Future works (?)• Explain a set of news showing how the main

topics interacted with each other over time.• Example: European sovereign-debt crisis

time

Merkel

Monti

France

Berlusconi

Greece

EU

New Italiangovernment

Fiscal CompactEuroBond Obama

Loan

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 18: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Future works (?)• Explain a set of news showing how the main topics

interacted with each other over time.• Applications:

• Given the news the user is currently reading, provide an explanation of the related facts that precede that news

• Given a query, provide an explanation of the documents related to that query

• Given a set of topics, explain their relations over time

• Browse a collection of news, by changing the topics of interest, the time window, the granularity

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 19: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Future works (?)• Explain a set of news showing how the main

topicsinteracted with each other over time.

• A topic is a named entity relevant over time• An interaction is a cluster of news related to

some event and relevant in a small time window• It might be important to cover the given time

window, but recent events might be more interesting

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 20: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Future works (?)• Explain a set of news showing how the main

topicsinteracted with each other over time.

• Given a maximum number of main topics and interactions, maximize:• Topic coverage and diversity• Events time coverage• Cluster similarity• Main topics connectivity

6/15/12 1st HPC Workshp - Claudio Lucchese

Page 21: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Future works (?)• Explain a set of news showing how the main topics

interacted with each other over time.• Its is different from news clustering:

• Even if you had a good clustering, might not be trivial to select which events and which topics to show in order to maximize the amount of information delivered to the user

• There is some interesting related work• aimed at finding chains of news,

we are more interested in topic evolution 6/15/12 1st HPC Workshp - Claudio

Lucchese

Page 22: On Frequent Chatters Mining Claudio Lucchese 1 st HPC Lab Workshop 6/15/12 1st HPC Workshp - Claudio Lucchese.

Thank you !

Click icon to add picture

6/15/12 1st HPC Workshp - Claudio Lucchese