Mining Interest Topics from Plurk

74
Mining Interest Topics from Plurk Ken Yi-Chien Lee 2012/11/27

Transcript of Mining Interest Topics from Plurk

Page 1: Mining Interest Topics from Plurk

Mining Interest Topics from Plurk

Ken Yi-Chien Lee 2012/11/27

Page 2: Mining Interest Topics from Plurk

Outline • Introduction

– Why and what we do in this thesis? • The SNSD system

– Community detection – Interest hierarchy

• Implementation – Preprocessing – Celery task queue

• Experiments • Conclusions and future works

Page 3: Mining Interest Topics from Plurk

INTRODUCTION I what to make friend with you.

Page 4: Mining Interest Topics from Plurk

Scenario

Page 5: Mining Interest Topics from Plurk

Scenario (cont.)

Page 6: Mining Interest Topics from Plurk

Plurk Timeline

Page 7: Mining Interest Topics from Plurk

Private Status

Page 8: Mining Interest Topics from Plurk

Traffic Statistics of Plurk in Taiwan

Page 9: Mining Interest Topics from Plurk

The Go!Plurk Project

Issue: 1. Unable to analysis private user 2. Pie chart is too simple, no details interest information

Page 10: Mining Interest Topics from Plurk

THE SNSD SYSTEM Find out what the plurker is interested in.

Page 11: Mining Interest Topics from Plurk

Social Networking Service Discovery

• Discover users’ interest topics via 1. Posted contents (plurks) from users 2. Aggregated interest information from

communities for the private users

• Have to prepare – Relationships – Plurks

Page 12: Mining Interest Topics from Plurk

Work-flow of SNSD System

Page 13: Mining Interest Topics from Plurk

Aggregation and Derivation

Page 14: Mining Interest Topics from Plurk

Aggregation and Derivation

Page 15: Mining Interest Topics from Plurk

Aggregation and Derivation

Page 16: Mining Interest Topics from Plurk

Aggregation and Derivation

Page 17: Mining Interest Topics from Plurk

Aggregation and Derivation

Page 18: Mining Interest Topics from Plurk

Aggregation and Derivation

Page 19: Mining Interest Topics from Plurk

Community Detection

• Snowball sampling • Louvain algorithm • Filtering

– Karma – Gender – Privacy

Page 20: Mining Interest Topics from Plurk

Snowball Sampling

Page 21: Mining Interest Topics from Plurk

Modularity • 𝑸𝑸 = (number of edges within communities) -

(expected number within communities) • Idea:

– dense internal connections between the nodes within modules

– sparse connections between different modules

• Work as a measurement for the quality of partitions and an objective function to optimize.

Page 22: Mining Interest Topics from Plurk

Definition of Modularity

𝑸𝑸 =12𝑚𝑚

� 𝐴𝐴𝑖𝑖𝑖𝑖 −𝑑𝑑𝑖𝑖𝑑𝑑𝑖𝑖2𝑚𝑚

𝛿𝛿 𝐶𝐶 𝑖𝑖 ,𝐶𝐶 𝑗𝑗𝑖𝑖𝑖𝑖

– 𝐴𝐴𝑖𝑖𝑖𝑖 = the weight of the edge between 𝑖𝑖 and 𝑗𝑗 – 𝑑𝑑𝑖𝑖 = degree of vertex 𝑖𝑖

– 𝑚𝑚 = 12∑ 𝐴𝐴𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 , number of edges of the graph

– 𝛿𝛿 𝐶𝐶 𝑖𝑖 ,𝐶𝐶 𝑗𝑗 = �1, 𝑖𝑖𝑖𝑖 𝐶𝐶 𝑖𝑖 = 𝐶𝐶 𝑗𝑗 0, 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖𝑜𝑜𝑜𝑜

– 𝐶𝐶 𝑖𝑖 is the community of vertex 𝑖𝑖

Page 23: Mining Interest Topics from Plurk

Expected Number of Edges Between Two Nodes

• 𝐸𝐸 𝑖𝑖 → 𝑗𝑗 = 𝑑𝑑𝑖𝑖 × 𝑃𝑃𝑜𝑜 → 𝑗𝑗 = 𝑑𝑑𝑖𝑖 × 𝑑𝑑𝑗𝑗2𝑚𝑚

𝑖𝑖 𝑗𝑗

𝑑𝑑𝑖𝑖 𝑑𝑑𝑖𝑖2𝑚𝑚

Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010

Page 24: Mining Interest Topics from Plurk

Louvain Algorithm • Louvain algorithm is a heuristic greedy method

based on modularity optimization • Louvain algorithm consists of two phases

1. Look for small communities by optimizing modularity locally

2. Aggregate vertices in the same community and build a new network whose vertices are the communities

3. Repeat until a maximum of modularity is attained

Page 25: Mining Interest Topics from Plurk

Example

• 𝐵𝐵𝑖𝑖𝑖𝑖 = 𝐴𝐴𝑖𝑖𝑖𝑖 −𝑑𝑑𝑖𝑖𝑑𝑑𝑗𝑗2𝑚𝑚

• ∆𝑖𝑖 𝑗𝑗 = 𝐵𝐵𝑖𝑖𝑖𝑖 − 𝐵𝐵𝑖𝑖𝑖𝑖 𝑚𝑚𝑜𝑜𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑜𝑜𝑖𝑖𝑜𝑜𝑚𝑚 𝑔𝑔𝑚𝑚𝑖𝑖𝑔𝑔

• 𝑗𝑗∗ 𝑖𝑖 = arg max ∆𝑖𝑖 𝑗𝑗 | 𝑗𝑗 ∈ 𝑔𝑔

9

2 8

5

6

7

4 1

3

• 𝐵𝐵11 = 𝐴𝐴11 −𝑑𝑑1𝑑𝑑12𝑚𝑚

= 0 − 3×32×14

= −0.32

• 𝐵𝐵12 = 𝐴𝐴12 −𝑑𝑑1𝑑𝑑22𝑚𝑚

= 1 − 3×22×14

= 0.79

• 𝐵𝐵13 = 1 − 3×32×14

= 0.68

• 𝐵𝐵14 = 1 − 3×42×14

= 0.57

• 𝐵𝐵15 = 0 − 3×42×14

= −0.43

• 𝐵𝐵16 = 0 − 3×42×14

= −0.43

• 𝐵𝐵17 = 0 − 3×42×14

= −0.43

• 𝐵𝐵18 = 0 − 3×32×14

= −0.32

• 𝐵𝐵19 = 0 − 3×12×14

= −0.11

• 𝑗𝑗∗ 1 = 2

Page 26: Mining Interest Topics from Plurk

9

2 8

5

6

7

4 1

3 9

2 8

5

6

7

4 1

3

1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9

{1,2}

{1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9}

𝑖𝑖 1 2 3 4 5 6 7 8 9

𝑗𝑗∗ 𝑖𝑖 2 1 2 1 8 8 9 5 7

Page 27: Mining Interest Topics from Plurk

2

{7,9}

{5,6,8}

{1,2,3,4}

6

10

2 3 9

2 8

5

6

7

4 1

3

{1,2,3,4} {5,6,8} {7,9}

{1,2,3,4}

{5,6,8}

{7,9} {{5,6,8}, {7,9}}

𝑖𝑖 {1,2,3,4} {5,6,8} {7,9}

𝑗𝑗∗ 𝑖𝑖 {1,2,3,4} {5,6,8} {5,6,8}

Page 28: Mining Interest Topics from Plurk

2

{7,9}

{5,6,8}

{1,2,3,4}

6

10

2 3 {1,2,3,4}

{5,6,7,8,9}

10

14 2

𝑖𝑖 {1,2,3,4} {5,6,7,8,9}

𝑗𝑗∗ 𝑖𝑖 {1,2,3,4} {5,6,7,8,9}

Page 29: Mining Interest Topics from Plurk

Example (cont.)

9

2 8

5

6

7

4 1

3

{1,2,3,4}

{5,6,7,8,9}

original 1st pass, phase 1

2nd pass, terminate

2

{7,9}

{5,6,8}

{1,2,3,4}

1st pass, phase 2

6

10

10

14

9

2 8

5

6

7

4 1

3

2 3 2

Page 30: Mining Interest Topics from Plurk

INTEREST KEYWORDS HIERARCHY

Page 31: Mining Interest Topics from Plurk

Closure Table

Page 32: Mining Interest Topics from Plurk

Interest Keywords Hierarchy

SNSD

Taeyeon

Bo Peep Bo Peep

Twinkle

YoonA

Gee

Girls Generation

PSY

Gangnam Style

Page 33: Mining Interest Topics from Plurk

CRAWLING SYSTEM How to dump Plurk.com?

Page 34: Mining Interest Topics from Plurk
Page 35: Mining Interest Topics from Plurk

Overview of Crawling System

Page 36: Mining Interest Topics from Plurk

ZeroMQ: The Intelligent Transport Layer

Page 37: Mining Interest Topics from Plurk

Work-flow of Crawling Task Queue

Page 38: Mining Interest Topics from Plurk

Plurk API

• Plurk API 2.0 is based on OAuth Core 1.0a standard

• Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit

Page 39: Mining Interest Topics from Plurk

Plurk API Library

• Original provider – plurk-oauth by clsung

• Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1

• Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1

Page 40: Mining Interest Topics from Plurk

Performance Comparison 53.71

27.49

15.44 15.50 14.97 13.21 13.13

52.77

26.74

14.10 11.17 9.45 7.94 7.08

0.00

10.00

20.00

30.00

40.00

50.00

60.00

8 16 32 64 128 256 512

seco

nds

concurrency

OriginalEnhanced

Page 41: Mining Interest Topics from Plurk

An Example of a Plurk

Page 42: Mining Interest Topics from Plurk

Plurk Attributes • _id

– The unique plurk id, used for identification of the plurk • owner

– The owner/poster of this plurk • content

– The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc.

• content_raw – The raw content as user entered it

• posted – The date this plurk was posted in ISODate format

Page 43: Mining Interest Topics from Plurk

Plurks Preprocessing

Page 44: Mining Interest Topics from Plurk

URL Filtering

Page 45: Mining Interest Topics from Plurk

URL Filtering (cont.)

Page 46: Mining Interest Topics from Plurk

URL Filtering (cont.)

Page 47: Mining Interest Topics from Plurk

Normalization

Page 48: Mining Interest Topics from Plurk

Tokenization

Page 49: Mining Interest Topics from Plurk

Celery Task Queue

Page 50: Mining Interest Topics from Plurk

Celery Task Queue

Page 51: Mining Interest Topics from Plurk

Datastore Architecture

• Why MongoDB? – Auto-sharding – Replica sets

• MongoDB cluster – mongos – Config servers – Shard servers

• Deploy to Delta cloud cluster

Page 52: Mining Interest Topics from Plurk

MongoDB Server Layout

Page 53: Mining Interest Topics from Plurk

Cluster Configuration

Page 54: Mining Interest Topics from Plurk

Delta Cloud Server

Page 55: Mining Interest Topics from Plurk

Delta Cloud Server (cont.)

Page 56: Mining Interest Topics from Plurk

EXPERIMENTS

Page 57: Mining Interest Topics from Plurk

Environment

Page 58: Mining Interest Topics from Plurk

Experiment

• Sampling 40 public plurkers • public: get top-64 freq. interest keywords • private: regard the plurker as private, derive

his interest keywords by communities and get top-64 freq. interest keywords

• len(intersect(public, private))

Page 59: Mining Interest Topics from Plurk

Result

3

6 7

16

4 3

1

21 ~ 25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 550

2

4

6

8

10

12

14

16

18

# matching

Page 60: Mining Interest Topics from Plurk

LIVE DEMO Never

Page 61: Mining Interest Topics from Plurk
Page 62: Mining Interest Topics from Plurk
Page 63: Mining Interest Topics from Plurk
Page 64: Mining Interest Topics from Plurk
Page 65: Mining Interest Topics from Plurk
Page 66: Mining Interest Topics from Plurk
Page 67: Mining Interest Topics from Plurk
Page 68: Mining Interest Topics from Plurk

CONCLUSIONS AND FUTURE WORKS

Page 69: Mining Interest Topics from Plurk

Conclusions

• Construct an online SNSD system for Plurk users to find interesting topics and relationship

• Develop a new scalable crawling framework based on ZeroMQ

• Patch the plurk-oauth library • Build a website for visualizing interest and

relationship by D3.js

Page 70: Mining Interest Topics from Plurk

Future Works

• Interest hierarchy: – Manageable UI – Recommend by users

• Apply the SNSD system to Twitter for western language and Sina weibo for mainland China

• Employ other community dectection algorithm and optimize NetworkX

Page 71: Mining Interest Topics from Plurk

Future Works (cont.)

• Consider responses in a plurk and fans relationship in interest derivation

• Serve as a Plurk full-text search engine

Page 72: Mining Interest Topics from Plurk

Q & A

Thank you for listening.

Page 73: Mining Interest Topics from Plurk

CS Workstation Architecture

Page 74: Mining Interest Topics from Plurk

Delta Cluster Architecture