Mining Interest Topics from Plurk
Transcript of Mining Interest Topics from Plurk
Mining Interest Topics from Plurk
Ken Yi-Chien Lee 2012/11/27
Outline • Introduction
– Why and what we do in this thesis? • The SNSD system
– Community detection – Interest hierarchy
• Implementation – Preprocessing – Celery task queue
• Experiments • Conclusions and future works
INTRODUCTION I what to make friend with you.
Scenario
Scenario (cont.)
Plurk Timeline
Private Status
Traffic Statistics of Plurk in Taiwan
The Go!Plurk Project
Issue: 1. Unable to analysis private user 2. Pie chart is too simple, no details interest information
THE SNSD SYSTEM Find out what the plurker is interested in.
Social Networking Service Discovery
• Discover users’ interest topics via 1. Posted contents (plurks) from users 2. Aggregated interest information from
communities for the private users
• Have to prepare – Relationships – Plurks
Work-flow of SNSD System
Aggregation and Derivation
Aggregation and Derivation
Aggregation and Derivation
Aggregation and Derivation
Aggregation and Derivation
Aggregation and Derivation
Community Detection
• Snowball sampling • Louvain algorithm • Filtering
– Karma – Gender – Privacy
Snowball Sampling
Modularity • 𝑸𝑸 = (number of edges within communities) -
(expected number within communities) • Idea:
– dense internal connections between the nodes within modules
– sparse connections between different modules
• Work as a measurement for the quality of partitions and an objective function to optimize.
Definition of Modularity
𝑸𝑸 =12𝑚𝑚
� 𝐴𝐴𝑖𝑖𝑖𝑖 −𝑑𝑑𝑖𝑖𝑑𝑑𝑖𝑖2𝑚𝑚
𝛿𝛿 𝐶𝐶 𝑖𝑖 ,𝐶𝐶 𝑗𝑗𝑖𝑖𝑖𝑖
– 𝐴𝐴𝑖𝑖𝑖𝑖 = the weight of the edge between 𝑖𝑖 and 𝑗𝑗 – 𝑑𝑑𝑖𝑖 = degree of vertex 𝑖𝑖
– 𝑚𝑚 = 12∑ 𝐴𝐴𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 , number of edges of the graph
– 𝛿𝛿 𝐶𝐶 𝑖𝑖 ,𝐶𝐶 𝑗𝑗 = �1, 𝑖𝑖𝑖𝑖 𝐶𝐶 𝑖𝑖 = 𝐶𝐶 𝑗𝑗 0, 𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑜𝑖𝑖𝑜𝑜𝑜𝑜
– 𝐶𝐶 𝑖𝑖 is the community of vertex 𝑖𝑖
Expected Number of Edges Between Two Nodes
• 𝐸𝐸 𝑖𝑖 → 𝑗𝑗 = 𝑑𝑑𝑖𝑖 × 𝑃𝑃𝑜𝑜 → 𝑗𝑗 = 𝑑𝑑𝑖𝑖 × 𝑑𝑑𝑗𝑗2𝑚𝑚
𝑖𝑖 𝑗𝑗
𝑑𝑑𝑖𝑖 𝑑𝑑𝑖𝑖2𝑚𝑚
Lei Tang, Huan Liu, Community Detection and Mining in Social Media, 2010
Louvain Algorithm • Louvain algorithm is a heuristic greedy method
based on modularity optimization • Louvain algorithm consists of two phases
1. Look for small communities by optimizing modularity locally
2. Aggregate vertices in the same community and build a new network whose vertices are the communities
3. Repeat until a maximum of modularity is attained
Example
• 𝐵𝐵𝑖𝑖𝑖𝑖 = 𝐴𝐴𝑖𝑖𝑖𝑖 −𝑑𝑑𝑖𝑖𝑑𝑑𝑗𝑗2𝑚𝑚
• ∆𝑖𝑖 𝑗𝑗 = 𝐵𝐵𝑖𝑖𝑖𝑖 − 𝐵𝐵𝑖𝑖𝑖𝑖 𝑚𝑚𝑜𝑜𝑑𝑑𝑚𝑚𝑚𝑚𝑚𝑚𝑜𝑜𝑖𝑖𝑜𝑜𝑚𝑚 𝑔𝑔𝑚𝑚𝑖𝑖𝑔𝑔
• 𝑗𝑗∗ 𝑖𝑖 = arg max ∆𝑖𝑖 𝑗𝑗 | 𝑗𝑗 ∈ 𝑔𝑔
9
2 8
5
6
7
4 1
3
• 𝐵𝐵11 = 𝐴𝐴11 −𝑑𝑑1𝑑𝑑12𝑚𝑚
= 0 − 3×32×14
= −0.32
• 𝐵𝐵12 = 𝐴𝐴12 −𝑑𝑑1𝑑𝑑22𝑚𝑚
= 1 − 3×22×14
= 0.79
• 𝐵𝐵13 = 1 − 3×32×14
= 0.68
• 𝐵𝐵14 = 1 − 3×42×14
= 0.57
• 𝐵𝐵15 = 0 − 3×42×14
= −0.43
• 𝐵𝐵16 = 0 − 3×42×14
= −0.43
• 𝐵𝐵17 = 0 − 3×42×14
= −0.43
• 𝐵𝐵18 = 0 − 3×32×14
= −0.32
• 𝐵𝐵19 = 0 − 3×12×14
= −0.11
• 𝑗𝑗∗ 1 = 2
9
2 8
5
6
7
4 1
3 9
2 8
5
6
7
4 1
3
1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9
{1,2}
{1,2,3} {1,2,3,4} {5,8} {5,6,8} {7,9}
𝑖𝑖 1 2 3 4 5 6 7 8 9
𝑗𝑗∗ 𝑖𝑖 2 1 2 1 8 8 9 5 7
2
{7,9}
{5,6,8}
{1,2,3,4}
6
10
2 3 9
2 8
5
6
7
4 1
3
{1,2,3,4} {5,6,8} {7,9}
{1,2,3,4}
{5,6,8}
{7,9} {{5,6,8}, {7,9}}
𝑖𝑖 {1,2,3,4} {5,6,8} {7,9}
𝑗𝑗∗ 𝑖𝑖 {1,2,3,4} {5,6,8} {5,6,8}
2
{7,9}
{5,6,8}
{1,2,3,4}
6
10
2 3 {1,2,3,4}
{5,6,7,8,9}
10
14 2
𝑖𝑖 {1,2,3,4} {5,6,7,8,9}
𝑗𝑗∗ 𝑖𝑖 {1,2,3,4} {5,6,7,8,9}
Example (cont.)
9
2 8
5
6
7
4 1
3
{1,2,3,4}
{5,6,7,8,9}
original 1st pass, phase 1
2nd pass, terminate
2
{7,9}
{5,6,8}
{1,2,3,4}
1st pass, phase 2
6
10
10
14
9
2 8
5
6
7
4 1
3
2 3 2
INTEREST KEYWORDS HIERARCHY
Closure Table
Interest Keywords Hierarchy
SNSD
Taeyeon
Bo Peep Bo Peep
Twinkle
YoonA
Gee
Girls Generation
PSY
Gangnam Style
CRAWLING SYSTEM How to dump Plurk.com?
Overview of Crawling System
ZeroMQ: The Intelligent Transport Layer
Work-flow of Crawling Task Queue
Plurk API
• Plurk API 2.0 is based on OAuth Core 1.0a standard
• Requests should be signed using HMAC-SHA1 • API returns JSON encoded data • No request rate limit
Plurk API Library
• Original provider – plurk-oauth by clsung
• Performance Bottleneck – HTTP persistent connection – JSON decode – HMAC-SHA1
• Enhancements – HTTP connection pool – C extension for JSON and HMAC-SHA1
Performance Comparison 53.71
27.49
15.44 15.50 14.97 13.21 13.13
52.77
26.74
14.10 11.17 9.45 7.94 7.08
0.00
10.00
20.00
30.00
40.00
50.00
60.00
8 16 32 64 128 256 512
seco
nds
concurrency
OriginalEnhanced
An Example of a Plurk
Plurk Attributes • _id
– The unique plurk id, used for identification of the plurk • owner
– The owner/poster of this plurk • content
– The formatted and filtered content, e.g. URL will be turned into text tags and emoticons will be filtered etc.
• content_raw – The raw content as user entered it
• posted – The date this plurk was posted in ISODate format
Plurks Preprocessing
URL Filtering
URL Filtering (cont.)
URL Filtering (cont.)
Normalization
Tokenization
Celery Task Queue
Celery Task Queue
Datastore Architecture
• Why MongoDB? – Auto-sharding – Replica sets
• MongoDB cluster – mongos – Config servers – Shard servers
• Deploy to Delta cloud cluster
MongoDB Server Layout
Cluster Configuration
Delta Cloud Server
Delta Cloud Server (cont.)
EXPERIMENTS
Environment
Experiment
• Sampling 40 public plurkers • public: get top-64 freq. interest keywords • private: regard the plurker as private, derive
his interest keywords by communities and get top-64 freq. interest keywords
• len(intersect(public, private))
Result
3
6 7
16
4 3
1
21 ~ 25 26 ~ 30 31 ~ 35 36 ~ 40 41 ~ 45 46 ~ 50 51 ~ 550
2
4
6
8
10
12
14
16
18
# matching
LIVE DEMO Never
CONCLUSIONS AND FUTURE WORKS
Conclusions
• Construct an online SNSD system for Plurk users to find interesting topics and relationship
• Develop a new scalable crawling framework based on ZeroMQ
• Patch the plurk-oauth library • Build a website for visualizing interest and
relationship by D3.js
Future Works
• Interest hierarchy: – Manageable UI – Recommend by users
• Apply the SNSD system to Twitter for western language and Sina weibo for mainland China
• Employ other community dectection algorithm and optimize NetworkX
Future Works (cont.)
• Consider responses in a plurk and fans relationship in interest derivation
• Serve as a Plurk full-text search engine
Q & A
Thank you for listening.
CS Workstation Architecture
Delta Cluster Architecture