PTT2 Life Analysis of A Person in PTT2

PTT2 LifeAnalysis of A Person in PTT2

資工四 B94902003 黃竣瑋資工四 B94902032 陳縕儂資工四 B94902095 陳晉暉資工四 B94902097 呂哲安

2

System Structure

PTT2

Crawler

files Parser Controller

Database

User Interface

Programs

parser parser parser

User

PTT2 manager (wens)

ProgramsPrograms

3

Parallel Skill• MapReduce (Hadoop)– 計算高維度的 K-means

• OpenMP– Insert data into DB– testing model weight (bayesian 原理 )– 尋找 IP location– 對推文時間作 sort

4

System Component• BBS Crawler• Parser Controller• Program Components• User Interface

5


6

BBS Crawler (1)• 給定板名即可自動抓下所有板名內的文章– Telnet Protocol• 將 telnet 回傳的封包內容的 command 和 content 分開，同時對 telnet 作回應。

– Terminal• 模擬 BBS 上 buffer 畫面，包含判斷 content 中對游標下的 command 及畫面的改變。

– Robot• 判斷此時的 state

– Crawler• 做每個詳細的動作

7

BBS Crawler (2)

Terminal

PTT2

Telnet Protocol

Robot

Crawler data

8


9

Parser Controller• Function– Parsing all files to extract• 標題、作者、 po 文時間、內文、 po 文 ip 、推文時間、推文者、推文內容等資訊

– Inserting into database• Parallel Programming– 每一篇文章都給一個 thread 去 parse information

10


11

Program Components• 個人分析• 十大名言• 凡來過必留下痕跡• 凡 PO 過必留下 IP

12

個人分析 (1)• 分析板主文章決定六種指數並以此找尋相似的人– 變態、貪吃、氣質、黑特、宅宅、陽光

• Using corpus in PTT to generate LM– Text Normalization– Text Segmentation– Stopword Removing– Training LM

• Classification (bayesian 理論 )– 利用 LM 中的機率計算所有文章對於六個 model 的

weight ，以估計個人的六項指數

13

個人分析 (2)• Parallel Programming– 使用 OpenMP 用 6 個 threads 將之平行放進 language

model 去 testing• 效能比較 ( 對一篇文章產生六個 weight)– Sequential

• 0.384u 0.016s 0:00.40 97.5% – Parallel with p = 6

• 0.320u 0.032s 0:00.21 166.6%

speedup ≈ 2

14

個人分析 (3)

LM1 LM2 LM3 LM4 LM5 LM6Program

LM1 LM2 LM3 LM4 LM5 LM6

Program

15

個人分析 (4)

16

個人分析 (5)• 星座分析– 在板內尋找生日文，並從中獲得日期的資訊，以此判斷板主的星座

17

十大名言 (1)• 找尋板主較常用的句子，並且列出與此句相似的前幾名句子• K-means preproccessing– 計算 sentence 中 character 的 unigram 及

bigram– 移除機率太小的將維度降至約 10000 維– 每一句當作一 space 的 vector

18

十大名言 (2) - K-means• Step 1. Random 決定 K 個

center• Step 2. 計算每個點最近的

center 並歸到同一群– 距離為 cosine similarity

• Step 3. 同一群中取平均值當作新的 center• Step 4. 重複 Step 2. 和

Step 3. ，最後計算出 K 個cluster

• Step 5. 群組排序

19

十大名言 (3)• Parallel Programming– 用 MapReduce(Hadoop) 將每個點都平行分下去計算與 center 的距離

• 效能比較 ( 約 10000 句，每句 10000 維 )– Sequential (perl)• 等了 3 小時以上都無法跑完

– Parallel with Mapper = 4 & Reducer = 2• 約 15~20 分鐘時間的下降非常明顯 (speedup 很大 )

20

十大名言 (4)

21

凡來過必留下痕跡 (1)• 個板中從開板至今所有推文者所出現的時間及頻率– Interval• 同一個人兩次推文時間相差超過此數則顯示中斷點

– Density period• 以多少時間為單位來計算推文數

• Parallel Programming– 對所有使用者出現在板上的所有時間作 sorting時使用到平行的技術。

22

凡來過必留下痕跡 (2)

23

凡 PO 過必留下 IP (1)• 板主 po 文的 IP 位置跟時間的對照，呈現在

UI 並表現時間與空間的概念– 根據 IP 去查找板主在地圖上的位置

24

凡 PO 過必留下 IP(2)• Parallel Programming– 對於每一個 IP 去查詢對應到的地點時，需要耗費較長的時間，因此使用 OpenMP ，以平行化的方式去算每個 IP 對應到的位置。

• 效能比較– Sequential

• .496u 2.060s 0:52.53 14.3%– Parallel with p = 16

• 4.680u 1.820s 0:09.12 71.2% 可從 50 多秒下降到 10 秒內

25

凡 PO 過必留下 IP(3)

26


27

Demo ^.<

28

Thank you for your listening!

PTT2 Life Analysis of A Person in PTT2

Documents

Transcript of PTT2 Life Analysis of A Person in PTT2