資料科學在 Whoscall 產品體系中的角色

129
Gogolook Confidential
  • date post

    22-Nov-2014
  • Category

    Technology

  • view

    12.494
  • download

    9

description

郭建甫 (Jeff Kuo) Gogolook 走著瞧公司創辦人兼執行長 郭博士與鄭勝丰、宋政桓一同創立走著瞧 (Gogolook) 公司,目前擔任 WhosCall 開發團隊執行長。曾就讀於成大工業設計學系,畢業於清華大學工業工程研究所,其專精領域為產品設計與使用者經驗研究。專業經歷為德國 Heinz Nixdorf Institute 研究員、先構技研(股)公司共同創辦人、安通國際(股)公司新事業發展總監等。 --- 高義銘 (Yimin Kao) Gogolook 走著瞧公司資料科學家 目前為走著瞧(Gogolook)公司數據分析科學家,畢業於美國北卡州立大學統計系。專業研究領域包含統計分類與分群方法、貝氏模型和空間統計。應用於電腦病毒封包偵測、基因關聯檢測和預測颶風路徑等。以散播音樂和歡樂為人生志向。

Transcript of 資料科學在 Whoscall 產品體系中的角色

Page 1: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Page 2: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

How Started?

Page 3: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

How Started?

Page 4: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

How Started?

Page 5: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

The Best App For

identifying and blocking calls

The Best App – LINE whoscall

Page 6: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Page 7: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Key Features

Page 8: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Instant Caller Identification

LINE whoscall identifies background information of incoming unknown calls in seconds through tags reported by other users, Internet search results, and our comprehensive global database.

Instant Caller Identification

Page 9: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Database with over 600 Million Phone Numbers

LINE whoscall boasts an online database with over 600 million phone numbers. The database of LINE whoscall covers yellow pages, spammers, telemarketers, costumer services...,etc. with numerous community tags contributed by users and comments based on real users’ experiences.

Database & Number Details

Page 10: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Incoming Call Dialogue

Incoming Call DialogueFraud Call

Business Corporation

Restaurant

Page 11: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Community Tag

★ Block unwanted calls & SMSs

Contributions from the global user community has always been the pillar of LINE whoscall’s service. LINE whoscalluser can tag a phone number and share it with others, which creates an integrated phone number database and a reliable communication network for everyone.

Block calls and SMSs intelligently to ensure a harassment-free calling experience.

Tag & Block

Page 12: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ World’s Largest Yellow Page Database

★ Offline Database Available for Free

LINE whoscall owns one of the world’s largest onlinephone number database in the world, which covers most of numbers of businesses and service providers essential to you daily lives.

The free database is not only available online but also offline. And they are completely free! The unlimited usage of database with over 600 million phone numbers is only on LINE whoscall.

Database Usage

Page 13: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

3 of every 5 strangers’ calls

can be identified by LINE whoscall

Over 400 million phone calls

are identified by LINE whoscall

every month.

3000 spammer numbers

are reported by LINE whoscalluser every day.

Number Identification

– 2014.07 – 2014.07

Page 14: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Market

Page 15: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Honors

Page 16: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What we will be…

Page 17: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Vision

Page 18: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

資料科學在 whoscall 的應用GOGOLOOK 資料科學家 高義銘

Page 19: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ 日常生活經常會遇到的問題

Page 20: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ 人面對未知的事物就會有一種…

我有一種不祥的預感!

Page 21: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ 坊間流傳著許多解決此問題的 APPs

小熊來電通知

Presenter
Presentation Notes
老闆說這張圖怪怪的….他也說不上來…
Page 22: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ 坊間流傳著許多解決此問題的 APPs

小熊來電通知

Presenter
Presentation Notes
改成這樣之後…他就說嗯….好像對了
Page 23: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Why whoscall?

因為… 他是連 Google 執行長都說讚的軟體!

唉呦,讚喔

Presenter
Presentation Notes
甚麼事都可以不相信,但就是不能不相信google
Page 24: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

whoscall 是如何解決未知來電的問題咧?

Page 25: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Technologies adopted

1. Yellow pages:HiPage, Yelp, Zenrin…

2. Google search

3. Other sources

Technologies adopted

Presenter
Presentation Notes
Other source..
Page 26: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Technologies adopted

Technologies adopted

4. 使用者回報與標記

Presenter
Presentation Notes
Crowdsourcing
Page 27: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Technologies adopted

Technologies adopted

4. 使用者回報與標記

Presenter
Presentation Notes
Crowdsourcing, 也可以想像成一種電話號碼界中的wikipedia….
Page 28: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ whoscall, I have a problem…

如果一個未知號碼,我們無法從這些 sources 去取得任何資訊,那就 GG 了嗎?

Page 29: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ whoscall, I have a problem…

如果一個未知號碼,我們無法從這些 sources 去取得任何資訊,那就 GG 了嗎?

是的,GG 然後洗洗睡…

Page 30: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

當然不能洗洗睡,要不然我站在這邊幹嘛?

Page 31: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem we want to solve

For an unknown phone number:• No google result• No user tag / report• Not a whoscall user

Problem we want to solve

Page 32: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem we want to solve

For an unknown phone number:• No google result• No user tag / report• Not a whoscall user

Can we determine if it’s a spam number?

Problem we want to solve

Page 33: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem we want to solve

For an unknown phone number:• No google result• No user tag / report• Not a whoscall user

Can we determine if it’s a spam number?

推銷電話?

Problem we want to solve

Page 34: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem we want to solve

For an unknown phone number:• No google result• No user tag / report• Not a whoscall user

Can we determine if it’s a spam number?

推銷電話? 詐騙電話?騷擾電話?

Problem we want to solve

Page 35: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem we want to solve

For an unknown phone number:• No google result• No user tag / report• Not a whoscall user

Can we determine if it’s a spam number?

推銷電話? 詐騙電話?騷擾電話?

打錯電話?

Problem we want to solve

Page 36: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem we want to solve

For an unknown phone number:• No google result• No user tag / report• Not a whoscall user

Can we determine if it’s a spam number?

推銷電話? 詐騙電話?騷擾電話?

打錯電話?

Problem we want to solve

(我又不是神!!)

Page 37: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Scenario

Scenario

OO推銷

小明

小明妹

小明哥

?

Page 38: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ We think it should work because…

whoscall userbase ( = potential sensors)• > 10 million installations• > 10 thousands tags (daily)• > 30 million phone calls (daily)

Page 39: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Analysis procedures

Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using

machine learning techniques

Page 40: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Collect call logs

• Recruit a group of voluntary whoscall users as our sensors.

• Collect phone call logs from these sensors for a month.

Collect call logs

Page 41: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ User privacy User privacy is kept in the highest priority. Phone numbers are stored as one-way hash

codes. (therefore unable to be reversed)

User privacy

Page 42: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Analysis procedures

Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using

machine learning techniques

Page 43: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ List of user tags

List of user tags

一接就掛斷一打來就掛掉一接對方馬上掛斷一接就掛電話一接起來就掛斷電話一接起來,就說打錯一直傳廣告簡訊一直打錯電話一直收到沒顯示的APP一直狂打錯電話一聲一聲不響,就掛掉,有問題一聲就掛一聲掛斷一聽收線

嚴重騷擾

國外莫名來電

國際電話偽裝台北碼???

地下錢莊

地下錢莊推銷

地下非法期公司

地產

垃圾

垃圾簡訊

垃圾訊息

基隆美髮

壽險

外勞

夜半打給不認識的在亂

色情交友

色情交友電話

色情人肉市場

色情垃圾簡訊

色情外送

色情妹妹電話

色情干擾

色情廣告簡訊

色情拉客妹

色情按摩

色情推銷

色情推銷電話

色情援交外送

色情敗類

摩門

撥了馬上掛掉

擾亂電話

收數率調查

收視率調查

放款簡訊

政府宣導

敲一聲而已

整人電話

新光保全

星展借貸

星展推消

星展銀行

淫媒仲介

Presenter
Presentation Notes
A picture worth a thousand words
Page 44: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Compare with user tags

• Compare these phone numbers with user reports from whoscall database (封鎖記錄)

Compare with user tags

Normal numbers

0987-991-XXX0986-225-XXX02-2675-XXXX03-862-XXXX

...

02-2543-XXXX03-556-XXXX886-XXXX

推銷電話

02-2783-XXXX886-903-XXXX0800-000-XXX

惡意電話

Presenter
Presentation Notes
All the phone numbers are manipulated ……
Page 45: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Data summary

Data summary

推銷電話

民調中心

騷擾電話

詐騙電話

70% 1%

5%

24%

# Samples: 7854Normal: 4000Spam: 3854

Page 46: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Analysis procedures

Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using

machine learning techniques

Page 47: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Normal numbers

0

5

10

15

20 Calls =195 (in 66, out 129)Opponents = 72 (in 21, out 58)

★ Normal numbers

Presenter
Presentation Notes
周末附近就會有一個peak
Page 48: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Spam numbersSpam numbers

0

10

20

30

Calls =471 (in 15, out 456)Opponents = 186 (in 11, out 183)XX信用卡行銷 (7)OOO,XXXX行銷 (6)電話行銷 (3)

Page 49: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Analysis procedures

Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using

machine learning techniques

Page 50: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ What is a feature?

What is a feature?

“Feature” is a measurable property of a phenomenon being observed.

Presenter
Presentation Notes
可量測的一個觀察到的現象
Page 51: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Example

Or, we want to analyze a company, we can look at features:

公司人數

★ Example

Page 52: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Example

Or, we want to analyze a company, we can look at features:

工程師人數

★ Example

Page 53: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Example

Or, we want to analyze a company, we might look at features:

公司裡面Python工程師的比例

★ Example

Presenter
Presentation Notes
但有些feature其實不容易量化,像是…
Page 54: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Example

Or, we want to analyze a company, we might look at features:

公司向心力

★ Example

Page 55: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Example

Or, we want to analyze a company, we might look at features:

CEO 帥氣程度

★ Example

Presenter
Presentation Notes
Extracting features need experience, creativity and insight.
Page 56: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Features for call patterns

Ratio of out calls

0.8

0.6

0.4

0.2

0.0Fraud Marketing Normal

Presenter
Presentation Notes
這些電話號碼打給別人的比例
Page 57: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Features for call patterns

Ratio of recurring opponents

Fraud Marketing Normal0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Presenter
Presentation Notes
和這個電話號碼有兩次以上交流的比例
Page 58: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Features for call patterns

Ratio of missed out calls

Fraud Marketing Normal

0.6

0.5

0.4

0.3

0.2

0.1

0

Presenter
Presentation Notes
此號碼打出去而沒有人接得比例
Page 59: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Features for call patterns

Ratio of working time calls

Fraud Marketing Normal

0.6

0.5

0.4

0.3

0.2

0.1

0

0.7

Presenter
Presentation Notes
在 working time 打電話的比例
Page 60: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Features for call patterns

Median of call durations

Fraud Marketing Normal

50

40

30

20

10

0

60

seconds

Presenter
Presentation Notes
通話長度的中位數
Page 61: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Features for call patterns

Ratio of out calls in contact book

Fraud Marketing Normal

0.10

0

0.25

0.30

0.35

0.20

0.15

0.05

Presenter
Presentation Notes
In total we have approximately 64 features.
Page 62: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Analysis procedures

Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using

machine learning techniques

Page 63: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Ratio of recurring components is less than 40% Ratio of out calls is more than 60%

Ratio of in calls is less than 20% Then we claim the number is a spam number

Intuitively, we can determine an unknown number by rules such as if

★ Naïve method

Presenter
Presentation Notes
Then we claim
Page 64: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem 1

Too many features…

Presenter
Presentation Notes
Then we claim
Page 65: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem 2

How to determine the rule?

Page 66: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Machine learning★ Solution

Page 67: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Machine learning★ Solution

Let the machine learn from the data

Page 68: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What is machine learning?

★ What is machine learning?

機器學習是一種從過去的資料或經驗當中,構造一個模型 (Model),而學習 (Learning) 這件事就是讓這個模型以程式的方式執行,等到學習到一定的程度後,就可以做預測 (猜),這個「猜」是有根據的,且命中率高的。

Page 69: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Machine learning techniques for classification

★ Machine learning techniques for classification

Support vector machine

Logistic regression

Decision tree

Neural networks

Naïve Bayes

Nonparametric Bayesian method

Page 70: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 71: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 72: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 73: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 74: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 75: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 76: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 77: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Support vector machine for binary classification

★Support vector machine for binary classification

Presenter
Presentation Notes
We first do binary classification, which means we only classify unknown numbers as spam or non-spam Why we use SVM, simple, easy, fast, and efficient.
Page 78: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

這樣就夠了嗎?

Presenter
Presentation Notes
Data 介紹玩了,Machine learning technique 也決定了。 ? 要改
Page 79: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Real-life scenario

★ Real-life scenario

When will we require a spam number prediction?Ans: The time a phone call reaches a whoscall user

We want to predict whether a number is spam as EARLY as possible in order to prevent further victims…

Page 80: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Real-life scenario

Time

# recent calls

Victim 1 Victim 2 Victim 3

XX推銷★ Real-life scenario

推銷電話

Presenter
Presentation Notes
饅頭人
Page 81: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Let’s look at the performances of SVM under different numbers of recent calls

Page 82: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

SVM for binary classification

★ SVM for binary classification

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10# recent calls

Accuracy

Page 83: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

嗯…表現的不錯,但是…可以再快一點嗎?

Page 84: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Reduce the number of features

★ Reduce the number of features

Features computation is time-consuming. So we want to reduce the number of features before we do classification.

Presenter
Presentation Notes
possible to visualize, stable parameter estimation, 減少所謂 collinearity
Page 85: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Reduce the number of features

★ Reduce the number of features

Features computation is time-consuming. So we want to reduce the number of features before we do classification.

當然我們不是用手去選…

Page 86: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Reduce the number of features

★ Reduce the number of features

Features computation is time-consuming. So we want to reduce the number of features before we do classification.

Feature selection methods:

Regularization methods

Backward, forward, and stepwise methods

Bayesian feature selection

Random forest method

Page 87: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Feature selection results

★ Feature selection results

10 15 20 25 30

3 recent calls

5 recent calls

10 recent calls0.8

0.85

0.9

0.95

1.0

# features

Accuracy

Presenter
Presentation Notes
這個圖展現了feature selection 的結果, X 軸是選取feature的個數 Y軸是準確率。 第一 當feature數越多,準確率越高。 三條線分別代表不同的observation windows 舉例來說,黃色這條線代表如果我只看這個未知號碼的前兩通call,橘色就是前四通… Observation window 越長,你其實不需要很多feature就可以達到一定的accuracy. 譬如紅色這條線表示你看了前九通的call log, 原則上你只需要 大概10-15個feature左右,你的accuracy 其實就已經趨於穩定了!!! 在這裡我決定取20 features, 因為可以看到20個feature 之後,不同的observation windows 也都趨於穩定。
Page 88: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Feature selection results

★ Feature selection results

10 15 20 25 30

3 recent calls

5 recent calls

10 recent calls0.8

0.85

0.9

0.95

1.0

# features

Accuracy

Presenter
Presentation Notes
這個圖大致上展現了feature selection 的結果, X 軸是選取feature的個數 Y軸是準確率。 第一 當feature數越多,準確率越高。 三條線分別代表不同的observation windows 舉例來說,黃色這條線代表如果我只看這個未知號碼的前兩通call,橘色就是前四通… Observation window 越長,你其實不需要很多feature就可以達到一定的accuracy. 譬如紅色這條線表示你看了前九通的call log, 原則上你只需要 大概10-15個feature左右,你的accuracy 其實就已經趨於穩定了!!! 在這裡我決定取20 features, 因為可以看到20個feature 之後,不同的observation windows 也都趨於穩定。
Page 89: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Feature selection results

★ Feature selection results

10 15 20 25 30

3 recent calls

5 recent calls

10 recent calls0.8

0.85

0.9

0.95

1.0

# features

Accuracy

Presenter
Presentation Notes
這個圖大致上展現了feature selection 的結果, X 軸是選取feature的個數 Y軸是準確率。 第一 當feature數越多,準確率越高。 三條線分別代表不同的observation windows 舉例來說,黃色這條線代表如果我只看這個未知號碼的前兩通call,橘色就是前四通… Observation window 越長,你其實不需要很多feature就可以達到一定的accuracy. 譬如紅色這條線表示你看了前九通的call log, 原則上你只需要 大概10-15個feature左右,你的accuracy 其實就已經趨於穩定了!!! 在這裡我決定取20 features, 因為可以看到20個feature 之後,不同的observation windows 也都趨於穩定。
Page 90: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Ratio of out calls

Rate of out calls

Ratio of out calls in contact book

Ratio of reciprocal opponents

Ratio of recurring opponents

Median call duration of in calls

Ring duration of answered calls

and more…

★ Selected features

Ratio of missed calls

Rate of new opponents

Ratio of in calls in contact book

Presenter
Presentation Notes
54 20
Page 91: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Comparison of w/ and w/o feature selection

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10# recent calls

Accuracy

Page 92: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Done?

阿不就好棒棒?

Presenter
Presentation Notes
有一天我就是很好奇想看一下各種type of 的spam表現得如何!!於是我看了所謂的Power
Page 93: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What is power?

★ What is power?

Power of class A: The probability of accurately classify a class A sample to class A.

Page 94: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What is power?

★ What is power?

Power of class A: The probability of accurately classify a class A sample to class A.

性別Classifier

97.5% this is a male

Page 95: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What is power?

★ What is power?

Power of class A: The probability of accurately classify a class A sample to class A.

性別Classifier

97.5% this is a male

Page 96: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Power of our classifier

★ Power of our classifier

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10

# recent calls

Power

Presenter
Presentation Notes
挖…….Fraud number 的power 好像沒有想像中的理想耶….糟糕!!!! 拿掉Accuracy!!!
Page 97: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

義銘,加油好嗎?

Presenter
Presentation Notes
所以我重新回去看了Data 的分布
Page 98: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Data summary

Data summary

推銷電話

民調中心

騷擾電話

詐騙電話

70% 1%

5%

24%

# Samples: 7854Normal: 4000Spam: 3854

Presenter
Presentation Notes
Imbalanced samples problem
Page 99: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Data summary

Data summary

推銷電話

民調中心

騷擾電話

詐騙電話

70% 1%

5%

24%

# Samples: 7854Normal: 4000Spam: 3854

Presenter
Presentation Notes
Imbalanced samples problem
Page 100: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Marketing numbers vs. normal numbers

★ Marketing numbers vs. normal numbers

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10

# recent calls

Accuracy

Page 101: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Fraud numbers vs. normal numbers

★ Fraud numbers vs. normal numbers

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10

# recent calls

Accuracy

Presenter
Presentation Notes
Fraud 雖然比Marketing number困難一點,但還是有不錯的能力..
Page 102: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

一種摻在一起做撒尿牛丸的概念…

Page 103: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Power of SVM for multi-classification

★ Power of SVM for multi-classification

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10

# recent calls

Power

Page 104: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Power of SVM for binary classification

★ Power of SVM for binary classification

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10

# recent calls

Power

Page 105: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What is type I error rate?

★ What is type I error rate?

Type I error: The probability of misclassify a class B sample to class A.

性別Classifier

5% this is a male

Page 106: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What is type I error rate?

★ What is type I error rate?

Type I error: The probability of misclassify a class B sample to class A.

性別Classifier

5% this is a male

Page 107: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Type I error comparison

★ Type I error comparison

0

0.05

0.1

0.15

0.3

3 4 5 6 7 8 9 10

# recent calls

Type I error0.2

0.25

Page 108: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

這點小成果讓我稍稍放鬆地去逛街,突然電話響一聲,我開心地接了起來…

Page 109: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

結果,對方掛斷了

Page 110: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

響一聲掛斷的惡意電話

★ 響一聲掛斷的惡意電話

“響一聲掛斷”(one-ring call) 是一種引誘接電話者回撥的惡意電話,通常伴隨著高額付款電話。

於是我們先觀察“響一聲掛斷”這類型電話號碼的 call patterns。

Page 111: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Call patterns of one-ring calls

★ Call patterns of one-ring calls

Numbers Mean duration of ringing (seconds)

Mean duration ofout calls (seconds)

0982-415-XXX 1.6 0

0982-420-XXX 3.6 0

0982-495-XXX 5.2 1.25

04-3-704-XXXX 0.9 0

0923-931-XXX 6.7 2.6

Presenter
Presentation Notes
同樣的…..我們可以extract features from call patterns 來看一下one-ring, fraud, marketing, and normal number 的不同
Page 112: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Feature comparison

Ratio of new opponents

Fraud Marketing NormalOne-ring0

0.2

0.4

0.6

0.8

Page 113: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Feature comparison

Ratio of in calls

0

0.1

0.2

0.3

0.4

0.5

Fraud Marketing NormalOne-ring

Page 114: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Feature comparison

Ratio of missed calls

0

0.2

0.4

0.6

0.8

Fraud Marketing NormalOne-ring

Presenter
Presentation Notes
來看一些不一樣的
Page 115: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Naïve method

Similarly, without machine learning we can design rules such as:

Presenter
Presentation Notes
但同樣的問題,這個rule是你訂的…有可能會發生以下的事情!!
Page 116: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Naïve method

Similarly, without machine learning we can design rules such as:

Rule1: The mean of the ringing duration is less then 7 seconds. and

Rule 2: The mean of the outcall duration is less than 3 seconds.

Then we claim that it is a one-ring spam call.

Presenter
Presentation Notes
但同樣的問題,這個rule是你訂的…有可能會發生以下的事情!!
Page 117: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problems

1. Too many features…2. How to determine the rule?3. New observations.

Presenter
Presentation Notes
Then we claim
Page 118: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

★ Problem 3

Numbers Mean duration of ringing (seconds)

Mean duration ofout calls (seconds)

0982-415-XXX 1.6 0

0982-420-XXX 3.6 0

0982-495-XXX 5.2 1.25

04-3-704-XXXX 0.9 0

0923-931-XXX 6.7 2.6

Presenter
Presentation Notes
依照上面你訂的rule….這個號碼將會被判斷成不是one-ring….但他真的不是嗎??
Page 119: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Numbers Mean duration of ringing (seconds)

Mean duration ofout calls (seconds)

0982-415-XXX 1.6 0

0982-420-XXX 3.6 0

0982-495-XXX 5.2 1.25

04-3-704-XXXX 0.9 0

0923-931-XXX 6.7 2.6

04-2-676-XXXX 15.7 1.4

★ Problem 3

New observation

Presenter
Presentation Notes
依照上面你訂的rule….這個號碼將會被判斷成不是one-ring….但他真的不是嗎??
Page 120: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Numbers Mean duration of ringing (seconds)

Mean duration ofout calls (seconds)

0982-415-XXX 1.6 0

0982-420-XXX 3.6 0

0982-495-XXX 5.2 1.25

04-3-704-XXXX 0.9 0

0923-931-XXX 6.7 2.6

04-2-676-XXXX 15.7 (S.D.=10.7) 1.4

★ Problem 3

Presenter
Presentation Notes
依照上面你訂的rule….這個號碼將會被判斷成不是one-ring….但他真的不是嗎??
Page 121: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Machine learning can efficiently “learn” from new data and create rules for us.

Page 122: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Power of SVM for multi-classification

★ Power of SVM for multi-classification

0.8

0.85

0.9

0.95

1.0

3 4 5 6 7 8 9 10

# recent calls

Power

Page 123: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Accuracy comparison

★ Accuracy comparison

3 4 5 6 7 8 9 10

# recent calls

0

0.05

0.1

0.15

0.3

0.2

0.25

Type I error

Page 124: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

DeploymentAll the algorithms have been implemented in the whoscall app, so how does it work?

Page 125: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

OO推銷

小明

Data center

Classifier calculating…

0984-003-XXX

回傳: 此號碼可能為推銷電話

所需時間: 50-100 milliseconds

Page 126: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

What’s next?

Page 127: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Improvements of the classification model

1. Fraud numbers analysis2. Fuzzy classification algorithm3. Spam-category scores4. Cooperate with more solid outside sources5. Generalize to other countries.

Much more…

★ Improvements of the classification model

Page 128: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Future perspectives

1. User’s tag correction mechanisms2. Personalized penalty setting3. Anti-countermeasures4. Extend to SMS spam detection5. Clustering vs. user tags6. Spam detect Scam detection

★ Future perspectives

Page 129: 資料科學在 Whoscall 產品體系中的角色

Gogolook Confidential

Creating a contact network of trust

感謝大家寶貴的時間