二等辺三角形の性質(1) - ed2.city.yamato.kanagawa.jp · 三角形・四角形 二等辺三角形の性質(1) 二等辺三角形…2つの辺が等しい三角形(定義)
資料科學在 Whoscall 產品體系中的角色
-
date post
22-Nov-2014 -
Category
Technology
-
view
12.494 -
download
9
description
Transcript of 資料科學在 Whoscall 產品體系中的角色
Gogolook Confidential
Gogolook Confidential
How Started?
Gogolook Confidential
How Started?
Gogolook Confidential
How Started?
Gogolook Confidential
The Best App For
identifying and blocking calls
The Best App – LINE whoscall
Gogolook Confidential
Gogolook Confidential
Key Features
Gogolook Confidential
★ Instant Caller Identification
LINE whoscall identifies background information of incoming unknown calls in seconds through tags reported by other users, Internet search results, and our comprehensive global database.
Instant Caller Identification
Gogolook Confidential
★ Database with over 600 Million Phone Numbers
LINE whoscall boasts an online database with over 600 million phone numbers. The database of LINE whoscall covers yellow pages, spammers, telemarketers, costumer services...,etc. with numerous community tags contributed by users and comments based on real users’ experiences.
Database & Number Details
Gogolook Confidential
Incoming Call Dialogue
Incoming Call DialogueFraud Call
Business Corporation
Restaurant
Gogolook Confidential
★ Community Tag
★ Block unwanted calls & SMSs
Contributions from the global user community has always been the pillar of LINE whoscall’s service. LINE whoscalluser can tag a phone number and share it with others, which creates an integrated phone number database and a reliable communication network for everyone.
Block calls and SMSs intelligently to ensure a harassment-free calling experience.
Tag & Block
Gogolook Confidential
★ World’s Largest Yellow Page Database
★ Offline Database Available for Free
LINE whoscall owns one of the world’s largest onlinephone number database in the world, which covers most of numbers of businesses and service providers essential to you daily lives.
The free database is not only available online but also offline. And they are completely free! The unlimited usage of database with over 600 million phone numbers is only on LINE whoscall.
Database Usage
Gogolook Confidential
3 of every 5 strangers’ calls
can be identified by LINE whoscall
Over 400 million phone calls
are identified by LINE whoscall
every month.
3000 spammer numbers
are reported by LINE whoscalluser every day.
Number Identification
– 2014.07 – 2014.07
Gogolook Confidential
Market
Gogolook Confidential
Honors
Gogolook Confidential
What we will be…
Gogolook Confidential
Vision
Gogolook Confidential
資料科學在 whoscall 的應用GOGOLOOK 資料科學家 高義銘
Gogolook Confidential
★ 日常生活經常會遇到的問題
Gogolook Confidential
★ 人面對未知的事物就會有一種…
我有一種不祥的預感!
Gogolook Confidential
★ 坊間流傳著許多解決此問題的 APPs
小熊來電通知
Gogolook Confidential
★ 坊間流傳著許多解決此問題的 APPs
小熊來電通知
Gogolook Confidential
★ Why whoscall?
因為… 他是連 Google 執行長都說讚的軟體!
唉呦,讚喔
Gogolook Confidential
whoscall 是如何解決未知來電的問題咧?
Gogolook Confidential
★ Technologies adopted
1. Yellow pages:HiPage, Yelp, Zenrin…
2. Google search
3. Other sources
Technologies adopted
Gogolook Confidential
★ Technologies adopted
Technologies adopted
4. 使用者回報與標記
Gogolook Confidential
★ Technologies adopted
Technologies adopted
4. 使用者回報與標記
Gogolook Confidential
★ whoscall, I have a problem…
如果一個未知號碼,我們無法從這些 sources 去取得任何資訊,那就 GG 了嗎?
Gogolook Confidential
★ whoscall, I have a problem…
如果一個未知號碼,我們無法從這些 sources 去取得任何資訊,那就 GG 了嗎?
是的,GG 然後洗洗睡…
Gogolook Confidential
當然不能洗洗睡,要不然我站在這邊幹嘛?
Gogolook Confidential
★ Problem we want to solve
For an unknown phone number:• No google result• No user tag / report• Not a whoscall user
Problem we want to solve
Gogolook Confidential
★ Problem we want to solve
For an unknown phone number:• No google result• No user tag / report• Not a whoscall user
Can we determine if it’s a spam number?
Problem we want to solve
Gogolook Confidential
★ Problem we want to solve
For an unknown phone number:• No google result• No user tag / report• Not a whoscall user
Can we determine if it’s a spam number?
推銷電話?
Problem we want to solve
Gogolook Confidential
★ Problem we want to solve
For an unknown phone number:• No google result• No user tag / report• Not a whoscall user
Can we determine if it’s a spam number?
推銷電話? 詐騙電話?騷擾電話?
Problem we want to solve
Gogolook Confidential
★ Problem we want to solve
For an unknown phone number:• No google result• No user tag / report• Not a whoscall user
Can we determine if it’s a spam number?
推銷電話? 詐騙電話?騷擾電話?
打錯電話?
Problem we want to solve
Gogolook Confidential
★ Problem we want to solve
For an unknown phone number:• No google result• No user tag / report• Not a whoscall user
Can we determine if it’s a spam number?
推銷電話? 詐騙電話?騷擾電話?
打錯電話?
Problem we want to solve
(我又不是神!!)
Gogolook Confidential
★ Scenario
Scenario
OO推銷
小明
小明妹
小明哥
?
Gogolook Confidential
★ We think it should work because…
whoscall userbase ( = potential sensors)• > 10 million installations• > 10 thousands tags (daily)• > 30 million phone calls (daily)
Gogolook Confidential
Analysis procedures
Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using
machine learning techniques
Gogolook Confidential
★ Collect call logs
• Recruit a group of voluntary whoscall users as our sensors.
• Collect phone call logs from these sensors for a month.
Collect call logs
Gogolook Confidential
★ User privacy User privacy is kept in the highest priority. Phone numbers are stored as one-way hash
codes. (therefore unable to be reversed)
User privacy
Gogolook Confidential
Analysis procedures
Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using
machine learning techniques
Gogolook Confidential
★ List of user tags
List of user tags
一接就掛斷一打來就掛掉一接對方馬上掛斷一接就掛電話一接起來就掛斷電話一接起來,就說打錯一直傳廣告簡訊一直打錯電話一直收到沒顯示的APP一直狂打錯電話一聲一聲不響,就掛掉,有問題一聲就掛一聲掛斷一聽收線
嚴重騷擾
國外莫名來電
國際電話偽裝台北碼???
地下錢莊
地下錢莊推銷
地下非法期公司
地產
垃圾
垃圾簡訊
垃圾訊息
基隆美髮
壽險
外勞
夜半打給不認識的在亂
色情交友
色情交友電話
色情人肉市場
色情垃圾簡訊
色情外送
色情妹妹電話
色情干擾
色情廣告簡訊
色情拉客妹
色情按摩
色情推銷
色情推銷電話
色情援交外送
色情敗類
摩門
撥了馬上掛掉
擾亂電話
收數率調查
收視率調查
放款簡訊
政府宣導
敲一聲而已
整人電話
新光保全
星展借貸
星展推消
星展銀行
淫媒仲介
Gogolook Confidential
★ Compare with user tags
• Compare these phone numbers with user reports from whoscall database (封鎖記錄)
Compare with user tags
Normal numbers
0987-991-XXX0986-225-XXX02-2675-XXXX03-862-XXXX
...
02-2543-XXXX03-556-XXXX886-XXXX
…
推銷電話
02-2783-XXXX886-903-XXXX0800-000-XXX
…
惡意電話
Gogolook Confidential
★ Data summary
Data summary
推銷電話
民調中心
騷擾電話
詐騙電話
70% 1%
5%
24%
# Samples: 7854Normal: 4000Spam: 3854
Gogolook Confidential
Analysis procedures
Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using
machine learning techniques
Gogolook Confidential
Normal numbers
0
5
10
15
20 Calls =195 (in 66, out 129)Opponents = 72 (in 21, out 58)
★ Normal numbers
Gogolook Confidential
★ Spam numbersSpam numbers
0
10
20
30
Calls =471 (in 15, out 456)Opponents = 186 (in 11, out 183)XX信用卡行銷 (7)OOO,XXXX行銷 (6)電話行銷 (3)
Gogolook Confidential
Analysis procedures
Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using
machine learning techniques
Gogolook Confidential
★ What is a feature?
What is a feature?
“Feature” is a measurable property of a phenomenon being observed.
Gogolook Confidential
Example
Or, we want to analyze a company, we can look at features:
公司人數
★ Example
Gogolook Confidential
Example
Or, we want to analyze a company, we can look at features:
工程師人數
★ Example
Gogolook Confidential
Example
Or, we want to analyze a company, we might look at features:
公司裡面Python工程師的比例
★ Example
Gogolook Confidential
Example
Or, we want to analyze a company, we might look at features:
公司向心力
★ Example
Gogolook Confidential
Example
Or, we want to analyze a company, we might look at features:
CEO 帥氣程度
★ Example
Gogolook Confidential
Features for call patterns
Ratio of out calls
0.8
0.6
0.4
0.2
0.0Fraud Marketing Normal
Gogolook Confidential
Features for call patterns
Ratio of recurring opponents
Fraud Marketing Normal0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Gogolook Confidential
Features for call patterns
Ratio of missed out calls
Fraud Marketing Normal
0.6
0.5
0.4
0.3
0.2
0.1
0
Gogolook Confidential
Features for call patterns
Ratio of working time calls
Fraud Marketing Normal
0.6
0.5
0.4
0.3
0.2
0.1
0
0.7
Gogolook Confidential
Features for call patterns
Median of call durations
Fraud Marketing Normal
50
40
30
20
10
0
60
seconds
Gogolook Confidential
Features for call patterns
Ratio of out calls in contact book
Fraud Marketing Normal
0.10
0
0.25
0.30
0.35
0.20
0.15
0.05
Gogolook Confidential
Analysis procedures
Analysis procedures1. Collect call logs2. Compare with user tags3. Explore call behaviors4. Extract features5. Classify unknown numbers using
machine learning techniques
Gogolook Confidential
Ratio of recurring components is less than 40% Ratio of out calls is more than 60%
Ratio of in calls is less than 20% Then we claim the number is a spam number
Intuitively, we can determine an unknown number by rules such as if
★ Naïve method
Gogolook Confidential
★ Problem 1
Too many features…
Gogolook Confidential
★ Problem 2
How to determine the rule?
Gogolook Confidential
Machine learning★ Solution
Gogolook Confidential
Machine learning★ Solution
Let the machine learn from the data
Gogolook Confidential
What is machine learning?
★ What is machine learning?
機器學習是一種從過去的資料或經驗當中,構造一個模型 (Model),而學習 (Learning) 這件事就是讓這個模型以程式的方式執行,等到學習到一定的程度後,就可以做預測 (猜),這個「猜」是有根據的,且命中率高的。
Gogolook Confidential
Machine learning techniques for classification
★ Machine learning techniques for classification
Support vector machine
Logistic regression
Decision tree
Neural networks
Naïve Bayes
Nonparametric Bayesian method
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
Support vector machine for binary classification
★Support vector machine for binary classification
Gogolook Confidential
這樣就夠了嗎?
Gogolook Confidential
Real-life scenario
★ Real-life scenario
When will we require a spam number prediction?Ans: The time a phone call reaches a whoscall user
We want to predict whether a number is spam as EARLY as possible in order to prevent further victims…
Gogolook Confidential
Real-life scenario
Time
# recent calls
Victim 1 Victim 2 Victim 3
XX推銷★ Real-life scenario
推銷電話
Gogolook Confidential
Let’s look at the performances of SVM under different numbers of recent calls
Gogolook Confidential
SVM for binary classification
★ SVM for binary classification
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10# recent calls
Accuracy
Gogolook Confidential
嗯…表現的不錯,但是…可以再快一點嗎?
Gogolook Confidential
Reduce the number of features
★ Reduce the number of features
Features computation is time-consuming. So we want to reduce the number of features before we do classification.
Gogolook Confidential
Reduce the number of features
★ Reduce the number of features
Features computation is time-consuming. So we want to reduce the number of features before we do classification.
當然我們不是用手去選…
Gogolook Confidential
Reduce the number of features
★ Reduce the number of features
Features computation is time-consuming. So we want to reduce the number of features before we do classification.
Feature selection methods:
Regularization methods
Backward, forward, and stepwise methods
Bayesian feature selection
Random forest method
Gogolook Confidential
Feature selection results
★ Feature selection results
10 15 20 25 30
3 recent calls
5 recent calls
10 recent calls0.8
0.85
0.9
0.95
1.0
# features
Accuracy
Gogolook Confidential
Feature selection results
★ Feature selection results
10 15 20 25 30
3 recent calls
5 recent calls
10 recent calls0.8
0.85
0.9
0.95
1.0
# features
Accuracy
Gogolook Confidential
Feature selection results
★ Feature selection results
10 15 20 25 30
3 recent calls
5 recent calls
10 recent calls0.8
0.85
0.9
0.95
1.0
# features
Accuracy
Gogolook Confidential
Ratio of out calls
Rate of out calls
Ratio of out calls in contact book
Ratio of reciprocal opponents
Ratio of recurring opponents
Median call duration of in calls
Ring duration of answered calls
and more…
★ Selected features
Ratio of missed calls
Rate of new opponents
Ratio of in calls in contact book
Gogolook Confidential
★ Comparison of w/ and w/o feature selection
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10# recent calls
Accuracy
Gogolook Confidential
Done?
阿不就好棒棒?
Gogolook Confidential
What is power?
★ What is power?
Power of class A: The probability of accurately classify a class A sample to class A.
Gogolook Confidential
What is power?
★ What is power?
Power of class A: The probability of accurately classify a class A sample to class A.
性別Classifier
97.5% this is a male
Gogolook Confidential
What is power?
★ What is power?
Power of class A: The probability of accurately classify a class A sample to class A.
性別Classifier
97.5% this is a male
Gogolook Confidential
Power of our classifier
★ Power of our classifier
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10
# recent calls
Power
Gogolook Confidential
義銘,加油好嗎?
Gogolook Confidential
★ Data summary
Data summary
推銷電話
民調中心
騷擾電話
詐騙電話
70% 1%
5%
24%
# Samples: 7854Normal: 4000Spam: 3854
Gogolook Confidential
★ Data summary
Data summary
推銷電話
民調中心
騷擾電話
詐騙電話
70% 1%
5%
24%
# Samples: 7854Normal: 4000Spam: 3854
Gogolook Confidential
Marketing numbers vs. normal numbers
★ Marketing numbers vs. normal numbers
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10
# recent calls
Accuracy
Gogolook Confidential
Fraud numbers vs. normal numbers
★ Fraud numbers vs. normal numbers
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10
# recent calls
Accuracy
Gogolook Confidential
一種摻在一起做撒尿牛丸的概念…
Gogolook Confidential
Power of SVM for multi-classification
★ Power of SVM for multi-classification
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10
# recent calls
Power
Gogolook Confidential
Power of SVM for binary classification
★ Power of SVM for binary classification
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10
# recent calls
Power
Gogolook Confidential
What is type I error rate?
★ What is type I error rate?
Type I error: The probability of misclassify a class B sample to class A.
性別Classifier
5% this is a male
Gogolook Confidential
What is type I error rate?
★ What is type I error rate?
Type I error: The probability of misclassify a class B sample to class A.
性別Classifier
5% this is a male
Gogolook Confidential
Type I error comparison
★ Type I error comparison
0
0.05
0.1
0.15
0.3
3 4 5 6 7 8 9 10
# recent calls
Type I error0.2
0.25
Gogolook Confidential
這點小成果讓我稍稍放鬆地去逛街,突然電話響一聲,我開心地接了起來…
Gogolook Confidential
結果,對方掛斷了
Gogolook Confidential
響一聲掛斷的惡意電話
★ 響一聲掛斷的惡意電話
“響一聲掛斷”(one-ring call) 是一種引誘接電話者回撥的惡意電話,通常伴隨著高額付款電話。
於是我們先觀察“響一聲掛斷”這類型電話號碼的 call patterns。
Gogolook Confidential
Call patterns of one-ring calls
★ Call patterns of one-ring calls
Numbers Mean duration of ringing (seconds)
Mean duration ofout calls (seconds)
0982-415-XXX 1.6 0
0982-420-XXX 3.6 0
0982-495-XXX 5.2 1.25
04-3-704-XXXX 0.9 0
0923-931-XXX 6.7 2.6
Gogolook Confidential
Feature comparison
Ratio of new opponents
Fraud Marketing NormalOne-ring0
0.2
0.4
0.6
0.8
Gogolook Confidential
Feature comparison
Ratio of in calls
0
0.1
0.2
0.3
0.4
0.5
Fraud Marketing NormalOne-ring
Gogolook Confidential
Feature comparison
Ratio of missed calls
0
0.2
0.4
0.6
0.8
Fraud Marketing NormalOne-ring
Gogolook Confidential
★ Naïve method
Similarly, without machine learning we can design rules such as:
Gogolook Confidential
★ Naïve method
Similarly, without machine learning we can design rules such as:
Rule1: The mean of the ringing duration is less then 7 seconds. and
Rule 2: The mean of the outcall duration is less than 3 seconds.
Then we claim that it is a one-ring spam call.
Gogolook Confidential
★ Problems
1. Too many features…2. How to determine the rule?3. New observations.
Gogolook Confidential
★ Problem 3
Numbers Mean duration of ringing (seconds)
Mean duration ofout calls (seconds)
0982-415-XXX 1.6 0
0982-420-XXX 3.6 0
0982-495-XXX 5.2 1.25
04-3-704-XXXX 0.9 0
0923-931-XXX 6.7 2.6
Gogolook Confidential
Numbers Mean duration of ringing (seconds)
Mean duration ofout calls (seconds)
0982-415-XXX 1.6 0
0982-420-XXX 3.6 0
0982-495-XXX 5.2 1.25
04-3-704-XXXX 0.9 0
0923-931-XXX 6.7 2.6
04-2-676-XXXX 15.7 1.4
★ Problem 3
New observation
Gogolook Confidential
Numbers Mean duration of ringing (seconds)
Mean duration ofout calls (seconds)
0982-415-XXX 1.6 0
0982-420-XXX 3.6 0
0982-495-XXX 5.2 1.25
04-3-704-XXXX 0.9 0
0923-931-XXX 6.7 2.6
04-2-676-XXXX 15.7 (S.D.=10.7) 1.4
★ Problem 3
Gogolook Confidential
Machine learning can efficiently “learn” from new data and create rules for us.
Gogolook Confidential
Power of SVM for multi-classification
★ Power of SVM for multi-classification
0.8
0.85
0.9
0.95
1.0
3 4 5 6 7 8 9 10
# recent calls
Power
Gogolook Confidential
Accuracy comparison
★ Accuracy comparison
3 4 5 6 7 8 9 10
# recent calls
0
0.05
0.1
0.15
0.3
0.2
0.25
Type I error
Gogolook Confidential
DeploymentAll the algorithms have been implemented in the whoscall app, so how does it work?
Gogolook Confidential
OO推銷
小明
Data center
Classifier calculating…
0984-003-XXX
回傳: 此號碼可能為推銷電話
所需時間: 50-100 milliseconds
Gogolook Confidential
What’s next?
Gogolook Confidential
Improvements of the classification model
1. Fraud numbers analysis2. Fuzzy classification algorithm3. Spam-category scores4. Cooperate with more solid outside sources5. Generalize to other countries.
Much more…
★ Improvements of the classification model
Gogolook Confidential
Future perspectives
1. User’s tag correction mechanisms2. Personalized penalty setting3. Anti-countermeasures4. Extend to SMS spam detection5. Clustering vs. user tags6. Spam detect Scam detection
★ Future perspectives
Gogolook Confidential
Creating a contact network of trust
感謝大家寶貴的時間