Advanced Lecture on Statistical Data Analysis (Lecture 01)takeuchi/T/NIPm/NipM01_web.pdf · About...
Transcript of Advanced Lecture on Statistical Data Analysis (Lecture 01)takeuchi/T/NIPm/NipM01_web.pdf · About...
Advanced Lecture on
Statistical Data Analysis
(Lecture 01)
Ichiro Takeuchi
Nagoya Institute of Technology
Ichiro Takeuchi, Nagoya Institute of Technology 1/24
About this course
▶ Instructor: Ichiro Takeuchi
▶ Language: English (Japanese questions are allowed)
▶ Room: 0231 (for lecture), 101A (for computer exerciseunless somebody does not have laptop)
▶ Time: Fri 08:50 - 10:20 (hopefully shorter)
▶ Handwritten exercise is often assigned
▶ You must attend every class unless you have good reasons
▶ The grade will be determined based on the final reportabout data analysis
▶ Up-to-date course info will be available from the courseweb-site (find a link in the Moodle cite)
Ichiro Takeuchi, Nagoya Institute of Technology 2/24
What we learn in this course
▶ We learn machine learning and statistics
▶ We also learn how to use statistical software R (only onceor twice this year)
▶ In addition, I hope you have a nice experience on havingan English lecture
Ichiro Takeuchi, Nagoya Institute of Technology 3/24
For Today
▶ Introduction to machine learning and statistics
▶ Introduction to R software
▶ How to set up your account in CSE (if applicable)
▶ Plan to finish it within 45mins
Ichiro Takeuchi, Nagoya Institute of Technology 4/24
What is Machine Learning?
▶ The goal of ML is to provide general data analysis toolsfor prediction or knowledge discovery.
Spam Mail Filter Financial Data Analysis Disease Diagnosis
Ichiro Takeuchi, Nagoya Institute of Technology 5/24
ML and AI (Artificial Intelligence)
http://anime.goo.ne.jp/special/tezuka/inf_atom-new.html
Ichiro Takeuchi, Nagoya Institute of Technology 6/24
Human vs. Computer
http://www.toptens.net/
1997 2011 2013
Ichiro Takeuchi, Nagoya Institute of Technology 7/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
This is a cat
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
This is also a cat
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
This is also a cat
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
This is also a cat
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
This is also a cat
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
This is also a kind of cat
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Can computer recognize cat?
▶ Q. Can we write a computer program that can recognizecats?
▶ There are infinitely many cats in the world!
▶ How can we define a cat explicitly?
Ichiro Takeuchi, Nagoya Institute of Technology 8/24
Computer Programs
▶ Computer programs for recognizing cat should besomething like this:
if (x is cute)
x is a cat;
if (x has triangle-shape ears)
x is a cat;
if (x has fur)
x is a cat;
if (x has a tail)
x is cat;
...
▶ We cannot explicitly specify properties of all the cats inthe world
Ichiro Takeuchi, Nagoya Institute of Technology 9/24
Machine Learning Approach
▶ Instead of explicitly describing the properties of a cat, weprovide examples to a computer and let the computerlearn what is cat
Ichiro Takeuchi, Nagoya Institute of Technology 10/24
Machine Learning Approach
▶ Instead of explicitly describing the properties of a cat, weprovide examples to a computer and let the computerlearn what is cat
Ichiro Takeuchi, Nagoya Institute of Technology 10/24
Machine Learning Approach
▶ Instead of explicitly describing the properties of a cat, weprovide examples to a computer and let the computerlearn what is cat
Ichiro Takeuchi, Nagoya Institute of Technology 10/24
Machine Learning Approach
▶ Instead of explicitly describing the properties of a cat, weprovide examples to a computer and let the computerlearn what is cat
Ichiro Takeuchi, Nagoya Institute of Technology 10/24
Machine Learning Approach
▶ Instead of explicitly describing the properties of a cat, weprovide examples to a computer and let the computerlearn what is cat
Ichiro Takeuchi, Nagoya Institute of Technology 10/24
Machine Learning Approach
▶ Instead of explicitly describing the properties of a cat, weprovide examples to a computer and let the computerlearn what is cat
Ichiro Takeuchi, Nagoya Institute of Technology 10/24
Google + Stanford Project (2012)
▶ Data compression by auto-encoder
Ichiro Takeuchi, Nagoya Institute of Technology 11/24
Google + Stanford Project (2012)
▶ Data compression by auto-encoder
Ichiro Takeuchi, Nagoya Institute of Technology 11/24
Google + Stanford Project (2012)
▶ Data compression by auto-encoder
Ichiro Takeuchi, Nagoya Institute of Technology 11/24
Google + Stanford Project (2012)
▶ Data compression by auto-encoder
Ichiro Takeuchi, Nagoya Institute of Technology 11/24
Google + Stanford Project (2012)
▶ Data compression by auto-encoder
Ichiro Takeuchi, Nagoya Institute of Technology 11/24
Types of Problems in ML
▶ Supervised learning▶ Regression▶ Classification
▶ Binary classification▶ Multiclass classification
▶ Semi-supervised learning
▶ Unsupervised learning▶ Clustering▶ Density estimation
Ichiro Takeuchi, Nagoya Institute of Technology 12/24
Regression Learning
▶ Data
(x1, y1), (x2, y2), . . . , (xn, yn)
▶ Input
xi ∈ Rd, i ∈ {1, . . . , n}
▶ Output
yi ∈ R, i ∈ {1, . . . , n}
Ichiro Takeuchi, Nagoya Institute of Technology 13/24
An Example
area (x) price (y)3.472 25.93.531 29.52.275 27.94.050 25.94.455 29.9...
... 20
25
30
35
40
45
50
2 3 4 5 6 7 8 9 10
Ichiro Takeuchi, Nagoya Institute of Technology 14/24
An Example
area (x) price (y)3.472 25.93.531 29.52.275 27.94.050 25.94.455 29.9...
... 20
25
30
35
40
45
50
2 3 4 5 6 7 8 9 10
Ichiro Takeuchi, Nagoya Institute of Technology 14/24
Least-square Linear Regression
▶ Training Data:
{(x1, y1), . . . , (xn, yn)}, xi ∈ Rd, yi ∈ R
▶ Linear Model:
f(xi) = β0 + β1xi1 + . . .+ βdxid = β0 + β⊤xi
▶ Quadratic Loss:
minβ0,β
E(β0,β) =n∑
i=1
(yi − f(xi)
)2
Ichiro Takeuchi, Nagoya Institute of Technology 15/24
Types of Problems in ML
▶ Supervised learning▶ Regression▶ Classification
▶ Binary classification▶ Multiclass classification
▶ Semi-supervised learning
▶ Unsupervised learning▶ Clustering▶ Density estimation
Ichiro Takeuchi, Nagoya Institute of Technology 16/24
Classification Learning
▶ Data
(x1, y1), (x2, y2), . . . , (xn, yn)
▶ Input
xi ∈ Rd, i ∈ {1, . . . , n}
▶ Output
yi ∈ {−1,+1}, i ∈ {1, . . . , n}
Ichiro Takeuchi, Nagoya Institute of Technology 17/24
A Binary Classification Example
▶ Consider a medical diagnosis problem (classify cancer ornot) based on the activities of two genes A and B:
ID gene A gene Bcancer (+1)or not (−1)
1 310 150 +12 190 160 +13 280 120 +14 310 170 +15 290 120 +16 200 100 −17 180 130 −18 240 110 −19 150 150 −110 150 110 −1
80
100
120
140
160
180
200
100 150 200 250 300 350 400A
ctiv
ity o
f gen
e B
Activity of gene A
Ichiro Takeuchi, Nagoya Institute of Technology 18/24
A Binary Classification Example
▶ We want to determine the classification boundary so thatwe can diagnose or classify a new patient.
80
100
120
140
160
180
200
100 150 200 250 300 350 400
Act
ivit
y o
f g
en
e B
Activity of gene A
Ichiro Takeuchi, Nagoya Institute of Technology 19/24
A Binary Classification Example
▶ We want to determine the classification boundary so thatwe can diagnose or classify a new patient.
80
100
120
140
160
180
200
100 150 200 250 300 350 400
Act
ivit
y o
f g
en
e B
Activity of gene A
Ichiro Takeuchi, Nagoya Institute of Technology 19/24
A Binary Classification Example
▶ We want to determine the classification boundary so thatwe can diagnose or classify a new patient.
80
100
120
140
160
180
200
100 150 200 250 300 350 400
Act
ivit
y o
f g
en
e B
Activity of gene A
Ichiro Takeuchi, Nagoya Institute of Technology 19/24
Formulation of Binary Classifier Training
▶ Training Set (with n instances and d features):
{xi, yi}ni=1, xi = [xi1 . . . xid]⊤ ∈ Rd, yi ∈ {−1,+1}
▶ (Linear) Classifier:
f(x) = w0 +w⊤x, y =
{−1, if f(x) < 0+1, if f(x) > 0
▶ Optimization Problem:
minw,w0
n∑i=1
L(yi, f(xi))︸ ︷︷ ︸loss term
+ λR(w)︸ ︷︷ ︸regularization term
.
Ichiro Takeuchi, Nagoya Institute of Technology 20/24
Support Vector Machine (SVM)
▶ SVM has been one of the most popular binary classifiersin the last decade.
80
100
120
140
160
180
200
100 150 200 250 300 350 400
Act
ivit
y o
f g
en
e B
Activity of gene A
80
100
120
140
160
180
200
100 150 200 250 300 350 400
Act
ivit
y o
f g
en
e B
Activity of gene A
80
100
120
140
160
180
200
100 150 200 250 300 350 400
Act
ivit
y o
f g
en
e B
Activity of gene A
(a) (b) (c)
Which boundary is the best (although all the three boundaries per-
fectly classify the 10 training instances)?
Ichiro Takeuchi, Nagoya Institute of Technology 21/24
R
▶ R is a freely available language and environment forstatistical computing
▶ Many computer scientists and statisticians developpackage
▶ Many practitioners use package
▶ Useful for your own research (data analysis, visualization)
▶ Introduce some useful packages in this course
▶ Your final presentation should be prepared using R
▶ To install, google “CRAN”
Ichiro Takeuchi, Nagoya Institute of Technology 22/24
How to set up your account in CSE
▶ Make decisions by 5pm today
▶ For each 研究室, only one of the students should sendthe list of
▶ 名前▶ 学籍番号▶ 基盤 ID▶ 所属専攻▶ 所属研究室▶ 学年 (M1, M2)▶ メールアドレス
to [email protected] by 5pm Apr 09th.
▶ To issue your account, you need to put all the requiredinformation to 統一データベース
▶ If you already have an account, you have nothing to do
Ichiro Takeuchi, Nagoya Institute of Technology 23/24
Hope to see you in the next week
Ichiro Takeuchi, Nagoya Institute of Technology 24/24