Linear Models for Regression - World...
Transcript of Linear Models for Regression - World...
![Page 1: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/1.jpg)
Linear Models for Regression
![Page 2: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/2.jpg)
Agenda
• Introduction
•Gradient Descent• Cost function and minimization
• Implementation
•Evaluating Regression Models
•Regularization
2Copyright © 2018. Data Science Dojo
![Page 3: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/3.jpg)
INTRODUCTION
Copyright © 2018. Data Science Dojo3
![Page 4: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/4.jpg)
Regression
Predict length of patients stay in hospital
4Copyright © 2018. Data Science Dojo
Forecast cost of treatment Predict number of staff needed on a given day
![Page 5: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/5.jpg)
Notation: Breast Cancer Dataset
5: The patient is in the 5th row
1: The patient’s diagnosis is the 1st column
𝑥15
5Copyright © 2018. Data Science Dojo
123456
![Page 6: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/6.jpg)
Notation: Breast Cancer Dataset
So how do we describe all the rows?
𝑥1 = [17.99, 10.38, 122.80]
𝑥2 = [20.57, 17.77, 132.90]
𝑥3 = [19.69, 21.25, 130.00]
Row 1
Row 2
Row 3
6Copyright © 2018. Data Science Dojo
![Page 7: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/7.jpg)
The breast cancer dataset records the physical properties of the tumor and it’s diagnosis
Using this notation, we can describe all the columns of the dataset.
Notation: Breast Cancer Dataset
𝑥1 𝑥2 𝑥3
𝑋𝑌
7Copyright © 2018. Data Science Dojo
![Page 8: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/8.jpg)
Notation Summary
𝑥𝑖– Each row of features𝑥𝑗 – Each column of featuresX – Set of all the feature columns𝑦𝑖 – Each row of the targetY – The target columnn – Number of rows in the datasetm – Number of columns in the dataset
Features
Target
8Copyright © 2018. Data Science Dojo
![Page 9: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/9.jpg)
COST FUNCTION AND GRADIENT DESCENT
Copyright © 2018. Data Science Dojo9
![Page 10: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/10.jpg)
What is a good regression line?
•Wind Speed=15 mph
•Ozone = ?
•Use the line that is somewhere in the middle
•How do we define "middle"?
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥
10Copyright © 2018. Data Science Dojo
![Page 11: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/11.jpg)
Defining a line
How do we define a line in slope-intercept notation?• 𝑦 = 𝒎𝑥 + 𝒃
In 𝜃 notation• ℎ𝜃(x)= 𝜽1x + 𝜽0
m = slope
b = intercept 𝜃0
𝜃1
11Copyright © 2018. Data Science Dojo
![Page 12: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/12.jpg)
More Features
𝑦 𝑥1 𝑥2 𝑥3
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥1 + 𝜃2𝑥2 + 𝜃3𝑥3
12Copyright © 2018. Data Science Dojo
![Page 13: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/13.jpg)
Residuals (or "Errors")
Difference between hypothesis hθ(x) (predicted value) and true value (known target)
Error 2
Error 1
13Copyright © 2018. Data Science Dojo
![Page 14: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/14.jpg)
Cost Function
Minimize the ‘cost’ or ‘loss’ function – 𝐽(𝜃)
• Smaller for lower error
• Larger for higher error
𝐽 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
14Copyright © 2018. Data Science Dojo
Error 2
Error 1
![Page 15: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/15.jpg)
Cost Function
θ1=2
θ1=1.0
θ1=0.5
θ0=0
ℎ𝜃 𝑥 = 𝜃0 + 𝜃1𝑥𝐽 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
Each point on the parabola corresponds to a line on the graph on the left
15Copyright © 2018. Data Science Dojo
![Page 16: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/16.jpg)
Cost function in three dimensions
𝐽 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
𝜃0 𝜃1
𝐽(𝜃0,𝜃
1)
16Copyright © 2018. Data Science Dojo
![Page 17: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/17.jpg)
How do we find the minimum of the cost function?
17Copyright © 2018. Data Science Dojo
![Page 18: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/18.jpg)
Maximum/Minimum Problem
Find two non-negative numbers whose sum is 9 and so that
the product of one number and the square of the other number
is a maximum.
18Copyright © 2018. Data Science Dojo
![Page 19: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/19.jpg)
Solution (1/2)
Sum of number is 9
9 = x + y
Product of two numbers is
P = x y2
= x (9-x)2
19Copyright © 2018. Data Science Dojo
![Page 20: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/20.jpg)
Solution (2/2)
Using the product rule and chain rule from Calculus 101:
P' = x (2) ( 9-x)(-1) + (1) ( 9-x)2
= ( 9-x) [ -2x + ( 9-x) ]
= ( 9-x) [ 9-3x ]
= ( 9-x) (3)[ 3-x ]
= 0
x=9 or x=3
20Copyright © 2018. Data Science Dojo
![Page 21: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/21.jpg)
Maximum Problem
There are 50 apple trees in an orchard.
Each tree produces 800 apples. For each additional tree planted in the orchard, the apple output per tree drops by 10 apples.
Question: How many additional trees should be planted in the existing orchard in order to maximize the apple output of the orchard?
21Copyright © 2018. Data Science Dojo
![Page 22: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/22.jpg)
Adding 15 trees will maximize apple production
Solution
A = (50 + t) x (800 – 10t)
A = 40,000 + 300t – 10t2
Solve for A’ and set to 0 to find maximum.A’ = – 20t + 300 = 0t = 15
22Copyright © 2018. Data Science Dojo
![Page 23: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/23.jpg)
Gradient Descent
•Goal : minimize 𝐽(𝜃)
•Start with some initial 𝜃 and then perform an update on each 𝜃𝑗 in turn:
•Repeat until 𝜃 converges
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
23Copyright © 2018. Data Science Dojo
![Page 24: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/24.jpg)
Gradient Descent
• 𝛼 is known as the learning rate; set by user
• Each time the algorithm takes a step in the direction of the steepest descent and 𝐽 𝜃 decreases.
• 𝛼 determines how quickly or slowly the algorithm will converge to a solution
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
24Copyright © 2018. Data Science Dojo
![Page 25: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/25.jpg)
Intuition
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
𝜃𝑗𝑘
𝜃𝑗𝑘+1
𝜃𝑗𝑘+3
Positive
gradient
Negative
gradient
𝜃𝑗
25Copyright © 2018. Data Science Dojo
![Page 26: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/26.jpg)
Effect of High Learning Rate: Large 𝛼
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
Positive
gradient
Negative
gradient
𝜃𝑗
26Copyright © 2018. Data Science Dojo
![Page 27: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/27.jpg)
Learning Rate Effects Small 𝛼
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
Positive
gradient
Negative
gradient
𝜃𝑗
27Copyright © 2018. Data Science Dojo
![Page 28: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/28.jpg)
Gradient Descent Implementation
When do we stop updating?
Positive
gradient
Negative
gradient
𝜃𝑗
Here?Here?
• When 𝜃𝑗𝑘+1 is close to 𝜃𝑗
𝑘
• When 𝐽(𝜃𝑘+1) is close to 𝐽(𝜃𝑘) [Error does not change]
28Copyright © 2018. Data Science Dojo
![Page 29: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/29.jpg)
Batch Gradient Descent
•How do we incorporate all our data?
•Loop!
For j from 0 to m:
•ℎ𝜃 is updated only once the loop has completed
•Weaknesses?
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼1
𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼𝜕
𝜕𝜃𝑗𝐽(𝜃𝑘)
Each represents one feature𝜃𝑗
29Copyright © 2018. Data Science Dojo
![Page 30: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/30.jpg)
Batch Gradient Descent
•Loop!
For j from 0 to m:
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼1
𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖
30Copyright © 2018. Data Science Dojo
![Page 31: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/31.jpg)
Stochastic Gradient Descent
• Consider an alternative approach:
• ℎ𝜃 is updated when inner loop is complete
• If the training set is big, converges quicker than batch
• May oscillate around a minimum of 𝐽(𝜃) and never converge
for i from 1 to n:for j from 0 to m:
𝜃𝑗𝑘+1 ≔ 𝜃𝑗
𝑘 − 𝛼 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 𝑥𝑗𝑖
* We're now only taking one random observation at a time as a sample, instead of averaging across observations
31Copyright © 2018. Data Science Dojo
![Page 32: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/32.jpg)
Batch vs. Stochastic
Which is the better to use? It depends.
Batch Gradient
Descent
Stochastic Gradient
Descent
FunctionUpdates hypothesis by
scanning whole dataset
Updates hypothesis by
scanning one training
sample at a time
Rate of convergence Slowly
Quickly
(but may oscillate at
minimum)
Appropriate Dataset
SizeSmall Large
32Copyright © 2018. Data Science Dojo
![Page 33: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/33.jpg)
EVALUATING REGRESSION MODELS
33Copyright © 2018. Data Science Dojo
![Page 34: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/34.jpg)
Evaluation Metrics for Regression
•Mean Absolute Error (MAE)
•Root-Mean-Square Error (RMSE)• Root-Mean-Square Deviation
•Coefficient of Determination (R2)
34Copyright © 2018. Data Science Dojo
![Page 35: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/35.jpg)
Mean Absolute Error
•Mean of residual values
• "Pure" measure of error
𝑀𝐴𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖
𝑛
35Copyright © 2018. Data Science Dojo
![Page 36: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/36.jpg)
Mean Absolute Error - Example
𝑦 = 36, 19, 34, 6, 1, 45
ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48
ℎ𝜃 𝑥 − 𝑦 = 9, 21.6, 21, 13.3, 3.6, 3
𝑀𝐴𝐸 𝜃 =71.5
6= 11.9
36Copyright © 2018. Data Science Dojo
![Page 37: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/37.jpg)
Root-Mean-Square Error
•Square root of mean of squared residuals
•Penalizes large errors more than small
•Good measure to use to accentuate outliers
𝑅𝑀𝑆𝐸 𝜃 =σ𝑖=1𝑛 ℎ𝜃 𝑥𝑖 − 𝑦𝑖 2
𝑛
37Copyright © 2018. Data Science Dojo
![Page 38: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/38.jpg)
RMSE - Example
𝑦 = 36, 19, 34, 6, 1, 45
ℎ𝜃 𝑥 = 27,−2.6, 13, −7.3, −2.6, 48
ℎ𝜃 𝑥 − 𝑦 2 = 81, 467, 441, 177, 13, 9
𝑅𝑆𝑀𝐸 𝜃 =1187
6= 14.1
38Copyright © 2018. Data Science Dojo
![Page 39: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/39.jpg)
Coefficient of Determination (R2)
where
𝑆𝑆𝑟𝑒𝑠 – Sum of squared residuals (i.e. total squared error)
𝑆𝑆𝑡𝑜𝑡 –Sum of squared differences from mean (i.e. total variation in dataset)
Result: Measure of how well the model explains the data• "Fraction of variation in data explained by model"
𝑅2 = 1 −𝑆𝑆𝑟𝑒𝑠𝑆𝑆𝑡𝑜𝑡
𝑆𝑆𝑟𝑒𝑠 =
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2
𝑆𝑆𝑡𝑜𝑡 =
𝑖=1
𝑛
𝑦𝑖 − ത𝑦2
39Copyright © 2018. Data Science Dojo
![Page 40: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/40.jpg)
REGULARIZATION
40Copyright © 2018. Data Science Dojo
![Page 41: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/41.jpg)
OverfittingP
rice
Size
𝜃0 + 𝜃1𝑥
Pri
ce
Size
𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2
Pri
ce
Size
𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥
3 + 𝜃4𝑥4
41Copyright © 2018. Data Science Dojo
![Page 42: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/42.jpg)
Intuition
𝐽′ 𝜃 = 𝐽 𝜃 + 𝑃𝑒𝑛𝑎𝑙𝑡𝑦
Pri
ce
Size
𝜃0 + 𝜃1𝑥 + 𝜃2𝑥2 + 𝜃3𝑥
3 + 𝜃4𝑥4 𝜃0 + 𝜃1𝑥 + 𝜃2𝑥
2 + 𝜃3𝑥3 + 𝜃4𝑥
4
Pri
ce
Size
Ensure Small
42Copyright © 2018. Data Science Dojo
![Page 43: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/43.jpg)
Definitions
•Two most common methods• L1 regularization
• lasso regression
• L2 regularization • ridge regression
• weight decay𝐽𝐿2 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗2
𝐽𝐿1 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗
43Copyright © 2018. Data Science Dojo
![Page 44: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/44.jpg)
Regularized Regression
•Find the best fit
•Keep the 𝜃𝑗 terms as small as possible.
•λ is a user-set parameter which controls the trade off
𝐽𝐿2 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗2𝐽𝐿1 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗
44Copyright © 2018. Data Science Dojo
![Page 45: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/45.jpg)
Regularized Regression
•Size of 𝜆 important• 𝜆 too high => no fitting
• 𝜆 too low => no regularization
𝐽𝐿2 𝜃 =1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗2𝐽𝐿1 𝜃 =
1
2𝑛
𝑖=1
𝑛
ℎ𝜃 𝑥𝑖 − 𝑦𝑖2+ 𝜆
𝑗=1
𝑚
𝜃𝑗
45Copyright © 2018. Data Science Dojo
![Page 46: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/46.jpg)
QUESTIONS
46Copyright © 2018. Data Science Dojo
![Page 47: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/47.jpg)
Unsupervised Learning andK-Means Clustering
![Page 48: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/48.jpg)
Unsupervised Learning
• Trying to find hidden structure in unlabeleddata
• No error or reward signal to evaluate a potential solution. No need to pick a response class.
• Common techniques: K-Means clustering, hierarchical clustering, hidden Markov models, etc.
48Copyright (c) 2018. Data Science Dojo
![Page 49: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/49.jpg)
Choose Number of Clusters
Example 1 (domain knowledge / practicalities): Clothing sizes• Tailor-made for each person is too
expensive• One-size-fits-all: does not work!• Groups people of similar sizes together to
make “small”, “medium”, and “large” t-shirts
49Copyright (c) 2018. Data Science Dojo
![Page 50: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/50.jpg)
Choose Number of Clusters
Example 2 (via evaluation): Patient Segmentation• Subdivide patients into
distinct subsets based on disease characteristics
• Where any subset may conceivably be selected as a segment and then be targeted with care models and intervention programs tailored to their needs.
50Copyright (c) 2018. Data Science Dojo
![Page 51: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/51.jpg)
K-Means Clustering
• Partitions data points into similarity clusters
• Unsupervised technique: there is no partitioning into a learning or a test set in unsupervised learning
• Useful in grouping observations
• Only works for numeric data
51Copyright (c) 2018. Data Science Dojo
![Page 52: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/52.jpg)
Data Preparation
• Transform categorical variables into numeric• Standardize•Reduce dimensionality
52
Age Pclass.1 Pclass.2 Pclass.3 Sex.female Sex.male
19 0 1 0 0 1
28 1 0 0 1 0
64 0 0 1 0 1
Often called“dummy variables” or “one-hot encoding”
Copyright (c) 2018. Data Science Dojo
![Page 53: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/53.jpg)
Euclidean Distance
Determine intra- and inter-cluster similarity
53
x1, y1
x2, y2
Copyright (c) 2018. Data Science Dojo
Intra-cluster distancesare minimized
Inter-cluster distancesare maximized
![Page 54: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/54.jpg)
K-Means Clustering (1/2)
1 2
![Page 55: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/55.jpg)
K-Means Clustering (2/2)
3 4 5
The positions of the cluster centers are determined by the mean of all the points in the cluster.
![Page 56: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/56.jpg)
K-Means Clustering
56Copyright (c) 2018. Data Science Dojo
K=3
![Page 57: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/57.jpg)
K-Means Clustering Algorithm
57
Suppose set of data points: { x1, x2, x3……….xn}• Step 0: Decide the number of clusters, K=1,2,…k.• Step 1: Place centroids at random locations
➢ c1, c2,….ck
• Step 2: Repeat until convergence:
{ for each point xi find nearest centroid cj (eg. Euclidean distance)
assign the point xi to cluster j
for each cluster j = 1..k calculate new centroid cj
cj=mean of all points xi assigned to cluster j in previous step
}• Step 3: Stop when none of the cluster assignments change
Copyright (c) 2018. Data Science Dojo
![Page 58: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/58.jpg)
K-Means Clustering
• Minimizes aggregate intra-cluster distance
• Measure squared distance from point to center of its cluster.
𝑗=1
𝐾
𝑥∈𝑔𝑗
𝐷 𝑐𝑗, 𝑥2
• Could converge to local minimum
• Different starting points very different results
• Run many times with random starting points
• Nearby points may not be assigned to the same cluster
![Page 59: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/59.jpg)
• Strengths• Simple: easy to understand and to implement
• Efficient: linear time, minimal storage
• Weaknesses • Mean must be well defined
• The user needs to specify k
• Algorithm is sensitive to outliers
59Copyright (c) 2018. Data Science Dojo
K-Means Clustering
![Page 60: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/60.jpg)
Finding K with Elbow Method
60
Option 1 - Percentage of variance
explained as a function of the
number of clusters.
Goal - Choose a number of clusters so that adding
another cluster doesn't give much better modelling
of the data.
Option 2 -Total of the
squared distances of cluster
point to center.
Copyright (c) 2018. Data Science Dojo
![Page 61: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/61.jpg)
QUESTIONS
61Copyright (c) 2018. Data Science Dojo
![Page 62: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/62.jpg)
Big Data Engineering
![Page 63: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/63.jpg)
• Introduction
• A key problem – machine learning at scale
• Distributed computing with Apache Hadoop & Hive
• Machine learning at scale with Apache Mahout
• Distributed computing v2.0 – Apache Spark
Agenda
Copyright (c) 2018. Data Science Dojo 63
![Page 64: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/64.jpg)
5 Vs of Big Data
Data at restTerabytes to exabytes
of existing data to
process
Velocity
Data in motionStreaming data,
milliseconds to
seconds to respond
Variety
Data in many
formsStructured,
unstructured, text, and
multimedia
Veracity
Data in doubtUncertainty due to
data inconsistency and
incompleteness,
ambiguities, latency,
deception, and model
approximations
Value
Data can have
different valueNot all bytes are
created equal
$$$$
$
$
$
$$ $
$
$
▪ Goal: As data scientists we want cost-effective access to the
raw materials for our data products!
Copyright (c) 2018. Data Science Dojo 64
![Page 65: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/65.jpg)
MACHINE LEARNING AT SCALE
65Copyright © 2018. Data Science Dojo
![Page 66: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/66.jpg)
OSS R Limits
▪ Single core
▪ Single threaded
Model A Model B Model C
Quad Core Laptop
Copyright (c) 2018. Data Science Dojo 66
![Page 67: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/67.jpg)
•Single core
•Single threaded
•All in memory (RAM)
•Vectors & Matrices capped at 4,294,967,295 elements (rows) if 32-bit version; 2^32 - 1
OSS R Limits
Copyright (c) 2018. Data Science Dojo 67
![Page 68: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/68.jpg)
OSS R Limits: RAM
• All in memory (RAM)
Laptop Example:
𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 𝑇𝑜𝑡𝑎𝑙 𝑅𝐴𝑀 𝐴𝑐𝑐𝑒𝑠𝑠 𝑥 80% − 𝑁𝑜𝑟𝑚𝑎𝑙 𝑅𝐴𝑀 𝑈𝑠𝑎𝑔𝑒
𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 5.9 𝑔𝑏 𝑥 80% − 3.2gb𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏
*R data frames actually bloats data files by ~3x𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1.52𝑔𝑏 ÷ 3 = ~506.7𝑚𝑏
Copyright (c) 2018. Data Science Dojo 68
![Page 69: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/69.jpg)
OSS R Limits: RAM
Azure’s VM with largest RAM*:
*Data collected 06/07/2017
24x7x52 Annual Cost: $116,938.44!
𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = 2000𝑔𝑏 𝑥 80% − 1𝑔𝑏𝑀𝑎𝑥 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏𝑅 𝐷𝑎𝑡𝑎 𝐿𝑖𝑚𝑖𝑡 = ~1600𝑔𝑏 ÷ 3 = ~533.33 𝑔𝑏
Copyright (c) 2018. Data Science Dojo 69
![Page 70: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/70.jpg)
Machine Learning Scaling
•Hadoop
•Spark
•H20
•Microsoft R
Server
Distributed
•Azure ML
•AWS ML
•Big ML
•Cloud Virtual
Machines
Cloud
•R
•Python
•SAS
Programming
•Excel
Programs
This only gets us so far! Big data scale!
Copyright (c) 2018. Data Science Dojo 70
![Page 71: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/71.jpg)
DISTRIBUTED COMPUTING WITH APACHE HADOOP
71Copyright © 2018. Data Science Dojo
![Page 72: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/72.jpg)
Turn Back The Clock, The Mainframe
• “Big Iron”• Backbone of computing for
decades.• Still widely used.• “Scale-up” model of shared
computing.• Core platform is cost effective,
ecosystem is not (e.g., software licensing).
• The original VM host!
Copyright (c) 2018. Data Science Dojo 72
![Page 73: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/73.jpg)
Distributed Computing
Copyright (c) 2018. Data Science Dojo 73
![Page 74: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/74.jpg)
Cloud Computing
• Conceptually – a combination of mainframe and distributed computing.
• VM hosts are now the “Big Iron”.
• Many VMs work together to distribute workloads.
• Some workloads on dedicated HW (e.g., SAP HANA).
Copyright (c) 2018. Data Science Dojo 74
![Page 75: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/75.jpg)
Scaling Computational Power
▪ Horizontal Scaling, Scaling OUT
▪ Commodity hardware, distributedNew Scaling:
▪ Vertical Scaling, Scaling UP
▪ High performance computersOld Scaling:
Copyright (c) 2018. Data Science Dojo 75
![Page 76: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/76.jpg)
What is Hadoop?
• OSS Platform for distributed computing over Internet-scale data.
• Originally built at Yahoo!
• Implementation of ideas (e.g., MapReduce) published by Google.
• The de facto standard big data platform.
• Named after a stuffed animal belonging to Doug Cutting’s son.
Copyright (c) 2018. Data Science Dojo 76
![Page 77: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/77.jpg)
Distributed batch processing engine for big data.
Hadoop at Base
Storage Compute
Copyright (c) 2018. Data Science Dojo 77
![Page 78: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/78.jpg)
HDFS & MapReduce
60gb of Tweets
1 Computer
Processing: 30 hours
60gb
Copyright (c) 2018. Data Science Dojo 78
![Page 79: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/79.jpg)
HDFS & MapReduce
60gb of Tweets
Processing: 15 hours
30gb
2 Computers
Copyright (c) 2018. Data Science Dojo 79
![Page 80: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/80.jpg)
HDFS & MapReduce
60 Gb of Tweets
Processing: 10 hours
20Gb
3 Computers
Copyright (c) 2018. Data Science Dojo 80
![Page 81: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/81.jpg)
Most Cases, Linear Scaling Of Processing Power
Number of Computers Processing Time (hours)
1 30
2 15
3 10
4 7.5
5 6
6 5
7 4.26
8 3.75
9 3.33
Copyright (c) 2018. Data Science Dojo 81
![Page 82: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/82.jpg)
Head Node
(Named Node)Data Nodes
If dogs were servers…
Copyright (c) 2018. Data Science Dojo 82
![Page 83: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/83.jpg)
DataNode 2
DataNode 1
DataNode 3
Partition 3
Partition 2
Partition1
HDFS
HDFS Partitioning
Copyright (c) 2018. Data Science Dojo 83
![Page 84: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/84.jpg)
B1B2
DataNode 1
O1O2
DataNode 3
G1G2
HDFS Redundancy
DataNode 2
O1 O2B1
B2G2G1
Copyright (c) 2018. Data Science Dojo 84
![Page 85: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/85.jpg)
MapReduce – Sandwich Analogy
Copyright (c) 2018. Data Science Dojo 85
![Page 86: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/86.jpg)
Limitations with MapReduce
• Lot of code to perform the simplest task
• Slow
• Troubleshooting multiple computers
• Good devs are scarce
• Expensive certifications
Copyright (c) 2018. Data Science Dojo 86
![Page 87: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/87.jpg)
Hive
• Abstraction built on top of MapReduce & HDFS.
• Makes Hadoop look like an RDBMS (e.g., coding in SQL).
• Developed by Facebook to democratize Hadoop.
• Applies structure to data at runtime (“schema on read”).
Copyright (c) 2018. Data Science Dojo 87
![Page 88: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/88.jpg)
Hive Jobs
HiveQL
Statement
Translation &
Conversion
MapReduce
Job
Copyright (c) 2018. Data Science Dojo 88
![Page 89: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/89.jpg)
Word Count Revisited
vs.SELECT word,
COUNT(*) AS
word_count
FROM words
GROUP BY word;
Copyright (c) 2018. Data Science Dojo 89
![Page 90: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/90.jpg)
Caution
SELECT * FROM ANYTHING: This brings back everything. Everything
doesn’t fit on a single computer.
JOIN: Join will take hours or days to perform and eat up all cluster
bandwidth for everyone else trying to use it in the queue.
ORDER BY: Sorting is very computationally expensive.
Sub Queries: A sub query essentially creates a secondary table, which
will be huge in HIVE.
Interactivity: SQL in DBMS is interactive because it's almost
instantaneous.
Copyright (c) 2018. Data Science Dojo 90
![Page 91: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/91.jpg)
HDInsight
Hadoop Implementations
Copyright (c) 2018. Data Science Dojo 91
![Page 92: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/92.jpg)
Blob Storage
Hadoop in Azure
HDInsight
Azure Data
Lake Store
Compute
HDFS
Storage
Copyright (c) 2018. Data Science Dojo 92
![Page 93: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/93.jpg)
Mahout
• Distributed Machine Learning platform.
• Built on top of MapReduce and HDFS.
• Script-based and command line interfaces.
• R-like language implementation.
Copyright (c) 2018. Data Science Dojo 93
![Page 94: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/94.jpg)
Data
Node 2
Data
Node 1
Data
Node 3
Partition 3
Partition 2
Partition1
Distributed Random Forest
HDFS
Partitioning
Copyright (c) 2018. Data Science Dojo 94
![Page 95: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/95.jpg)
Data
Node 2
Data
Node 1
Data
Node 3
Data
Shuffle
Distributed Random Forest
Copyright (c) 2018. Data Science Dojo 95
![Page 96: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/96.jpg)
NODE
A
NODE
BNODE
CBag 1 Bag 2 Bag 3
Decision Tree A Decision Tree B Decision Tree C
Distributed Random Forest
Copyright (c) 2018. Data Science Dojo 96
![Page 97: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/97.jpg)
Processing Times - Machine Learning
Data
Cleaning
Training
Prediction
Hours
Days
Milliseconds
• Large scale systems are only needed for training
• Phones can use models outputted by mahout to predict new data
• After a model is trained, save the model to any IO file type and reload it where you want
Bottleneck
Copyright (c) 2018. Data Science Dojo 97
![Page 98: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/98.jpg)
Distributed computing v2.0 – Apache spark
Copyright (c) 2018. Data Science Dojo 98
![Page 99: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/99.jpg)
What is Spark?
• “A fast and general engine for large-scale data processing.”
• Designed to incorporate the goodness of Hadoop and address Hadoop’s shortcomings.
• Can complement Hadoop via integration with both HDFS and Hive.
Copyright (c) 2018. Data Science Dojo 99
![Page 100: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/100.jpg)
Why Spark? Improved Perf!
▪ Up to 10x faster than Hadoop working with data from disk.*
▪ Up to 100x faster working with data stored in memory!*
* benchmark is without Apache Yarn
Copyright (c) 2018. Data Science Dojo 100
![Page 101: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/101.jpg)
Daytona GraySort Contest: Sort 100 TB of data!
Previous World Record:• Method: Hadoop• Yahoo!• 72 Minutes• 2100 Nodes
2014:• Method: Spark• Databricks• 23 Minutes• 206 Nodes
3x faster on 10x fewer machines!
Source: https://databricks.com/blog/2014/10/10/spark-petabyte-sort.html
Big Data, Faster!
Copyright (c) 2018. Data Science Dojo 101
![Page 102: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/102.jpg)
Conceptual Architecture
Copyright (c) 2018. Data Science Dojo 102
![Page 103: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/103.jpg)
Spark and Hadoop
YAR
N
HDFS
MapReduce
HiveJava APISpark SQL
Spark Streamin
gMLlib
▪ Spark can be deployed on a Hadoop cluster and share cluster resources via YARN.
▪ Spark, however, does not require Hadoop!
Copyright (c) 2018. Data Science Dojo 103
![Page 104: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/104.jpg)
QUESTIONS
104Copyright © 2018. Data Science Dojo
![Page 105: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/105.jpg)
Interpreting Findings from Machine Learning
![Page 106: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/106.jpg)
Outline
• Metrics for a machine learning problem
• Examples in health (Case Study)
• Interpretation of a metric
• Conclusion
106Copyright (c) 2018. Data Science Dojo
![Page 107: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/107.jpg)
Classification
• When the number of possible predictions is finite then it is a classification problem e.g.,• Benign Vs. malignant tumor (2 possible
predictions)
107Copyright (c) 2018. Data Science Dojo
![Page 108: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/108.jpg)
Right Metrics for Classification Problem
Following are the go-to metrics for evaluating any classification problem including the health care applications.
• Precision, recall, specificity, F1 score
• ROC, AUC
• Accuracy
• Log-Loss
• Root Mean Squared Error (useful when classification is done for predictions)
• Some times one metric does not clarify the whole picture.
• Therefore, multiple metrics should be considered for evaluating classifiers
• Evaluation metrics for binary classification (two possible classes). However, the concepts can easily be extended to M-ary classification (M possible classes)
Copyright (c) 2018. Data Science Dojo 108
![Page 109: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/109.jpg)
Interpreting Metrics for Imbalanced Datasets
• So, which model is the better one?
Copyright (c) 2018. Data Science Dojo 109
![Page 110: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/110.jpg)
Cost Difference Between FP (Type I) and FN (Type 2) Errors
• Imagine the datasets for two applications where positive examples represent• Chronic schizophrenia patients having suicidal tendencies
(dataset 1)• Fitzpatrick scale for human skin color (dataset 2)
• Which model is better for dataset 1 and which model is better for dataset 2?
• For both models and the datasets Log-Loss is not a useful metric as it is the same for both
Model Accuracy Precision Recall F1 Score AUC Log-Loss
Model 1 0.97 1 0.83 0.91 0.85 0.2
Model 2 0.94 0.75 1 0.86 0.8 0.2
Copyright (c) 2018. Data Science Dojo 110
![Page 111: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/111.jpg)
Cost of FP Vs. FN for Dataset 1
• The FP represents the patients that actually are not chronic schizophrenic and are wrongfully classified as schizophrenic
• The FN represents the patients that actually are chronic and are wrongfully classified as -ve
• The cost associated with FP and FN is very skewed. • FN costs way too more than that of FP. FP costs a few more
tests while FN can cost the human life
• For dataset 1, the goal should be to minimize the FN. That is, the model having higher Recall should be given preference for dataset 1
• Since Model 2 has a higher Recall value, it is the preferred model over Model 1
Copyright (c) 2018. Data Science Dojo 111
![Page 112: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/112.jpg)
FP Vs. FN for Dataset 2
• It is assumed that the dataset 2 is for the Fitzpatrick scale for human skin color. There are 6 possible predictions in {Type I, Type II, …., Type VI}
• For simplicity, let’s assume we are interested in identifying the Type I from the rest
• Fitzpatrick scale is useful for skincare (and cosmetics industry)
Copyright (c) 2018. Data Science Dojo 112
![Page 113: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/113.jpg)
FP Vs. FN for Dataset 2
• The FP for this dataset represents the patients who actually have not a Type I skin tone and are wrongfully classified as Type I
• The FN represents the patients that actually have Type I skin tone and are wrongfully classified as other than Type I
• The costs associated with FP and FN are similar
• For such datasets the goal should be to minimize both the FN and the FP
• Hence, the model having higher Accuracy together with the higher AUC should be given more weightage
• Therefore Model 1 is a better classifier where the cost of error for FP and FN is symmetric
Copyright (c) 2018. Data Science Dojo 113
![Page 114: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/114.jpg)
Interpreting Metrics for More +veExamples
Copyright (c) 2018. Data Science Dojo 114
![Page 115: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/115.jpg)
Same Cost of Type I and Type II Errors
Model Accuracy Precision Recall F1 Score AUC
Model 1 0.94 1 0.67 0.8 0.8
Model 2 0.94 0.75 1 0.86 0.9
Model 3 0.94 0.93 1 0.964 0.8
Model 4 0.94 1 0.92 0.958 0.9
• AUC handles the more +ve and more –ve examples the same way
• F1 score for Model 3 and 4 is very similar. The reason is that because of large number of +ve examples and the fact that F1 score depends (read deteriorates) on +ve examples’ misclassification only
• For imbalanced data F1 score and AUC are important. However, F1 score becomes more important when there are fewer +veexamples, which is quite common in healthcare
Copyright (c) 2018. Data Science Dojo 115
![Page 116: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/116.jpg)
Which Classifier is Better?
Classifier Accuracy Precision Recall F1 Score AUC
Model 1 0.90 0.87 0.88 0.875 0.97
Model 2 0.91 0.92 0.83 0.873 0.96
• Classifiers
• Support Vector Machines (Model 1)
• KNN (Model 2)
• As per the accuracy, Model 2 is slightly better than the Model 1
• As discussed, accuracy’s calculation does not take into account the difference in the costs associated with the FP (Type I) and FN (Type II) errors
Copyright (c) 2018. Data Science Dojo 116
![Page 117: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/117.jpg)
AUC
• The AUC for both the classifiers is close to 1, which is desirable. That is, the TPR is higher and the FPR is lower for both the classifiers
• As per the AUC, Model 1 is slightly better
• Accuracy of model is slightly higher and AUC is slightly higher for the other model. So, the accuracy and AUC are not helpful in this case
• So the question still remains to be answered: which Model is the better one?
Copyright (c) 2018. Data Science Dojo 117
![Page 118: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/118.jpg)
Recall, Precision, F1 Score
• FP (examples predicted as Malignant that actually are Benign) costs the patient a few more tests and eventually his/her money
• While FN (examples predicted as Benign that actually are Malignant) costs the human life
• In this case the choice for the health care provider’s preference for the model is obvious: chose the model with a lower FN. FN is inversely proportional to the Recall
• The more the Recall the better, that is, the bigger proportion of Malignant patients are identified correctly.
Copyright (c) 2018. Data Science Dojo 118
![Page 119: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/119.jpg)
Answer: The Higher Recall
• Usually, more Recall and more Precision are preferred. However in this case it is acceptable to identify the patients as FP, conduct a few more tests and be more certain. This results in the lower Precision
• Though, both models have the similar F1 score and Model 2’s Precision is more than that of Model 1
• Higher Recall value for Model 1 that makes it a superior classifier in this example.
Copyright (c) 2018. Data Science Dojo 119
![Page 120: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/120.jpg)
Balanced Dataset Example: Early stroke detection and diagnosis
• Stroke is a frequent disease that has affected 500 million people world wide
• It is the leading cause of death in China and the 5th
largest in the US *
• Let’s consider the F1 score, AUC and Log-Loss as the possible evaluation metrics
* Artificial intelligence in healthcare: past, present and future (https://svn.bmj.com/content/2/4/230)
Models F1 score AUC Log-Loss
Model 1 0.88 0.94 0.28
Model 2 0.97 0.98 0.6
Copyright (c) 2018. Data Science Dojo 120
![Page 121: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/121.jpg)
Interpreting findings for the case of balanced dataset
• Model 2 has the better• F1 score
• AUC
• However, as per the Log-Loss, Model 1 is the better one
• Though the data is balanced, the cost of Type I is different from that of Type II error. This gives more credence to F1 score and AUC than the Log-Loss.
• AUC can be reasonable even for inferior models. As seen in this example where it’s 0.94 for the worse and 0.98 for the better model
• Therefore, more importantly, it is the higher F1 score for Model 2 that makes more sense in this scenario.
Copyright (c) 2018. Data Science Dojo 121
![Page 122: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/122.jpg)
Case Study: The Henry Ford ExercIseTesting (FIT) Project
• Data and evaluation results are obtained from the journal article [1] published on April 18, 2018
• The clinical research dataset consists of 23,095 patients, collected by the FIT project to investigate relative performance of different classification techniques for predicting the individuals at risk of developing hypertension using medical records of cardiorespiratory fitness.
• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.
• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.
[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for
predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.
https://doi.org/10.1371/journal.pone.0195344
Copyright (c) 2018. Data Science Dojo 122
![Page 123: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/123.jpg)
The Henry Ford ExercIse Testing (FIT) Project
• The study compares the performance of six different ML models for predicting the individuals at risk of developing hypertension using cardiorespiratory fitness data.
• Using different validation methods, the RTF model on the dataset has shown the best performance (AUC = 0.93) which outperforms the models of the previous studies.
[1] Sakr S, Elshawi R, Ahmed A, Qureshi WT, Brawner C, et al. (2018) Using machine learning on cardiorespiratory fitness data for
predicting hypertension: The Henry Ford ExercIse Testing (FIT) Project. PLOS ONE 13(4): e0195344.
https://doi.org/10.1371/journal.pone.0195344
Copyright (c) 2018. Data Science Dojo 123
![Page 124: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/124.jpg)
AUC Curves for the Different Machine Learning Models using SMOTE evaluated using 10-fold cross-validation
• RTF has the best ROC curve. • AUC appears to be the maximum
Copyright (c) 2018. Data Science Dojo 124
![Page 125: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/125.jpg)
The Metrics
• The RTF (Random Tree Forest) model achieves the highest AUC (0.93), F 1 Score (86.70%), sensitivity (69.96%) and specificity (91.71%).
• What does the higher specificity for RTF mean?• Remember that specificity is the true negative recognition rate:
TN/(TN+FP)• The higher the specificity the lower the FP• Hence fewer patients are tested further
• What does the higher sensitivity/recall for RTF mean?• Fewer FN and hence fewer +ve patients sneak through the ML
paradigm
Copyright (c) 2018. Data Science Dojo 125
![Page 126: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/126.jpg)
A Note for ML Enthusiasts
• The results show that it is not necessary that the more complex the machine learning model, the better prediction accuracy that can be achieved. Simpler models can perform better in some cases as well.
• The results have also shown that it is critical to carefully explore and evaluate the performance of the machine learning models using various model evaluation methods as the prediction accuracy can significantly differ.
Copyright (c) 2018. Data Science Dojo 126
![Page 127: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/127.jpg)
When to Use a Particular Metric?
• For balanced class accuracy is a good metric
• AUC is a good metric for balanced data, however, it is more effective for imbalanced dataset
• If there is a dominant class (imbalanced dataset) then give more importance to AUC and F1 score
• If the goal is to classify the smaller class better, irrespective of it being a +veor a –ve class, then AUC is a good measure
• F1 score is important when the +ve class is small.
• If the application requires to have the minimum FN then go for more recall
• If the application requires to have the minimum FP then go for more precision
• Higher recall/sensitivity is better for identifying the +ve examples
• Higher specificity is better for identifying the -ve examples
• Though, rarely used, Log-Loss is important for absolute probabilistic difference. It is important in some applications
• RMSE is useful when classification algorithms are evaluated for predictions
Copyright (c) 2018. Data Science Dojo 127
![Page 128: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/128.jpg)
Conclusion
• One metric is good in one scenario and may not work for another
• Metrics should be chosen based on the balance of the data
• If +ve examples are fewer or are more important to classify then it gives credence to certain metrics over others
• In general, it is a better idea to calculate more metrics and then decide in favor of a particular model
• Generally, a good ML algorithm strikes a good balance between the Precision and Recall.
• If the difference in costs of Type I and Type II errors is large then Precision and Recall are the preferred metrics
• For health care applications the best scenario is when both recall and specificity are maximum
Copyright (c) 2018. Data Science Dojo 128
![Page 129: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/129.jpg)
Implementation of Policy RecommendationsPerformance-Based Financing in Health
![Page 130: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/130.jpg)
Using Supervised Learning to Select Audit Targets in Performance-Based Financing in Health: An Example from ZambiaBy Dhruv Grover, Sebastian Bauhoff, and Jed Friedman
130Copyright (c) 2018. Data Science Dojo
![Page 131: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/131.jpg)
Setting the context
• Zambia operated a pilot project from 2012-4 of performance-based financing of public health centers
• Public health centers are paid for the quantity and quality of the services they deliver
• Public health centres (covering 11% of Zambia’s population) in 10 rural districts participated
131Copyright (c) 2018. Data Science Dojo
![Page 132: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/132.jpg)
The Program Improved Certain Indicators But…
• 42% of facilities over-reported at least once in the four quarters measured
• Financing per service incentivized both provision of the service and over-reporting
• Payment for over-reported services undermines the incentive for delivery of the service and is a waste of public resources => we need to minimize over-reporting
132Copyright (c) 2018. Data Science Dojo
![Page 133: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/133.jpg)
What Measures Were in Place to Reduce Over-reporting?
• Dedicated district steering committees conducted continuous internal verification by reconciling the facility-reported information with the paper-based evidence
• An independent third-party conducted a one-off external verification process after 2-years of the program operation (costing $22.5k)
133Copyright (c) 2018. Data Science Dojo
![Page 134: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/134.jpg)
They Need to Target the Verification So it Identifies Over-reporting Facilities But Isn’t Cost Prohibitive
• The aim of the external verification is to minimize over-reporting whilst minimizing verification costs
• You could independently verify every single facility => this would completely eliminate over-reporting BUT this is probably cost prohibitive
• You could not verify any facilities BUT there is likely to be a substantial amount of over-reporting which may worsen over time as practitioners realise they can take advantage of the lack of verification
134Copyright (c) 2018. Data Science Dojo
![Page 135: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/135.jpg)
How to Identify Facilities That are Likely to Have Over-reported?
• It is possible to identify over-reporting using random sampling or machine learning
• They predict over-reporting defined as: • 1 if the difference between the reported and verified
data is > 10% of the reported value
• 0 otherwise
• Using the following input features:• Reported and verified values for the 9 quantity
measures rewarded in the PBF program;
• Control variables
135Copyright (c) 2018. Data Science Dojo
![Page 136: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/136.jpg)
Factors to Determine Choice of ML Technique
1. What is the size of the training dataset?
2. Can features be treated as independent variables?
3. Will additional training data become available in the future and need to be incorporated into the model?
4. Is the data linearly separable?
5. Is overfitting expected to be a problem?
6. Are there any speed, performance, memory usage requirements?
136Copyright (c) 2018. Data Science Dojo
![Page 137: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/137.jpg)
The Algorithms Learn the Patterns From the Inputs That Indicate That a Facility is at Risk From Over-reporting
• The patterns are learnt on data only from the first quarter
• The models (algorithm + data + parameters) are measured on how well they correctly identify facilities that over-report in the first quarter
• And the models can then apply this learning to predict the risk for other facilities over-reporting on unseen data in subsequent quarters
• On this occasion, random forest outperformed all the other algorithms on all 5 metrics
137Copyright (c) 2018. Data Science Dojo
![Page 138: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/138.jpg)
How Does This Change the Targeting of Verification?
• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting
• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary
• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.
138Copyright (c) 2018. Data Science Dojo
![Page 139: Linear Models for Regression - World Bankpubdocs.worldbank.org/en/624301541088519856/Linear...K-Means Clustering • Measure squared distance from point to center of its cluster. =1](https://reader036.fdocuments.net/reader036/viewer/2022071010/5fc7839c78f3031de44c4fca/html5/thumbnails/139.jpg)
How Does This Change the Targeting of Verification?
• The verifiers can send audit teams to verify data at those facilities that have the highest risk of over-reporting
• They save c.$800 per clinic they didn’t inspect because it wasn’t necessary
• The verifiers need to periodically collect another random sample to re-train the model so that those identified as low-risk taking advantage of the lack of monitoring and over-reporting.
139Copyright (c) 2018. Data Science Dojo