Weekly Report-Kmeans Ph.D. Student: Leo Lee date: Nov. 13, 2009.
Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN,...
Transcript of Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN,...
![Page 1: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/1.jpg)
Dean Abbott
Abbott Analytics, SmarterHQ
KNIME Fall Summit 2018
Email: [email protected]
Twitter: @deanabb
© Abbott Analytics 2001-20181
Doing the Data Science Dance
![Page 2: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/2.jpg)
Data Science vs. Other Labels
© Abbott Analytics 2001-20182
![Page 3: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/3.jpg)
Google Trends
© Abbott Analytics, 2001-20183
![Page 4: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/4.jpg)
Google Trends
© Abbott Analytics, 2001-20184
![Page 5: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/5.jpg)
What do Predictive Modelers do?The CRISP-DM Process Model
© Abbott Analytics 2001-20185
• CRoss-Industry Standard Process Model for Data Mining
• Describes Components of Complete Data Mining Cycle from the Project Manager’s Perspective
• Shows Iterative Nature of Data Mining
Business
Understanding
Data
Understanding
Data
Preparation
Modeling
Evaluation
Deployment
DataData
Data
![Page 6: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/6.jpg)
What we Want
to Do!
© Abbott Analytics 2001-20186
![Page 7: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/7.jpg)
© Abbott Analytics 2001-20187
How The Citizen Data
Scientist Will
Democratize Big Data
Published on April 6, 2016
![Page 8: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/8.jpg)
© Abbott Analytics 2001-20188
How The Citizen Data
Scientist Will
Democratize Big Data
Published on April 6, 2016
Retailer Sears, for example,
recently empowered 400 staff
from its business intelligence (BI)
operations to carry out advanced,
Big Data driven customer
segmentation – work which
would previously have been
carried out by specialist Big Data
analysts, probably with PhDs.
![Page 9: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/9.jpg)
Is it a Recipe?
© Abbott Analytics 2001-20189
![Page 10: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/10.jpg)
Is it a Recipe?
© Abbott Analytics 2001-201810
Can we apply a recipe to
machine learning and
data science modeling
processes?
![Page 11: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/11.jpg)
Good Set of Data Prep Steps!
© Abbott Analytics, 2001-201711
https://www.knime.org/files/knime_seventechniquesdatadimreduction.pdf
![Page 12: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/12.jpg)
Data Preparation Dependencies
• Fill missing values
• Explode categorical variables
• *Outliers and scale very influential
• Sometimes automatic in software; beware of how!
• Categoricals are fine
• Numeric data must be binned (except some
decision trees)
• Outliers don’t matter
• Missing values a category
© Abbott Analytics, 2001-201812
![Page 13: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/13.jpg)
Why Are Outliers a Problem?
Squares…
Linear Regression:
Mean Squared ErrorK-Means Clustering
© Abbott Analytics, 2001-201813
https://en.wikipedia.org/wiki/Mean_s
quared_error
https://en.wikipedia.org/wiki/Eucli
dean_distance
![Page 14: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/14.jpg)
Effect of Outliers on Correlations
(and Regression)
• 4,843 records
© Abbott Analytics 2001-201814
![Page 15: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/15.jpg)
Effect of Outliers on Correlations
(and Regression)
• 4,843 records
© Abbott Analytics 2001-201815
![Page 16: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/16.jpg)
Effect of Outliers on Correlations
(and Regression)
• 4,843 records
Corresponds to R^2 increase from 0.42 to 0.53
© Abbott Analytics 2001-201816
![Page 17: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/17.jpg)
Decision Trees Can Handle it
© Abbott Analytics 2001-201817
![Page 18: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/18.jpg)
Effect of Distance on Clusters
© Abbott Analytics, 2001-201718
![Page 19: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/19.jpg)
Effect of Distance on Clusters
© Abbott Analytics, 2001-201719
![Page 20: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/20.jpg)
Effect of Distance on Clusters
© Abbott Analytics, 2001-201720
![Page 21: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/21.jpg)
Effect of Distance on Clusters
© Abbott Analytics, 2001-201721
![Page 22: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/22.jpg)
© Abbott Analytics 2001-201822
![Page 23: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/23.jpg)
Log transform the heavily skewed fields
© Abbott Analytics 2001-201823
![Page 24: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/24.jpg)
© Abbott Analytics 2001-201824
Dummy Vars
Note: stdev are
Typically 0.5
![Page 25: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/25.jpg)
Try K-Means with Different
Normalization Approaches
© Abbott Analytics 2001-201825
![Page 26: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/26.jpg)
K Means Clustering:
Magnitude and Dummy Bias
© Abbott Analytics 2001-201826
Measurements
are F Statistic
![Page 27: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/27.jpg)
PCA: Natural Units
© Abbott Analytics 2001-201827
![Page 28: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/28.jpg)
PCA: Scaled Units
© Abbott Analytics 2001-201828
![Page 29: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/29.jpg)
PCA: Scaled and Dummy Scaling
© Abbott Analytics 2001-201829
![Page 30: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/30.jpg)
PCA: Scaled and Dummy Scaling
© Abbott Analytics 2001-201830
![Page 31: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/31.jpg)
Missing Value Imputation
• Delete the record (row), or delete the field (column)
• Replace with a constant
• Replace missing value with mean, median, or distribution
• Replace missing with random self-substitution
• Surrogate Splits (CART)
• Make missing a category
• Simple for “rule-based” algorithms; Turn continuous into categorical for numeric algorithms
• Replace with the missing value with an estimate
• Select value from another field having high correlation with variable containing missing values
• Build a model with variable containing missing values as output, and other variables without
missing values as an input© Abbott Analytics, 2001-201831
![Page 32: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/32.jpg)
CHAID Trees: Missing Values are
Just Another Category
© Abbott Analytics 2001-201832
![Page 33: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/33.jpg)
Summary
© Abbott Analytics 2001-201833
Data Preparation StepLinear
Regression K-NNK-Means
Clustering PCANeural
NetworksDecision
TreesFill Missing Values Y Y Y Y Y *Correlation Filtering Y Y Y
De-Skew (log, box-cox) Y Y Y YMitigate Outliers Y Y Y Y * *Remove Magnitude Bias (Scale) Y Y Y Y *Remove Categorical "Dummy" Bias Y Y Y Y
Mitigate Categorical Cardinality Bias -- -- -- -- -- Y
![Page 34: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/34.jpg)
Stratify or Not to Stratify…
That is the Question!?
© Abbott Analytics 2001-201834
5.1% TARGET_B = 1:
unbalanced data
![Page 35: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/35.jpg)
Comparing Logistic Regression with and
without Equal Size Sampling
© Abbott Analytics 2001-201835
Equal Sampling
No Stratified Sampling
https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf
![Page 36: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/36.jpg)
Don’t Need to Stratify With Many Algorithms
© Abbott Analytics 2001-201836https://www.predictiveanalyticsworld.com/sanfrancisco/2013/pdf/Day2_1050_Abbott.pdf
![Page 37: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/37.jpg)
Know the Algorithm when Developing
Sampling Strategy
© Abbott Analytics 2001-201837
Variable Coeff.Std. Err.
P>|z| Coeff._natural Std. Err._natural P>|z|_natural coeff diff coeff compare
RFA_2F -0.133532984 0.0338 0.000 -0.1563345 0.024 0.000 0.023 within SE
D_RFA_2A -0.163727182 0.1210 0.176 -0.0934212 0.079 0.237 0.070 within SE
F_RFA_2A 0.038231571 0.0884 0.665 0.0357819 0.062 0.565 0.002 within SE
G_RFA_2A 0.316663027 0.1267 0.012 0.2779701 0.091 0.002 0.039 within SE
DOMAIN2 -0.068966948 0.0767 0.369 -0.1169964 0.056 0.036 0.048 within SE
DOMAIN1 -0.266408264 0.0837 0.001 -0.2845323 0.060 0.000 0.018 within SE
NGIFTALL_log10
-0.46212497 0.0998 0.000 -0.4444304 0.072 0.000 0.018 within SE
LASTGIFT_log10
0.062766545 0.2044 0.759 0.1813683 0.141 0.199 0.119 within SE
Constant 0.695770991 0.2785 0.012 3.5393926 0.194 0.000 2.844 outside SE
Stratified Natural (orig)
![Page 38: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/38.jpg)
Input Variable Interactions
• Algorithms are mixed on interactions in theory
• Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models
• Decision trees are greedy searchers
• Built to find interactions
• But, only if they can be found in sequence (one at a time, stepwise)
• Neural Networks find interactions well (XOR)
• Naïve Bayes find intersections, not interactions
• Algorithms don’t always identify interactions well or well-enough in practice
© Abbott Analytics, 2001-201738
![Page 39: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/39.jpg)
Simple Interaction Function
• Two uniform variables: x and y
• 2,564 records
• if ( x*y > 0 ) return ("1");
• else return("0");
© Abbott Analytics, 2001-201739
![Page 40: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/40.jpg)
Four Classifiers
• aaa
© Abbott Analytics, 2001-201740
Naïve BayesDecision Tree, min Leaf node 50 records
Logistic Regression Rprop Neural Net, 300 epochs
![Page 41: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/41.jpg)
Errors
© Abbott Analytics, 2001-201741
Naïve BayesDecision Tree, min Leaf node 50 records
Logistic Regression Rprop Neural Net, 300 epochs
True correct
False incorrect
False correct
True incorrect
![Page 42: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/42.jpg)
Don’t Build Interactions Manually*
• Too many…too many
• So what do you do?
© Abbott Analytics, 2001-201742* Except for those you know about
![Page 43: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/43.jpg)
Automatic Interaction Detection• Trees: build 2-level trees
• Pros: works with continuous and categoricals
• Cons: greedy, only finds one solution at a time (Battery)
• Association rules: build 2-antecedent rules
• Pros: exhaustive
• Cons: only works with categoricals
• Use the linear/logistic regression algorithm itself, loop over all 2-way interactions
• Pros: context is the model you may want to use, easy to do in R, Matlab, Python, SAS (coding)
• Cons: slow, have to code, what to do with dummies
© Abbott Analytics, 2001-201743
![Page 44: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/44.jpg)
© Abbott Analytics 2001-201844
Summing up what
we’ve covered
Is this a Recipe?
![Page 45: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/45.jpg)
Is it a Recipe?....YES!
© Abbott Analytics 2001-201845
Can we apply a recipe to
machine learning and
data science modeling
processes?
![Page 46: Doing the Data Science Dance - files.knime.com · • Linear Regression, Logistic Regression, kNN, kMeans clustering, PCA…. are main effect models • Decision trees are greedy](https://reader030.fdocuments.net/reader030/viewer/2022011811/5e1800b2a996760f7b6ed16a/html5/thumbnails/46.jpg)
Conclusions
• Know what the algorithms can do (and not do!)
before deciding on data preparation
• When are data shapes and data ranges important?
• It’s not hard….just requires some thought
• Once you know what to do, you have your recipe!
© Abbott Analytics 2001-201846