Musings of kaggler

55
Musing of a Kaggler By Kai Xin

description

In this presentation at R user group, I share about the various advance techniques I used for Kaggle competitions. Includes: Interactive visualization via leaflet, geospatial clustering via local Moran's I, feature creation, text categorization via splitTag techniques and ensemble modeling. Full code can be downloaded here: https://github.com/thiakx/RUGS-Meetup Train / test data from Kaggle: http://www.kaggle.com/c/see-click-predict-fix/data Interactive map demo: http://www.thiakx.com/misc/playground/scfMap/scfMap.html

Transcript of Musings of kaggler

Page 1: Musings of kaggler

Musing of a Kaggler By Kai Xin

Page 2: Musings of kaggler

I am not a good student. Skipped school, played games all day, almost got kicked out of school.

Page 3: Musings of kaggler

I play a different game now. But at the core it is the same: understand the game, devise a strategy, keep playing.

Page 4: Musings of kaggler

My Overall Strategy

Page 5: Musings of kaggler

Every piece of data is unique but some data is more important than others

Page 6: Musings of kaggler

It is not about the tools or the model or the stats. It is about the steps to put everything together.

Page 7: Musings of kaggler

The Kaggle Competition

Page 8: Musings of kaggler

https://github.com/thiakx/RUGS-Meetup

Remember to download data from Kaggle Competition and put it here

Page 9: Musings of kaggler

First look at the data

223,129 rows

Page 10: Musings of kaggler

First look at the data

Plot on map?

Not really free text? Some repeats

Need to predict these

Related to summary /

description?

Page 11: Musings of kaggler

Graph by Ryn Locar

Understand the data via visualization

Page 12: Musings of kaggler

Oakland

http://www.thiakx.com/misc/playground/scfMap/scfMap.html

Oakland Chicargo

New Haven Richmond

LeafletR Demo

Visualize the data - Interactive maps

Page 13: Musings of kaggler

Step1: Draw Boundary Polygon

Step 2: Create Base (each hex 1km wide)

Step 3: Point in Polygon Analysis

Step 4: Local Moran’s I

Page 14: Musings of kaggler

Obtain Boundary Polygon Lat Long

App can be found at: leafletMaps/latlong.html

leafletMaps/regionPoints.csv

Page 15: Musings of kaggler

Generating Hex

Code can be found at: baseFunctions_map.R

Page 16: Musings of kaggler

Point in Polygon Analysis Code can be found at: 1. dataExplore_map.R

Page 17: Musings of kaggler

Local Moran’s I

Code can be found at: 1. dataExplore_map.R

Page 18: Musings of kaggler

LeafletR

Code can be found at: 1. dataExplore_map.R

Kx’s layered demo map:

leafletMaps/scfMap_kxDemoVer

Page 19: Musings of kaggler

In Search of the 20% data

Page 20: Musings of kaggler

Ignore

Ignore

Ignore

Model

Ignore

Model

Ignore

MAD

Training Data

Page 21: Musings of kaggler

In Search of the 20% Data

Detection of “Anomalies”

Can we justify this using statistics?

Page 22: Musings of kaggler

ksTest<-ks.test(trainData$num_views[trainData$month==4&trainData$year==2013],

trainData$num_views[trainData$month==9&trainData$year==2012])

#d is like the distance of difference, smaller d =

the two data sets probably from same distribution

d

Jan’12 to Oct’12 and Mar’13 training data ignored

2 sample Kolmogorov–Smirnov test

Page 23: Musings of kaggler

What happenedhere?

No need to model? Just assume all Chicargo

data to be 0?

Chicargo data generated by remote_API mostly 0s, no need to model

Page 24: Musings of kaggler

Separate Outliers using Median Absolute Deviation (MAD)

MAD is robust and can handle skewed data. It helps to identify outliers. We separated data more which are more than 3 Median Absolute Deviation.

Code can be found at: baseFunctions_cleanData.R

Page 25: Musings of kaggler

Ignore

Ignore

Ignore

Model

Ignore

Model

Ignore

MAD

Page 26: Musings of kaggler

Ignore

Ignore

Ignore

Model

Ignore

Model

Ignore

MAD

10% of training data is

used for modeling

59% of data are Chicargo

data generate

d by remote_API, mostly

0s, no need

model, just

estimate using

medianKey Advantage: Rapid prototyping!

4% of data is identified as outliers by MAD

KS test: 27% of training data are of different distribution

Page 27: Musings of kaggler

When you can focus on a small but representative subset of data, you can run many, many experiments very quickly (I did

several hundreds)

Page 28: Musings of kaggler

Now we have the raw ingredients prepared, it is time to make the dishes

Page 29: Musings of kaggler

Experiment with Different Models

❖ Random Forest ❖ Generalized Boosted Regression Models (GBM)❖ Support Vector Machines (SVM)❖ Bootstrap aggregated (bagged) linear models

How to use? Ask Google & RTFM

Page 30: Musings of kaggler

Or just do download my code

Page 31: Musings of kaggler

I don’t spend time on optimizing/tuning model settings (learning rate etc) with cross validation. I find it really boring

and really slow

Page 32: Musings of kaggler

Obsessing with tuning model variables is like being obsessed with tuning the oven

Page 33: Musings of kaggler

Instead, the magic happens when we combine data and when we create new data - aka feature creation

Page 34: Musings of kaggler

Creating Simples Features : City

trainData$city[trainData$longitude=="-77"]<-"richmond"trainData$city[trainData$longitude=="-72"]<-"new_haven"trainData$city[trainData$longitude=="-87"]<-"chicargo"trainData$city[trainData$longitude=="-122"]<-"oakland"

Code can be found at: 1. dataExplore_map.R

Page 35: Musings of kaggler

Creating Complex Features: Local Moran’s I

Code can be found at: 1. dataExplore_map.R

Page 36: Musings of kaggler

Creating Complex Features: Predicted View

The task is to predict view, votes, comments but logically,

won’t number of votes and comments be correlated with

number of views?

Code can be found at: baseFunctions_model.

R

Page 37: Musings of kaggler

Creating Complex Features: Predicted View

Storing the predicted value of view as new column and using it as a new feature to predict votes & comments…

very risky business but powerful if you know what you are doing

Page 38: Musings of kaggler

Creating Complex Features: SplitTag, wordMine

Page 39: Musings of kaggler

Creating Complex Features: SplitTag, wordMine

Code can be found at:

baseFunctions_cleanData.

R

Page 40: Musings of kaggler

Adjusting Features: Simplify Tags

Code can be found at: baseFunctions_cleanData.

R

Page 41: Musings of kaggler

Adjusting Features: Recode Unknown Tags

Code can be found at: baseFunctions_cleanData.

R

Page 42: Musings of kaggler

Adjusting Features: Combine Low Count Tags

Code can be found at: baseFunctions_cleanData.

R

Page 43: Musings of kaggler

Full List of Features Used

Code can be found at: baseFunctions_model.

R

+Num View as Y variable

+Num Comments as Y variable

+Num Votes as Y variable

Fed into models to predict view, votes, comments respectively

Page 44: Musings of kaggler

Only used 1 original feature, I created the other 13 features

Code can be found at: baseFunctions_model.

R

Fed into models to predict view, votes, comments respectively

Original Feature (1) Created Feature (13)

Page 45: Musings of kaggler

An ensemble of good enough models can be surprisingly strong

Page 46: Musings of kaggler

An ensemble of good enough models can be surprisingly strong

Page 47: Musings of kaggler

An ensemble of the 4 base model has less error

Page 48: Musings of kaggler

Each model is good for different scenario

GBM is rock solid, good

for all scenarios

SVM is counter weight,

don’t trust anything it

says

GLM is amazing for predicting comments,

not so much for others

RandomForest is

moderate, provides a balanced

view

Page 49: Musings of kaggler

Ensemble (Stacking using regression)

testDataAns rfAns gbmAns svmAns glmBagAns

2.3 2 2.5 2.4 1.8

2 1.8 2.2 1.7 1.6

1.3 1.3 1.7 1.2 1.0

1.5 1.4 1.9 1.6 1.2

… … … … …

glm(testDataAns~rfAns+gbmAns+svmAns+glmBagAns)We are interested in the coefficient

Page 50: Musings of kaggler

Ensemble (Stacking using regression)

Sort and column bind the predictions from the 4 models

Run regression (logistic or linear) and obtain coefficients

Scale ensemble ratio back to 1 (100%)

Page 51: Musings of kaggler

Obtaining the ensemble ratio for each model

Inside 3. testMod_generateEnsembleRatio folder

- getEnsembleRatio.r

Page 52: Musings of kaggler

Ensemble is not perfect…❖ Simple to implement? Kind of. But very tedious

to update. Will need to rerun every single model every time you make any changes to the data (as the ensemble ratio may change).

❖ Easy to overfit test data (will require another set of validation data or cross validation).

❖ Very hard to explain to business users what is going on.

Page 53: Musings of kaggler

All this should get you to top rank 49/532

Page 54: Musings of kaggler

Ignore

Ignore

Ignore

Model

Ignore

Model

Ignore

MAD

10% of training data is

used for modeling

4% of data is identified as outliers by MAD

KS test: Too different from rest of data

59% of data are Chicargo

data generate

d by remote_API, mostly

0s, no need

model, just

estimate using

medianKey Advantage: Rapid prototyping!