Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning...

Post on 09-Jul-2015

1.177 views 2 download

Tags:

description

The Higgs Boson Machine Learning Challenge is, by far, one of the biggest big data competitions focusing on data analysis in the world. To be successful in such a competition, Cheng applied his knowledge in Computer Science, Mathematics, Statistics, and Physics, while his problem solving habit is developed during his training in Civil Engineering. In this presentation, Cheng will use his experience in this competition to illustrate some important elements in big data analytics and why they are important. The content of the presentation covers different disciplines such as physics, statistics, and mathematics. But no background knowledge of these areas are required to understand the essence of the presentation. In brief, the presentation covers the following content: An effective framework for general data mining projects, Introduction of the competition and its related physics background, Various techniques in data exploring and some traps to avoid, Various ways of feature enhancement, Model building and selection, and Optimization of model performance

Transcript of Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning...

BIG  DATA  COMPETITION:  MAXIMIZING  YOUR  POTENTIAL

EXAMPLED  WITH  THE  2014  HIGGS  BOSON  MACHINE  LEARNING  CHALLENGE

Dr.  Cheng  CHEN  email:  cchen@goDCI.com  

twitter:  @cheng_chen_us  

Development  Consulting  International  LLC  

goDCI.com

1this  presentation  is  copyright  protected  ©

Ohio State University, Tongji University

Ph.D. Civil Engineering

M.S. Applied Statistics

Minor Computer Science

Advanced trainings:

City and Regional Planning

Industrial and Systems Engineering

Mathematics

Passion: (this) machine learning

PRESENTER

2

• Goal: improve the procedure that produces the selection region of Higgs Boson

• 4 Month Duration

• 1,785 teams

• Many machine learning experts, statisticians, and physicist

• Top 5 are from 5 different countries

HIGGS  BOSON    MACHINE  LEARNING  CHALLENGE

3

Netherlands

Hungary

France

Russia

U.S.A/Chinahttp://www.kaggle.com/c/higgs-­‐boson/leaderboard

Background

4

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

cross  validate innovate

read

discuss

Validate

apply

fine-­‐tune

find

©

Background

5

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovate

read

discuss

Validate

apply

fine-­‐tune

find

cross  validate

©

READ  AND  DISCUSS

6

• a.k.a  the  God  Particle  (explains  some  mass)  

• A  fundamental  particle  theorized  in  1964  in  the  Standard  Model  of  Particle  Physics  

• “Considered”  discovered  in  2011  –  2013  in  LHC  by  CERN  

• A  number  of  prestigious  awards  in  2013,  including  a  Nobel  prize

HIGGS  BOSON

7http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg

A  "definitive"  answer  might  require  "another  few  years"  after  the  collider's  2015  restart.deputy  chair  of  physics  at  Brookhaven  National  Laboratory  

http://en.wikipedia.org/wiki/Higgs_boson

• Established  in  1954  

• Birth  of  World  Wide  Web  (1989)

CERN:  THE  EUROPEAN  ORGANIZATION  FOR  NUCLEAR  RESEARCH

8

maps.google.com

• 27  km  (17  mi)  in  circumference  

• 175  meters  (574  ft)  beneath  ground  

• Built  from  1998  to  2008  

• Over  10,000  scientists  and  engineers  

• Over  100  counties  

• Seven  particle  detectors

LARGE  HADRON  COLLIDER  (LHC)

9https://www.llnl.gov/news/llnl-­‐set-­‐host-­‐international-­‐lattice-­‐physics-­‐conference

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://en.wikipedia.org/wiki/Large_Hadron_Collider

• 46  meters  long    

• 25  meters  in  diameter  

• Weighs  about  7,000  tonnes  

• Contains  some  3000  km  of  cable  

• Involves  roughly  3,000  physicists  from  over  175  institutions  in  38  countries.

ATLAS

10

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://higgsml.lal.in2p3.fr/documentation/

• 46  meters  long    

• 25  meters  in  diameter  

• Weighs  about  7,000  tonnes  

• Contains  some  3000  km  of  cable  

• Involves  roughly  3,000  physicists  from  over  175  institutions  in  38  countries.

ATLAS

11

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://higgsml.lal.in2p3.fr/documentation/

• 46  meters  long    

• 25  meters  in  diameter  

• Weighs  about  7,000  tonnes  

• Contains  some  3000  km  of  cable  

• Involves  roughly  3,000  physicists  from  over  175  institutions  in  38  countries.

ATLAS

12

http://en.wikipedia.org/wiki/Large_Hadron_Collider

http://higgsml.lal.in2p3.fr/documentation/

• Higgs  Boson  can  not  be  measured  directly  (decays  immediately  into  lighter  particles)  

• Other  particles  can  decay  into  the  same  set  of  lighter  particles  

• PRODUCTION  and  DECAY  of  Higgs  Boson  depends  on  the  mass,  while  mass  was  not  predicted  by  theory  (now  we  know  it  is  close  to  125  Gev)

CHALLENGES  IN  DETECTION  OF  HIGGS  BOSON

13https://www2.physics.ox.ac.uk/sites/default/files/2012-­‐03-­‐27/sinead_farrington_pdf_17376.pdf

Seeing  a  circular  shaped  shadow  does  not  mean  the  real  object  is  a  sphere  ball  

• Raw  data  collected  from  LHC  

• Hundreds  of  millions  of  proton-­‐proton  collisions  (event)  per  second  

• 400  events  of  interest  are  selected  per  second  

– Signal  event  (i.e.  Higgs  Boson)  

–Background  event  (i.e.  other  particles)  

• Events  in  Ad  Hoc  selection  region  (in  certain  channels)  exceeding  background  noise

CURRENT  DETECTION  MECHANISM

14

Needs  improvement  in  significance  and  robustness  in  selection  criteria

• Simulated  Data  

• Fixed  mass  (125  GeV)  

• Simplified  decay  channel  

–Next  Slide  

• Simplified  background  events  (three  representative  types  only)  

–Decay  of  the  Z  boson  (91.2  GeV)  into  Tau-­‐Tau  –Decay  of  a  pair  of  top  quarks  into  lepton  and  hadronic  tau    –“Decay”  of  the  W  boson  into  lepton  and  hadronic  tau  due  to  imperfections  in  the  particle  identification  procedure  

• Simplified  objective  function  (significance  score)

SIMPLIFICATIONS  FOR  COMPETITION

15

• Decay  of  Tau-­‐Tau  Channel  only    

• One  tau  decays  into  lepton  and  two  neutrino  

• The  other  tau  decays  into  hadronic  tau  and  a  neutrino  

• (Note:  Neutrinos  can  not  be  detected)

SIMPLIFIED  DECAY  CHANNEL

16

hadronic tau:a bunch of hadrons

• Decay  of  Tau-­‐Tau  Channel  only    

• One  tau  decays  into  lepton  and  two  neutrino  

• The  other  tau  decays  into  hadronic  tau  and  a  neutrino  

• (Note:  Neutrinos  can  not  be  detected)

SIMPLIFIED  DECAY  CHANNEL

17

hadronic tau:a bunch of hadrons

• Decay  of  Tau-­‐Tau  Channel  only    

• One  tau  decays  into  lepton  and  two  neutrino  

• The  other  tau  decays  into  hadronic  tau  and  a  neutrino  

• (Note:  Neutrinos  can  not  be  detected)

SIMPLIFIED  DECAY  CHANNEL

18

Jets MET

vectorized  momenta  are  givenhadronic tau:a bunch of hadrons

Background

19

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovate

read

discuss

Validate

apply

fine-­‐tune

find

cross  validate

©

• 250,000  training    

• 550,000  testing  

• 30  variables  

– 17  Primitive  • Momenta    • Direction  

– 13  Derived

DATA  DIMENSION

20

4 rows in training data

EventId DER_mass_MMC

DER_mass_transverse_met_lep

DER_mass_vis

DER_pt_h

DER_deltaeta_jet_jet

DER_mass_jet_jet

DER_prodeta_jet_jet

DER_deltar_tau_lep

DER_pt_tot

DER_sum_pt

100000 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064 41.928 197.76100001 160.937 68.768 103.235 48.146 NA NA NA 3.473 2.078 125.157100002 NA 162.172 125.953 35.635 NA NA NA 3.148 9.336 197.814100003 143.905 81.417 80.943 0.414 NA NA NA 3.31 0.414 75.968

EventIdDER_pt_ratio_lep_tau

DER_met_phi_centrality

DER_lep_eta_centrality

PRI_tau_pt

PRI_tau_eta

PRI_tau_phi

PRI_lep_pt

PRI_lep_eta

PRI_lep_phi PRI_met

100000 1.582 1.396 0.2 32.638 1.017 0.381 51.626 2.273 -2.414 16.824100001 0.879 1.414 NA 42.014 2.039 -3.011 36.918 0.501 0.103 44.704100002 3.776 1.414 NA 32.154 -0.705 -2.093 121.409 -0.953 1.052 54.283100003 2.354 -1.285 NA 22.647 -1.655 0.01 53.321 -0.522 -3.1 31.082

EventId PRI_met_phi

PRI_met_sumet

PRI_jet_num

PRI_jet_leading_pt

PRI_jet_leading_eta

PRI_jet_leading_phi

PRI_jet_subleading_pt

PRI_jet_subleading_eta

PRI_jet_subleading_phi

PRI_jet_all_pt

100000 -0.277 258.733 2 67.435 2.15 0.444 46.062 1.24 -2.475 113.497100001 -1.916 164.546 1 46.226 0.725 1.158 NA NA NA 46.226100002 -2.186 260.414 1 44.251 2.053 -2.028 NA NA NA 44.251100003 0.06 86.062 0 NA NA NA NA NA NA 0

EventId Weight Label100000 0.00265331133733s100001 2.23358448717b100002 2.34738894364b100003 5.44637821192b

Data  loaded  correctly  Notice  NA  values

MISSING  VALUES

21

  col_name NA_count   NA_pct1 EventId    2 DER_mass_MMC  38,114   15%3 DER_mass_transverse_met_lep    4 DER_mass_vis    5 DER_pt_h    6 DER_deltaeta_jet_jet  177,457   71%7 DER_mass_jet_jet  177,457   71%8 DER_prodeta_jet_jet  177,457   71%9 DER_deltar_tau_lep    10 DER_pt_tot    11 DER_sum_pt    12 DER_pt_ratio_lep_tau    13 DER_met_phi_centrality    14 DER_lep_eta_centrality  177,457   71%15 PRI_tau_pt    16 PRI_tau_eta    17 PRI_tau_phi    18 PRI_lep_pt    19 PRI_lep_eta    20 PRI_lep_phi    21 PRI_met    22 PRI_met_phi    23 PRI_met_sumet    24 PRI_jet_num    25 PRI_jet_leading_pt  99,913   40%26 PRI_jet_leading_eta  99,913   40%27 PRI_jet_leading_phi  99,913   40%  28 PRI_jet_subleading_pt  177,457   71%29 PRI_jet_subleading_eta  177,457   71%30 PRI_jet_subleading_phi  177,457   71%31 PRI_jet_all_pt    32 Weight    33 Label    

MISSING  VALUES

22

  col_name NA_count   NA_pct1 EventId    2 DER_mass_MMC  38,114   15%3 DER_mass_transverse_met_lep    4 DER_mass_vis    5 DER_pt_h    6 DER_deltaeta_jet_jet  177,457   71%7 DER_mass_jet_jet  177,457   71%8 DER_prodeta_jet_jet  177,457   71%9 DER_deltar_tau_lep    10 DER_pt_tot    11 DER_sum_pt    12 DER_pt_ratio_lep_tau    13 DER_met_phi_centrality    14 DER_lep_eta_centrality  177,457   71%15 PRI_tau_pt    16 PRI_tau_eta    17 PRI_tau_phi    18 PRI_lep_pt    19 PRI_lep_eta    20 PRI_lep_phi    21 PRI_met    22 PRI_met_phi    23 PRI_met_sumet    24 PRI_jet_num    25 PRI_jet_leading_pt  99,913   40%26 PRI_jet_leading_eta  99,913   40%27 PRI_jet_leading_phi  99,913   40%28 PRI_jet_subleading_pt  177,457   71%29 PRI_jet_subleading_eta  177,457   71%30 PRI_jet_subleading_phi  177,457   71%31 PRI_jet_all_pt    32 Weight    33 Label    

Notice  the  consistency  in  missing  values

• Assign  a  value  

–Generate  a  random  value  

– Fit  a  value  (mean,  median,  nearest  neighbor,  etc.)  

– Fix  a  value  (domain  knowledge)  

• Remove  the  record  

• Leave  as  is

HOW  TO  HANDLE  MISSING  VALUES

23

• Assign  a  value  

–Generate  a  random  value  

– Fit  a  value  (mean,  median,  nearest  neighbor,  etc.)  

– Fix  a  value  (domain  knowledge)  

• Remove  the  record  

• Leave  as  is

HOW  TO  HANDLE  MISSING  VALUES

24

HISTOGRAM

25Density  is  more  meaningful  in  the  range  of  x No  fuzzy  jump  at  the  edge

PRI_jet_leading_pt

Coun

t

Log  transformation

Coun

t

Inverse  transformation

Coun

t

HISTOGRAM  (CONT’D)

26Bi-­‐modality  is  revealed

DER_pt_h

Coun

t

Log  transformation

Coun

t

Inverse  transformation

Coun

t

INTERACTIVE  VISUALIZATION  R  SHINY

27http://chencheng.shinyapps.io/demo_higgsDEMO

INTERACTIVE  VISUALIZATION  R  SHINY

28http://chencheng.shinyapps.io/demo_higgsDEMO

INTERACTIVE  VISUALIZATION  R  SHINY

29

Use  a  reasonable  number  of  bins  to  display  the  underlying  distribution

http://chencheng.shinyapps.io/demo_higgsDEMO

INTERACTIVE  VISUALIZATION  R  SHINY

30

Use  a  reasonable  transformation  to  display  the  underlying  distribution

http://chencheng.shinyapps.io/demo_higgsDEMO

HISTOGRAM  (CONT’D)

31

Coun

t

PRI_tau_etaTransformations  are  sometimes  not  necessary

32

Do  that  for  all  30  variables

PAIRWISE  CORRELATIONS

33

Coun

t

Count

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

PAIRWISE  CORRELATIONS

34

Coun

t

CountSet  transparency  parameter  appropriately  to  reveal  important  patterns

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

PAIRWISE  CORRELATIONS

35

Coun

t

CountCorrelation  coefficient  ==  0  does  not  mean  no  correlation

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

PAIRWISE  CORRELATIONS

36

Coun

t

Count

BKG

SGN

PRI_lep_phi  &  PRI_met_phi

FEATURE  ENHANCEMENT  ROTATION

37Validate  visual  “evidence”  from  various  perspectives

BKG

SGN

rotated  PRI_lep_phi  &  PRI_met_phi

FEATURE  ENHANCEMENT  ROTATION

38Validate  visual  “evidence”  from  various  perspectives

BKG

SGN

rotated  PRI_lep_phi  &  PRI_met_phi

PAIRWISE  VARIABLES  —  LOW  RES.

39

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep  

PAIRWISE  VARIABLES  —  HIGH  RES.

40Try  High  Resolution

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep

PAIRWISE  VARIABLES  —  HIGH  RES.

41Curve  fitting

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep

FEATURE  ENHANCEMENT    CURVE  FITTING

42Enhance  a  variable  based  on  correlation  with  another  variable

Coun

t

Count

BKG

SGN

DER_pt_h  &  DER_deltar_tau_lep

FEATURE  ENHANCEMENT      ROTATION  BY  PRI_TAU_PHI

43

Domain  Knowledge

Coun

t

Count

BKG

SGN

DER_pt_h  &  PRI_lep_phi

FEATURE  ENHANCEMENT      ROTATION  BY  PRI_TAU_PHI

44Feature  enhancement  by  applying  domain  knowledge  

Coun

t

Count

BKG

SGN

DER_pt_h  &  PRI_lep_phi

Domain  Knowledge

FEATURE  ENHANCEMENT  ROTATION

45

Coun

t

Count

BKG

SGN

PRI_jet_leading_eta  &  PRI_jet_subleading_eta

• Select  variable(s):  One  var.  for  histogram,  two  var.  for  scatter  plot

DATA  DRILL  DOWN

46http://chencheng.shinyapps.io/demo_higgsDEMO

• Dynamically  select  a  subset  of  data  —  PRI_jet_num  =  2

DATA  DRILL  DOWN

47http://chencheng.shinyapps.io/demo_higgsDEMO

• Patterns  in  the  subset  data  —  PRI_jet_leading_eta  &  PRI_jet_subleading_eta

DATA  DRILL  DOWN

48http://chencheng.shinyapps.io/demo_higgsDEMO

• Dynamically  select  a  subset  of  data  —  PRI_jet_num  =  3

DATA  DRILL  DOWN

49http://chencheng.shinyapps.io/demo_higgsDEMO

• Patterns  in  the  subset  data  —  PRI_jet_leading_eta  &  PRI_jet_subleading_eta

DATA  DRILL  DOWN

50http://chencheng.shinyapps.io/demo_higgsDEMO

• Patterns  in  the  subset  data  —  PRI_jet_leading_eta  &  PRI_jet_subleading_eta

DATA  DRILL  DOWN

51

PRI_jet_num  =  2 PRI_jet_num  =  3

Interactive  data  visualization  techniques  are  helpful

http://chencheng.shinyapps.io/demo_higgsDEMO

52

Do  that  for  all  30  *  29  ~=  900  pairs

PARTICLE  LOCATION  —  (0,  S)

53

Convert  numerical  data  back  into  actual  object  with  meaning

Animation

PARTICLE  LOCATION  —  (0,  B)

54

Animation

• Distance  ratio  between  MET-­‐Lep  and  Tau-­‐Lep  

                                                                                 d(MET,  Lep)/d(Tau,  Lep)

INSPIRATION  FROM  ANIMATION

55

Inspiration  from  meaningful  visualization  can  be  helpful

Coun

t

dist_ratio_met_lep_tau

BKG

SGN

• Distance  ratio  between  MET-­‐Lep  and  Tau-­‐Lep  

                                                                                 d(MET,  Lep)/d(Tau,  Lep)

BKG

SGN

INSPIRATION  FROM  ANIMATION

56

Adjust  visualization  for  better  efficiency

Coun

t

dist_ratio_met_lep_tau

Coun

t

dist_ratio_met_lep_tau

BKG

SGN

• Variable  reduction  

– Simple  rotation  

– Transformation  

–Domain  knowledge  

–…  

• Feature  generation  

–Domain  knowledge  

– Inspiration  from  various  visualizations  

– Statistical  approaches  

–…

FEATURE  ENHANCEMENT  

57

Principle  component  analysis

distance_ratio

Rotation  by  phiCurve  fitting

45  degree  rotation

Background

58

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

©

• Gradient  boosting  tree  

• Neural  network  

• Bayesian  network  

• Support  vector  machine  

• Generalized  additive  model

MODELS

59

• Gradient  boosting  tree  

• Neural  network  

• Bayesian  network  

• Support  vector  machine  

• Generalized  additive  model

MODELS

60

• Decision  tree  

–Build  many  shallow  trees  

• Boosting  

–Build  trees  based  on  residual  

• Bagging  

– Each  tree  uses  a  subset  of  the  data  

• Ensembling  

–Combine  the  trees

GRADIENT  BOOSTING  TREE

61

• Decision  tree  

–Build  many  shallow  trees  

• Boosting  

–Build  trees  based  on  residual  

• Bagging  

– Each  tree  uses  a  subset  of  the  data  

• Ensembling  

–Combine  the  trees  

GRADIENT  BOOSTING  TREE

62

• Regression  tree

DECISION  TREE

63

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

• Regression  tree

DECISION  TREE

64

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

|

x< 6.614x>=6.614

0.19n=100

−0.08n=64

0.66n=36

Regression Tree with Node Depth = 1

Depth  =  1  

• Regression  tree

DECISION  TREE

65

|

x< 6.614

x>=3.049 x>=8.953

x>=6.614

x< 3.049 x< 8.953

0.19n=100

−0.08n=64

−0.53n=40

0.67n=24

0.66n=36

0.086n=7

0.8n=29

Regression Tree with Node Depth = 2

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth  =  2

• Regression  tree

DECISION  TREE

66

|

x< 6.614

x>=3.049

x< 5.862

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22

Regression Tree with Node Depth = 3

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth  =  3

• Regression  tree

DECISION  TREE

67

|

x< 6.614

x>=3.049

x< 5.862

x>=3.594

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 3.594

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

−0.8n=25

−0.23n=7

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22

Regression Tree with Node Depth = 4

−1.0

−0.5

0.0

0.5

1.0

0.0 2.5 5.0 7.5 10.0x

y

Depth  =  4

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  wts  *  latest_model(X);  

       tree(ii)  =  train_tree(X,  v_pseudo_resid,  wts);  

       latest_model  +=  LARNING_RATE  *  tree(ii)  

DECISION  TREE

68

base  model

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  latest_model(X);  

       tree_add=  train_tree(X,  v_resid);  

       latest_model  +=  LARNING_RATE  *  tree_add  

GRADIENT  BOOSTING  TREE  (V.  1)

69

get  the  residuals

fit  a  tree  for  residuals

additive  model

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  latest_model(X);  

       tree_add  =  train_tree(X,  v_resid);  

       latest_model  +=  LARNING_RATE  *  tree_add  

(STOCHASTIC)  GRADIENT  BOOSTING  TREE

70

get  sampled  index

sampled  records  as  input

store  input

X0  =  X;  Y0  =  Y;  

latest_model  =  train_tree(X,  Y,  wts);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_resid  =  Y  -­‐  wts  *  latest_model(X);  

       tree_add  =  train_tree(X,  v_resid,  wts);  

       latest_model  +=  LARNING_RATE  *  tree_add  

(STOCHASTIC)  GRADIENT  BOOSTING  TREE  WITH  WEIGHT

71

X0  =  X;  Y0  =  Y;  

latest_model  =  train_base_model(X,  Y,  wts);  

for  ii  =  1:NUM_ITER  

       Index_train  =  random(1:NUM_REC,  FRAC_TRAIN  *  NUM_REC)  

       X  =  X0[Index_train];  Y  =  Y0[Index_train];  

       v_pseudo_resid  =  get_pseudo_residual(X,  Y,  wts,  latest_model,  LOSS_FUNCTION_TYPE);  

       model_add_base  =  train_base_model(X,  v_pseudo_resid,  wts);  

       alpha  =  linear_search(cost_function,  model_add_base,  X,  Y,  wts);  

       latest_model  +=  LARNING_RATE  *  (alpha  *  model_add_base)  

(GENERAL)  GRADIENT  BOOSTING

72

[Stochastic Gradient Boosting] Jerome H. Friedman, 1999

Background

73

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

©

   gbm_model  =  gbm.fit(  

                                           x=train[,x_vars,  with  =  FALSE],  

                                           y=train$Label,  

                                           distribution  =  char_distr,  

                                           w  =  w,  

                                           n.trees  =  n_trees,  

                                           interaction.depth  =  num_inter,  

                                           n.minobsinnode  =  min_obs_node,  

                                           shrinkage  =  shrinkage_rate,  

                                           bag.fraction  =  frac_bag)

APPLYING  GBM  IN  R

74

VARIABLE  IMPORTANCE

75Relative  Importance

APPLY  MODEL  ON  TEST  DATA

76

EventId Score RankOrder Class

1 0.98 501 s

2 0.42 259,579 b

3 0.46 264,125 b

. . . .

. . . .

449,998 0.86 31,154 s

449,999 0.12 489,251 b

550,000 0.79 110,154 b

Background

77

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

• Number  of  iteration  

• Minimum  observation  for  each  node  

• Fraction  of  bagging  (0.5  ~  0.8)  

• Learning  rate  (<0.1)  

• Depth  of  tree  (4  ~  8)

GRADIENT  BOOSTING  PARAMETERS

78

Background

79

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

• Split  training  data  

– 70%  for  training  

– 30%  for  cross  validation  

• Train  model  (70%)  

• Measure  performance  (30%)

CROSS  VALIDATION

80

PERFORMANCE  BASED  ON  AMS

81

Trade-­‐off  between:          Ratio  of  Signal/Background  events          Number  of  records  in  selection  region

EventId Score RankOrder

Class truth

1 0.98 501 S S

2 0.42 259,579 B

3 0.46 264,125 B

. . . .

. . . .

449,998 0.86 31,154 S B

449,999 0.12 489,251 B

550,000 0.79 110,154 B

Selection  Region

s  =  sum(S)  b=  sum(B)

PERFORMANCE  BASED  ON  AMS

82

Percentile

AMS

AMS

percentage  of  signal

COMPARE  TWO  MODEL  RESULTS

Percentile

83

Training

Cross  validation

Percentile

AMS

AMS

percentage  of  signal

Percentile

84

COMPARE  TWO  MODEL  RESULTS

Training

Cross  validation

Percentile

AMS

AMS

percentage  of  signal

AMS  BY  NUM.  ITERATION

85

Percentile

AMS

Animation

Background

86

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

s

b

>>  4

HEAT  MAP  OF  AMS  ON  B-­‐S  PLAN

87

OPTIMIZATION  BASED  ON    OBJECTIVE  FUNCTION

Percentile

88

A

B

C

AMS

HEAT  MAP  OF  AMS  ON  B-­‐S  PLAN

89

s

b

A

B

C

HEAT  MAP  OF  AMS  ON  B-­‐S  PLAN

90

s

b

A

B

C

Inspiration  from  Lagrangian  Method  Weight  signal  and  background  events  by  partial  derivatives  of  AMS  function

AMS  CURVE  ON  B-­‐S  PLAN

91

A

B

C

Inspiration  from  Lagrangian  Method  Weight  signal  and  background  events  by  partial  derivatives  of  AMS  function

s

b

partial  derivative  of  AMS  against  s

partial  derivative  of  AMS  against  b

Ratio  of  the  derivatives  ==>  relative  weight

IMPROVEMENT  DUE  TO    WEIGHTING

92

AMS*

Num_Iterations

AMS    

IMPROVEMENT  DUE  TO    WEIGHTING  (CONT’D)

93Num_Iterations

AMS*

AMS    

AUGMENTED  GRADIENT  BOOSTING

94

Apply  GBMWeight  

Adjustment

©

AUGMENTED  GRADIENT  BOOSTING

95

Apply  GBMWeight  

Adjustment

Remove    very  high  and  very  low  score  records    

from    train  and  test

©

IMPROVEMENT  DUE  TO  ELIMINATION

96Num_Iterations

AMS*

AMS    

IMPROVEMENT  DUE  TO  ELIMINATION  (CONT’D)

97Num_Iterations

AMS*

AMS    

AUGMENTED  GRADIENT  BOOSTING

98

Apply  ML  

Model

Weight  Adjustment

Remove    very  high  and  very  low  score  records    

from    train  and  test

©

Background

99

Data

Model

Understand

Explore Enhance

Train Select Optimize

read

visualize

reduce

generate

innovateapply

fine-­‐tune

read

discuss

Validate

find

cross  validate

• Version  control  (Git,  Source  Tree)  

– Effectively  implement  many  different  ideas  

• File  organization  

– Efficiently  pull  out  the  file  needed  

• Effective  code  (R,  Python)  

– it  matters  so  much  when  dealing  with  big  data

OTHER  TOPICS

100

Thanks  you  for  your  participation!  

Any  Questions?

goDCI.com