Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning...

BIG DATA COMPETITION: MAXIMIZING YOUR POTENTIAL

EXAMPLED WITH THE 2014 HIGGS BOSON MACHINE LEARNING CHALLENGE

Dr. Cheng CHEN email: cchen@goDCI.com

twitter: @cheng_chen_us

Development Consulting International LLC

goDCI.com

1this presentation is copyright protected ©

Ohio State University, Tongji University

Ph.D. Civil Engineering

M.S. Applied Statistics

Minor Computer Science

Advanced trainings:

City and Regional Planning

Industrial and Systems Engineering

Mathematics

Passion: (this) machine learning

PRESENTER

• Goal: improve the procedure that produces the selection region of Higgs Boson

• 4 Month Duration

• 1,785 teams

• Many machine learning experts, statisticians, and physicist

• Top 5 are from 5 different countries

HIGGS BOSON MACHINE LEARNING CHALLENGE

Netherlands

Hungary

France

Russia

U.S.A/Chinahttp://www.kaggle.com/c/higgs-‐boson/leaderboard

Background

Understand

Explore Enhance

Train Select Optimize

visualize

reduce

generate

cross validate innovate

discuss

Validate

fine-‐tune

Background

Understand

Explore Enhance

visualize

reduce

generate

innovate

discuss

Validate

fine-‐tune

cross validate

READ AND DISCUSS

• a.k.a the God Particle (explains some mass)

• A fundamental particle theorized in 1964 in the Standard Model of Particle Physics

• “Considered” discovered in 2011 – 2013 in LHC by CERN

• A number of prestigious awards in 2013, including a Nobel prize

HIGGS BOSON

7http://upload.wikimedia.org/wikipedia/commons/0/00/Standard_Model_of_Elementary_Particles.svg

A "definitive" answer might require "another few years" after the collider's 2015 restart.deputy chair of physics at Brookhaven National Laboratory

http://en.wikipedia.org/wiki/Higgs_boson

• Established in 1954

• Birth of World Wide Web (1989)

CERN: THE EUROPEAN ORGANIZATION FOR NUCLEAR RESEARCH

maps.google.com

• 27 km (17 mi) in circumference

• 175 meters (574 ft) beneath ground

• Built from 1998 to 2008

• Over 10,000 scientists and engineers

• Over 100 counties

• Seven particle detectors

LARGE HADRON COLLIDER (LHC)

9https://www.llnl.gov/news/llnl-‐set-‐host-‐international-‐lattice-‐physics-‐conference

http://en.wikipedia.org/wiki/Large_Hadron_Collider

• 46 meters long

• 25 meters in diameter

• Weighs about 7,000 tonnes

• Contains some 3000 km of cable

• Involves roughly 3,000 physicists from over 175 institutions in 38 countries.

http://higgsml.lal.in2p3.fr/documentation/

• 46 meters long

• Higgs Boson can not be measured directly (decays immediately into lighter particles)

• Other particles can decay into the same set of lighter particles

• PRODUCTION and DECAY of Higgs Boson depends on the mass, while mass was not predicted by theory (now we know it is close to 125 Gev)

CHALLENGES IN DETECTION OF HIGGS BOSON

13https://www2.physics.ox.ac.uk/sites/default/files/2012-‐03-‐27/sinead_farrington_pdf_17376.pdf

Seeing a circular shaped shadow does not mean the real object is a sphere ball

• Raw data collected from LHC

• Hundreds of millions of proton-‐proton collisions (event) per second

• 400 events of interest are selected per second

– Signal event (i.e. Higgs Boson)

–Background event (i.e. other particles)

• Events in Ad Hoc selection region (in certain channels) exceeding background noise

CURRENT DETECTION MECHANISM

Needs improvement in significance and robustness in selection criteria

• Simulated Data

• Fixed mass (125 GeV)

• Simplified decay channel

–Next Slide

• Simplified background events (three representative types only)

–Decay of the Z boson (91.2 GeV) into Tau-‐Tau –Decay of a pair of top quarks into lepton and hadronic tau –“Decay” of the W boson into lepton and hadronic tau due to imperfections in the particle identification procedure

• Simplified objective function (significance score)

SIMPLIFICATIONS FOR COMPETITION

• Decay of Tau-‐Tau Channel only

• One tau decays into lepton and two neutrino

• The other tau decays into hadronic tau and a neutrino

• (Note: Neutrinos can not be detected)

SIMPLIFIED DECAY CHANNEL

hadronic tau:a bunch of hadrons

Jets MET

vectorized momenta are givenhadronic tau:a bunch of hadrons

Background

Understand

Explore Enhance

visualize

reduce

generate

innovate

discuss

Validate

fine-‐tune

cross validate

• 250,000 training

• 550,000 testing

• 30 variables

– 17 Primitive • Momenta • Direction

– 13 Derived

DATA DIMENSION

4 rows in training data

EventId DER_mass_MMC

DER_mass_transverse_met_lep

DER_mass_vis

DER_pt_h

DER_deltaeta_jet_jet

DER_mass_jet_jet

DER_prodeta_jet_jet

DER_deltar_tau_lep

DER_pt_tot

DER_sum_pt

100000 138.47 51.655 97.827 27.98 0.91 124.711 2.666 3.064 41.928 197.76100001 160.937 68.768 103.235 48.146 NA NA NA 3.473 2.078 125.157100002 NA 162.172 125.953 35.635 NA NA NA 3.148 9.336 197.814100003 143.905 81.417 80.943 0.414 NA NA NA 3.31 0.414 75.968

EventIdDER_pt_ratio_lep_tau

DER_met_phi_centrality

DER_lep_eta_centrality

PRI_tau_pt

PRI_tau_eta

PRI_tau_phi

PRI_lep_pt

PRI_lep_eta

PRI_lep_phi PRI_met

100000 1.582 1.396 0.2 32.638 1.017 0.381 51.626 2.273 -2.414 16.824100001 0.879 1.414 NA 42.014 2.039 -3.011 36.918 0.501 0.103 44.704100002 3.776 1.414 NA 32.154 -0.705 -2.093 121.409 -0.953 1.052 54.283100003 2.354 -1.285 NA 22.647 -1.655 0.01 53.321 -0.522 -3.1 31.082

EventId PRI_met_phi

PRI_met_sumet

PRI_jet_num

PRI_jet_leading_pt

PRI_jet_leading_eta

PRI_jet_leading_phi

PRI_jet_subleading_pt

PRI_jet_subleading_eta

PRI_jet_subleading_phi

PRI_jet_all_pt

100000 -0.277 258.733 2 67.435 2.15 0.444 46.062 1.24 -2.475 113.497100001 -1.916 164.546 1 46.226 0.725 1.158 NA NA NA 46.226100002 -2.186 260.414 1 44.251 2.053 -2.028 NA NA NA 44.251100003 0.06 86.062 0 NA NA NA NA NA NA 0

EventId Weight Label100000 0.00265331133733s100001 2.23358448717b100002 2.34738894364b100003 5.44637821192b

Data loaded correctly Notice NA values

MISSING VALUES

col_name NA_count NA_pct1 EventId 2 DER_mass_MMC 38,114 15%3 DER_mass_transverse_met_lep 4 DER_mass_vis 5 DER_pt_h 6 DER_deltaeta_jet_jet 177,457 71%7 DER_mass_jet_jet 177,457 71%8 DER_prodeta_jet_jet 177,457 71%9 DER_deltar_tau_lep 10 DER_pt_tot 11 DER_sum_pt 12 DER_pt_ratio_lep_tau 13 DER_met_phi_centrality 14 DER_lep_eta_centrality 177,457 71%15 PRI_tau_pt 16 PRI_tau_eta 17 PRI_tau_phi 18 PRI_lep_pt 19 PRI_lep_eta 20 PRI_lep_phi 21 PRI_met 22 PRI_met_phi 23 PRI_met_sumet 24 PRI_jet_num 25 PRI_jet_leading_pt 99,913 40%26 PRI_jet_leading_eta 99,913 40%27 PRI_jet_leading_phi 99,913 40% 28 PRI_jet_subleading_pt 177,457 71%29 PRI_jet_subleading_eta 177,457 71%30 PRI_jet_subleading_phi 177,457 71%31 PRI_jet_all_pt 32 Weight 33 Label

MISSING VALUES

col_name NA_count NA_pct1 EventId 2 DER_mass_MMC 38,114 15%3 DER_mass_transverse_met_lep 4 DER_mass_vis 5 DER_pt_h 6 DER_deltaeta_jet_jet 177,457 71%7 DER_mass_jet_jet 177,457 71%8 DER_prodeta_jet_jet 177,457 71%9 DER_deltar_tau_lep 10 DER_pt_tot 11 DER_sum_pt 12 DER_pt_ratio_lep_tau 13 DER_met_phi_centrality 14 DER_lep_eta_centrality 177,457 71%15 PRI_tau_pt 16 PRI_tau_eta 17 PRI_tau_phi 18 PRI_lep_pt 19 PRI_lep_eta 20 PRI_lep_phi 21 PRI_met 22 PRI_met_phi 23 PRI_met_sumet 24 PRI_jet_num 25 PRI_jet_leading_pt 99,913 40%26 PRI_jet_leading_eta 99,913 40%27 PRI_jet_leading_phi 99,913 40%28 PRI_jet_subleading_pt 177,457 71%29 PRI_jet_subleading_eta 177,457 71%30 PRI_jet_subleading_phi 177,457 71%31 PRI_jet_all_pt 32 Weight 33 Label

Notice the consistency in missing values

• Assign a value

–Generate a random value

– Fit a value (mean, median, nearest neighbor, etc.)

– Fix a value (domain knowledge)

• Remove the record

• Leave as is

HOW TO HANDLE MISSING VALUES

• Assign a value

–Generate a random value

– Fit a value (mean, median, nearest neighbor, etc.)

– Fix a value (domain knowledge)

• Remove the record

• Leave as is

HOW TO HANDLE MISSING VALUES

HISTOGRAM

25Density is more meaningful in the range of x No fuzzy jump at the edge

PRI_jet_leading_pt

Log transformation

Inverse transformation

HISTOGRAM (CONT’D)

26Bi-‐modality is revealed

DER_pt_h

Log transformation

Inverse transformation

INTERACTIVE VISUALIZATION R SHINY

27http://chencheng.shinyapps.io/demo_higgsDEMO

Use a reasonable number of bins to display the underlying distribution

http://chencheng.shinyapps.io/demo_higgsDEMO

Use a reasonable transformation to display the underlying distribution

HISTOGRAM (CONT’D)

PRI_tau_etaTransformations are sometimes not necessary

Do that for all 30 variables

PAIRWISE CORRELATIONS

PRI_lep_phi & PRI_met_phi

CountSet transparency parameter appropriately to reveal important patterns

CountCorrelation coefficient == 0 does not mean no correlation

FEATURE ENHANCEMENT ROTATION

37Validate visual “evidence” from various perspectives

rotated PRI_lep_phi & PRI_met_phi

38Validate visual “evidence” from various perspectives

rotated PRI_lep_phi & PRI_met_phi

PAIRWISE VARIABLES — LOW RES.

DER_pt_h & DER_deltar_tau_lep

PAIRWISE VARIABLES — HIGH RES.

40Try High Resolution

PAIRWISE VARIABLES — HIGH RES.

41Curve fitting

FEATURE ENHANCEMENT CURVE FITTING

42Enhance a variable based on correlation with another variable

FEATURE ENHANCEMENT ROTATION BY PRI_TAU_PHI

Domain Knowledge

DER_pt_h & PRI_lep_phi

FEATURE ENHANCEMENT ROTATION BY PRI_TAU_PHI

44Feature enhancement by applying domain knowledge

DER_pt_h & PRI_lep_phi

Domain Knowledge

PRI_jet_leading_eta & PRI_jet_subleading_eta

• Select variable(s): One var. for histogram, two var. for scatter plot

DATA DRILL DOWN

• Dynamically select a subset of data — PRI_jet_num = 2

DATA DRILL DOWN

• Patterns in the subset data — PRI_jet_leading_eta & PRI_jet_subleading_eta

DATA DRILL DOWN

• Dynamically select a subset of data — PRI_jet_num = 3

DATA DRILL DOWN

PRI_jet_num = 2 PRI_jet_num = 3

Interactive data visualization techniques are helpful

Do that for all 30 * 29 ~= 900 pairs

PARTICLE LOCATION — (0, S)

Convert numerical data back into actual object with meaning

Animation

PARTICLE LOCATION — (0, B)

Animation

• Distance ratio between MET-‐Lep and Tau-‐Lep

d(MET, Lep)/d(Tau, Lep)

INSPIRATION FROM ANIMATION

Inspiration from meaningful visualization can be helpful

dist_ratio_met_lep_tau

• Distance ratio between MET-‐Lep and Tau-‐Lep

d(MET, Lep)/d(Tau, Lep)

INSPIRATION FROM ANIMATION

Adjust visualization for better efficiency

dist_ratio_met_lep_tau

• Variable reduction

– Simple rotation

– Transformation

–Domain knowledge

–…

• Feature generation

–Domain knowledge

– Inspiration from various visualizations

– Statistical approaches

–…

FEATURE ENHANCEMENT

Principle component analysis

distance_ratio

Rotation by phiCurve fitting

45 degree rotation

Background

Understand

Explore Enhance

visualize

reduce

generate

innovateapply

fine-‐tune

discuss

Validate

cross validate

• Gradient boosting tree

• Neural network

• Bayesian network

• Support vector machine

• Generalized additive model

MODELS

• Gradient boosting tree

• Neural network

• Bayesian network

• Support vector machine

• Generalized additive model

MODELS

• Decision tree

–Build many shallow trees

• Boosting

–Build trees based on residual

• Bagging

– Each tree uses a subset of the data

• Ensembling

–Combine the trees

GRADIENT BOOSTING TREE

• Decision tree

–Build many shallow trees

• Boosting

–Build trees based on residual

• Bagging

– Each tree uses a subset of the data

• Ensembling

–Combine the trees

GRADIENT BOOSTING TREE

• Regression tree

DECISION TREE

−1.0

−0.5

0.0 2.5 5.0 7.5 10.0x

• Regression tree

DECISION TREE

−1.0

−0.5

0.0 2.5 5.0 7.5 10.0x

x< 6.614x>=6.614

0.19n=100

−0.08n=64

0.66n=36

Regression Tree with Node Depth = 1

Depth = 1

• Regression tree

DECISION TREE

x< 6.614

x>=3.049 x>=8.953

x>=6.614

x< 3.049 x< 8.953

0.19n=100

−0.08n=64

−0.53n=40

0.67n=24

0.66n=36

0.086n=7

0.8n=29

−1.0

−0.5

0.0 2.5 5.0 7.5 10.0x

Depth = 2

• Regression tree

DECISION TREE

x< 6.614

x>=3.049

x< 5.862

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22

−1.0

−0.5

0.0 2.5 5.0 7.5 10.0x

Depth = 3

• Regression tree

DECISION TREE

x< 6.614

x>=3.049

x< 5.862

x>=3.594

x>=8.953

x< 7.207

x>=6.614

x< 3.049

x>=5.862

x< 3.594

x< 8.953

x>=7.207

0.19n=100

−0.08n=64

−0.53n=40

−0.67n=32

−0.8n=25

−0.23n=7

0.045n=8

0.67n=24

0.66n=36

0.086n=7

0.8n=29

0.57n=7

0.87n=22

−1.0

−0.5

0.0 2.5 5.0 7.5 10.0x

Depth = 4

X0 = X; Y0 = Y;

latest_model = train_tree(X, Y);

for ii = 1:NUM_ITER

Index_train = random(1:NUM_REC, FRAC_TRAIN * NUM_REC)

X = X0[Index_train]; Y = Y0[Index_train];

v_resid = Y -‐ wts * latest_model(X);

tree(ii) = train_tree(X, v_pseudo_resid, wts);

latest_model += LARNING_RATE * tree(ii)

DECISION TREE

base model

X0 = X; Y0 = Y;

for ii = 1:NUM_ITER

v_resid = Y -‐ latest_model(X);

tree_add= train_tree(X, v_resid);

latest_model += LARNING_RATE * tree_add

GRADIENT BOOSTING TREE (V. 1)

get the residuals

fit a tree for residuals

additive model

X0 = X; Y0 = Y;

for ii = 1:NUM_ITER

v_resid = Y -‐ latest_model(X);

tree_add = train_tree(X, v_resid);

(STOCHASTIC) GRADIENT BOOSTING TREE

get sampled index

sampled records as input

store input

X0 = X; Y0 = Y;

latest_model = train_tree(X, Y, wts);

for ii = 1:NUM_ITER

v_resid = Y -‐ wts * latest_model(X);

tree_add = train_tree(X, v_resid, wts);

(STOCHASTIC) GRADIENT BOOSTING TREE WITH WEIGHT

X0 = X; Y0 = Y;

latest_model = train_base_model(X, Y, wts);

for ii = 1:NUM_ITER

v_pseudo_resid = get_pseudo_residual(X, Y, wts, latest_model, LOSS_FUNCTION_TYPE);

model_add_base = train_base_model(X, v_pseudo_resid, wts);

alpha = linear_search(cost_function, model_add_base, X, Y, wts);

latest_model += LARNING_RATE * (alpha * model_add_base)

(GENERAL) GRADIENT BOOSTING

[Stochastic Gradient Boosting] Jerome H. Friedman, 1999

Background

Understand

Explore Enhance

visualize

reduce

generate

innovateapply

fine-‐tune

discuss

Validate

cross validate

gbm_model = gbm.fit(

x=train[,x_vars, with = FALSE],

y=train$Label,

distribution = char_distr,

w = w,

n.trees = n_trees,

interaction.depth = num_inter,

n.minobsinnode = min_obs_node,

shrinkage = shrinkage_rate,

bag.fraction = frac_bag)

APPLYING GBM IN R

VARIABLE IMPORTANCE

75Relative Importance

APPLY MODEL ON TEST DATA

EventId Score RankOrder Class

1 0.98 501 s

2 0.42 259,579 b

3 0.46 264,125 b

. . . .

449,998 0.86 31,154 s

449,999 0.12 489,251 b

550,000 0.79 110,154 b

Background

Understand

Explore Enhance

visualize

reduce

generate

innovateapply

fine-‐tune

discuss

Validate

cross validate

• Number of iteration

• Minimum observation for each node

• Fraction of bagging (0.5 ~ 0.8)

• Learning rate (<0.1)

• Depth of tree (4 ~ 8)

GRADIENT BOOSTING PARAMETERS

Background

Understand

Explore Enhance

visualize

reduce

generate

innovateapply

fine-‐tune

discuss

Validate

cross validate

• Split training data

– 70% for training

– 30% for cross validation

• Train model (70%)

• Measure performance (30%)

CROSS VALIDATION

PERFORMANCE BASED ON AMS

Trade-‐off between: Ratio of Signal/Background events Number of records in selection region

EventId Score RankOrder

Class truth

1 0.98 501 S S

2 0.42 259,579 B

3 0.46 264,125 B

. . . .

449,998 0.86 31,154 S B

449,999 0.12 489,251 B

550,000 0.79 110,154 B

Selection Region

s = sum(S) b= sum(B)

PERFORMANCE BASED ON AMS

Percentile

percentage of signal

COMPARE TWO MODEL RESULTS

Percentile

Training

Cross validation

Percentile

COMPARE TWO MODEL RESULTS

Training

Cross validation

Percentile

AMS BY NUM. ITERATION

Percentile

Animation

Background

Understand

Explore Enhance

visualize

reduce

generate

innovateapply

fine-‐tune

discuss

Validate

cross validate

HEAT MAP OF AMS ON B-‐S PLAN

OPTIMIZATION BASED ON OBJECTIVE FUNCTION

Percentile

HEAT MAP OF AMS ON B-‐S PLAN

Inspiration from Lagrangian Method Weight signal and background events by partial derivatives of AMS function

AMS CURVE ON B-‐S PLAN

Inspiration from Lagrangian Method Weight signal and background events by partial derivatives of AMS function

partial derivative of AMS against s

partial derivative of AMS against b

Ratio of the derivatives ==> relative weight

IMPROVEMENT DUE TO WEIGHTING

Num_Iterations

IMPROVEMENT DUE TO WEIGHTING (CONT’D)

93Num_Iterations

AUGMENTED GRADIENT BOOSTING

Apply GBMWeight

Adjustment

Apply GBMWeight

Adjustment

Remove very high and very low score records

from train and test

IMPROVEMENT DUE TO ELIMINATION

96Num_Iterations

IMPROVEMENT DUE TO ELIMINATION (CONT’D)

97Num_Iterations

Apply ML

Weight Adjustment

Remove very high and very low score records

from train and test

Background

Understand

Explore Enhance

visualize

reduce

generate

innovateapply

fine-‐tune

discuss

Validate

cross validate

• Version control (Git, Source Tree)

– Effectively implement many different ideas

• File organization

– Efficiently pull out the file needed

• Effective code (R, Python)

– it matters so much when dealing with big data

Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning...

Data & Analytics

Transcript of Big Data Competition: maximizing your potential exampled with the 2014 Higgs Boson Machine Learning...

Higgs Boson Physics

The Higgs Boson - CERNaccelconf.web.cern.ch/accelconf/LINAC08/talks/fr204_talk.pdfOutline What is the Higgs Boson?What is the Higgs Boson? What is the Higgs Mechanism? Why are weak

Rpp2014 Rev Higgs Boson

The Higgs Boson & Beyond

Monografia Boson de Higgs

Rpp2012 Rev Higgs Boson

Higgs Boson discovery

boson de higgs

The Higgs boson

Higgs Boson

Higgs Boson Time Travel

Higgs boson (s)

Finding the Higgs Boson

El Boson de Higgs

Rpp2013 List Higgs Boson

Boson higgs

Higgs Boson Properties

Higgs Boson Lecture 1

Higgs-Boson Portfolio 2012

Higgs Boson Pair Production