RAMP: data challenges with modularization and code submission

33
Center for Data Science Paris-Saclay B. Kégl (CNRS) 1 Université Paris-Saclay / CNRS BALÁZS KÉGL RAMP DATA CHALLENGES WITH MODULARIZATION AND CODE SUBMISSION LESSONS LEARNED

Transcript of RAMP: data challenges with modularization and code submission

Page 1: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 1

Université Paris-Saclay / CNRSBALÁZS KÉGL

RAMP DATA CHALLENGES WITH

MODULARIZATION AND CODE SUBMISSION LESSONS LEARNED

Page 2: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• A short history of RAMPs

• motivations, design principles, and the current tool

• Three data challenges

• anomaly detection in the LHC ATLAS detector

• classifying and regressing on molecular spectra

• time series forecasting of El Niño

• What have we learned?

• number of participants, incentives?

• open vs closed?

• blending vs human ingenuity

2

OUTLINE

Page 3: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

3

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects

345 affiliated researchers, 50 laboratories

Page 4: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

CDS: A SET OF INNOVATIVE TOOLS AND PROCESSES TO CONNECT DATA SCIENCE AND DOMAIN SCIENCE COMMUNITIES

4

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • data challenges • ultrawalls and interactive visualization

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

!• data science RAMPs and TSs • IT platform for linked data

Page 5: RAMP: data challenges with modularization and code submission

5

RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)

http://www.ramp.studio

Page 6: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• Organizers have no direct access to solutions

• Emphasize competition: participants cannot build on each other’s solutions

• No modularization: ideas go unnoticed unless packaged into a top submission

6

LIMITATIONS OF DATA CHALLENGES

Page 7: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• Challenge with code submission

• Following Nielsen’s three crowdsourcing principles:

• modularity: pipelines are sliced into workflow element modules that can be tackled independently

• encourage small contributions: e.g., copy another submission, add features, change the hyperparameters, resubmit

• rich and well structured information commons: open and download each other’s code, discuss on slack

7

RAMP

Page 8: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• Roughly two formats

• single day hackatons with 20-50 participants, open leaderboard, 15 minute timeout

• 1-3 week course challenges up 150 students (but no limit really): closed phase with 1-3 submissions per day followed by an open phase with 15 minute timeout

• 500+ users, 3000+ models

8

RAMP RAPID ANALYTICS AND MODEL PROTOTYPING

Page 9: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP RAPID ANALYTICS AND MODEL PROTOTYPING

9

frontend

DB

backend

users submissions score problems workflow starting kit crossval

data pipeline

train+test+blend

Page 10: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP

10

Page 11: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP

11

Page 12: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP

12

Page 13: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP

13

Page 14: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP

14

Page 15: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

RAMP

15

Page 16: RAMP: data challenges with modularization and code submission

16

Three recent RAMPs

Page 17: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

ANOMALY DETECTION IN THE LHC ATLAS DETECTOR

17

reconstruction+simulated anomalies

classifier

anomaly (isSkewed = 1)

correct (isSkewed = 0)

?

Page 18: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA

18

chemotherapy drug in elastic pocket

laser spectrometer

molecular spectra

feature extractor 1

feature extractor 2

regressor

concentration

classifier

drug type

Page 19: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

FORECASTING EL NINO SIX MONTHS AHEAD

19

Page 20: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

FORECASTING EL NINO SIX MONTHS AHEAD

20

… 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …

feature extractor

x (a fixed length feature vector) regressor

• We give the full series to the feature extractor

• It could look ahead in the future (even inadvertently)

• Checking lookahead by a randomized test

Page 21: RAMP: data challenges with modularization and code submission

21

Analyzing the analysis

Page 22: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

OPEN PHASE LETS PARTICIPANTS CATCH UP

22

Page 23: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 23

T-SNE ON TEST PREDICTIONS

starting kit

the crowdearly influencers

inventors

Page 24: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 24

the single day hackaton ceiling

what you achieved with a well tuned deep net

the diversity gap

the human blender gap

Page 25: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 25

blending is immune to overfitting

the single day hackaton floor

Page 26: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 26

the single day hackaton floor

Page 27: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 27

Page 28: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 28

Page 29: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 29

Page 30: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS) 30

Page 31: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• Course RAMPs beat single day hackatons significantly

• larger number of students?

• longer RAMPs?

• master-level students are better than data science researchers?

• stronger incentives?

• closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?

• Open phase helps novice participants to catch up: the goal of teaching!

• Sometimes also makes the best and blended score better

• Human blending often beats machine blending

• Human feature engineering easily beats deep learning on some data

31

WHAT WE LEARNED

Page 32: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• Fast development of analytics solutions

• Teaching support

• Networking

• Support for collaborative team work

32

THE RAMP TOOL

A prototyping tool for collaborative development of data science workflows

Page 33: RAMP: data challenges with modularization and code submission

Center for Data ScienceParis-SaclayB. Kégl (CNRS)

• Open sourcing and packaging for easy deployment

• More RAMPs, stay tuned, sign up athttp://www.ramp.studio if interested

33

WHAT’S NEXT