Post on 21-Feb-2017
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 1
Université Paris-Saclay / CNRSBALÁZS KÉGL
RAMP DATA CHALLENGES WITH
MODULARIZATION AND CODE SUBMISSION LESSONS LEARNED
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• A short history of RAMPs
• motivations, design principles, and the current tool
• Three data challenges
• anomaly detection in the LHC ATLAS detector
• classifying and regressing on molecular spectra
• time series forecasting of El Niño
• What have we learned?
• number of participants, incentives?
• open vs closed?
• blending vs human ingenuity
2
OUTLINE
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale
ChemistryEA4041/UPSud
Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique
EconomyLM/ENSAE RITM/UPSudLFA/ENSAE
NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA
Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud
The Paris-Saclay Center for Data ScienceData Science for scientific Data
250 researchers in 35 laboratories
Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry
VisualizationINRIALIMSI
Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA
StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech
Data sciencestatistics
machine learninginformation retrieval
signal processingdata visualization
databases
Domain sciencehuman society
life brain earth
universe
Tool buildingsoftware engineering
clouds/gridshigh-performance
computingoptimization
Data scientist
Applied scientist
Domain scientist
Data engineer
Software engineer
Center for Data ScienceParis-Saclay
datascience-paris-saclay.fr
@SaclayCDS
LIST/CEA
3
Center for Data ScienceParis-Saclay
A multi-disciplinary initiative, building interfaces, matching people, helping them launching projects
345 affiliated researchers, 50 laboratories
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
CDS: A SET OF INNOVATIVE TOOLS AND PROCESSES TO CONNECT DATA SCIENCE AND DOMAIN SCIENCE COMMUNITIES
4
Data scientist
Data trainer
Applied scientist
Domain expertSoftware engineer
Data engineer
Tool building Data domains
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
• interdisciplinary projects • data challenges • ultrawalls and interactive visualization
• coding sprints • Open Software Initiative • code consolidator and engineering projects
software engineeringclouds/grids
high-performancecomputing
optimization
energy and physical sciences health and life sciences Earth and environment
economy and society brain
!• data science RAMPs and TSs • IT platform for linked data
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Organizers have no direct access to solutions
• Emphasize competition: participants cannot build on each other’s solutions
• No modularization: ideas go unnoticed unless packaged into a top submission
6
LIMITATIONS OF DATA CHALLENGES
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Challenge with code submission
• Following Nielsen’s three crowdsourcing principles:
• modularity: pipelines are sliced into workflow element modules that can be tackled independently
• encourage small contributions: e.g., copy another submission, add features, change the hyperparameters, resubmit
• rich and well structured information commons: open and download each other’s code, discuss on slack
7
RAMP
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Roughly two formats
• single day hackatons with 20-50 participants, open leaderboard, 15 minute timeout
• 1-3 week course challenges up 150 students (but no limit really): closed phase with 1-3 submissions per day followed by an open phase with 15 minute timeout
• 500+ users, 3000+ models
8
RAMP RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP RAPID ANALYTICS AND MODEL PROTOTYPING
9
frontend
DB
backend
users submissions score problems workflow starting kit crossval
data pipeline
train+test+blend
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
10
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
11
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
12
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
13
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
14
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
RAMP
15
16
Three recent RAMPs
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
ANOMALY DETECTION IN THE LHC ATLAS DETECTOR
17
reconstruction+simulated anomalies
classifier
anomaly (isSkewed = 1)
correct (isSkewed = 0)
?
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
CLASSIFYING AND REGRESSING ON MOLECULAR SPECTRA
18
chemotherapy drug in elastic pocket
laser spectrometer
molecular spectra
feature extractor 1
feature extractor 2
regressor
concentration
classifier
drug type
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS AHEAD
19
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
FORECASTING EL NINO SIX MONTHS AHEAD
20
… 300.14 299.83 298.76 299.87 299.82 300.15 300.10 299.50… …
feature extractor
x (a fixed length feature vector) regressor
• We give the full series to the feature extractor
• It could look ahead in the future (even inadvertently)
• Checking lookahead by a randomized test
21
Analyzing the analysis
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
OPEN PHASE LETS PARTICIPANTS CATCH UP
22
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 23
T-SNE ON TEST PREDICTIONS
starting kit
the crowdearly influencers
inventors
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 24
the single day hackaton ceiling
what you achieved with a well tuned deep net
the diversity gap
the human blender gap
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 25
blending is immune to overfitting
the single day hackaton floor
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 26
the single day hackaton floor
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 27
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 28
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 29
Center for Data ScienceParis-SaclayB. Kégl (CNRS) 30
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Course RAMPs beat single day hackatons significantly
• larger number of students?
• longer RAMPs?
• master-level students are better than data science researchers?
• stronger incentives?
• closed phase preceding an open phase (vs pure open RAMP) helps to create diversity?
• Open phase helps novice participants to catch up: the goal of teaching!
• Sometimes also makes the best and blended score better
• Human blending often beats machine blending
• Human feature engineering easily beats deep learning on some data
31
WHAT WE LEARNED
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Fast development of analytics solutions
• Teaching support
• Networking
• Support for collaborative team work
32
THE RAMP TOOL
A prototyping tool for collaborative development of data science workflows
Center for Data ScienceParis-SaclayB. Kégl (CNRS)
• Open sourcing and packaging for easy deployment
• More RAMPs, stay tuned, sign up athttp://www.ramp.studio if interested
33
WHAT’S NEXT