RAMP: Collaborative challenge with code submission
-
Upload
balazs-kegl -
Category
Data & Analytics
-
view
373 -
download
3
Transcript of RAMP: Collaborative challenge with code submission
Center for Data ScienceParis-Saclay1
CNRS & University Paris Saclay TektosData
BALÁZS KÉGL
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
COLLABORATIVE CHALLENGE WITH CODE SUBMISSION
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay2
A bit of context
Center for Data ScienceParis-Saclay3
UNIVERSITÉ PARIS-SACLAY
19 founding partners
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay4
UNIVERSITÉ PARIS-SACLAY
+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay5
Center for Data ScienceParis-Saclay
A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay
http://www.datascience-paris-saclay.fr/
Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale
ChemistryEA4041/UPSud
Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique
EconomyLM/ENSAE RITM/UPSudLFA/ENSAE
NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA
Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud
The Paris-Saclay Center for Data ScienceData Science for scientific Data
250 researchers in 35 laboratories
Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry
VisualizationINRIALIMSI
Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA
StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech
Data sciencestatistics
machine learninginformation retrieval
signal processingdata visualization
databases
Domain sciencehuman society
life brain earth
universe
Tool buildingsoftware engineering
clouds/gridshigh-performance
computingoptimization
Data scientist
Applied scientist
Domain scientist
Data engineer
Software engineer
Center for Data ScienceParis-Saclay
datascience-paris-saclay.fr
@SaclayCDS
LIST/CEA
Center for Data ScienceParis-Saclay
Data domainsenergy and physical sciences
health and life sciences Earth and environment
economy and society brain
Data scientist
Data trainer
Applied scientist
Domain scientistSoftware engineer
Data engineer
Data sciencestatistics
machine learning information retrieval
signal processing data visualization
databases
Tool building software engineering
clouds/grids high-performance
computing optimization
• (The lack of) manpower
• especially at the interfaces
• industrial brain-drain
• Incentives
• data scientists are not incentivized to work on domain science
• scientists are not incentivized to work on tools
• Access
• no well-developed channels to identify the right experts for a given problem
• Tools
• few tools that can help domain scientists and data scientists to collaborate efficiently
6
CHALLENGEShttps://medium.com/@balazskegl
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay7
TWO ANALYTICS TOOLS FOR INITIATING DOMAIN-DATA SCIENCE INTERACTIONS
RAPID ANALYTICS AND MODEL PROTOTYPING
(RAMP)
DATA CHALLENGES
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay8
DATA CHALLENGES
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay9
DATA CHALLENGES
• The HiggsML challenge on Kaggle
• https://www.kaggle.com/c/higgs-boson
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay10
HUGE PUBLICITY B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
14
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay11
SIGNIFICANT IMPROVEMENT OVER THE BASELINE
B. Kégl / AppStat@LAL Learning to discover
CLASSIFICATION FOR DISCOVERY
15
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay12
HUGE PUBLICITY
SIGNIFICANT IMPROVEMENT OVER THE BASELINE
yet partially missing the objectives
Center for Data ScienceParis-Saclay
• Challenges are useful for
• generating visibility in the data science community about novel application domains
• benchmarking in a fair way state-of-the-art techniques on well-defined problems
• finding talented data scientists
• Limitations
• not necessary adapted to solving complex and open-ended data science problems in realistic environments
• no direct access to solutions and data scientist
• emphasizes competition
13
DATA CHALLENGES
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay14
HUGE PUBLICITY
We decided to design something better
Center for Data ScienceParis-Saclay
• Prototyping
• Training
• Human resources
• Collaboration building, networking
• Social science observatory
15
RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)
Center for Data ScienceParis-Saclay
RAMPS
16
• Single-day coding sessions
• 20-40 participants
• preparation is similar to challenges
• Goals
• focusing and motivating top talents
• promoting collaboration, speed, and efficiency
• solving (prototyping) real problems
Center for Data ScienceParis-Saclay17
ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay18
ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE
Center for Data ScienceParis-Saclay
RAMPS
19
www.ramp.studiosoftware + management
backend is open source: https://github.com/camillemarini/datarun
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay
2015 Jan 15 The HiggsML challenge
20
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay
2015 Apr 10 Classifying variable stars
21
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay22
VARIABLE STARS
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay23
VARIABLE STARS
accuracy improvement: 89% to 96%
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay
2015 June 16 and Sept 26 Predicting El Nino
24
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay25
RMSE improvement: 0.9˚C to 0.4˚C
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay26
2015 October 8 Insect classification
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay27
accuracy improvement: 30% to 70%
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay28
2016 February 10 Macroeconomic agent-based models
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay29
f1-score improvement: 0.57 to 0.63
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay30
2016 February 13 Epidemium cancer survival rate
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay31
RMSE improvement: 3000 to 300
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay32
2016 May 11 Drug identification from spectra
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-SaclayCenter for Data ScienceParis-Saclay33
Drug identification error improvement: 9% to 3%
Drug concentration accuracy improvement: 20% to 12%
RAPID ANALYTICS AND MODEL PROTOTYPING
Center for Data ScienceParis-Saclay
• Fast development of analytics solutions
• Teaching support
• Networking and HR support
• Support for collaborative team work
• Commercialized through TektosData
34
THE RAMP TOOL
A prototyping tool for collaborative development of data science workflows
Center for Data ScienceParis-Saclay
• We have a cool tool for collaborative data analytics
• designing workflows beyond scikit-learn predictors
• Data management/munging is a big part of the data analytics workflow, we need tools
• preparing a RAMP takes two weeks to six months
• Big data is rare: our problems are more about flexible organization of heterogeneous data
35
TAKE HOME MESSAGES
Center for Data ScienceParis-Saclay36
THANK YOU!