Deep learning and the systemic challenges of data science initiatives

65
Center for Data Science Paris-Saclay 1 CNRS & University Paris Saclay BALÁZS KÉGL DEEP LEARNING AND THE SYSTEMIC CHALLENGES OF DATA SCIENCE INITIATIVES

Transcript of Deep learning and the systemic challenges of data science initiatives

Page 1: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay1

CNRS & University Paris SaclayBALÁZS KÉGL

DEEP LEARNING AND THE SYSTEMIC CHALLENGES OF DATA SCIENCE

INITIATIVES

Page 2: Deep learning and the systemic challenges of data science initiatives

2

I’m not going to explain deep learning in detail

Page 3: Deep learning and the systemic challenges of data science initiatives

3

Rather: give an overview of what you can do with it

Page 4: Deep learning and the systemic challenges of data science initiatives

• Vincent Vanhoucke (Google)

• Hugo Larochelle (Twitter)

• Andrew Ng (Baidu)

• Nando de Freitas (Oxford, Google DeepMind)

4

DEEP LEARNING COURSES

Page 5: Deep learning and the systemic challenges of data science initiatives

5

Your challenges are not technological but organizational

Page 6: Deep learning and the systemic challenges of data science initiatives

• Technology is disruptive

• The current organization of research is half broken and changing

• Misplaced incentives, interdisciplinarity, peer-reviewed publications, code vs papers, funding, reproducibility, questions around data-driven scientific method

• We are using few of the tools developed mainly in industry to manage disruptive innovation

6

WHY CHALLENGES ARE ORGANIZATIONAL?

Page 7: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

• Intro to deep learning

• The PS-CDS

• The data science ecosystem: challenges

• Some tools

7

OUTLINE

Page 8: Deep learning and the systemic challenges of data science initiatives

• You have a prediction or inference problem y = f(x)

• X: photo, spectrum, y: galaxy/star and redshift

• X: calorimetric image, y: particle parameters

• X: particle parameters, y: calorimetric image

8

DATA-DRIVEN INFERENCE

Page 9: Deep learning and the systemic challenges of data science initiatives

• You have a prediction or inference problem y = f(x)

• You have no model to fit, but a large set of (x, y) pairs

• The source is (typically) either

• observation + human labeling

• simulation

• And a loss function L(y, ypred)9

DATA-DRIVEN INFERENCE

Page 10: Deep learning and the systemic challenges of data science initiatives

• The solution

• Design/define a lot of application/domain-dependent cues/features hj(x), or use “simple” nonlinear functions (like trees, basis functions, filters).

• Learn a linear function f(x) = ∑j wj hj(x)

• shallow neural nets, ensemble methods, kernel methods

• Works well for most of the practical problems (but not all)

10

THE SHALLOW LEARNING PARADIGM

Page 11: Deep learning and the systemic challenges of data science initiatives

11

Your most important question is:

are you in the “not all” part?

Page 12: Deep learning and the systemic challenges of data science initiatives

• The solution

• Parametrize f(x) = f(x, w)

• w is very high-dimensional, f has a lot of capacity

• make everything quasi-differentiable (L and f(.,w))

• regularize (L1, L2, dropout, etc.)

• learn w using stochastic gradient descent

12

THE DEEP LEARNING PARADIGM

Page 13: Deep learning and the systemic challenges of data science initiatives

• From a design (user) point of view

• Instead of hand-crafting (families of) informative features, you will design a system of reusable blocks of differentiable functions

• Close to the data, domain knowledge is important

• Deeper layers are rather general

• A lot of partly reusable trial-end-error tricks

• Pre-trained and saved networks/blocks,

• “dark knowledge”

13

SHALLOW TO DEEP LEARNING

Page 14: Deep learning and the systemic challenges of data science initiatives

14

STATE OF THE ART

• Computer vision

• close to the data: convolutional layers, max pooling

• Sequential data (speech, language)

• recurrent nets, networks with memory (LSTM)

• Multi-modal embeddings (eg: caption generation)

• (Half) future: robotics, Turing machine, reasoning, neural simulators

Page 15: Deep learning and the systemic challenges of data science initiatives

• Tools, techniques

• deep learning libraries (Theano, TensorFlow, Caffe, Torch)

• automatic differentiation

• stochastic gradient descent

• hyperparameter optimization

• lots of data and machines (GPUs)

15

THE DEEP LEARNING PARADIGM

Page 16: Deep learning and the systemic challenges of data science initiatives

16

I will stop talking about science

Well, not really

Page 17: Deep learning and the systemic challenges of data science initiatives

17

I will talk about management

(of) (data) science

Page 18: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

• My eight-year of experience interfacing between high-energy physics and data science

• Our two-year experience of running PS-CDS

• Extensive collaboration with management scientist

18

WHERE DOES IT COME FROM?

Page 19: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

DATA SCIENCE IN THE WORLD

19

Page 20: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay20

UNIVERSITÉ PARIS-SACLAY

19 founding partners

Page 21: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

UNIVERSITÉ PARIS-SACLAY

21

+ horizontal multi-disciplinary and multi-partner initiatives to create cohesion

Page 22: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay22

Center for Data ScienceParis-Saclay

A multi-disciplinary initiative to define, structure, and manage the data science ecosystem at the Université Paris-Saclay

http://www.datascience-paris-saclay.fr/

Biology & bioinformaticsIBISC/UEvry LRI/UPSudHepatinovCESP/UPSud-UVSQ-Inserm IGM-I2BC/UPSud MIA/AgroMIAj-MIG/INRALMAS/Centrale

ChemistryEA4041/UPSud

Earth sciencesLATMOS/UVSQ GEOPS/UPSudIPSL/UVSQLSCE/UVSQLMD/Polytechnique

EconomyLM/ENSAE RITM/UPSudLFA/ENSAE

NeuroscienceUNICOG/InsermU1000/InsermNeuroSpin/CEA

Particle physics astrophysics & cosmologyLPP/Polytechnique DMPH/ONERACosmoStat/CEAIAS/UPSudAIM/CEALAL/UPSud

The Paris-Saclay Center for Data ScienceData Science for scientific Data

250 researchers in 35 laboratories

Machine learningLRI/UPSud LTCI/TelecomCMLA/Cachan LS/ENSAELIX/PolytechniqueMIA/AgroCMA/PolytechniqueLSS/SupélecCVN/Centrale LMAS/CentraleDTIM/ONERAIBISC/UEvry

VisualizationINRIALIMSI

Signal processingLTCI/TelecomCMA/PolytechniqueCVN/CentraleLSS/SupélecCMLA/CachanLIMSIDTIM/ONERA

StatisticsLMO/UPSud LS/ENSAELSS/SupélecCMA/PolytechniqueLMAS/CentraleMIA/AgroParisTech

Data sciencestatistics

machine learninginformation retrieval

signal processingdata visualization

databases

Domain sciencehuman society

life brain earth

universe

Tool buildingsoftware engineering

clouds/gridshigh-performance

computingoptimization

Data scientist

Applied scientist

Domain scientist

Data engineer

Software engineer

Center for Data ScienceParis-Saclay

datascience-paris-saclay.fr

@SaclayCDS

LIST/CEA

Page 23: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

DATA SCIENCE

23

Design of automated methods

to analyze massive and complex data

to extract useful information

Page 24: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay24

CENTER FOR DATA SCIENCE=

DATA CENTER

We are focusing on inference:

data knowledge

Interfacing with HPC, cloud, storage, production, privacy, security

Page 25: Deep learning and the systemic challenges of data science initiatives

25

WHAT IS NEW?

“As the flow of data increases, it is increasingly processed, analyzed, and acted upon by machines, not

humans.”

NYU-CDS manifesto

Page 26: Deep learning and the systemic challenges of data science initiatives

• We have the data

• statistical / physical modeling is less important

• data-driven prediction

• We have the computational power

• We have the algorithms

• deep learning breakthrough: image, speech, language

• closing on AI, step by step

26

WHAT IS NEW?

Page 27: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay27

https://medium.com/@balazskegl

Page 28: Deep learning and the systemic challenges of data science initiatives

Domain scienceenergy and physical sciences

health and life sciences Earth and environment

economy and society brain

28

THE DATA SCIENCE LANDSCAPEData scientist

Data trainer

Applied scientist

Domain scientistSoftware engineer

Data engineer

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

Tool building software engineering

clouds/grids high-performance

computing optimization

Page 29: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

• (The lack of) manpower

• especially at the interfaces

• industrial brain-drain

• Incentives

• data scientists are not incentivized to work on domain science

• scientists are not incentivized to work on tools

• Access

• no well-developed channels to identify the right experts for a given problem

• Tools

• few tools that can help domain scientists and data scientists to collaborate efficiently

29

CHALLENGES

Page 30: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

TOOLS

30

We are designing and learning to manage tools to

accompany data science projects with different needs

Page 31: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

TOOLS: LANDSCAPE TO ECOSYSTEM

31

Data scientist

Data trainer

Applied scientist

Domain expertSoftware engineer

Data engineer

Tool building Data domains

Data sciencestatistics

machine learning information retrieval

signal processing data visualization

databases

• interdisciplinary projects • matchmaking tool • design and innovation strategy workshops • data challenges

• coding sprints • Open Software Initiative • code consolidator and engineering projects

software engineeringclouds/grids

high-performancecomputing

optimization

energy and physical sciences health and life sciences Earth and environment

economy and society brain

• data science RAMPs and TSs • IT platform for linked data • annotation tools • SaaS data science platform

Page 32: Deep learning and the systemic challenges of data science initiatives

32

• Efficient exploration of the space of innovative ideas

• Communication, knowledge sharing

• Project building

DESIGNING DATA SCIENCE PROJECTS

Page 33: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

DESIGNING DATA SCIENCE PROJECTS

33

Data value

Data analytics

Exploration of value !

• design theory • data-based prospection • innovation workshops

Problem formulation Problem solving

!• specialized teams • RAMPs / training sprints • data challenges

Page 34: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

• Putting domain scientists, data scientists, and management scientist in the same room

• Getting them understand each other

• Keeping them collectively creative

• The goal: identifying and defining projects

• low-hanging fruits

• breakthrough projects

• long-term vision

34

DESIGN AND INNOVATION STRATEGY WORKSHOPS

Page 35: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

innovative design

=

interaction and joint expansion of concepts

and knowledge

35

DESIGN AND INNOVATION STRATEGY WORKSHOPS

C/K design theory

Page 36: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay36

DESIGN AND INNOVATION STRATEGY WORKSHOPS

[P]$Project$building!Ini3alisa3on$

[K]$Knowledge$sharing$

Workshops$

[C]$IFM?Design$Workshops$

[RUN]!

DKCP process: linearizing C-K dynamics

Page 37: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

THREE ANALYTICS TOOLS FOR INITIATING DOMAIN-DATA SCIENCE INTERACTIONS

37

RAPID ANALYTICS AND MODEL PROTOTYPING

(RAMP)

DATA CHALLENGES

TRAINING SPRINTS (TS)

Page 38: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

DATA CHALLENGES

38

Page 39: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

DATA CHALLENGES

39

• A data challenge is a recently developed unconventional dissemination and communication tool

• a scientific or industrial data producer arrives with a well-defined problem and a corresponding annotated data set

• defines a quantitative goal

• makes the problem and part of the data set (the training set) public on a dedicated site

• data science experts then take the public training data and submit solutions (predictions) for a test set with hidden annotations

• submissions are evaluated numerically using the quantitative measure

• contestants are listed on a leaderboard

• after a predefined time, typically a couple of months, the final results are revealed and the winners are awarded

Page 40: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

DATA CHALLENGES

40

• The HiggsML challenge on Kaggle

• https://www.kaggle.com/c/higgs-boson

Page 41: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

HUGE PUBLICITY

41

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

14

Page 42: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

SIGNIFICANT IMPROVEMENT OVER THE BASELINE

42

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

15

Page 43: Deep learning and the systemic challenges of data science initiatives

43

HUGE PUBLICITY

SIGNIFICANT IMPROVEMENT OVER THE BASELINE

yet partially missing the objectives

Page 44: Deep learning and the systemic challenges of data science initiatives

• Challenges are useful for

• generating visibility in the data science community about novel application domains

• benchmarking in a fair way state-of-the-art techniques on well-defined problems

• finding talented data scientists

• Limitations

• not necessary adapted to solving complex and open-ended data science problems in realistic environments

• no direct access to solutions and data scientist

• emphasizes competition

44

DATA CHALLENGES

Page 45: Deep learning and the systemic challenges of data science initiatives

45

We decided to design something better

Page 46: Deep learning and the systemic challenges of data science initiatives

46

RAPID ANALYTICS AND MODEL PROTOTYPING (RAMP)

• Prototyping

• Training

• Collaboration building

Page 47: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay47

RAMPS

• Single-day coding sessions

• 20-40 participants

• preparation is similar to challenges

• Goals

• focusing and motivating top talents

• promoting collaboration, speed, and efficiency

• solving (prototyping) real problems

Page 48: Deep learning and the systemic challenges of data science initiatives

48

TRAINING SPRINTS

• Single-day training sessions

• 20-40 participants

• focusing on a single subject (deep learning, model tuning, functional data, etc.)

• preparing RAMPs

Page 49: Deep learning and the systemic challenges of data science initiatives

49

ANALYTICS TOOLS TO PROMOTE COLLABORATION AND CODE REUSE

Page 50: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

SIGNIFICANT IMPROVEMENT OVER THE BASELINE

50

B. Kégl / AppStat@LAL Learning to discover

CLASSIFICATION FOR DISCOVERY

15

Page 51: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay51

ANALYTICS TOOL TO PROMOTE COLLABORATION AND CODE REUSE

Page 52: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

ANALYTICS TOOLS TO MONITOR PROGRESS

52

Page 53: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Jan 15 The HiggsML challenge

53

Page 54: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 Apr 10 Classifying variable stars

54

Page 55: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

VARIABLE STARS

55

Page 56: Deep learning and the systemic challenges of data science initiatives

Learning to discoverB. Kégl / CNRS - Saclay

VARIABLE STARS

56

accuracy improvement: 89% to 96%

Page 57: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

RAPID ANALYTICS AND MODEL PROTOTYPING

2015 June 16 and Sept 26 Predicting El Nino

57

Page 58: Deep learning and the systemic challenges of data science initiatives

58

RAPID ANALYTICS AND MODEL PROTOTYPING

RMSE improvement: 0.9˚C to 0.4˚C

Page 59: Deep learning and the systemic challenges of data science initiatives

59

2015 October 8 Insect classification

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 60: Deep learning and the systemic challenges of data science initiatives

60

RAPID ANALYTICS AND MODEL PROTOTYPING

accuracy improvement: 30% to 70%

Page 61: Deep learning and the systemic challenges of data science initiatives

61

2015 Fall Drug identification from spectra

RAPID ANALYTICS AND MODEL PROTOTYPING

Page 62: Deep learning and the systemic challenges of data science initiatives

• Teaching support

• Networking and HR support

• Support for collaborative team work

62

THE RAMP TOOL

A prototyping tool for collaborative development of data science workflows

Page 63: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay63

THANK YOU!

Page 64: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

• A window to open data at Paris-Saclay

• We are not storing or handling existing large data sets

• Rather indexing, linking, and mapping, embedding in the worldwide linked data (RDF) ecosystem

• Storing small data sets of small teams is possible

• Subsets of large sets for prototyping

• Or simply store metadata plus pointer

64

IT PLATFORM FOR LINKED DATA

http://io.datascience-paris-saclay.fr/

Page 65: Deep learning and the systemic challenges of data science initiatives

Center for Data ScienceParis-Saclay

IT PLATFORM FOR LINKED DATA

65