Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning...

44
1

Transcript of Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning...

Page 1: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

1

Page 2: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

2

Page 3: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

3

Out of Memory? No Problem. Developing Machine Learning Models on Big Data

Heather Gorr, PhD

MATLAB Product Marketing Manager

Page 4: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

4

Big data without big changes

One file One hundred files

Page 5: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

5

The big data landscape can seem overwhelming

Page 6: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

6

Building machine learning models with big data

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 7: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

7

Case study: Predict Air Quality in North America

Page 8: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

8

Building machine learning models with big data – step by step

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 9: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

9

Historical files are on HDFS and real time data are available through an API

• Temperature• Pressure• Relative Humidity• Dew Point• Wind speed • Wind direction• Ozone• CO• NO2• SO2

Page 10: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

10

You have 1TB of data you’ve never seen before. Where do you start?

Page 11: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

11

Use a Spark-enabled Hadoop cluster and MATLAB. Both are well known for machine learning.

HDFS

YARN

Spark

MATLAB

Page 12: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

12

Access and preview the data with datastore

Page 13: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

13

Databases

Images

MDF Files

Custom

Simulink

There are numerous datastores to access data in many forms

Page 14: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

14

Access air quality data using datastore

Page 15: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

15

Page 16: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

16

Access air quality data using datastore

Page 17: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

17

Preview the data and adjust properties to best represent the data of interest

Page 18: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

18

Use tall arrays to work with the data like any MATLAB array

Page 19: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

19

Create a tall array for each datastore

ozone

Page 20: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

20

Use familiar MATLAB functions on tall arrays

Page 21: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

21

Clean messy data using common preprocessing functions

Page 22: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

22

Execution model makes operations more efficient on big data

▪ Deferred evaluation– Commands are not executed right away

– Operations are added to a queue

▪ Execution triggers include:– gather function

– summary function

– Machine learning models

– Plotting

Page 23: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

23

Execution model makes operations more efficient on big data

Unnecessary results are not computed

Page 24: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

24

Explore the data with tall visualizations

plot

scatter

binscatter

histogram

histogram2

ksdensity

Page 25: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

25

Get a summary of the data

Page 26: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

26

Gather a subset of the data

datasample: from 1980 - 2017

head: first 10000tail: last 10000

Page 27: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

27

Explore the subset of data in MATLAB as you always do

Page 28: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

28

Use the results of explorations to help make decisions

Page 29: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

29

Use the results of explorations to help make decisions

Page 30: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

30

Synchronize all data to daily times

Page 31: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

31

Save the preprocessed data to not have to repeat these steps each time

Page 32: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

32

You don’t need to leave MATLAB to monitor large jobs

Page 33: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

33

Building machine learning models with big data

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 34: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

34

How do you know which model to use?

Try them all ☺

Page 35: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

35

Predict air quality

Air Quality Index Air Quality Label

Regression Classification

Page 36: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

36

Use apps for easy model exploration

Page 37: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

37

Validate and compare models

Page 38: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

38

Select the most important features

Page 39: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

39

Building machine learning models with big data

Access, Preprocessing,

and Exploration

Model Validation and Scaling Up

Model Development

Page 40: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

40

Scale up with tall machine learning models

▪ Linear Regression (fitlm)

▪ Logistic & Generalized Linear Regression (fitglm)

▪ Discriminant Analysis Classification (fitcdiscr)

▪ K-means Clustering (kmeans)

▪ Principal Component Analysis (pca)

▪ Partition for Cross Validation (cvpartition)

▪ Linear Support Vector Machine (SVM) Classification (fitclinear)

▪ Naïve Bayes Classification (fitcnb)

▪ Random Forest Ensemble Classification (TreeBagger)

▪ Lasso Linear Regression (lasso)

▪ Linear Support Vector Machine (SVM) Regression (fitrlinear)

▪ Single Classification Decision Tree (fitctree)

▪ Linear SVM Classification with Random Kernel Expansion (fitckernel)

Page 41: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

41

Big data machine learning models also include goodness of fit measures and convenient functions to explore and validate model

Page 42: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

42

Scale up. But not all at once

Use tall arrays in code

Apply model to subset of data

Apply model to all data

Apply model to new data

Deploy/Compile

Page 43: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

43

Big data without big changes

One file One hundred files

Page 44: Out of Memory? No Problem. - MathWorks · Out of Memory? No Problem. Developing Machine Learning Models on Big Data Heather Gorr, PhD MATLAB Product Marketing Manager. 4 Big data

44