Presentación de PowerPoint - AI & Big Data Congress

45
portada

Transcript of Presentación de PowerPoint - AI & Big Data Congress

Page 1: Presentación de PowerPoint - AI & Big Data Congress

portada

Page 2: Presentación de PowerPoint - AI & Big Data Congress

WORKSHOP 2MODELOS DE MACHINE LEARNING EN PRODUCCIÓN:

DIFERENTES ENFOQUES Y EL ROL DE LAS HERRAMIENTAS DE GESTIÓN DEL CICLO DE VIDA ML

Page 3: Presentación de PowerPoint - AI & Big Data Congress

Rohit KumarResearcher, Data Science and Big Data Analytics UnitEurecateurecat.org | @Eurecat_News | @Eurecat_Events

Page 4: Presentación de PowerPoint - AI & Big Data Congress

BIG DATA & AI Congress 2019

MACHINE LEARNING MODELS IN PRODUCTION

Dr. Rohit Kumar

Researcher, Eurecat

Page 5: Presentación de PowerPoint - AI & Big Data Congress

5BigData & AI Congress 2019

Who is this guy!!

Physicist Java Developer/System Architect (6+ years) Big Data Researcher

(5+ years)

I am not a ML expert or a Data Scientist!!I am just an engineer who can follow complex maths…

Page 6: Presentación de PowerPoint - AI & Big Data Congress

6BigData & AI Congress 2019

Quick Poll?

How many Data scientist?

How many Data engineer?

How many DevOps?

How many managers/architects?

Rockstars who does everything?

Page 7: Presentación de PowerPoint - AI & Big Data Congress

7BigData & AI Congress 2019

Quick Feedback: How many have this issue?

Page 8: Presentación de PowerPoint - AI & Big Data Congress

8BigData & AI Congress 2019

What I am going to talk about!!

Data wrangling Training Prediction

ML Life Cycle!! Not Data Science Life Cycle..

Acknowledgment: https://www.kdnuggets.com/2019/06/approaches-deploying-machine-learning-production.html by Julien Kervizic

Page 9: Presentación de PowerPoint - AI & Big Data Congress

9BigData & AI Congress 2019

Training

1. One Off training

Where do you save your notebook/code?Where and how do you save your models?

Page 10: Presentación de PowerPoint - AI & Big Data Congress

10BigData & AI Congress 2019

Training

2. Batch training (continuous learning system)

Not simply retraining the model weights !!

Train Predict

But also feature processing, feature selection, model selections and parameter optimization.

Page 11: Presentación de PowerPoint - AI & Big Data Congress

11BigData & AI Congress 2019

AutoML

AutoWEKA is an approach for the simultaneous selection of a machine learning algorithm and its hyperparameters; combined withthe WEKA package it automatically yields good models for a wide variety of data sets.

TPOT is a data-science assistant which optimizes machine learningpipelines using genetic programming. (Good for Python users)

H2O AutoML provides automated model selection and ensembling forthe H2O machine learning and data analytics platform. (Nice for Java users)

TransmogrifAI is an AutoML library running on top of Spark.

Page 12: Presentación de PowerPoint - AI & Big Data Congress

12BigData & AI Congress 2019

To manage different data workflows

To automate Machine Learning

Page 13: Presentación de PowerPoint - AI & Big Data Congress

13BigData & AI Congress 2019

Page 14: Presentación de PowerPoint - AI & Big Data Congress

14BigData & AI Congress 2019

Azure ML

Offcourse there are easy to use fancy platforms !! (if you have money and do not want to get your hands too dirty !!)

Page 15: Presentación de PowerPoint - AI & Big Data Congress

15BigData & AI Congress 2019

Training

3. Real time trainingTrain

..

Incremental Training using “Online Machine Learning Algorithms”

Page 16: Presentación de PowerPoint - AI & Big Data Congress

16BigData & AI Congress 2019

Quick Poll: What kind of ML training approach you have used?

A. One Off training.

B. Batch training adhoc

C. Batch training using some platform.

D. Real time training

Page 17: Presentación de PowerPoint - AI & Big Data Congress

17BigData & AI Congress 2019

Saving your Models

In Python world:1) Pickle: 2) Joblib: significantly faster on large numpy arrays

In R world:1) saveRDS/readRDS – can be assigned to a new model variable2) save/load – returns the same model variable

In Java (H20 world):1) Pojo : Plan old java object2) Mojo: Model ObJect, Optimized

Page 18: Presentación de PowerPoint - AI & Big Data Congress

18BigData & AI Congress 2019

Saving your Models

“Change, which is the only constant in life, can be secured and facilitated bystandardization.”

Page 19: Presentación de PowerPoint - AI & Big Data Congress

19BigData & AI Congress 2019

Saving your Models

ONNX (Open Neural Network Exchange) format: developed byFacebook and Microsoft in collaboration it is an open-source standard for representing deep learning models that will let you transfer them between CNTK, Caffe2 and PyTorch.

Link: https://github.com/onnx/tutorials

TensorFlow PyTorch

Page 20: Presentación de PowerPoint - AI & Big Data Congress

20BigData & AI Congress 2019

Saving your Models

Page 21: Presentación de PowerPoint - AI & Big Data Congress

21BigData & AI Congress 2019

Page 22: Presentación de PowerPoint - AI & Big Data Congress

22BigData & AI Congress 2019

Saving your Models

PMML (Predictive model markup language): Developed by Data Mining Group (DMG) a consortium of commercial and open-source data mining companies and supported by IBM, AWS and Google.

PFA (Portable Format for Analytics): it is more flexible (using JSON instead of XML) and also handling data preparation along with the modelsthemselves.

Page 23: Presentación de PowerPoint - AI & Big Data Congress

23BigData & AI Congress 2019

Saving your Models

MLWriter/MLReader: Inherente to Spark

Supports not only saving a Model but a ML Pipeline.

The result of save for pipeline model is a JSON file for metadata whileParquet for model data, e.g. coefficients.

Page 24: Presentación de PowerPoint - AI & Big Data Congress

24BigData & AI Congress 2019

Is a model or Pipeline saved using Apache Spark ML persistence in Sparkversion X loadable by Spark version Y?• Major versions: No guarantees, but best-effort.

• Minor and patch versions: Yes; (backwards compatible)

• Note about the format: There are no guarantees for a stable persistence format, but model loading itself is designed to be backwards compatible.

Does a model or Pipeline in Spark version X behave identically in Spark versionY?

• Major versions: No guarantees, but best-effort.

• Minor and patch versions: Identical behavior, except for bug fixes.

Page 25: Presentación de PowerPoint - AI & Big Data Congress

25BigData & AI Congress 2019

Quick Poll: Machine Learning Deployment

A. One time PPT/PDF/Report for Managers.

B. Deployed in Server for Batch Prediction.

C. Deployed in a streaming platform for real time prediction.

Page 26: Presentación de PowerPoint - AI & Big Data Congress

26BigData & AI Congress 2019

Batch vs. Real-time Prediction

Load implications Infrastructure Implications

Cost Implications Evaluation Implication

Page 27: Presentación de PowerPoint - AI & Big Data Congress

27BigData & AI Congress 2019

Batch Prediction

Data

Load Model

Featureextracted

SaveApply to the modelon data

Used by other systems

Use workflow managers like Airflow, or as simple as crontab etc.

Page 28: Presentación de PowerPoint - AI & Big Data Congress

28BigData & AI Congress 2019

Real Time Prediction

1. Database Trigger based prediction

PostGres supprts native Python integration and can run Python script on trig

Ms SQL Server has Machine Learning Services (In-Database)

Teradata can run R/Python script through an external script command.

Oracle supports PMML model through its data mining extension.

Page 29: Presentación de PowerPoint - AI & Big Data Congress

29BigData & AI Congress 2019

Real Time Prediction

2. Message queue based prediction service

A combination of Kafka for message queue and Spark Streaming forprocessing.

Azure eventhub with Azure functions

Google Pub/Sub with Apache Beam.

Page 30: Presentación de PowerPoint - AI & Big Data Congress

30BigData & AI Congress 2019

Real Time Prediction

3. Web services based deployment

Get unique ids and let web service pull related data from backend system and make prediction.

Get the complete payload with data and make the prediction.

A combination of both.

Page 31: Presentación de PowerPoint - AI & Big Data Congress

31BigData & AI Congress 2019

Real Time Prediction

3.1 Using Functions!!

+ ML Net

Page 32: Presentación de PowerPoint - AI & Big Data Congress

32BigData & AI Congress 2019

Page 33: Presentación de PowerPoint - AI & Big Data Congress

33BigData & AI Congress 2019

Real Time Prediction

3.2 Using Container

Ununiform environments across models. Ununiform library requirements across models. Ununiform resource requirements across models. Scaling at model level

Page 34: Presentación de PowerPoint - AI & Big Data Congress

34BigData & AI Congress 2019

Real Time Prediction

3.3 Directly In-App

Google ML Kit or the likes of Caffe2or Apple’s core ML allows to leveragemodels within native applications

Tensorflow.js or ONNXJs allow running modelsdirectly on browser

Page 35: Presentación de PowerPoint - AI & Big Data Congress

35BigData & AI Congress 2019

Real Time Prediction

3.4 Using Notebooks!!!!

Page 36: Presentación de PowerPoint - AI & Big Data Congress

36BigData & AI Congress 2019

So Far So Good!!

Page 37: Presentación de PowerPoint - AI & Big Data Congress

37BigData & AI Congress 2019

Why ML Life Cycle is more complicated than SDLC!

Variety of tools and packages at disposal.

Hard to track useful experiments.

Reproducibility is hard.

Deploying is hard.

Page 38: Presentación de PowerPoint - AI & Big Data Congress

38BigData & AI Congress 2019

ML Life Cycle Management Platforms

Azure ML services (launched in 2015)

Page 39: Presentación de PowerPoint - AI & Big Data Congress

39BigData & AI Congress 2019

ML Life Cycle Management Platforms

Amazon SageMaker (launched in 2017)

Page 40: Presentación de PowerPoint - AI & Big Data Congress

40BigData & AI Congress 2019

ML Life Cycle Management Platforms

Google AI (launched in 2018)

Page 41: Presentación de PowerPoint - AI & Big Data Congress

41BigData & AI Congress 2019

ML Life Cycle Management Platforms

Cloudera DS Workbench (launched in 2018)

Page 42: Presentación de PowerPoint - AI & Big Data Congress

42BigData & AI Congress 2019

ML Life Cycle Management Platforms

Databricks Managed ML Flow (launched in 2019) : ML Flow in itself is a library and not a hosted platform.

Page 43: Presentación de PowerPoint - AI & Big Data Congress

43BigData & AI Congress 2019

ML Life Cycle Management Platforms

Page 44: Presentación de PowerPoint - AI & Big Data Congress

44BigData & AI Congress 2019

Page 45: Presentación de PowerPoint - AI & Big Data Congress

portada