Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

57
Azure Machine Learning 其他篇 台灣微軟 技術傳教士 吳宏彬 8/25/2016

Transcript of Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Page 1: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Azure Machine Learning

– 其他篇

台灣微軟

技術傳教士

吳宏彬

8/25/2016

Page 2: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 3: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 4: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

什麼是R語言

Open Source “lingua franca”

Analytics, Computing, Modeling

Global Community

Millions of users 7000+ Algorithms, Test Data & Evaluations

Can be Scaled to Big Data,

Big Analytics

Ecosystem

Scalability

Page 5: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Polls of data miners and analytics professionals on their software

choices since 2007Source: http://blog.revolutionanalytics.com/2013/10/r-usage-skyrocketing-rexer-poll.html

Page 6: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

R is developed and contributed by open source community

CRAN – the Comprehensive R Archive Network Package repository of R

7500+ packages, covering all aspects of statistical analysis, machine learning, natural language processing …

Still exponentially growth

Free!

Source: http://r4stats.com/2014/04/07/r-continues-its-rapid-growth/

Page 7: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 8: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 9: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 10: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 11: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 12: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

1.Seasonal ARIMA

2.Non Seasonal

ARIMA

3.Seasonal ETS

4.Non -Seasonal ETS

5.Average of Seasonal

ETS and Seasonal

ARIMA

Page 13: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Mean Error (ME) - Average forecasting error (an error is the difference between the

predicted value and the actual value) on the test dataset

Root Mean Squared Error (RMSE) - The square root of the average of squared errors of

predictions made on the test dataset.

Mean Absolute Error (MAE) - The average of absolute errors

Mean Percentage Error (MPE) - The average of percentage errors

Mean Absolute Percentage Error (MAPE) - The average of absolute percentage errors

Mean Absolute Scaled Error (MASE)

Symmetric Mean Absolute Percentage Error (sMAPE)

Page 14: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 15: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

DatasizeIn-memory

In-memory In-Memory or Disk Based

Speed of

AnalysisSingle threaded Multi-threaded

Multi-threaded, parallel

processing 1:N servers

SupportCommunity Community Community + Commercial

Analytic

Breadth &

Depth

7500+ innovative analytic

packages7500+ innovative analytic

packages

7500+ innovative packages

+ commercial parallel high-

speed functions

LicenseOpen Source

Open Source

Commercial license.

Supported release with

indemnity

Microsoft

R Open

Microsoft

R Server

Page 16: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 17: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Support standard Python library types such as

Pandas data frames and NumPy arrays.

Execute the Python code is based on Anaconda

2.1, It comes with close to 200 of the most

common Python packages (as NumPy, SciPy and

Scikits-Learn )

Output generate images from MatplotLib

Page 18: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 19: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

KNN

Page 20: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 21: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

21

What is Spark?

Page 22: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Data is growing faster than processing

speeds

Only solution is to parallelize data

processing on large clusters

Example: HDInsight

Page 23: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Fast, expressive cluster computing system compatible with Apache

Hadoop

• Works with any Hadoop-supported storage system (HDFS, S3, Avro, …)

Improves efficiency through:

• In-memory computing primitives

• General computation graphs

Improves usability through:

• Rich APIs in Java, Scala, Python

• Interactive shell

Spark was initially started by Matei Zaharia at UC Berkeley AMPLab

in 2009, was open sourced in 2010 and donated to Apache in 2013

Up to 100× faster

Often 2-10× less code

What is Spark?

Page 24: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Spark for Azure HDInsight

Spark Node

Spark Node

Spark Node

Spark Node

Spark Node

Storage Layer

Decision Maker

Decision Maker

Decision

Maker

Spark Cluster

clients

Page 25: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Spark Notebooks

Using the Spark shell to run

interactive queries

Using the Spark shell to run Spark

SQL queries

Using a standalone Scala program

Page 26: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Spark

Notebooks

Zeppelin – for

Scala users

Zupyter – for

Python users

Page 27: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Programming

Spark

Page 28: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 29: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 30: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

2015 System

Human Error Rate 4%

Speech Recognition could reach human parity in the next 3 years

Page 31: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 32: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 33: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

33

Page 34: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Microsoft 透過深度學習技術贏得 ImageNet 2015所有比賽項目冠軍

28.225.8

16.4

11.7

7.3 6.73.5

ILSVRC 2010NEC

America

ILSVRC 2011Xerox

ILSVRC 2012AlexNet

ILSVRC 2013Clarifi

ILSVRC 2014VGG

ILSVRC 2014GoogleNet

ILSVRC 2015MSRA

ResNet

ImageNet Classification top-5 error (%)

Microsoft had all 5 entries being the 1-st places this year: ImageNet

classification, ImageNet localization, ImageNet detection, COCO

detection, and COCO segmentation

Page 35: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

CNTK At the Heart: Computational Networks

•A generalization of machine learning models that can be described as a series of computational steps.

• E.g., DNN, CNN, RNN, LSTM, DSSM, Seq2Sqe, Log-linear model

•Representation: • A list of computational nodes denoted as

n = {node name : operation name}

• The parent-children relationship describing the operands

{n : c1, · · · , cKn }• Kn is the number of children of node n. For leaf nodes Kn = 0.

• Order of the children matters: e.g., XY is different from YX

• Given the inputs (operands) the value of the node can be computed.

•Can flexibly describe deep learning models. • Adopted by many other popular tools as well

35

Page 36: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

36

•A generalization of machine learning models that can be described as a series of computational steps.

• E.g., DNN, CNN, RNN, LSTM, DSSM, Log-linear model

•Representation: • A list of computational nodes denoted as

n = {node name : operation name}

• The parent-children relationship describing the operands

{n : c1, · · · , cKn }• Kn is the number of children of node n. For leaf nodes Kn = 0.

• Order of the children matters: e.g., XY is different from YX

• Given the inputs (operands) the value of the node can be computed.

•Can flexibly describe deep learning models. • Adopted by many other popular tools as well

Page 37: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

“CNTK is production-ready: State-of-the-art accuracy, efficient, and scales to multi-GPU/multi-server.”

Theano only supports 1 GPU

Achieved with 1-bit gradient quantizationalgorithm

0

10000

20000

30000

40000

50000

60000

70000

80000

CNTK Theano TensorFlow Torch 7 Caffe

speed comparison (samples/second), higher = better[note: December 2015]

1 GPU 1 x 4 GPUs 2 x 4 GPUs (8 GPUs)

* TensorFlow add distributed compute support in April 2016

Page 38: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Micrsoft Reacher SLAWEKSMYL win in CIF 2016 byusing LSTM Neural Network

Powered by CNTK

Page 39: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

CIF Competition 2016 – Final Results• Contestant 1 – Slawek Smyl (LSTM-based

NN on deseasonalized data)

• Contestant 2 – Slawek Smyl (weighted average of my 3 methods)

• Contestant 3 – prof. Sven Crone (Multilayer Perceptron with a thorough feature search)

• Contestant 4 - Mikhail Artyukhov (previous competition winner, ensemble models)

• Contestant 5 - Joerg Wichard, Bayer Healthcare AG (Adaptive Forecasting Strategy with Hybrid Ensemble Models)

• Contestant 6 – Slawek Smyl (LSTM-based NN)

Page 40: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

CNTK Demo

Page 41: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

CNTK Architecture

41

CNBuilder

LambdaCN

Description Use Build

ILearnerIDataReaderFeatures &

Labels Load Get data

IExecutionEngine

CPU/GPU

Task-specific

reader

SGD, AdaGrad,

etc.

Evaluate

Compute Gradient

Page 42: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

(1) Kai Chen and Qiang Huo, “Scalable training of deep learning machines by incremental block training with intra-block

parallel optimization and blockwise model-update filtering”,

in Internal Conference on Acoustics, Speech and Signal Processing , March 2016, Shanghai, China.

Page 43: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

CNTK is a powerful tool that supports CPU/GPU and runs under Windows/Linux

CNTK is extensible with the low-coupling modular design: adding new readers and new computation nodes is easy with a new reader design

Network definition language, macros, and model editing language (as well as Python and C++ bindings in the future) makes network design and modification easy

Compared to other tools CNTK has a great balance between efficiency, performance, and flexibility

Page 44: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

microsoft.com/cognitive

Page 45: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Mahout Spark ML Azure ML R Server

Shared Service No No Yes No

Deployment Model PaaS PaaS PaaS IaaS

Extensibility High High Medium High

Deployment Complexity Medium High Low Medium

Cost High High Low High

Programming Languages Java/Scala Scala/Java/Python Python/R R

Algorithms Limited (growing) MLlib/scikit Many (scikit/CRAN) Many (CRAN)

Scalability High High Medium Medium

Page 46: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 47: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 48: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

xgboost Vowpal Wabbit

Rattle

CNTK

*Copy

Page 49: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 51: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

雲端隨選隨用 各式資料 快速上線服務 資料分享跟協同合作

開放 支援完整資料分析流程

Page 52: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

https://gallery.cortanaintelligence.com/

Page 53: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

唯一一家提供從資料匯入到產生行動及資料呈現完整的解決方案

Page 54: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Page 55: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

ConnectR• High-speed & direct

connectors

Available for:• High-performance XDF

• SAS, SPSS, delimited & fixed format text data files

• Hadoop HDFS (text & XDF)

• Teradata Database & Aster

• EDWs and ADWs

• ODBC

ScaleR• Ready-to-Use high-performance

big data big analytics

• Fully-parallelized analytics

• Data prep & data distillation

• Descriptive statistics & statistical tests

• Range of predictive functions

• User tools for distributing customized R algorithms across nodes

DistributedR• Distributed computing framework

• Delivers cross-platform portability

Available on:

• Windows Servers

• Red Hat and SuSE Linux Servers

• Teradata Database

• Cloudera Hadoop

• Hortonworks Hadoop

• MapR Hadoop

R+CRAN• Open source R interpreter

• R 3.2.2

• Freely-available huge range of R algorithms

• Algorithms callable by RevoR

• 100% Compatible with existing R scripts, functions and packages

RevoR• Performance enhanced R

interpreter

• Based on open source R

• Adds high-performance math library to speed up linear algebra functions

R Open Microsoft R Server

DeployRDevelopR

Page 56: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Gradient Boosted Decision Trees

Naïve Bayes

Data import – Delimited, Fixed, SAS, SPSS,

OBDC

Variable creation & transformation

Recode variables

Factor variables

Missing value handling

Sort, Merge, Split

Aggregate by category (means, sums)

Min / Max, Mean, Median (approx.)

Quantiles (approx.)

Standard Deviation

Variance

Correlation

Covariance

Sum of Squares (cross product matrix for set

variables)

Pairwise Cross tabs

Risk Ratio & Odds Ratio

Cross-Tabulation of Data (standard tables & long

form)

Marginal Summaries of Cross Tabulations

Chi Square Test

Kendall Rank Correlation

Fisher’s Exact Test

Student’s t-Test

Subsample (observations & variables)

Random Sampling

Data Step Statistical Tests

Sampling

Descriptive Statistics Sum of Squares (cross product matrix for set

variables)

Multiple Linear Regression

Generalized Linear Models (GLM) exponential

family distributions: binomial, Gaussian, inverse

Gaussian, Poisson, Tweedie. Standard link

functions: cauchit, identity, log, logit, probit. User

defined distributions & link functions.

Covariance & Correlation Matrices

Logistic Regression

Classification & Regression Trees

Predictions/scoring for models

Residuals for all models

Predictive Models K-Means

Decision Trees

Decision Forests

Cluster Analysis

Classification

Simulation

Variable Selection

Stepwise Regression

Simulation (e.g. Monte Carlo)

Parallel Random Number Generation

Combination rxDataStep

rxExec

PEMA-R API Custom Algorithms

Page 57: Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習

Additional Resources

• CNTK: • https://github.com/Microsoft/CNTK

• Contains all the source code and example setups

• You may understand better how CNTK works by reading the source code

• New features are added constantly

• How to contact:• CNTK team: ask a question on CNTK GitHub!

• Alexey: • Email: [email protected]

• : https://www.linkedin.com/in/alexeykamenev

59