PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY...

Presentation & Action

Storage &Batch Analysis

StreamAnalytics

Event Queuing & StreamIngestion

Event production

IoT Hubs

Applications

Archiving for long term storage/ batch analytics

Real-time dashboard

Stream Analytics

Automation to kick-off workflows

Machine Learning

Reference Data

Event Hubs

Blobs

Devices &

Gateways PowerBI

Programmer ProductivityDeclarative SQL like language

Built-in temporal semantics

Ease of Getting StartedIntegrations with sources, sinks, & ML

Build real-time dashboards in minutes

Lowest Total Cost of Ownership(TCO)Fully managed service

No cluster topology management required

Seamless scalability

Usage based pricing

1,915 lines of code with Apache Storm

@ApplicationAnnotation(name="WordCountDemo")public class Application implements StreamingApplication{

protected String fileName = "com/datatorrent/demos/wordcount/samplefile.txt";

private Locality locality = null;

@Override public void populateDAG(DAG dag, Configuration conf){

locality = Locality.CONTAINER_LOCAL;WordCountInputOperator input = dag.addOperator("wordinput", new WordCountInputOperator());input.setFileName(fileName);UniqueCounter<String> wordCount = dag.addOperator("count", new UniqueCounter<String>());dag.addStream("wordinput-count", input.outputPort, wordCount.data).setLocality(locality);ConsoleOutputOperator consoleOperator = dag.addOperator("console", new ConsoleOutputOperator());dag.addStream("count-console",wordCount.count, consoleOperator.input);

}}

3 lines of SQL in Azure Stream Analytics

SELECT Avg(Purchase), ScoreTollId, Count(*)

FROM GameDataStream

GROUP BY TumblingWindows(5, Minute), Score

Data ManipulationSELECT

FROM

WHERE

HAVING

GROUP BY

CASE WHEN THEN ELSE

INNER/LEFT OUTER JOIN

UNION

CROSS/OUTER APPLY

CAST INTO

ORDER BY ASC, DSC

Scaling ExtensionsWITHPARTITION BY

OVER

Date and Time FunctionsDateNameDatePart Day, Month, YearDateDiffDateTimeFromPartsDateAdd

Windowing ExtensionsTumblingWindowHoppingWindowSlidingWindow

Aggregate FunctionsSUMCOUNTAVGMINMAXSTDEVSTDEVPVARVARPTopOne

String FunctionsLenConcatCharIndexSubstringLower, UpperPatIndex

Temporal FunctionsLagIsFirstLastCollectTop

Mathematical FunctionsABSCEILINGEXPFLOORPOWERSIGNSQUARESQRT

Geospatial Functions (preview)CreatePointCreatePolygonCreateLineStringST_DISTANCEST_WITHINST_OVERLAPSST_INTERSECTS

Mission critical

reliability

Lowest TCOFully managedProgrammer

Productivity

Ease of getting

startedAzure Stream

Analytics

SQL like query language Source/sink

integrations

No cluster provisioning Pay as you go Enterprise grade SLA

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards &

Visualizations

Cortana

Bot

Framework

Cognitive

Services

Power BI

Information

Management

Event Hubs

Data Catalog

Data Factory

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Intelligence

Data Lake

Analytics

Machine

Learning

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

example data past experience

• Supervised Learning

• Unsupervised Learning (similar to Density Estimates in Statistics)

分析

存储

Microsoft Hadoop Stack

Azure HDInsight

Machine Learning

Local (HDFS) or Cloud (Azure Blob/Azure Data Lake Store)

• Classification

• Clustering

• Regression

• Collaborative Filtering

• Feature Extraction

• Statistics/Linear Algebra

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index，便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

• SparkSession• 在Spark 2.0中变为创建compute context的主要API（取代SparkContext）

spark = SparkSession \.builder \.appName("MNIST Classifier") \.config('spark.sql.warehouse.dir',

'file:///random/path/as/we/need/to/config/this/but/dont/use/it') \.config('spark.executor.instances', 10) \.getOrCreate()

# fileNameTrain = 'adl://xiaoyzhusparkadls20.azuredatalakestore.net/train.csv'fileNameTrain = 'C:\\Users\\xiaoyzhu\\Downloads\\train.csv'# read the data from raw csv file.# #We use "inferSchema" as "True" since we want to import the data as integers. Otherwise spark will treat it as strings.mnist_train = spark.read.csv(fileNameTrain, header=True, inferSchema=True)mnist_train.show()

• DataFrames• Spark’s new abstraction for data science

• Support PB-scale “table”

• Easier development experience (than RDD based Spark API) and less code

• Contains named columns

• Contains types (Scala primitives)

spark = SparkSession \.builder \.appName("MNIST Classifier") \.config('spark.sql.warehouse.dir',

'file:///random/path/as/we/need/to/config/this/but/dont/use/it') \.config('spark.executor.instances', 10) \.getOrCreate()

# fileNameTrain = 'adl://xiaoyzhusparkadls20.azuredatalakestore.net/train.csv'fileNameTrain = 'C:\\Users\\xiaoyzhu\\Downloads\\train.csv'# read the data from raw csv file.# #We use "inferSchema" as "True" since we want to import the data as integers. Otherwise spark will treat it as strings.mnist_train = spark.read.csv(fileNameTrain, header=True, inferSchema=True)mnist_train.select("label").show()