PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY...

64

Transcript of PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY...

Page 1: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 2: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 3: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 4: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Presentation & Action

Storage &Batch Analysis

StreamAnalytics

Event Queuing & StreamIngestion

Event production

IoT Hubs

Applications

Archiving for long term storage/ batch analytics

Real-time dashboard

Stream Analytics

Automation to kick-off workflows

Machine Learning

Reference Data

Event Hubs

Blobs

Devices &

Gateways PowerBI

Page 5: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 6: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Programmer ProductivityDeclarative SQL like language

Built-in temporal semantics

Ease of Getting StartedIntegrations with sources, sinks, & ML

Build real-time dashboards in minutes

Lowest Total Cost of Ownership(TCO)Fully managed service

No cluster topology management required

Seamless scalability

Usage based pricing

1,915 lines of code with Apache Storm

@ApplicationAnnotation(name="WordCountDemo")public class Application implements StreamingApplication{

protected String fileName = "com/datatorrent/demos/wordcount/samplefile.txt";

private Locality locality = null;

@Override public void populateDAG(DAG dag, Configuration conf){

locality = Locality.CONTAINER_LOCAL;WordCountInputOperator input = dag.addOperator("wordinput", new WordCountInputOperator());input.setFileName(fileName);UniqueCounter<String> wordCount = dag.addOperator("count", new UniqueCounter<String>());dag.addStream("wordinput-count", input.outputPort, wordCount.data).setLocality(locality);ConsoleOutputOperator consoleOperator = dag.addOperator("console", new ConsoleOutputOperator());dag.addStream("count-console",wordCount.count, consoleOperator.input);

}}

3 lines of SQL in Azure Stream Analytics

SELECT Avg(Purchase), ScoreTollId, Count(*)

FROM GameDataStream

GROUP BY TumblingWindows(5, Minute), Score

Page 7: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Data ManipulationSELECT

FROM

WHERE

HAVING

GROUP BY

CASE WHEN THEN ELSE

INNER/LEFT OUTER JOIN

UNION

CROSS/OUTER APPLY

CAST INTO

ORDER BY ASC, DSC

Scaling ExtensionsWITHPARTITION BY

OVER

Date and Time FunctionsDateNameDatePart Day, Month, YearDateDiffDateTimeFromPartsDateAdd

Windowing ExtensionsTumblingWindowHoppingWindowSlidingWindow

Aggregate FunctionsSUMCOUNTAVGMINMAXSTDEVSTDEVPVARVARPTopOne

String FunctionsLenConcatCharIndexSubstringLower, UpperPatIndex

Temporal FunctionsLagIsFirstLastCollectTop

Mathematical FunctionsABSCEILINGEXPFLOORPOWERSIGNSQUARESQRT

Geospatial Functions (preview)CreatePointCreatePolygonCreateLineStringST_DISTANCEST_WITHINST_OVERLAPSST_INTERSECTS

Page 8: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 9: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 10: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 11: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Mission critical

reliability

Lowest TCOFully managedProgrammer

Productivity

Ease of getting

startedAzure Stream

Analytics

SQL like query language Source/sink

integrations

No cluster provisioning Pay as you go Enterprise grade SLA

Page 12: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 13: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards &

Visualizations

Cortana

Bot

Framework

Cognitive

Services

Power BI

Information

Management

Event Hubs

Data Catalog

Data Factory

Machine Learning

and Analytics

HDInsight

(Hadoop and

Spark)

Stream

Analytics

Intelligence

Data Lake

Analytics

Machine

Learning

Big Data Stores

SQL Data

Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

Page 14: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 15: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

example data past experience

• Supervised Learning

• Unsupervised Learning (similar to Density Estimates in Statistics)

Page 16: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 17: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

分析

存储

Microsoft Hadoop Stack

Azure HDInsight

Machine Learning

Local (HDFS) or Cloud (Azure Blob/Azure Data Lake Store)

Page 18: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 19: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 20: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

• Classification

• Clustering

• Regression

• Collaborative Filtering

• Feature Extraction

• Statistics/Linear Algebra

Page 21: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 22: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 23: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 24: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 25: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 26: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 27: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 28: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

• SparkSession• 在Spark 2.0中变为创建compute context的主要API(取代SparkContext)

Page 29: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

spark = SparkSession \.builder \.appName("MNIST Classifier") \.config('spark.sql.warehouse.dir',

'file:///random/path/as/we/need/to/config/this/but/dont/use/it') \.config('spark.executor.instances', 10) \.getOrCreate()

# fileNameTrain = 'adl://xiaoyzhusparkadls20.azuredatalakestore.net/train.csv'fileNameTrain = 'C:\\Users\\xiaoyzhu\\Downloads\\train.csv'# read the data from raw csv file.# #We use "inferSchema" as "True" since we want to import the data as integers. Otherwise spark will treat it as strings.mnist_train = spark.read.csv(fileNameTrain, header=True, inferSchema=True)mnist_train.show()

Page 30: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

• DataFrames• Spark’s new abstraction for data science

• Support PB-scale “table”

• Easier development experience (than RDD based Spark API) and less code

• Contains named columns

• Contains types (Scala primitives)

Page 31: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

spark = SparkSession \.builder \.appName("MNIST Classifier") \.config('spark.sql.warehouse.dir',

'file:///random/path/as/we/need/to/config/this/but/dont/use/it') \.config('spark.executor.instances', 10) \.getOrCreate()

# fileNameTrain = 'adl://xiaoyzhusparkadls20.azuredatalakestore.net/train.csv'fileNameTrain = 'C:\\Users\\xiaoyzhu\\Downloads\\train.csv'# read the data from raw csv file.# #We use "inferSchema" as "True" since we want to import the data as integers. Otherwise spark will treat it as strings.mnist_train = spark.read.csv(fileNameTrain, header=True, inferSchema=True)mnist_train.select("label").show()

Page 32: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 33: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 34: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

train your model over this dataset

use this data to validate the model

assess the generalization of the model

Training

50%

Test, 25%

Validation,

25%

DATA VOLUME

Page 35: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 36: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 37: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 38: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

随机森林算法的两个重要参数

• 树的数量• 决策树的深度

• 那么…如何选择这两个超参数呢?

Page 39: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 40: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 41: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Transfor

mEstimate Evaluate

PIPELINE

Page 42: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

# define the classifier hererfc = RandomForestClassifier(labelCol="labelIndex", featuresCol="features", impurity='gini', maxBins=32)

pipeline = Pipeline(stages=[rfc])

Page 43: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 44: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 45: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

• DataFrame -> new DataFrame

• Extraction of values into feature vector

• Map from one column to another column

• Append an additional column

• Predict a value and append value

• Implements transform() method

Page 46: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

# total 784 featuresFEATURE_NUM = 784# assemble those features to a vector to consume in Sparkassembler = VectorAssembler(

inputCols=["pixel{0}".format(i) for i in range(FEATURE_NUM)],

outputCol="features")

# Transform pixel0,pixel1...pixel783 to one column named "features"labeledPoints = assembler.transform(mnist_train).select("label", "features")

Page 47: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 48: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

• Implements a method fit()

• Takes in a DataFrame as input

• Produces a Model as Output

• Model is a Transformer

• Predict a value and append value

Page 49: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

创建Spark Session和

相关的DataFrame

SparkSession

读取数据

Spark.read.csv

将原始数据组合成

Vector并Transform成

labeledPoints

VectorAssembler

将标签label转化为

index,便于Spark分类

StringIndexer

数据分为train, test,

validation

randomSplit

创建RandomForest模型

RandomForestClassifie

r

创建Spark ML Pipeline

Pipeline

设置参数网络

ParamGridBuilder

设置交叉验证

CrossValidator训练模型并输出结果

Page 50: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 51: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 52: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 53: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 54: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

# define the classifier hererfc = RandomForestClassifier(labelCol="labelIndex", featuresCol="features", impurity='gini', maxBins=32)

pipeline = Pipeline(stages=[rfc])

# define the param grids to search for best hyper-parametersparamGrid = ParamGridBuilder() \

.addGrid(rfc.numTrees, range(3, 10)) \

.addGrid(rfc.maxDepth, range(4, 10)) \

.build()

Page 55: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 56: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

• Determine how close a fit your model is to data

• Get a score to determine effectiveness of model

• Precision, recall, F-Measures

• Area Under ROC

• MSE/RMSE

Page 57: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

# Define the cross validator. We need to define a model to be validated, an evaluator which is used to evaluate the model, and a param grid to be searched forcrossval = CrossValidator(estimator=pipeline,

estimatorParamMaps=paramGrid,evaluator=MulticlassClassificationEvaluator(),numFolds=3)

# Run TrainValidationSplit, and choose the best set of parameters.start_time = time.time()# we can now fit the model using above configurationscvModel = crossval.fit(trainData)print("grid search time --- %s seconds ---" % (time.time() - start_time))

# then we can choose the best model, either save it for later usage, or use it to train the test dataset.bestModel = cvModel.bestModel# bestModel.save('bestModel')# We can print out the best model to see more details.

testprediction = bestModel.transform(testData)evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="labelIndex", metricName="f1")print("Precision: " + str(evaluator.evaluate(testprediction)))

Page 58: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

https://github.com/zxzxy1988/MNISTAnalysis/blob/master/MNIST%20Classifier.py

Page 59: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 60: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 61: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

5 0 4 1

Page 62: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP

Ingest

Data

Feature

Extractio

n

Train

Model

Evaluate

Model

Page 63: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP
Page 64: PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY TumblingWindows(5, Minute), Score. Data Manipulation SELECT FROM WHERE HAVING GROUP