PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY...
Transcript of PowerPoint Presentationdownload.microsoft.com/download/B/D/A/BDA7513D-F0A9-487A-83… · GROUP BY...
Presentation & Action
Storage &Batch Analysis
StreamAnalytics
Event Queuing & StreamIngestion
Event production
IoT Hubs
Applications
Archiving for long term storage/ batch analytics
Real-time dashboard
Stream Analytics
Automation to kick-off workflows
Machine Learning
Reference Data
Event Hubs
Blobs
Devices &
Gateways PowerBI
Programmer ProductivityDeclarative SQL like language
Built-in temporal semantics
Ease of Getting StartedIntegrations with sources, sinks, & ML
Build real-time dashboards in minutes
Lowest Total Cost of Ownership(TCO)Fully managed service
No cluster topology management required
Seamless scalability
Usage based pricing
1,915 lines of code with Apache Storm
@ApplicationAnnotation(name="WordCountDemo")public class Application implements StreamingApplication{
protected String fileName = "com/datatorrent/demos/wordcount/samplefile.txt";
private Locality locality = null;
@Override public void populateDAG(DAG dag, Configuration conf){
locality = Locality.CONTAINER_LOCAL;WordCountInputOperator input = dag.addOperator("wordinput", new WordCountInputOperator());input.setFileName(fileName);UniqueCounter<String> wordCount = dag.addOperator("count", new UniqueCounter<String>());dag.addStream("wordinput-count", input.outputPort, wordCount.data).setLocality(locality);ConsoleOutputOperator consoleOperator = dag.addOperator("console", new ConsoleOutputOperator());dag.addStream("count-console",wordCount.count, consoleOperator.input);
}}
3 lines of SQL in Azure Stream Analytics
SELECT Avg(Purchase), ScoreTollId, Count(*)
FROM GameDataStream
GROUP BY TumblingWindows(5, Minute), Score
Data ManipulationSELECT
FROM
WHERE
HAVING
GROUP BY
CASE WHEN THEN ELSE
INNER/LEFT OUTER JOIN
UNION
CROSS/OUTER APPLY
CAST INTO
ORDER BY ASC, DSC
Scaling ExtensionsWITHPARTITION BY
OVER
Date and Time FunctionsDateNameDatePart Day, Month, YearDateDiffDateTimeFromPartsDateAdd
Windowing ExtensionsTumblingWindowHoppingWindowSlidingWindow
Aggregate FunctionsSUMCOUNTAVGMINMAXSTDEVSTDEVPVARVARPTopOne
String FunctionsLenConcatCharIndexSubstringLower, UpperPatIndex
Temporal FunctionsLagIsFirstLastCollectTop
Mathematical FunctionsABSCEILINGEXPFLOORPOWERSIGNSQUARESQRT
Geospatial Functions (preview)CreatePointCreatePolygonCreateLineStringST_DISTANCEST_WITHINST_OVERLAPSST_INTERSECTS
Mission critical
reliability
Lowest TCOFully managedProgrammer
Productivity
Ease of getting
startedAzure Stream
Analytics
SQL like query language Source/sink
integrations
No cluster provisioning Pay as you go Enterprise grade SLA
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards &
Visualizations
Cortana
Bot
Framework
Cognitive
Services
Power BI
Information
Management
Event Hubs
Data Catalog
Data Factory
Machine Learning
and Analytics
HDInsight
(Hadoop and
Spark)
Stream
Analytics
Intelligence
Data Lake
Analytics
Machine
Learning
Big Data Stores
SQL Data
Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
example data past experience
• Supervised Learning
• Unsupervised Learning (similar to Density Estimates in Statistics)
分析
存储
Microsoft Hadoop Stack
Azure HDInsight
Machine Learning
Local (HDFS) or Cloud (Azure Blob/Azure Data Lake Store)
• Classification
• Clustering
• Regression
• Collaborative Filtering
• Feature Extraction
• Statistics/Linear Algebra
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
• SparkSession• 在Spark 2.0中变为创建compute context的主要API(取代SparkContext)
spark = SparkSession \.builder \.appName("MNIST Classifier") \.config('spark.sql.warehouse.dir',
'file:///random/path/as/we/need/to/config/this/but/dont/use/it') \.config('spark.executor.instances', 10) \.getOrCreate()
# fileNameTrain = 'adl://xiaoyzhusparkadls20.azuredatalakestore.net/train.csv'fileNameTrain = 'C:\\Users\\xiaoyzhu\\Downloads\\train.csv'# read the data from raw csv file.# #We use "inferSchema" as "True" since we want to import the data as integers. Otherwise spark will treat it as strings.mnist_train = spark.read.csv(fileNameTrain, header=True, inferSchema=True)mnist_train.show()
• DataFrames• Spark’s new abstraction for data science
• Support PB-scale “table”
• Easier development experience (than RDD based Spark API) and less code
• Contains named columns
• Contains types (Scala primitives)
spark = SparkSession \.builder \.appName("MNIST Classifier") \.config('spark.sql.warehouse.dir',
'file:///random/path/as/we/need/to/config/this/but/dont/use/it') \.config('spark.executor.instances', 10) \.getOrCreate()
# fileNameTrain = 'adl://xiaoyzhusparkadls20.azuredatalakestore.net/train.csv'fileNameTrain = 'C:\\Users\\xiaoyzhu\\Downloads\\train.csv'# read the data from raw csv file.# #We use "inferSchema" as "True" since we want to import the data as integers. Otherwise spark will treat it as strings.mnist_train = spark.read.csv(fileNameTrain, header=True, inferSchema=True)mnist_train.select("label").show()
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
train your model over this dataset
use this data to validate the model
assess the generalization of the model
Training
50%
Test, 25%
Validation,
25%
DATA VOLUME
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
随机森林算法的两个重要参数
• 树的数量• 决策树的深度
• 那么…如何选择这两个超参数呢?
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
Transfor
mEstimate Evaluate
PIPELINE
# define the classifier hererfc = RandomForestClassifier(labelCol="labelIndex", featuresCol="features", impurity='gini', maxBins=32)
pipeline = Pipeline(stages=[rfc])
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
• DataFrame -> new DataFrame
• Extraction of values into feature vector
• Map from one column to another column
• Append an additional column
• Predict a value and append value
• Implements transform() method
# total 784 featuresFEATURE_NUM = 784# assemble those features to a vector to consume in Sparkassembler = VectorAssembler(
inputCols=["pixel{0}".format(i) for i in range(FEATURE_NUM)],
outputCol="features")
# Transform pixel0,pixel1...pixel783 to one column named "features"labeledPoints = assembler.transform(mnist_train).select("label", "features")
• Implements a method fit()
• Takes in a DataFrame as input
• Produces a Model as Output
• Model is a Transformer
• Predict a value and append value
创建Spark Session和
相关的DataFrame
SparkSession
读取数据
Spark.read.csv
将原始数据组合成
Vector并Transform成
labeledPoints
VectorAssembler
将标签label转化为
index,便于Spark分类
StringIndexer
数据分为train, test,
validation
randomSplit
创建RandomForest模型
RandomForestClassifie
r
创建Spark ML Pipeline
Pipeline
设置参数网络
ParamGridBuilder
设置交叉验证
CrossValidator训练模型并输出结果
# define the classifier hererfc = RandomForestClassifier(labelCol="labelIndex", featuresCol="features", impurity='gini', maxBins=32)
pipeline = Pipeline(stages=[rfc])
# define the param grids to search for best hyper-parametersparamGrid = ParamGridBuilder() \
.addGrid(rfc.numTrees, range(3, 10)) \
.addGrid(rfc.maxDepth, range(4, 10)) \
.build()
• Determine how close a fit your model is to data
• Get a score to determine effectiveness of model
• Precision, recall, F-Measures
• Area Under ROC
• MSE/RMSE
# Define the cross validator. We need to define a model to be validated, an evaluator which is used to evaluate the model, and a param grid to be searched forcrossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,evaluator=MulticlassClassificationEvaluator(),numFolds=3)
# Run TrainValidationSplit, and choose the best set of parameters.start_time = time.time()# we can now fit the model using above configurationscvModel = crossval.fit(trainData)print("grid search time --- %s seconds ---" % (time.time() - start_time))
# then we can choose the best model, either save it for later usage, or use it to train the test dataset.bestModel = cvModel.bestModel# bestModel.save('bestModel')# We can print out the best model to see more details.
testprediction = bestModel.transform(testData)evaluator = MulticlassClassificationEvaluator(predictionCol="prediction", labelCol="labelIndex", metricName="f1")print("Precision: " + str(evaluator.evaluate(testprediction)))
https://github.com/zxzxy1988/MNISTAnalysis/blob/master/MNIST%20Classifier.py
5 0 4 1
Ingest
Data
Feature
Extractio
n
Train
Model
Evaluate
Model