NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
-
Upload
zenika -
Category
Technology
-
view
614 -
download
1
Transcript of NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin
Spark & Zeppelin
Introduction#NightClazz Spark & ML
10/03/16Fabrice Sznajderman
Agenda
● Apache Spark● Apache Zeppelin
Introduction
Who I am?
● Fabrice Sznajderman ○ Java/Scala/Web developer
■ Java/Scala trainer● BrownBagLunch.fr
SparkIntroduction
Big pictureSpark introduction
What is it about?
● A cluster computing framework ● Open source● Written in Scala
History
2009 : Project start at MIT research lab
2010 : Project open-sourced
2013 : Become a Apache project and creation of the Databricks company
2014 : Become a top level Apache project and the most active project in the Apache fundation (500+ contributors)
2014 : Release of Spark 1.0, 1.1 and 1.2
2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6
2015 : IBM, SAP… investment in Spark
2015 : 2000 registration in Spark Summit SF, 1000 in Spark Summit Amsterdam
2016 : new Spark Summit in San Francisco in June 2016
Multi-languages
● Scala● Java● Python● R
Spark Shell
● REPL● Learn API● Interactive Analysis
RDDCore concept
Definition
● Resilient ● Distributed ● Datasets
Properties
● Immutable ● Serializable● Can be persist in RAM and / or
disk● Simple or complexe type
Use as a collection
● DSL● Monadic type● Several operators
○ map, filter, count, distinct, flatmap, ...○ join, groupBy, union, ...
Created from
● A collection (List, Set)● Various formats of file
○ json, text, Hadoop SequenceFile, ...
● Various database ○ JDBC, Cassandra, ...
● Others RDD
Sample
val conf = new SparkConf()
.setAppName("sample")
.setMaster("local")
val sc = new SparkContext(conf)
val rdd = sc.textFile("data.csv")
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
Lazy-evaluation
● Intermediate operators ○ map, filter, distinct, flatmap, …
● final operators○ count, mean, fold, first, ...
val nb = rdd.map(s => s.length).filter(i => i> 10).count()
Caching
● Reused an intermediate result● Cache operator● Avoid re-computing
val r = rdd.map(s => s.length).cache()
val nb = r.filter(i => i> 10).count()val sum = r.filter(i => i> 10).sum()
DistributedArchitecture
Core concept
Run locally
val master = "local"
val master = "local[*]"
val master = "local[4]"
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
Run on cluster
val master = "spark://..."
val conf = new SparkConf().setAppName("sample")
.setMaster(master)
val sc = new SparkContext(conf)
Standalone Cluster
SparkMaster
SparkSlave
SparkSlave
SparkSlave
E
E E
E
E
E
Spark client
Spark client
Spark client
ModulesCore concept
Composed by
Spark Core
SparkStreaming MLlib GraphXSpark
SQL
ML PipelineDataFrames
Several data sources
Several data sources
http://prog3.com/article/2015-06-18/2824958
Spark SQL
● Structured data processing● SQL Language● DataFrame
DataFrame 1/3
● A distributed collection of rows organized into named columns
● An abstraction for selecting, filtering, aggregating and plotting structured data
● Provide a schema● Not a RDD replacement
What?
DataFrame 1/3
● RDD more efficient than before (Hadoop)
● But RDD is still too complicated for common tasks
● DataFrame is more simple and faster
Why?
DataFrame 2/3
Optimized
DataFrame 3/3
● From Spark 1.3● DataFrame API is just an
interface○ Implementation is done one time in
Spark engine
○ All languages take benefits of
optimization with out rewriting anything
How ?
Spark Streaming
● Framework over RDD and Dataframe API
● Real-time data processing● RDD is DStream here● Same as before but dataset is
not static
Spark StreamingInternal flow
http://spark.apache.org/docs/latest/img/streaming-flow.png
Spark StreamingInputs / Ouputs
http://spark.apache.org/docs/latest/img/streaming-arch.png
Spark MLlib
● Make pratical machine learning scalable and easy
● Provide commons learning algorithms & utilities
Spark MLlib
● Divides into 2 packages○ spark.mllib ○ spark.ml
Spark MLlib
● Original API based on RDD● Each model has its own
interface
spark.mllib
Spark MLlibspark.mllib
val sc = //init sparkContext
val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))
val model = RandomForest.trainClassifier( trainingData, 10, Map[Int, Int](), 30, "auto", "gini", 7, 100, 0)
val prediction = model.predict(...)
//init sparkContext
val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))
val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(train)
val prediction = model.predict(...)
Each model exposes its own interface
Spark MLlib
● Provides uniform set of high-level APIs
● Based on top of Dataframe● Pipeline concepts
○ Transformer○ Estimator○ Pipeline
spark.ml
Spark MLlibspark.ml
● Transformer : transform(DF)○ map a dataFrame by adding new
column
○ predict the label and adding result in new column
● Estimator : fit(DF)○ learning algorithm○ produces a model from dataFrame
Spark MLlibspark.ml
● Pipeline ○ sequence of stages (transformer or
estimator)○ specific order
Spark MLlibspark.ml
val training:DataFrame = ???val test:DataFrame = ???
val lr = new LogisticRegression()
lr.setMaxIter(10).setRegParam(0.01)
//training modelval model1 = lr.fit(training)
//prediction on data testmodel1.transform(test)
Spark MLlibspark.ml
val training:DataFrame = ???val test:DataFrame = ???
val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new RandomForestClassifier()()
/*.add parameter*/
val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.val model = pipeline.fit(training)
model.transform(test)
val training:DataFrame = ???val test:DataFrame = ???
val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression()
/*.add parameter*/
val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))
// Fit the pipeline to training documents.val model = pipeline.fit(training)
model.transform(test)
Differents models
Same manner to create the pipeline
ZeppelinIntroduction
Big pictureZeppelin introduction
What it is about?
● “A web-based notebook that enables interactive data analytics”
● 100% opensource● Undergoing Incubation
Multi-purpose
● Data Ingestion● Data Discovery● Data Analytics● Data Visualization &
Collaboration
Multiple Language backend
● Scala● shell● python● markdown● your language by creation your
own interpreter
Data visualizationEasy way to build graph from data
Demo
Merci