NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

50
Spark & Zeppelin Introduction #NightClazz Spark & ML 10/03/16 Fabrice Sznajderman

Transcript of NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Page 1: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark & Zeppelin

Introduction#NightClazz Spark & ML

10/03/16Fabrice Sznajderman

Page 2: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Agenda

● Apache Spark● Apache Zeppelin

Introduction

Page 3: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Who I am?

● Fabrice Sznajderman ○ Java/Scala/Web developer

■ Java/Scala trainer● BrownBagLunch.fr

Page 4: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

SparkIntroduction

Page 5: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Big pictureSpark introduction

Page 6: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

What is it about?

● A cluster computing framework ● Open source● Written in Scala

Page 7: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

History

2009 : Project start at MIT research lab

2010 : Project open-sourced

2013 : Become a Apache project and creation of the Databricks company

2014 : Become a top level Apache project and the most active project in the Apache fundation (500+ contributors)

2014 : Release of Spark 1.0, 1.1 and 1.2

2015 : Release of Spark 1.3, 1.4, 1.5 and 1.6

2015 : IBM, SAP… investment in Spark

2015 : 2000 registration in Spark Summit SF, 1000 in Spark Summit Amsterdam

2016 : new Spark Summit in San Francisco in June 2016

Page 8: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Multi-languages

● Scala● Java● Python● R

Page 9: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark Shell

● REPL● Learn API● Interactive Analysis

Page 10: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

RDDCore concept

Page 11: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Definition

● Resilient ● Distributed ● Datasets

Page 12: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Properties

● Immutable ● Serializable● Can be persist in RAM and / or

disk● Simple or complexe type

Page 13: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Use as a collection

● DSL● Monadic type● Several operators

○ map, filter, count, distinct, flatmap, ...○ join, groupBy, union, ...

Page 14: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Created from

● A collection (List, Set)● Various formats of file

○ json, text, Hadoop SequenceFile, ...

● Various database ○ JDBC, Cassandra, ...

● Others RDD

Page 15: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Sample

val conf = new SparkConf()

.setAppName("sample")

.setMaster("local")

val sc = new SparkContext(conf)

val rdd = sc.textFile("data.csv")

val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Page 16: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Lazy-evaluation

● Intermediate operators ○ map, filter, distinct, flatmap, …

● final operators○ count, mean, fold, first, ...

val nb = rdd.map(s => s.length).filter(i => i> 10).count()

Page 17: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Caching

● Reused an intermediate result● Cache operator● Avoid re-computing

val r = rdd.map(s => s.length).cache()

val nb = r.filter(i => i> 10).count()val sum = r.filter(i => i> 10).sum()

Page 18: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

DistributedArchitecture

Core concept

Page 19: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Run locally

val master = "local"

val master = "local[*]"

val master = "local[4]"

val conf = new SparkConf().setAppName("sample")

.setMaster(master)

val sc = new SparkContext(conf)

Page 20: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Run on cluster

val master = "spark://..."

val conf = new SparkConf().setAppName("sample")

.setMaster(master)

val sc = new SparkContext(conf)

Page 21: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Standalone Cluster

SparkMaster

SparkSlave

SparkSlave

SparkSlave

E

E E

E

E

E

Spark client

Spark client

Spark client

Page 22: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

ModulesCore concept

Page 23: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Composed by

Spark Core

SparkStreaming MLlib GraphXSpark

SQL

ML PipelineDataFrames

Several data sources

Page 24: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Several data sources

http://prog3.com/article/2015-06-18/2824958

Page 25: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark SQL

● Structured data processing● SQL Language● DataFrame

Page 26: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

DataFrame 1/3

● A distributed collection of rows organized into named columns

● An abstraction for selecting, filtering, aggregating and plotting structured data

● Provide a schema● Not a RDD replacement

What?

Page 27: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

DataFrame 1/3

● RDD more efficient than before (Hadoop)

● But RDD is still too complicated for common tasks

● DataFrame is more simple and faster

Why?

Page 28: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

DataFrame 2/3

Optimized

Page 29: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

DataFrame 3/3

● From Spark 1.3● DataFrame API is just an

interface○ Implementation is done one time in

Spark engine

○ All languages take benefits of

optimization with out rewriting anything

How ?

Page 30: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark Streaming

● Framework over RDD and Dataframe API

● Real-time data processing● RDD is DStream here● Same as before but dataset is

not static

Page 31: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark StreamingInternal flow

http://spark.apache.org/docs/latest/img/streaming-flow.png

Page 32: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark StreamingInputs / Ouputs

http://spark.apache.org/docs/latest/img/streaming-arch.png

Page 33: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlib

● Make pratical machine learning scalable and easy

● Provide commons learning algorithms & utilities

Page 34: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlib

● Divides into 2 packages○ spark.mllib ○ spark.ml

Page 35: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlib

● Original API based on RDD● Each model has its own

interface

spark.mllib

Page 36: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlibspark.mllib

val sc = //init sparkContext

val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))

val model = RandomForest.trainClassifier( trainingData, 10, Map[Int, Int](), 30, "auto", "gini", 7, 100, 0)

val prediction = model.predict(...)

//init sparkContext

val (trainingData, checkData) = sc.textFile("train.csv") /*transform*/ .randomSplit(Array(0.98, 0.02))

val model = new LogisticRegressionWithLBFGS() .setNumClasses(10) .run(train)

val prediction = model.predict(...)

Each model exposes its own interface

Page 37: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlib

● Provides uniform set of high-level APIs

● Based on top of Dataframe● Pipeline concepts

○ Transformer○ Estimator○ Pipeline

spark.ml

Page 38: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlibspark.ml

● Transformer : transform(DF)○ map a dataFrame by adding new

column

○ predict the label and adding result in new column

● Estimator : fit(DF)○ learning algorithm○ produces a model from dataFrame

Page 39: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlibspark.ml

● Pipeline ○ sequence of stages (transformer or

estimator)○ specific order

Page 40: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlibspark.ml

val training:DataFrame = ???val test:DataFrame = ???

val lr = new LogisticRegression()

lr.setMaxIter(10).setRegParam(0.01)

//training modelval model1 = lr.fit(training)

//prediction on data testmodel1.transform(test)

Page 41: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Spark MLlibspark.ml

val training:DataFrame = ???val test:DataFrame = ???

val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new RandomForestClassifier()()

/*.add parameter*/

val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.val model = pipeline.fit(training)

model.transform(test)

val training:DataFrame = ???val test:DataFrame = ???

val tokenizer = new Tokenizer() .setInputCol("text") .setOutputCol("words")val hashingTF = new HashingTF() .setNumFeatures(1000) .setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new LogisticRegression()

/*.add parameter*/

val pipeline = new Pipeline() .setStages(Array(tokenizer, hashingTF, lr))

// Fit the pipeline to training documents.val model = pipeline.fit(training)

model.transform(test)

Differents models

Same manner to create the pipeline

Page 42: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

ZeppelinIntroduction

Page 43: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Big pictureZeppelin introduction

Page 44: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

What it is about?

● “A web-based notebook that enables interactive data analytics”

● 100% opensource● Undergoing Incubation

Page 45: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Multi-purpose

● Data Ingestion● Data Discovery● Data Analytics● Data Visualization &

Collaboration

Page 46: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Multiple Language backend

● Scala● shell● python● markdown● your language by creation your

own interpreter

Page 47: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Data visualizationEasy way to build graph from data

Page 48: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Demo

Page 49: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin

Merci

Page 50: NigthClazz Spark - Machine Learning / Introduction à Spark et Zeppelin