Large Scale Machine learning with Spark

48
Large scale machine learning with Apache Spark Md. Mahedi Kaysar (Research Master), Insight Centre for Data Analytics [DCU] [email protected]

Transcript of Large Scale Machine learning with Spark

Page 1: Large Scale Machine learning with Spark

Large scale machine learning with Apache Spark

Md. Mahedi Kaysar (Research Master), Insight Centre for Data Analytics [DCU] [email protected]

Page 2: Large Scale Machine learning with Spark

2

Agenda

• Spark Overview• Installing spark and deploying application• Machine learning with Spark 2.0– Typical Machine Learning workflow– Spark Mlib

• Develop a machine learning application– Spam Filtering

Page 3: Large Scale Machine learning with Spark

3

Spark Overview

• Open source large scale data processing engine• 100x times faster than hadoop map-reduce in

memory or 10x faster on disk• Can write application on java, scala, python and R• Runs on Mesos, standalone or YARN cluster

manager• It can access diverse data sources including HDFS,

Cassandra, HBase and S3

Page 4: Large Scale Machine learning with Spark

4

Spark Overview

• MapReduce: distributed execution model– Map read data from hard disk, process it and write

in back to the disk. Before doing the shuffle operation it send data to reducer

– Reduce read data from disk and process it and sent back to disk

Page 5: Large Scale Machine learning with Spark

5

Spark Overview

• MapReduce Execution Module:– Iterative job: lots of disk i/o operation

Page 6: Large Scale Machine learning with Spark

6

Spark Overview

• Spark Execution Model– Use memory instead of disk

Page 7: Large Scale Machine learning with Spark

7

Spark Overview

• RDD: Resilient distributed dataset– We write program in terms of operations on

distributed data set– Partitioned collection of object across the cluster,

stored in memory or disk– RDDs built and manipulated though a diverse

source of parallel transformation (Map, filter, join), action (save, count, collect)

– RDDs automatically rebuild on machine failure

Page 8: Large Scale Machine learning with Spark

8

Spark Overview

• RDD: Resilient distributed dataset– immutable and programmer specifies number of

partitions for an RDD.

Page 9: Large Scale Machine learning with Spark

9

Spark Overview

• RDD: Transformation– New dataset from existing one

Page 10: Large Scale Machine learning with Spark

10

Spark Core: underlying general execution engine. It provides In memory computing. APIs are build upon it.• Spark SQL• Spark Mlib• Spark GraphX• Spark Streaming

Spark Ecosystem

Apache Spark 2.0

Page 11: Large Scale Machine learning with Spark

11

Apache Spark 2.0

• Spark SQL– Module for structured or tabular data processing– Actually it is new data abstraction called SchemaRDD– Internally it has more information about the

structure of both data and the computation being performed

– Two way to interact with Spark SQL• SQL queries: “SELECT * FROM PEOPLE”• Dataset/DataFrame: domain specific language

Page 12: Large Scale Machine learning with Spark

12

Apache Spark 2.0

• Spark SQL

Page 13: Large Scale Machine learning with Spark

13

Apache Spark 2.0

• Spark Mlib– Machine learning library– ML Algorithms: common learning algorithms such as classification,

regression, clustering, and collaborative filtering• SVM, Decision Tree

– Featurization: feature extraction, transformation, dimensionality reduction, and selection• Term frequency, document frequency

– Pipelines: tools for constructing, evaluating, and tuning ML Pipelines– Persistence: saving and load algorithms, models, and Pipelines– Utilities: linear algebra, statistics, data handling, etc.– DataFrame-based API is primary API (spark.ml)

Page 14: Large Scale Machine learning with Spark

14

Apache Spark 2.0

• Spark Streaming– provides a high-level abstraction called discretized stream

or DStream, which represents a continuous stream of data– DStream is represented as a sequence of RDDs.

Page 15: Large Scale Machine learning with Spark

15

Apache Spark 2.0

• Structured Streaming (Experimental)– scalable and fault-tolerant stream processing engine built

on the Spark SQL engine– The Spark SQL engine will take care of running it

incrementally and continuously and updating the final result as streaming data continues to arrive

– Can be used Dataset or DataFrame APIs

Page 16: Large Scale Machine learning with Spark

16

Apache Spark 2.0

• GraphX– extends the Spark RDD by

introducing a new Graph abstraction

– a directed multigraph with properties attached to each vertex and edge

– Pagerank: measures the importance of a vertex

– Connected component– Triangle count

Page 17: Large Scale Machine learning with Spark

17

Apache Spark 2.0

• RDD vs. Data Frame vs. Dataset– All are immutable and distributed dataset– RDD is the main building block of Apache Spark called

resilient distributed dataset. It process data in memory for efficient use.

– The DataFrame and Dataset are more abstract then RDD and those are optimized and good when you have structured data like CSV, JSON, Hive and so on.

– When you have row data like text file you can use RDD and transform to structured data with the help of DataFrame and Dataset

Page 18: Large Scale Machine learning with Spark

18

Apache Spark 2.0• RDD:

– immutable, partitioned collections of objects

– Two main Operations

Page 19: Large Scale Machine learning with Spark

19

Apache Spark 2.0

• DataFrame– A dataset organized into named columns. – It is conceptually equivalent to a table in a

relational database – can be constructed from a wide array of sources

such as: structured data files, tables in Hive, external databases, or existing RDDs.

Page 20: Large Scale Machine learning with Spark

20

Apache Spark 2.0

• Dataset– distributed collection of data– It has all the benefits of RDD and Dataframe with

more optimization– You can switch any form of data from dataset– It is the latest API for data collections

Page 21: Large Scale Machine learning with Spark

21

Apache Spark 2.0• DataFrame/Dataset

– Reading a json data using dataset

Page 22: Large Scale Machine learning with Spark

22

Apache Spark 2.0

• DataFrame/Dataset– Connect hive and query it by HiveQL

Page 23: Large Scale Machine learning with Spark

23

Apache Spark 2.0• Dataset

– You can transform a dataset to rdd and rdd to dataset

Page 24: Large Scale Machine learning with Spark

24

Spark Cluster Overview• Spark uses master/slave architecture• One central coordinator called driver

that communicates with many distributed workers (executors)

• Drivers and executors run in their own Java Process

A Driver is a process where the main method runs. It converts the user program into task and schedule the task to the

executors with the help of cluster manager Cluster manager runs the executors and manages the worker nodes

Executors run the spark jobs and send back the result to the Driver program.

They provide in-memory storage for RDDs that are cached by user program

The workers are in charge of communicating the cluster manager the availability of their resources

Page 25: Large Scale Machine learning with Spark

25

Spark Cluster Overview• Standalone: a simple cluster manager included with Spark that makes it easy to set up a cluster.• Example: A standalone cluster with 2 worker nodes (each node having 2 cores)

– Local machine– Cloud EC2

Conf/spark-env.shexport SPARK_WORKER_MEMORY=1g

export SPARK_EXECUTOR_MEMORY=1gexport SPARK_WORKER_INSTANCES=2export SPARK_WORKER_CORES=2export SPARK_WORKER_DIR=/home/work/sparkdata

./sbin/start-master.shConf/slaves

Master node IP./sbin/start-slaves.sh

Page 26: Large Scale Machine learning with Spark

26

Application Deployment

• Standalone mode

Page 27: Large Scale Machine learning with Spark

27

Application Deployment• Standalone mode

– Client Deploy mode– Cluster Deploy mode

Page 28: Large Scale Machine learning with Spark

28

Machine Learning with Spark• Typical Machine learning workflow:

Load the sample data. Parse the data into the input format for the algorithm. Pre-process the data and handle the missing values. Split the data into two sets, one for building the model (training dataset) and one for

testing the model (validation dataset). Run the algorithm to build or train your ML model.

Page 29: Large Scale Machine learning with Spark

29

Machine Learning with Spark• Typical Machine learning workflow:

Make predictions with the training data and observe the results. Test and evaluate the model with the test data or alternatively validate the with some

cross-validator technique using the third dataset, called the validation dataset. Tune the model for better performance and accuracy. Scale-up the model so that it can handle massive datasets in the future Deploy the ML model in commercialization:

Page 30: Large Scale Machine learning with Spark

30

Machine Learning with Spark

• Pre-processing– The three most common data preprocessing steps

that are used are• formatting: data may not be in a good shape • cleaning: data may have unwanted records or sometimes

with missing entries against a record. This cleaning process deals with the removal or fixing of missing data

• sampling the data: when the available data size is large– Data Transformation – Dataset, RDD and DataFrame

Page 31: Large Scale Machine learning with Spark

31

Machine Learning with Spark

• Feature EngineeringExtraction: Extracting features from “raw” dataTransformation: Scaling, converting, or modifying featuresSelection: Selecting a subset from a larger set of features

Page 32: Large Scale Machine learning with Spark

32

Machine Learning with Spark

• ML AlgorithmsClassificationsRegressionTuning

Page 33: Large Scale Machine learning with Spark

33

Machine Learning with Spark

• ML Pipeline:– Higher level API build on top of DataFrame– Can combines multiple algorithms together to

make a complete workflow.– For example: text analytics• Split the texts=> words• Convert words => numerical feature vectors• Numerical feature vectors => labeling• Build an ML model as a prediction model using vectors

and labels

Page 34: Large Scale Machine learning with Spark

34

Machine Learning with Spark

• ML Pipeline Component:– Transformers• is an abstraction that includes feature transformers and

learned models• an algorithm for transforming one dataset or dataframe

to another dataset or dataframe• Ex. HashingTF

– Estimators • an algorithm which can fit on a dataset or dataframe to

produce a transformer or model. Ex- Logistic Regression

Page 35: Large Scale Machine learning with Spark

35

Machine Learning with Spark

• Spam detection or spam filtering: – Given some e-mails in an inbox, the task is to

identify those e-mails that are spam and those that are non-spam (often called ham) e-mail messages.

Page 36: Large Scale Machine learning with Spark

36

Machine Learning with Spark

• Spam detection or spam filtering: – Reading dataset– SparkSession is the single entry point to interact

with underlying spark functionality. It allows dataframe and dataset for programming

Page 37: Large Scale Machine learning with Spark

37

Machine Learning with Spark

• Spam detection or spam filtering: – Pre-process the dataset

Page 38: Large Scale Machine learning with Spark

38

Machine Learning with Spark

• Spam detection or spam filtering: – Feature Extraction: make feature vectors– TF: Term frequency is the number of times that

term appears in document • Feature vectorization method

Page 39: Large Scale Machine learning with Spark

39

Machine Learning with Spark

• Spam detection or spam filtering: – Tokenizer: Transformer to tokenise the text into

words– HashingTF: Transformer for making feature Vector

using TF techniques.• Takes set of terms• Converts it to set of feature vector• It uses hashing trick for indexing terms

Page 40: Large Scale Machine learning with Spark

40

Machine Learning with Spark• Spam detection or spam filtering:

– Train a model– Define classifier– Fit transet

Page 41: Large Scale Machine learning with Spark

41

Machine Learning with Spark

• Spam detection or spam filtering: – Test a model

Page 42: Large Scale Machine learning with Spark

42

Machine Learning with Spark

• Tuning – Model selection:• Hyper parameter tuning • Find the best model or parameter for a given task• Tuning can be done for individial estimator such as

logistic regression, pipeline– Model selection via cross-validation– Model selection via train validation split

Page 43: Large Scale Machine learning with Spark

43

Machine Learning with Spark

• Tuning – Model selection workflow• Split input data into separate training set or test set• For each (training,test) pair, they iterate through set of

ParamMaps.– For each ParamMap they fit the estimator using those

parameteras– Get fitted model and evaluate the models performance using

evaluator• They select the model produced by best performing set

of parameters.

Page 44: Large Scale Machine learning with Spark

44

Machine Learning with Spark

• Tuning – Model selection workflow• The evaluator can be RegressionEvaluator or

BinaryClassificationEvaluator and so on.

Page 45: Large Scale Machine learning with Spark

45

Machine Learning with Spark• Tuning

– Model selection via cross validation• CrossValidator begins by splitting the dataset into a set of folds (k=3)

means create 3 (training,test) dataset pair• Each pair use 2/3 of the data for training and 1/3 for testing• To evaluate particular ParamMap it computes the average evaluation

matric for the three model fitting by estimator• However, it is also a well-established method for choosing parameters

which is more statistically sound than heuristic hand-tuning.– Model selection via train-validation split

• only evaluates each combination of parameters once• less expensive, but will not produce as reliable results when the

training dataset is not sufficiently large.

Page 46: Large Scale Machine learning with Spark

46

Spam Filtering Application

• What we did so far?– Reading dataset– Cleaning – Feature engineering– Training– Testing– Tuning– Deploying – Persisting the model– Reuse the existing model

Page 47: Large Scale Machine learning with Spark

47

Spam Filtering Application

• Deployment

Page 48: Large Scale Machine learning with Spark

48

Thanks

Questions??