Machine Learning by Example - Apache Spark
-
Upload
meeraj-kunnumpurath -
Category
Data & Analytics
-
view
369 -
download
3
Transcript of Machine Learning by Example - Apache Spark
Service Symphony Ltd
Apache Spark Machine Learning by Example
Meeraj Kunnumpurath25th of February 2017
1
Introduction❖ Working as technologist and software architect for couple of decades, at number of leading
financial institutions in the UK
❖ Authored a number books on Enterprise Java, Web Services and SOA
❖ Spoken at a number of technology conferences
❖ Founded Service Symphony Ltd in 2009 serving leading financial services customers building mission critical middleware
❖ Engineer with a keen interest in ML, AI and Data Science
❖ Blog: http://www.servicesymphony.com/blog
❖ Email: [email protected]
❖ Presentation: https://www.slideshare.net/MeerajKunnumpurath/machine-learning-by-example-apache-spark
❖ GitHub: https://github.com/kunnum/sandbox/tree/master/notebooks
2
Agenda❖ Introduction to using ML with Apache Spark
❖ Hands-on example driven approach
❖ Not a deep dive into Apache Spark Architecture
❖ Neither a deep dive into ML algorithms
❖ Examples built using Apache Zeppelin
❖ Some of the examples are from Spark ASF documentation
3
Apache Spark - Overview❖ Open source large scale distributed data processing fabric
❖ Offers multiple components addressing different facets of data science for big and fast data processing, ML, analytics and data ingestion
❖ Ability to process large amount of data in memory spanning multiple process spaces
❖ Initially started as a research project in UC Berkeley
❖ Originally released under BSD, top level ASF project licensed under ASL 2.0 since 2014
❖ One of the most active open source project, arguably the most active ASF project
❖ Adopted, extended and commercialised by multiple vendors playing in the data science realm
4
Apache Spark - Architecture
5
Apache Spark - Architecture
6
Apache Spark - Architecture
7
Scala - Spark Natural Transition
❖ Interest in Spark stemmed from deep interest in Scala and functional programming
❖ Data processing echo system built around Scala, with a strong synergy in Scala’s design motivations
❖ Extends Scala’s idiomatic functional programming model to transcend beyond process boundaries
❖ Spark RDDs - Scala collections on steroids
8
Spark - Scala Notebook
9
Spark - Scala Notebook
10
ML Components
11
ML Components❖ Data Structures
❖ Vectors and Matrices
❖ Data Frames
❖ Feature Extractors and Transformers
❖ Estimators
❖ Models
❖ Pipelines
❖ Evaluators
❖ Tuning Aids
12
ML Components - Notebook
13
ML Components - Notebook
14
ML Components - Notebook
15
ML Components - Notebook
16
Spark ML - Pipeline Architecture
❖ Dataframe
❖ Estimator
❖ Transformer
❖ Pipeline
❖ Parameter
17
Spark ML - Pipeline Architecture
18
Training time flowPipeline in estimator mode
Pipeline.fit()Creates a pipeline model
Spark ML - Pipeline Architecture
19
Test time flowPipeline in transformer mode
PipelineModel.transform()Creates dataframe with augmented prediction columns
ML Pipeline Notebook
20
ML Pipeline Notebook
21
ML Pipeline Notebook
22
ML Pipeline Notebook
23
ML Pipeline Notebook
24
Regression❖ Supervised Learning Algorithm for predicting continuous labels
❖ Multiple Algorithms
❖ Linear Regression
❖ Generalised Linear Regression
❖ Decision Tree Regression
❖ Random Forest Regression
❖ Gradient Boosted Tree Regression
❖ Survival Regression
❖ Isotonic Regression
❖ Works with input feature vectors and labelled points
25
Regression
26
Linear Regression - Notebook
27
Linear Regression - Notebook
28
Linear Regression - Notebook
29
Linear Regression - Notebook
30
Linear Regression - Notebook
31
Classification❖ Supervised learning for predicting discrete labels
❖ Multiple algorithms
❖ Binomial and polynomial logistic regression
❖ Decision tree classifier
❖ Random forest classifier
❖ Gradient boosted tree classifier
❖ Multi-layer neural network classifier
❖ Naive Bayes Classifier
32
Classification
33
Classification - Notebook
34
Classification - Notebook
35
Classification - Notebook
36
Classification - Notebook
37
Classification - Notebook
38
Classification - Notebook
39
Classification - Notebook
40
Classification - Notebook
41
Clustering❖ Unsupervised learning algorithm based on similarity
vectors
❖ Multiple algorithms
❖ K-Means Clustering
❖ LDA - Latent Dirichlet Allocation
❖ Bisecting K-Means
❖ Gaussian Mixture Model
42
Clustering
43
Clustering - Notebook
44
Clustering - Notebook
45
Clustering - Notebook
46
Clustering - Notebook
47
Clustering - Notebook
48
Clustering - Notebook
49
Clustering - Notebook
50
Clustering - Notebook
51
Collaborative Filtering
❖ Commonly used for recommender systems
❖ Uses ALS (Alternating Least Squares) to learn latent factors in user to item association
❖ Default assumption is based on explicit feedback for matrix factorization
❖ You an explicitly enable implicit preferences
52
Collaborative Filtering
53
Collaborative Filtering - Notebook
54
Collaborative Filtering - Notebook
55
Collaborative Filtering - Notebook
56
Collaborative Filtering - Notebook
57
Collaborative Filtering - Notebook
58
Collaborative Filtering - Notebook
59
Model Tuning
❖ API to tune an individual estimator or the entire pipeline using a normalised parameter model
❖ API to support k-fold cross validation
❖ API to evaluate performance on linear regression, as well as binomial and polynomial classification
❖ API for performing training validation split
60
Model Tuning - Notebook
61
Model Tuning - Notebook
62
Model Tuning - Notebook
63
Model Tuning - Notebook
64
Model Tuning - Notebook
65
Model Tuning - Notebook
66
Questions
67