Introduction to Spark ML Pipelines Workshop

34
Introduction to Spark ML Machine learning at scale ApacheCon 2016 Hella-Legit

Transcript of Introduction to Spark ML Pipelines Workshop

Page 1: Introduction to Spark ML Pipelines Workshop

Introduction to Spark MLMachine learning at scale

ApacheCon 2016

Hella-Legit

Page 2: Introduction to Spark ML Pipelines Workshop

Who am I?

Holden● I prefer she/her for pronouns● Co-author of the Learning Spark book● Software Engineer at IBM’s Spark Technology Center● @holdenkarau● http://www.slideshare.net/hkarau● https://www.linkedin.com/in/holdenkarau

Page 3: Introduction to Spark ML Pipelines Workshop

What we are going to explore together!

● Who I think you all are● Spark’s two different ML APIs● Running through a simple example with one● A brief detour into some codegen funtimes● Exercises!● Model save/load● Discussion of “serving” options

Page 4: Introduction to Spark ML Pipelines Workshop

The different pieces of Spark

Apache Spark

SQL & DataFrames Streaming Language

APIs

Scala, Java, Python, & R

Graph Tools

Spark ML bagel & Grah X

MLLib Community Packages

Page 5: Introduction to Spark ML Pipelines Workshop

Who do I think you all are?

● Nice people*● Some knowledge of Apache Spark core & maybe SQL● Interested in using Spark for Machine Learning● Familiar-ish with Scala or Java or Python

Amanda

Page 6: Introduction to Spark ML Pipelines Workshop

Skipping intro & set-up time :)

Page 7: Introduction to Spark ML Pipelines Workshop

But maybe time to upgrade...

● Spark 1.5+ (Spark 1.6 would be best!)○ (built with Hive support if building from source)

Amanda

Page 9: Introduction to Spark ML Pipelines Workshop

Getting some data for working with:

● census data: https://archive.ics.uci.edu/ml/datasets/Adult

● Goal: predict income > 50k● Also included in the github repo● Download that now if you haven’t already● We will add a header to the data

○ http://pastebin.ca/3318687

PROTill Westermayer

Page 10: Introduction to Spark ML Pipelines Workshop

So what are the two APIs?

● Traditional and Pipeline○ Pipeline is the new shiny future which will fix all problems*

● Traditional API works on RDDs○ Data preparation work is generally done in traditional Spark

transformations● Pipeline API works on DataFrames

○ Often we want to apply some transformations to our data before feeding to the machine learning algorithm

○ Makes it easy to chain these together

(*until replaced by a newer shinier future)

Steve Jurvetson

Page 11: Introduction to Spark ML Pipelines Workshop

So what are DataFrames?

● Spark SQL’s version of RDDs of the world (its for more than just SQL)

● Restricted data types, schema information, compile time untyped

● Restricted operations (more relational style)● Allow lots of fun extra optimizations

○ Tungsten etc.● We’ll talk more about them (& Datasets) when we do

the Spark SQL component of this workshop

Page 12: Introduction to Spark ML Pipelines Workshop

Transformers, Estimators and Pipelines

● Transformers transform a DataFrame into another● Estimators can be trained on a DataFrame to produce a

transformer● Pipelines chain together multiple transformers and

estimators

Page 13: Introduction to Spark ML Pipelines Workshop

Let’s start with loading some data

● We’ve got some CSV data, we could use textfile and parse by hand

● spark-packages can save by providing the spark-csv package by Hossein Falaki○ If we were building a Java project we can include maven coordinates○ For the Spark shell when launching add:

--packages com.databricks:spark-csv_2.10:1.3.0

Jess Johnson

Page 14: Introduction to Spark ML Pipelines Workshop

Loading with sparkSQL & spark-csv

sqlContext.read returns a DataFrameReaderWe can specify general properties & data specific options● option(“key”, “value”)

○ spark-csv ones we will use are header & inferSchema● format(“formatName”)

○ built in formats include parquet, jdbc, etc. today we will use com.databricks.spark.csv

● load(“path”)

Jess Johnson

Page 15: Introduction to Spark ML Pipelines Workshop

Loading with sparkSQL & spark-csvdf = sqlContext.read

.format("com.databricks.spark.csv")

.option("header", "true")

.option("inferSchema", "true")

.load("resources/adult.data")

Jess Johnson

Page 16: Introduction to Spark ML Pipelines Workshop

Lets explore training a Decision Tree

● Step 1: Data loading (done!)● Step 2: Data prep (select features, etc.)● Step 3: Train● Step 4: Predict

Page 17: Introduction to Spark ML Pipelines Workshop

Data prep / cleaning

● We need to predict a double (can be 0.0, 1.0, but type must be double)

● We need to train with a vector of featuresImports:from pyspark.mllib.linalg import Vectors

from pyspark.ml.classification import DecisionTreeClassifier

from pyspark.ml.param import Param, Params

from pyspark.ml.feature import Bucketizer, VectorAssembler,

StringIndexer

from pyspark.ml import Pipeline

Huang Yun Chung

Page 18: Introduction to Spark ML Pipelines Workshop

Data prep / cleaning continued# Combines a list of double input features into a vector

assembler = VectorAssembler(inputCols=["age", "education-num"],

outputCol="feautres")

# String indexer converts a set of strings into doubles

indexer =

StringIndexer(inputCol="category")

.setOutputCol("category-index")

# Can be used to combine pipeline components together

pipeline = Pipeline().setStages([assembler, indexer])

Huang Yun Chung

Page 19: Introduction to Spark ML Pipelines Workshop

So a bit more about that pipeline

● Each of our previous components has “fit” & “transform” stage

● Constructing the pipeline this way makes it easier to work with (only need to call one fit & one transform)

● Can re-use the fitted model on future data

model=pipeline.fit(df)

prepared = model.transform(df)

Andrey

Page 20: Introduction to Spark ML Pipelines Workshop

What does our pipeline look like so far?

Input Data AssemblerInput Data + Vectors StringIndexer

Input Data+Cat ID+ Vectors

While not an ML learning algorithm this still needs to be fit

This is a regular transformer - no fitting required.

Page 21: Introduction to Spark ML Pipelines Workshop

Let's train a model on our prepared data:# Specify model

dt = DecisionTreeClassifier(labelCol = "category-index",

featuresCol="features")

# Fit it

dt_model = dt.fit(prepared)

# Or as part of the pipeline

pipeline_and_model = Pipeline().setStages([assembler, indexer,

dt])

pipeline_model = pipeline_and_model.fit(df)

Page 22: Introduction to Spark ML Pipelines Workshop

And predict the results on the same data:pipeline_model.transform(df).select("prediction",

"category-index").take(20)

Page 23: Introduction to Spark ML Pipelines Workshop

Exercise 1: Go from the index to something useful

● We could manually look up the labels and then write a select statement

● Or we could look at the features on the StringIndexerModel and use IndexToString

● Our pipeline has an array of stages we can use for this

Page 24: Introduction to Spark ML Pipelines Workshop

Solution:from pyspark.ml.feature import IndexToString

labels = list(pipeline_model.stages[1].labels())

inverter = IndexToString(inputCol="prediction", outputCol="

prediction-label", labels=labels)

inverter.transform(pipeline_model.transform(df)).select

("prediction-label", "category").take(20)

# Pre Spark 1.6 use SQL if/else or similar

Page 25: Introduction to Spark ML Pipelines Workshop

So what could we do for other types of data?

● org.apache.spark.ml.feature has a lot of options○ HashingTF○ Tokenizer○ Word2Vec○ etc.

Page 26: Introduction to Spark ML Pipelines Workshop

Exercise 2: Add more features to your tree

● Finished quickly? Help others!● Or tell me if adding these features helped or not…

○ We can download a reserve “test” dataset but how would we know if we couldn’t do that?

cobra libre

Page 27: Introduction to Spark ML Pipelines Workshop

And not just for getting data into doubles...

● Maybe a customers cat food preference only matters if the owns_cats boolean is true

● Maybe the scale is _way_ off● Maybe we’ve got stop words● Maybe we know one component has a non-linear

relation● etc.

Page 28: Introduction to Spark ML Pipelines Workshop

Cross-validationbecause saving a test set is effort

● Automagically* fit your model params● Because thinking is effort● org.apache.spark.ml.tuning has the tools

○ (not in Python yet so skipping for now)

Jonathan Kotta

Page 30: Introduction to Spark ML Pipelines Workshop

Exercise 3: Train a new model type

● Your choice!● If you want to do regression - change what we are

predicting

Page 31: Introduction to Spark ML Pipelines Workshop

So serving...

● Generally refers to using your model online○ Generating recommendations...

● In batch mode you can “just” save & use the Spark bits● Spark’s “native” formats (often parquet w/metadata)

○ Understood by Spark libraries and thats pretty much it○ If you are serving in JVM can load these but need Spark

dependencies (albeit often not a Spark cluster)● Some models support PMML export

○ https://github.com/jpmml/openscoring etc.● We can also write our own export & serving by hand :(

Ambernectar 13

Page 32: Introduction to Spark ML Pipelines Workshop

So what models are PMML exportable?

● Right now “old” style models○ KMeans, LinearRegresion, RidgeRegression, Lasso, SVM, and Binary

LogisticRegression ○ However if we look in the code we can sometimes find converters to

the old style models and use this to export our “new” style model● Waiting on https://issues.apache.

org/jira/browse/SPARK-11171 / https://github.com/apache/spark/pull/9207 for pipeline models

● Not getting in for 2.0 :(

Page 33: Introduction to Spark ML Pipelines Workshop

How to PMML export

toPMML● returns a string or● takes a path to local fs and saves results or● takes a SparkContext & a distributed path and saves or● takes a stream and writes result to stream

Page 34: Introduction to Spark ML Pipelines Workshop

Optional* exercise time

● Take a model you trained and save it to PMML○ You will have to dig around in the Spark code to be able to do this

● Look at the file● Load it into a serving system and try some predictions● Note: PMML export currently only includes the model -

not any transformations beforehand● Also: you might need to train a new model● If you don’t get it don’t worry - hints to follow :)