Machinelearning Spark Hadoop User Group Munich Meetup 2016

6
19.4.2016 MachineLearning - Databricks file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6 Machine Learning with Spark Scikit Learn Cheat Sheet Load basic dependencies > inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/ import java.util.Base64 import java.nio.charset.StandardCharsets encB64: (str: String)String decB64: (str: String)String import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration import java.net.URI import org.apache.hadoop.fs.FileStatus listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus] ls3: (s3FolderPath: String)Unit rm3: (s3Path: String)Boolean ? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB] Read taxi data as dataframe from parquet > %run "/meetup/kickoff/connect_s3" // read Parquet files val parquetTable= sqlContext.read.parquet(ouputParquetDir) val toDouble = udf[Double, Float]( _ .toDouble) val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount"))) (http://databricks.com) Import Notebook MachineLearning

Transcript of Machinelearning Spark Hadoop User Group Munich Meetup 2016

Page 1: Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 1/6

Machine Learning with SparkScikit Learn Cheat Sheet

Load basic dependencies

inputCsvDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/csv ouputParquetDir: String = s3a://bigpicture-guild/nyctaxi/sample_1_month/parquet/

import java.util.Base64 import java.nio.charset.StandardCharsets encB64: (str: String)String decB64: (str: String)String

import org.apache.hadoop.fs.{FileSystem, Path} import org.apache.hadoop.conf.Configuration import java.net.URI import org.apache.hadoop.fs.FileStatus listS3: (s3Path: String)Array[org.apache.hadoop.fs.FileStatus] ls3: (s3FolderPath: String)Unit rm3: (s3Path: String)Boolean

? s3a://bigpicture-guild/nyctaxi/sample_1_month/csv/trip_data_and_fare.csv.gz [967.71 MiB]

Read taxi data as dataframe from parquet

%run "/meetup/kickoff/connect_s3"

// read Parquet filesval parquetTable= sqlContext.read.parquet(ouputParquetDir)val toDouble = udf[Double, Float]( _.toDouble)val taxiData = parquetTable.withColumn("tip_amount_d", toDouble(parquetTable.col("tip_amount")))

(http://databricks.com) Import Notebook

MachineLearning

Page 2: Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 2/6

Showing the first 1000 rows.

2D4B95E2FA7B2E85118EC5CA4570FA58 CD2F522EEE1FF5F5A8D8B679E23576B3 CMT 1 N 2013-01-07T15:33:28.000+0000

0C5296F3C8B16E702F8F2E06F5106552 D2363240A9295EF570FC6069BC4F4C92 CMT 1 N 2013-01-07T22:25:46.000+0000

312E0CB058D7FC1A6494EDB66D360CD2 7B5156F38990963332B33298C8BAE25E CMT 1 N 2013-01-05T11:54:49.000+0000

DD98E2C3AF5C47B4449F720ECC5778D4 79807332B275653A2473554C7328500A CMT 1 N 2013-01-02T06:58:08.000+0000

0B57B9633A2FECD3D3B1944AFC7471CF CCD4367B417ED6634D986F573A552A62 CMT 1 N 2013-01-07T14:46:55.000+0000

Scatter plot for tip amount and fare amount

500m1.001.502.002.503.003.504.004.505.005.506.006.507.007.508.008.509.009.50

5.00 10.0 15.0 20.0 25.0 30.0 35.0 40.0

fare_amount

tip_amount

Showing sample based on the first 1000 rows.

Transformation of data with standard dataframe operations

The pipeline concept of Spark ML

medallion hack_license vendor_id rate_code store_and_fwd_flag pickup_datetime

taxiData.registerTempTable("ml_nyc_taxi")

%sql SELECT * FROM ml_nyc_taxi

%sql SELECT tip_amount, fare_amount FROM ml_nyc_taxi WHERE tip_amount > 0 AND tip_amount < 10 AND fare_amount < 50

import org.apache.spark.mllib.linalg.{Vector, Vectors}val toVec   = udf[Vector, Int, Float] { (a, b) => Vectors.dense(a, b) }val trainingData = taxiData .filter(toDouble(taxiData.col("tip_amount")) > 0.0) .withColumn("label", toDouble(taxiData.col("tip_amount"))) .withColumn("features", toVec(taxiData.col("passenger_count"), taxiData.col("fare_amount")))

Page 3: Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 3/6

A Pipeline chains Transformers and EstimatorsA Transformer can also be an estimator from a previous trained modelImportant for easily

training with different model parameters e.g. for cross-validationwith different test and training data (train-validation split)repeat the transformation steps before estimation

Watch out for KeyStoneML (http://keystone-ml.org (http://keystone-ml.org)), a ML pipeline framework with a richer set of operatorson Spark

SQL transformer:

Select and filter the relevant data

VectorAssembler:

Transform the data into labeled data as needed for ML estimators

+------------------+----------+ | label| features| +------------------+----------+ |1.2000000476837158| [1.0,5.5]| | 4.199999809265137|[1.0,20.5]| | 5.900000095367432|[1.0,29.0]| | 5.380000114440918|[1.0,21.0]| | 1.399999976158142| [6.0,6.5]| | 1.0| [1.0,5.0]| | 1.25| [1.0,4.5]| | 3.0|[6.0,26.0]| | 1.0|[1.0,14.5]| |1.2999999523162842| [1.0,6.5]| | 1.899999976158142| [5.0,9.5]| |1.6200000047683716| [1.0,6.5]| | 1.899999976158142| [1.0,9.0]| | 2.0|[1.0,22.0]| | 6.0|[1.0,25.0]| |3.5999999046325684|[1.0,17.5]| |1.2000000476837158| [1.0,6.0]| | 7.5|[1.0,24.5]|

Initialize the estimator

import org.apache.spark.ml.feature.SQLTransformerval taxiDataSelector = new SQLTransformer().setStatement( "SELECT tip_amount_d as label, passenger_count, fare_amount FROM ml_nyc_taxi WHERE tip_amount_d > 0")val selectedTaxiData = taxiDataSelector.transform(taxiData)

import org.apache.spark.ml.feature.VectorAssemblerimport org.apache.spark.mllib.linalg.Vectors

val trainingDataAssembler = new VectorAssembler() .setInputCols(Array("passenger_count", "fare_amount")) .setOutputCol("features")

val assembledTaxiData = trainingDataAssembler.transform(selectedTaxiData)assembledTaxiData.select("label", "features").show()

Page 4: Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 4/6

LogisticRegression parameters: elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha =1, it is an L1 penalty (default: 0.0, current: 0.8) featuresCol: features column name (default: features) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label) maxIter: maximum number of iterations (>= 0) (default: 100, current: 10) predictionCol: prediction column name (default: prediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) solver: the solver algorithm for optimization. If this is not set or empty, default value is 'auto'. (default: auto) standardization: whether to standardize the training features before fitting the model (default: true) tol: the convergence tolerance for iterative algorithms (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instance weights as 1.0. (default: )

import org.apache.spark.ml.regression.LinearRegression linearRegressionEstimator: org.apache.spark.ml.regression.LinearRegression = linReg_54024ee673fd

Split the data into training and test set

Setup the transformation and estimation PIPELINE

Use the pipeline to train the model

Predict with the trained model on the test data

5.00

10.0

15.0

20.0

25.0

30.0

35.0

5.00 10.0 15.0

prediction

label

Showing sample based on the first 1000 rows.

How to get started with Spark MLSetup your Laptop (16+ GB RAM recommended)

import org.apache.spark.ml.regression.LinearRegression// Create a LogisticRegression instance. This instance is an Estimator.val linearRegressionEstimator = new LinearRegression() .setMaxIter(10) .setRegParam(0.3) .setElasticNetParam(0.8)// Print out the parameters, documentation, and any default values.println("LogisticRegression parameters:\n" + linearRegressionEstimator.explainParams() + "\n")

val Array(trainingTaxiData, testTaxiData) = taxiData.randomSplit(Array(0.9, 0.1), seed = 12345)

import org.apache.spark.ml.{Pipeline, PipelineModel}val pipeline = new Pipeline().setStages(Array(taxiDataSelector, trainingDataAssembler, linearRegressionEstimator))

// Learn a LogisticRegression model.// val lrModel = linearRegressionEstimator.fit(trainingData)val lrModel = pipeline.fit(trainingTaxiData)

display(lrModel.transform(testTaxiData) .select("label", "prediction"))

Page 5: Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 5/6

mac$ brew install sparkor get Databricks Community Edition Notebook (Wait List)

Get dataJoin a ML competition and get BIG data from KaggleAnalyze the Panama Papers: https://github.com/amaboura/panama-papers-dataset-2016(https://github.com/amaboura/panama-papers-dataset-2016)

Visualize the data (Databricks or Zeppelin Notebook: https://zeppelin.incubator.apache.org/(https://zeppelin.incubator.apache.org/))Throw some algorithms on it !

? have a coffee? and maybe read the docs ? http://spark.apache.org/docs/latest/mllib-guide.html (http://spark.apache.org/docs/latest/mllib-guide.html)? read the Kaggle competition forums and blog

Graphs from the Panama Papers

Page 6: Machinelearning Spark Hadoop User Group Munich Meetup 2016

19.4.2016 MachineLearning - Databricks

file:///Users/lhaferkamp/Downloads/MachineLearning.html 6/6