SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

34
Scalable Machine Learning Alton Alexander @10altoids R

description

An overview of Spark, SparkR, and scalable machine learning. A talk and discussion given to the Utah R users group at the University of Utah.

Transcript of SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Page 1: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Scalable Machine Learning

Alton Alexander@10altoids

R

Page 2: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Motivation to use Spark

• http://spark.apache.org/–Speed–Ease of use–Generality–Integrated with Hadoop• Scalability

Page 3: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Performance

Page 4: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 6: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Learn more about Spark

• http://spark.apache.org/documentation.html – Great documentation (with video tutorials)– June 2014 Conference http://spark-summit.org

• Keynote Talk at STRATA 2014– Use cases by Yahoo and other companies– http://youtu.be/KspReT2JjeE– Matei Zaharia – Core developer and now at DataBricks– 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s

Page 7: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Motivation to use R

• Great community– R: The most powerful and most widely used statistical software– https://www.youtube.com/watch?v=TR2bHSJ_eck

• Statistics• Packages

– There’s an R package for that– Roger Peng, John Hopkins– https://www.youtube.com/watch?v=yhTerzNFLbo

• Plots

Page 8: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 9: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 10: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 11: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Example: Word CountLibrary(SparkR)sc <- sparkR.init(master="local")

Lines <- textFile(sc, “hdfs://my_text_file”)Words <- flatMap(lines,

function(line){strsplit(line, “ “)[[1]]})

wordCount <- lapply(words, function(word){list(word,1L)})

Counts <- reduceByKey(wordCount, "+", 2L)Output <- collect(counts)

Page 12: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 13: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 14: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 15: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th
Page 16: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Learn more about SparkR

• GitHub repository– https://github.com/amplab-extras/SparkR-pkg– How to install– Examples

• An old but still good talk introducing SparkR– http://

www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a

– Shows MINST demo

Page 17: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Backup Slides

Page 18: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Hands on Exercises

• http://spark-summit.org/2013/exercises/index.html– Walk through the tutorial– Set up a cluster on EC2– Data exploration– Stream processing with spark streaming– Machine learning

Page 19: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Local box• Start with a micro dev box using the latest public build on

Amazon EC2– spark.ami.pvm.v9 - ami-5bb18832

• Or start by just installing it on your laptop– wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1-bin-

hadoop1.tgz

• Add AWS keys as environment variables– AWS_ACCESS_KEY_ID=– AWS_SECRET_ACCESS_KEY=

Page 20: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Run the examples

• Load pyspark and work interactively– /root/spark-0.9.1-bin-hadoop1/bin/pyspark– >>> help(sc)

• Estimate pi– ./bin/pyspark python/examples/pi.py local[4] 20

Page 21: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Start Cluster

• Configure the cluster and start it– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark-

key -i ~/spark-key.pem -s 1 launch spark -test-cluster

• Log onto the master– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login

spark-test-cluster

Page 22: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Ganglia: The Cluster Dashboard

Page 23: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Run these Demos

• http://spark.apache.org/docs/latest/mllib-guide.html – Talks about each of the algorithms– Gives some demos in Scala– More demos in Python

Page 24: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Clusteringfrom pyspark.mllib.clustering import KMeansfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile(“data/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

Page 25: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Python Code

• http://spark.incubator.apache.org/docs/latest/api/pyspark/index.html

• Python API for Spark

• Package Mllib– Classification– Clustering– Recommendation– Regression

Page 26: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Clustering Skullcandy Followersfrom pyspark.mllib.clustering import KMeansfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile(“../skullcandy.csv")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

Page 27: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Clustering Skullcandy Followers

Page 28: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Clustering Skullcandy Followers

Page 29: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Apply model to all followers

• predictions = parsedData.map(lambda follower: clusters.predict(follower))

• Save this out for visualization– predictions.saveAsTextFile("predictions.csv")

Page 30: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Predicted Groups

Page 31: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Skullcandy Dashboard

Page 32: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Backup

Page 33: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

• Upgrade to python 2.7• https://spark-project.atlassian.net/browse/SP

ARK-922

Page 34: SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Correlation Matrix