SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Scalable Machine Learning

Alton Alexander@10altoids

R

Motivation to use Spark

• http://spark.apache.org/–Speed–Ease of use–Generality–Integrated with Hadoop• Scalability

http://spark.apache.org/

http://spark.apache.org/

Performance

Architecture

• New Amazon Memory Optimized Optionshttps://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-announcing-the-next-generation-of-amazon-ec2-memory-optimized-instances/

https://aws.amazon.com/about-aws/whats-new/2014/04/10/r3-announcing-the-next-generation-of-amazon-ec2-memory-optimized-instances/



Learn more about Spark

• http://spark.apache.org/documentation.html – Great documentation (with video tutorials)– June 2014 Conference http://spark-summit.org

• Keynote Talk at STRATA 2014– Use cases by Yahoo and other companies– http://youtu.be/KspReT2JjeE– Matei Zaharia – Core developer and now at DataBricks– 30min Talk in Detail http://youtu.be/nU6vO2EJAb4?t=20m42s

http://spark.apache.org/documentation.html

http://spark.apache.org/documentation.html

http://spark-summit.org/



http://youtu.be/KspReT2JjeE



http://youtu.be/nU6vO2EJAb4?t=20m42s



Motivation to use R

• Great community– R: The most powerful and most widely used statistical software– https://www.youtube.com/watch?v=TR2bHSJ_eck

• Statistics• Packages

– There’s an R package for that– Roger Peng, John Hopkins– https://www.youtube.com/watch?v=yhTerzNFLbo

• Plots

https://www.youtube.com/watch?v=TR2bHSJ_eck



https://www.youtube.com/watch?v=yhTerzNFLbo



Example: Word CountLibrary(SparkR)sc <- sparkR.init(master="local")

Lines <- textFile(sc, “hdfs://my_text_file”)Words <- flatMap(lines,

function(line){strsplit(line, “ “)[[1]]})

wordCount <- lapply(words, function(word){list(word,1L)})

Counts <- reduceByKey(wordCount, "+", 2L)Output <- collect(counts)

Learn more about SparkR

• GitHub repository– https://github.com/amplab-extras/SparkR-pkg– How to install– Examples

• An old but still good talk introducing SparkR– http://

www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a

– Shows MINST demo

https://github.com/amplab-extras/SparkR-pkg

https://github.com/amplab-extras/SparkR-pkg

http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a



Backup Slides

Hands on Exercises

• http://spark-summit.org/2013/exercises/index.html– Walk through the tutorial– Set up a cluster on EC2– Data exploration– Stream processing with spark streaming– Machine learning

http://spark-summit.org/2013/exercises/index.html

http://spark-summit.org/2013/exercises/index.html

Local box• Start with a micro dev box using the latest public build on

Amazon EC2– spark.ami.pvm.v9 - ami-5bb18832

• Or start by just installing it on your laptop– wget http://d3kbcqa49mib13.cloudfront.net/spark-0.9.1-bin-

hadoop1.tgz

• Add AWS keys as environment variables– AWS_ACCESS_KEY_ID=– AWS_SECRET_ACCESS_KEY=

Run the examples

• Load pyspark and work interactively– /root/spark-0.9.1-bin-hadoop1/bin/pyspark– >>> help(sc)

• Estimate pi– ./bin/pyspark python/examples/pi.py local[4] 20

Start Cluster

• Configure the cluster and start it– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 -k spark-

key -i ~/spark-key.pem -s 1 launch spark -test-cluster

• Log onto the master– spark-0.9.1-bin-hadoop1/ec2/spark-ec2 login

spark-test-cluster

Ganglia: The Cluster Dashboard

Run these Demos

• http://spark.apache.org/docs/latest/mllib-guide.html – Talks about each of the algorithms– Gives some demos in Scala– More demos in Python

http://spark.apache.org/docs/latest/mllib-guide.html

http://spark.apache.org/docs/latest/mllib-guide.html

Clusteringfrom pyspark.mllib.clustering import KMeansfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile(“data/kmeans_data.txt")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

Python Code

• http://spark.incubator.apache.org/docs/latest/api/pyspark/index.html

• Python API for Spark

• Package Mllib– Classification– Clustering– Recommendation– Regression

http://spark.incubator.apache.org/docs/latest/api/pyspark/index.html

http://spark.incubator.apache.org/docs/latest/api/pyspark/index.html

Clustering Skullcandy Followersfrom pyspark.mllib.clustering import KMeansfrom numpy import arrayfrom math import sqrt

# Load and parse the datadata = sc.textFile(“../skullcandy.csv")parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))

# Build the model (cluster the data)clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initializationMode="random")

# Evaluate clustering by computing Within Set Sum of Squared Errorsdef error(point): center = clusters.centers[clusters.predict(point)] return sqrt(sum([x**2 for x in (point - center)]))

WSSSE = parsedData.map(lambda point: error(point)).reduce(lambda x, y: x + y)print("Within Set Sum of Squared Error = " + str(WSSSE))

Clustering Skullcandy Followers

Apply model to all followers

• predictions = parsedData.map(lambda follower: clusters.predict(follower))

• Save this out for visualization– predictions.saveAsTextFile("predictions.csv")

Predicted Groups

Skullcandy Dashboard

Backup

• Upgrade to python 2.7• https://spark-project.atlassian.net/browse/SP

ARK-922

https://spark-project.atlassian.net/browse/SPARK-922

https://spark-project.atlassian.net/browse/SPARK-922

Correlation Matrix

SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th

Data & Analytics

Transcript of SparkR - Scalable machine learning - Utah R Users Group - U of U - June 17th