Spark MLlib and Viral Tweets

SPARK MLLIB AND VIRALTWEETS

BY ASIM JALISGALVANIZE

WHO AM I?

Asim JalisGalvanize/Zipfian, Data Engineering Lead Instructor

PreviouslyCloudera, Senior Technical InstructorSalesforce, Microsoft, Hewlett-Packard SeniorSoftware EngineerMS in Computer Science from University of Virginiahttps://www.linkedin.com/in/asimjalis

https://www.linkedin.com/in/asimjalis

WHAT IS GALVANIZE’S DATAENGINEERING PROGRAM?

12-week Data Engineering ImmersiveCapstone Project

WHAT IS THIS TALK ABOUT?

How can we predict how viral a tweet will get usingSpark and MLlib.Example of how to use Spark’s Machine Learninglibraries.

SPOT SURVEY

HOW MANY PEOPLE HEREARE FAMILIAR WITH

APACHE SPARK?

HOW MANY PEOPLE HEREARE FAMILIAR WITH MLLIB?

HOW MANY PEOPLE HEREARE FAMILIAR WITH

MACHINE LEARNING?

HOW MANY PEOPLE HEREARE FAMILIAR WITHRANDOM FORESTS?

WHAT IS SPARK?

Framework for dividing up dataand processing it across a clusterin a fault-tolerant way

WHY DOES SPARK EXIST?

Using Spark you can process datasets larger thanwhat can fit on a single machine.You can process the data in parallel.The code is executed on the machine where the datais stored.

WHAT ARE THEALTERNATIVES TO SPARK?

HADOOP MAPREDUCE

WHY USE SPARK INSTEADOF MAPREDUCE?

Spark has a clean elegant API.Spark can keep the intermediate data in memorybetween stages.This dramatically speeeds up Machine Learning.

WHY DOES THIS SPEED UPMACHINE LEARNING?

Machine Learning algorithm are frequently iterative.They have this flavor:

Start with a guessCalculate the errorTweak the modelTry again

SPARK SHELL

WHAT IS THE SPARK SHELLLet you interact with your data and code.Provides a REPL (Read-Eval-Print-Loop).Great development environment.

WHY SPARK SHELL IS NEATLets you include dependencies on command line.No pom.xml or build.sbt files required.

HOW DO I START SPARKWITH ALL DEPENDENCIES?

SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'org.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1org.apache.spark:spark-streaming-twitter_2.10:1.5.2org.twitter4j:twitter4j-core:4.0.4END)

spark-shell --packages $SPARK_PKGS

WHAT ARE RDDS?RDDs are Resilient Distributed Datasets.These are the fundamental particle of Spark.

WHAT ARE RDDSINTUITIVELY?

Sequence of records.Managed from Driver.Distributed across executors.

WHAT WILL THIS OUTPUT?sc.parallelize(Array(1,2,3,4)). map(x => x + 1). filter(x => x % 2 == 0). collect

Array(2,4)

WHERE IS IT EXECUTING?

WHERE IS IT EXECUTING?The driver is defining what is to be done.The data operations happen on the Executors.+ and % are executing on Executors.

WHAT’S HAPPENING UNDERTHE HOOD

WHAT IS MLLIB?Library for Machine Learning.Builds on top of Spark RDDs.Provides RDDs for Machine Learning.Implements common Machine Learning algorithms.

MACHINE LEARNING

WHAT ARE CLASSIFICATIONAND REGRESSION?

Classification and Regression are a form of supervisedlearning.You build a model using labeled features.Then you test it with unlabeled features.

WHAT ARE FEATURES ANDLABELS?

Features are input variables.Labels are output variables.These terms are useful when you use MLlib.

WHAT IS THE DIFFERENCEBETWEEN REGRESSIONAND CLASSIFICATION?

In Regression the labels are continuous.In Classification they are discrete.

WHAT ARE SOME EXAMPLESOF REGRESSION?

Predict sales of an item in a store based on:Day of weekWeather forecastSeasonPopulation density

WHAT ARE SOME EXAMPLESOF REGRESSION?

Predict how many retweets a tweet will get.

PREDICTING RETWEETS

WHAT PROBLEM ARE WETRYING TO SOLVE?

Predict how viral a tweet will get.

WHY IS THIS INTERESTING?What features play the biggest role in this?How well can we predict the virality of a tweet?What is the best way to build a model for virality?

WHY RETWEETS?

Retweets feed into themselves.Unlike favs, retweets can trigger a cascade ofretweets.

WHAT FEATURES ARE WEGOING TO USE?

t.getUser.getFollowersCountt.getMediaEntities.sizet.getUserMentionEntities.sizet.getHashtagEntities.sizet.getText.length

DO THESE FEATURESMULTIPLY OR ADD?

Suppose your followers double.Is that going to add a constant number to yourretweets?Or will it multiply them by 2?

USING LOG FEATURESIntuitively the features have a multiplicative effect.So lets apply log to them.This turns multiplication to addition.

WHY IS THIS A GOOD IDEA?Regression works better with addition.Using log lets us convert multiplication to addition.

RANDOM FORESTS

WHAT ARE RANDOMFORESTS?

Random Forests are a Classification and Regressionalgorithm.

WHO INVENTED THEM?Random Forests were invented by Leo Breiman andAdele Cutler.At Berkeley in 2001.

LEO BREIMAN

ADELE CUTLER

WHAT IS THE BASIC IDEA?Random Forests are an ensemble method.It is made up of a collection of decision trees.The decision trees look only at a subset of features.The final value is the average of all decision trees.

WHAT ARE DECISIONTREES?

Decision trees play 20 questions on your data.At each point they ask which feature separates thelabels best.The model consists of these features and theircutpoints.

WHAT ARE FEATUREIMPORTANCES?

Random Forests calculate feature importance.The values show how relatively detrimental removingthat feature will be to the error.

HOW CAN I PLAY WITH THECODE?

GET TWITTER CREDENTIALSGo to Click on Create New App.For Website use your GitHub account.Leave Callback URL blank.

https://apps.twitter.com/

https://apps.twitter.com/

SAVE THESE 4 TWITTERKEYS

Consumer Key (API Key)Consumer Secret (API Secret)Access TokenAccess Token Secret

CREATETWITTER4J.PROPERTIES

FILEdebug=truehttp.prettyDebug=trueoauth.consumerKey=xxxxoauth.consumerSecret=xxxxoauth.accessToken=xxxxoauth.accessTokenSecret=xxxx

START SPARK SHELL INSAME DIRECTORY AS THIS

FILESPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'org.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1org.apache.spark:spark-streaming-twitter_2.10:1.5.2org.twitter4j:twitter4j-core:4.0.4END)

spark-shell --packages $SPARK_PKGS

SPARK MLLIB CODEGrab Scala code for Spark MLlib from GitHub

Paste into Spark Shellhttps://gist.github.com/asimjalis/965bd44657b90aeab887

https://gist.github.com/asimjalis/965bd44657b90aeab887

RESULTS

WHAT WERE THE RESULTS?

WHAT WAS THE ROOT MEANSQUARE ERROR?

RMSE = 2.084

WHAT WERE THE FEATUREIMPORTANCES?

Feature Importancet.getText.length 0.340t.getHashtagEntities.size 0.328t.getUserMentionEntities.size 0.168t.getUser.getFollowersCount 0.085t.getMediaEntities.size 0.076

WHAT ARE NEXT STEPS?Predict an unpublished tweet’s virality.Analyze more features.For example, the text of the tweet.Analyze features of most viral tweets within a user’stimeline.

WANT MORE INFORMATIONON GALVANIZE’S DATA

ENGINEERING IMMERSIVE?

Talk to me.We are also hiring for faculty positions.

QUESTIONS

Spark MLlib and Viral Tweets

Data & Analytics

Transcript of Spark MLlib and Viral Tweets