Spark MLlib and Viral Tweets
-
Upload
asim-jalis -
Category
Data & Analytics
-
view
679 -
download
0
Transcript of Spark MLlib and Viral Tweets
SPARK MLLIB AND VIRALTWEETS
BY ASIM JALISGALVANIZE
INTRO
WHO AM I?
Asim JalisGalvanize/Zipfian, Data Engineering Lead Instructor
PreviouslyCloudera, Senior Technical InstructorSalesforce, Microsoft, Hewlett-Packard SeniorSoftware EngineerMS in Computer Science from University of Virginiahttps://www.linkedin.com/in/asimjalis
WHAT IS GALVANIZE’S DATAENGINEERING PROGRAM?
12-week Data Engineering ImmersiveCapstone Project
WHAT IS THIS TALK ABOUT?
How can we predict how viral a tweet will get usingSpark and MLlib.Example of how to use Spark’s Machine Learninglibraries.
SPOT SURVEY
HOW MANY PEOPLE HEREARE FAMILIAR WITH
APACHE SPARK?
HOW MANY PEOPLE HEREARE FAMILIAR WITH MLLIB?
HOW MANY PEOPLE HEREARE FAMILIAR WITH
MACHINE LEARNING?
HOW MANY PEOPLE HEREARE FAMILIAR WITHRANDOM FORESTS?
SPARK
WHAT IS SPARK?
Framework for dividing up dataand processing it across a clusterin a fault-tolerant way
WHY DOES SPARK EXIST?
Using Spark you can process datasets larger thanwhat can fit on a single machine.You can process the data in parallel.The code is executed on the machine where the datais stored.
WHAT ARE THEALTERNATIVES TO SPARK?
HADOOP MAPREDUCE
WHY USE SPARK INSTEADOF MAPREDUCE?
Spark has a clean elegant API.Spark can keep the intermediate data in memorybetween stages.This dramatically speeeds up Machine Learning.
WHY DOES THIS SPEED UPMACHINE LEARNING?
Machine Learning algorithm are frequently iterative.They have this flavor:
Start with a guessCalculate the errorTweak the modelTry again
SPARK SHELL
WHAT IS THE SPARK SHELLLet you interact with your data and code.Provides a REPL (Read-Eval-Print-Loop).Great development environment.
WHY SPARK SHELL IS NEATLets you include dependencies on command line.No pom.xml or build.sbt files required.
HOW DO I START SPARKWITH ALL DEPENDENCIES?
SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'org.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1org.apache.spark:spark-streaming-twitter_2.10:1.5.2org.twitter4j:twitter4j-core:4.0.4END)
spark-shell --packages $SPARK_PKGS
RDDS
WHAT ARE RDDS?RDDs are Resilient Distributed Datasets.These are the fundamental particle of Spark.
WHAT ARE RDDSINTUITIVELY?
Sequence of records.Managed from Driver.Distributed across executors.
WHAT WILL THIS OUTPUT?sc.parallelize(Array(1,2,3,4)). map(x => x + 1). filter(x => x % 2 == 0). collect
Array(2,4)
WHERE IS IT EXECUTING?
WHERE IS IT EXECUTING?The driver is defining what is to be done.The data operations happen on the Executors.+ and % are executing on Executors.
WHAT’S HAPPENING UNDERTHE HOOD
MLLIB
WHAT IS MLLIB?Library for Machine Learning.Builds on top of Spark RDDs.Provides RDDs for Machine Learning.Implements common Machine Learning algorithms.
MACHINE LEARNING
WHAT ARE CLASSIFICATIONAND REGRESSION?
Classification and Regression are a form of supervisedlearning.You build a model using labeled features.Then you test it with unlabeled features.
WHAT ARE FEATURES ANDLABELS?
Features are input variables.Labels are output variables.These terms are useful when you use MLlib.
WHAT IS THE DIFFERENCEBETWEEN REGRESSIONAND CLASSIFICATION?
In Regression the labels are continuous.In Classification they are discrete.
WHAT ARE SOME EXAMPLESOF REGRESSION?
Predict sales of an item in a store based on:Day of weekWeather forecastSeasonPopulation density
WHAT ARE SOME EXAMPLESOF REGRESSION?
Predict how many retweets a tweet will get.
PREDICTING RETWEETS
WHAT PROBLEM ARE WETRYING TO SOLVE?
Predict how viral a tweet will get.
WHY IS THIS INTERESTING?What features play the biggest role in this?How well can we predict the virality of a tweet?What is the best way to build a model for virality?
WHY RETWEETS?
Retweets feed into themselves.Unlike favs, retweets can trigger a cascade ofretweets.
WHAT FEATURES ARE WEGOING TO USE?
t.getUser.getFollowersCountt.getMediaEntities.sizet.getUserMentionEntities.sizet.getHashtagEntities.sizet.getText.length
DO THESE FEATURESMULTIPLY OR ADD?
Suppose your followers double.Is that going to add a constant number to yourretweets?Or will it multiply them by 2?
USING LOG FEATURESIntuitively the features have a multiplicative effect.So lets apply log to them.This turns multiplication to addition.
WHY IS THIS A GOOD IDEA?Regression works better with addition.Using log lets us convert multiplication to addition.
RANDOM FORESTS
WHAT ARE RANDOMFORESTS?
Random Forests are a Classification and Regressionalgorithm.
WHO INVENTED THEM?Random Forests were invented by Leo Breiman andAdele Cutler.At Berkeley in 2001.
LEO BREIMAN
ADELE CUTLER
WHAT IS THE BASIC IDEA?Random Forests are an ensemble method.It is made up of a collection of decision trees.The decision trees look only at a subset of features.The final value is the average of all decision trees.
WHAT ARE DECISIONTREES?
Decision trees play 20 questions on your data.At each point they ask which feature separates thelabels best.The model consists of these features and theircutpoints.
WHAT ARE FEATUREIMPORTANCES?
Random Forests calculate feature importance.The values show how relatively detrimental removingthat feature will be to the error.
DEMO
HOW CAN I PLAY WITH THECODE?
GET TWITTER CREDENTIALSGo to Click on Create New App.For Website use your GitHub account.Leave Callback URL blank.
https://apps.twitter.com/
SAVE THESE 4 TWITTERKEYS
Consumer Key (API Key)Consumer Secret (API Secret)Access TokenAccess Token Secret
CREATETWITTER4J.PROPERTIES
FILEdebug=truehttp.prettyDebug=trueoauth.consumerKey=xxxxoauth.consumerSecret=xxxxoauth.accessToken=xxxxoauth.accessTokenSecret=xxxx
START SPARK SHELL INSAME DIRECTORY AS THIS
FILESPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'org.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1org.apache.spark:spark-streaming-twitter_2.10:1.5.2org.twitter4j:twitter4j-core:4.0.4END)
spark-shell --packages $SPARK_PKGS
SPARK MLLIB CODEGrab Scala code for Spark MLlib from GitHub
Paste into Spark Shellhttps://gist.github.com/asimjalis/965bd44657b90aeab887
RESULTS
WHAT WERE THE RESULTS?
WHAT WAS THE ROOT MEANSQUARE ERROR?
RMSE = 2.084
WHAT WERE THE FEATUREIMPORTANCES?
Feature Importancet.getText.length 0.340t.getHashtagEntities.size 0.328t.getUserMentionEntities.size 0.168t.getUser.getFollowersCount 0.085t.getMediaEntities.size 0.076
WHAT ARE NEXT STEPS?Predict an unpublished tweet’s virality.Analyze more features.For example, the text of the tweet.Analyze features of most viral tweets within a user’stimeline.
WANT MORE INFORMATIONON GALVANIZE’S DATA
ENGINEERING IMMERSIVE?
Talk to me.We are also hiring for faculty positions.
QUESTIONS