Spark MLlib and Viral Tweets

82
SPARK MLLIB AND VIRAL TWEETS BY ASIM JALIS GALVANIZE

Transcript of Spark MLlib and Viral Tweets

Page 1: Spark MLlib and Viral Tweets

SPARK MLLIB AND VIRALTWEETS

BY ASIM JALISGALVANIZE

Page 2: Spark MLlib and Viral Tweets

INTRO

Page 3: Spark MLlib and Viral Tweets

WHO AM I?

Page 4: Spark MLlib and Viral Tweets

Asim JalisGalvanize/Zipfian, Data Engineering Lead Instructor

PreviouslyCloudera, Senior Technical InstructorSalesforce, Microsoft, Hewlett-Packard SeniorSoftware EngineerMS in Computer Science from University of Virginiahttps://www.linkedin.com/in/asimjalis

Page 5: Spark MLlib and Viral Tweets

WHAT IS GALVANIZE’S DATAENGINEERING PROGRAM?

12-week Data Engineering ImmersiveCapstone Project

Page 6: Spark MLlib and Viral Tweets
Page 7: Spark MLlib and Viral Tweets

WHAT IS THIS TALK ABOUT?

Page 8: Spark MLlib and Viral Tweets

How can we predict how viral a tweet will get usingSpark and MLlib.Example of how to use Spark’s Machine Learninglibraries.

Page 9: Spark MLlib and Viral Tweets

SPOT SURVEY

Page 10: Spark MLlib and Viral Tweets

HOW MANY PEOPLE HEREARE FAMILIAR WITH

APACHE SPARK?

Page 11: Spark MLlib and Viral Tweets

HOW MANY PEOPLE HEREARE FAMILIAR WITH MLLIB?

Page 12: Spark MLlib and Viral Tweets

HOW MANY PEOPLE HEREARE FAMILIAR WITH

MACHINE LEARNING?

Page 13: Spark MLlib and Viral Tweets

HOW MANY PEOPLE HEREARE FAMILIAR WITHRANDOM FORESTS?

Page 14: Spark MLlib and Viral Tweets

SPARK

Page 15: Spark MLlib and Viral Tweets

WHAT IS SPARK?

Page 16: Spark MLlib and Viral Tweets
Page 17: Spark MLlib and Viral Tweets

Framework for dividing up dataand processing it across a clusterin a fault-tolerant way

Page 18: Spark MLlib and Viral Tweets
Page 19: Spark MLlib and Viral Tweets

WHY DOES SPARK EXIST?

Page 20: Spark MLlib and Viral Tweets

Using Spark you can process datasets larger thanwhat can fit on a single machine.You can process the data in parallel.The code is executed on the machine where the datais stored.

Page 21: Spark MLlib and Viral Tweets

WHAT ARE THEALTERNATIVES TO SPARK?

Page 22: Spark MLlib and Viral Tweets

HADOOP MAPREDUCE

Page 23: Spark MLlib and Viral Tweets

WHY USE SPARK INSTEADOF MAPREDUCE?

Page 24: Spark MLlib and Viral Tweets

Spark has a clean elegant API.Spark can keep the intermediate data in memorybetween stages.This dramatically speeeds up Machine Learning.

Page 25: Spark MLlib and Viral Tweets

WHY DOES THIS SPEED UPMACHINE LEARNING?

Machine Learning algorithm are frequently iterative.They have this flavor:

Start with a guessCalculate the errorTweak the modelTry again

Page 26: Spark MLlib and Viral Tweets

SPARK SHELL

Page 27: Spark MLlib and Viral Tweets

WHAT IS THE SPARK SHELLLet you interact with your data and code.Provides a REPL (Read-Eval-Print-Loop).Great development environment.

Page 28: Spark MLlib and Viral Tweets

WHY SPARK SHELL IS NEATLets you include dependencies on command line.No pom.xml or build.sbt files required.

Page 29: Spark MLlib and Viral Tweets

HOW DO I START SPARKWITH ALL DEPENDENCIES?

Page 30: Spark MLlib and Viral Tweets

SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'org.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1org.apache.spark:spark-streaming-twitter_2.10:1.5.2org.twitter4j:twitter4j-core:4.0.4END)

spark-shell --packages $SPARK_PKGS

Page 31: Spark MLlib and Viral Tweets

RDDS

Page 32: Spark MLlib and Viral Tweets

WHAT ARE RDDS?RDDs are Resilient Distributed Datasets.These are the fundamental particle of Spark.

Page 33: Spark MLlib and Viral Tweets

WHAT ARE RDDSINTUITIVELY?

Sequence of records.Managed from Driver.Distributed across executors.

Page 34: Spark MLlib and Viral Tweets

WHAT WILL THIS OUTPUT?sc.parallelize(Array(1,2,3,4)). map(x => x + 1). filter(x => x % 2 == 0). collect

Page 35: Spark MLlib and Viral Tweets

Array(2,4)

Page 36: Spark MLlib and Viral Tweets

WHERE IS IT EXECUTING?

Page 37: Spark MLlib and Viral Tweets

WHERE IS IT EXECUTING?The driver is defining what is to be done.The data operations happen on the Executors.+ and % are executing on Executors.

Page 38: Spark MLlib and Viral Tweets

WHAT’S HAPPENING UNDERTHE HOOD

Page 39: Spark MLlib and Viral Tweets
Page 40: Spark MLlib and Viral Tweets
Page 41: Spark MLlib and Viral Tweets

MLLIB

Page 42: Spark MLlib and Viral Tweets

WHAT IS MLLIB?Library for Machine Learning.Builds on top of Spark RDDs.Provides RDDs for Machine Learning.Implements common Machine Learning algorithms.

Page 43: Spark MLlib and Viral Tweets
Page 44: Spark MLlib and Viral Tweets

MACHINE LEARNING

Page 45: Spark MLlib and Viral Tweets

WHAT ARE CLASSIFICATIONAND REGRESSION?

Classification and Regression are a form of supervisedlearning.You build a model using labeled features.Then you test it with unlabeled features.

Page 46: Spark MLlib and Viral Tweets

WHAT ARE FEATURES ANDLABELS?

Features are input variables.Labels are output variables.These terms are useful when you use MLlib.

Page 47: Spark MLlib and Viral Tweets

WHAT IS THE DIFFERENCEBETWEEN REGRESSIONAND CLASSIFICATION?

In Regression the labels are continuous.In Classification they are discrete.

Page 48: Spark MLlib and Viral Tweets

WHAT ARE SOME EXAMPLESOF REGRESSION?

Predict sales of an item in a store based on:Day of weekWeather forecastSeasonPopulation density

Page 49: Spark MLlib and Viral Tweets

WHAT ARE SOME EXAMPLESOF REGRESSION?

Predict how many retweets a tweet will get.

Page 50: Spark MLlib and Viral Tweets

PREDICTING RETWEETS

Page 51: Spark MLlib and Viral Tweets

WHAT PROBLEM ARE WETRYING TO SOLVE?

Predict how viral a tweet will get.

Page 52: Spark MLlib and Viral Tweets

WHY IS THIS INTERESTING?What features play the biggest role in this?How well can we predict the virality of a tweet?What is the best way to build a model for virality?

Page 53: Spark MLlib and Viral Tweets

WHY RETWEETS?

Retweets feed into themselves.Unlike favs, retweets can trigger a cascade ofretweets.

Page 54: Spark MLlib and Viral Tweets

WHAT FEATURES ARE WEGOING TO USE?

t.getUser.getFollowersCountt.getMediaEntities.sizet.getUserMentionEntities.sizet.getHashtagEntities.sizet.getText.length

Page 55: Spark MLlib and Viral Tweets

DO THESE FEATURESMULTIPLY OR ADD?

Suppose your followers double.Is that going to add a constant number to yourretweets?Or will it multiply them by 2?

Page 56: Spark MLlib and Viral Tweets

USING LOG FEATURESIntuitively the features have a multiplicative effect.So lets apply log to them.This turns multiplication to addition.

Page 57: Spark MLlib and Viral Tweets

WHY IS THIS A GOOD IDEA?Regression works better with addition.Using log lets us convert multiplication to addition.

Page 58: Spark MLlib and Viral Tweets

RANDOM FORESTS

Page 59: Spark MLlib and Viral Tweets

WHAT ARE RANDOMFORESTS?

Random Forests are a Classification and Regressionalgorithm.

Page 60: Spark MLlib and Viral Tweets

WHO INVENTED THEM?Random Forests were invented by Leo Breiman andAdele Cutler.At Berkeley in 2001.

Page 61: Spark MLlib and Viral Tweets

LEO BREIMAN

Page 62: Spark MLlib and Viral Tweets

ADELE CUTLER

Page 63: Spark MLlib and Viral Tweets

WHAT IS THE BASIC IDEA?Random Forests are an ensemble method.It is made up of a collection of decision trees.The decision trees look only at a subset of features.The final value is the average of all decision trees.

Page 64: Spark MLlib and Viral Tweets

WHAT ARE DECISIONTREES?

Decision trees play 20 questions on your data.At each point they ask which feature separates thelabels best.The model consists of these features and theircutpoints.

Page 65: Spark MLlib and Viral Tweets

WHAT ARE FEATUREIMPORTANCES?

Random Forests calculate feature importance.The values show how relatively detrimental removingthat feature will be to the error.

Page 66: Spark MLlib and Viral Tweets

DEMO

Page 67: Spark MLlib and Viral Tweets

HOW CAN I PLAY WITH THECODE?

Page 68: Spark MLlib and Viral Tweets

GET TWITTER CREDENTIALSGo to Click on Create New App.For Website use your GitHub account.Leave Callback URL blank.

https://apps.twitter.com/

Page 69: Spark MLlib and Viral Tweets

SAVE THESE 4 TWITTERKEYS

Consumer Key (API Key)Consumer Secret (API Secret)Access TokenAccess Token Secret

Page 70: Spark MLlib and Viral Tweets

CREATETWITTER4J.PROPERTIES

FILEdebug=truehttp.prettyDebug=trueoauth.consumerKey=xxxxoauth.consumerSecret=xxxxoauth.accessToken=xxxxoauth.accessTokenSecret=xxxx

Page 71: Spark MLlib and Viral Tweets

START SPARK SHELL INSAME DIRECTORY AS THIS

FILESPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'org.apache.hadoop:hadoop-aws:2.7.1com.amazonaws:aws-java-sdk-s3:1.10.30com.databricks:spark-csv_2.10:1.3.0com.databricks:spark-avro_2.10:2.0.1org.apache.spark:spark-streaming-twitter_2.10:1.5.2org.twitter4j:twitter4j-core:4.0.4END)

spark-shell --packages $SPARK_PKGS

Page 72: Spark MLlib and Viral Tweets

SPARK MLLIB CODEGrab Scala code for Spark MLlib from GitHub

Paste into Spark Shellhttps://gist.github.com/asimjalis/965bd44657b90aeab887

Page 73: Spark MLlib and Viral Tweets

RESULTS

Page 74: Spark MLlib and Viral Tweets

WHAT WERE THE RESULTS?

Page 75: Spark MLlib and Viral Tweets

WHAT WAS THE ROOT MEANSQUARE ERROR?

Page 76: Spark MLlib and Viral Tweets

RMSE = 2.084

Page 77: Spark MLlib and Viral Tweets

WHAT WERE THE FEATUREIMPORTANCES?

Page 78: Spark MLlib and Viral Tweets

Feature Importancet.getText.length 0.340t.getHashtagEntities.size 0.328t.getUserMentionEntities.size 0.168t.getUser.getFollowersCount 0.085t.getMediaEntities.size 0.076

Page 79: Spark MLlib and Viral Tweets

WHAT ARE NEXT STEPS?Predict an unpublished tweet’s virality.Analyze more features.For example, the text of the tweet.Analyze features of most viral tweets within a user’stimeline.

Page 80: Spark MLlib and Viral Tweets

WANT MORE INFORMATIONON GALVANIZE’S DATA

ENGINEERING IMMERSIVE?

Page 81: Spark MLlib and Viral Tweets

Talk to me.We are also hiring for faculty positions.

Page 82: Spark MLlib and Viral Tweets

QUESTIONS