Spark MLlib and Viral Tweets

download Spark MLlib and Viral Tweets

of 82

Embed Size (px)

Transcript of Spark MLlib and Viral Tweets

  • SPARK MLLIB AND VIRALTWEETS

    BY ASIM JALISGALVANIZE

  • INTRO

  • WHO AM I?

  • Asim JalisGalvanize/Zipfian, Data Engineering Lead Instructor

    PreviouslyCloudera, Senior Technical InstructorSalesforce, Microsoft, Hewlett-Packard SeniorSoftware EngineerMS in Computer Science from University of Virginiahttps://www.linkedin.com/in/asimjalis

    https://www.linkedin.com/in/asimjalis

  • WHAT IS GALVANIZES DATAENGINEERING PROGRAM?

    12-week Data Engineering ImmersiveCapstone Project

  • WHAT IS THIS TALK ABOUT?

  • How can we predict how viral a tweet will get usingSpark and MLlib.Example of how to use Sparks Machine Learninglibraries.

  • SPOT SURVEY

  • HOW MANY PEOPLE HEREARE FAMILIAR WITH

    APACHE SPARK?

  • HOW MANY PEOPLE HEREARE FAMILIAR WITH MLLIB?

  • HOW MANY PEOPLE HEREARE FAMILIAR WITH

    MACHINE LEARNING?

  • HOW MANY PEOPLE HEREARE FAMILIAR WITHRANDOM FORESTS?

  • SPARK

  • WHAT IS SPARK?

  • Framework for dividing up dataand processing it across a clusterin a fault-tolerant way

  • WHY DOES SPARK EXIST?

  • Using Spark you can process datasets larger thanwhat can fit on a single machine.You can process the data in parallel.The code is executed on the machine where the datais stored.

  • WHAT ARE THEALTERNATIVES TO SPARK?

  • HADOOP MAPREDUCE

  • WHY USE SPARK INSTEADOF MAPREDUCE?

  • Spark has a clean elegant API.Spark can keep the intermediate data in memorybetween stages.This dramatically speeeds up Machine Learning.

  • WHY DOES THIS SPEED UPMACHINE LEARNING?

    Machine Learning algorithm are frequently iterative.They have this flavor:

    Start with a guessCalculate the errorTweak the modelTry again

  • SPARK SHELL

  • WHAT IS THE SPARK SHELLLet you interact with your data and code.Provides a REPL (Read-Eval-Print-Loop).Great development environment.

  • WHY SPARK SHELL IS NEATLets you include dependencies on command line.No pom.xml or build.sbt files required.

  • HOW DO I START SPARKWITH ALL DEPENDENCIES?

  • SPARK_PKGS=$(cat
  • RDDS

  • WHAT ARE RDDS?RDDs are Resilient Distributed Datasets.These are the fundamental particle of Spark.

  • WHAT ARE RDDSINTUITIVELY?

    Sequence of records.Managed from Driver.Distributed across executors.

  • WHAT WILL THIS OUTPUT?sc.parallelize(Array(1,2,3,4)). map(x => x + 1). filter(x => x % 2 == 0). collect

  • Array(2,4)

  • WHERE IS IT EXECUTING?

  • WHERE IS IT EXECUTING?The driver is defining what is to be done.The data operations happen on the Executors.+ and % are executing on Executors.

  • WHATS HAPPENING UNDERTHE HOOD

  • MLLIB

  • WHAT IS MLLIB?Library for Machine Learning.Builds on top of Spark RDDs.Provides RDDs for Machine Learning.Implements common Machine Learning algorithms.

  • MACHINE LEARNING

  • WHAT ARE CLASSIFICATIONAND REGRESSION?

    Classification and Regression are a form of supervisedlearning.You build a model using labeled features.Then you test it with unlabeled features.

  • WHAT ARE FEATURES ANDLABELS?

    Features are input variables.Labels are output variables.These terms are useful when you use MLlib.

  • WHAT IS THE DIFFERENCEBETWEEN REGRESSIONAND CLASSIFICATION?

    In Regression the labels are continuous.In Classification they are discrete.

  • WHAT ARE SOME EXAMPLESOF REGRESSION?

    Predict sales of an item in a store based on:Day of weekWeather forecastSeasonPopulation density

  • WHAT ARE SOME EXAMPLESOF REGRESSION?

    Predict how many retweets a tweet will get.

  • PREDICTING RETWEETS

  • WHAT PROBLEM ARE WETRYING TO SOLVE?

    Predict how viral a tweet will get.

  • WHY IS THIS INTERESTING?What features play the biggest role in this?How well can we predict the virality of a tweet?What is the best way to build a model for virality?

  • WHY RETWEETS?

    Retweets feed into themselves.Unlike favs, retweets can trigger a cascade ofretweets.

  • WHAT FEATURES ARE WEGOING TO USE?

    t.getUser.getFollowersCountt.getMediaEntities.sizet.getUserMentionEntities.sizet.getHashtagEntities.sizet.getText.length

  • DO THESE FEATURESMULTIPLY OR ADD?

    Suppose your followers double.Is that going to add a constant number to yourretweets?Or will it multiply them by 2?

  • USING LOG FEATURESIntuitively the features have a multiplicative effect.So lets apply log to them.This turns multiplication to addition.

  • WHY IS THIS A GOOD IDEA?Regression works better with addition.Using log lets us convert multiplication to addition.

  • RANDOM FORESTS

  • WHAT ARE RANDOMFORESTS?

    Random Forests are a Classification and Regressionalgorithm.

  • WHO INVENTED THEM?Random Forests were invented by Leo Breiman andAdele Cutler.At Berkeley in 2001.

  • LEO BREIMAN

  • ADELE CUTLER

  • WHAT IS THE BASIC IDEA?Random Forests are an ensemble method.It is made up of a collection of decision trees.The decision trees look only at a subset of features.The final value is the average of all decision trees.

  • WHAT ARE DECISIONTREES?

    Decision trees play 20 questions on your data.At each point they ask which feature separates thelabels best.The model consists of these features and theircutpoints.

  • WHAT ARE FEATUREIMPORTANCES?

    Random Forests calculate feature importance.The values show how relatively detrimental removingthat feature will be to the error.

  • DEMO

  • HOW CAN I PLAY WITH THECODE?

  • GET TWITTER CREDENTIALSGo to Click on Create New App.For Website use your GitHub account.Leave Callback URL blank.

    https://apps.twitter.com/

    https://apps.twitter.com/

  • SAVE THESE 4 TWITTERKEYS

    Consumer Key (API Key)Consumer Secret (API Secret)Access TokenAccess Token Secret

  • CREATETWITTER4J.PROPERTIES

    FILEdebug=truehttp.prettyDebug=trueoauth.consumerKey=xxxxoauth.consumerSecret=xxxxoauth.accessToken=xxxxoauth.accessTokenSecret=xxxx

  • START SPARK SHELL INSAME DIRECTORY AS THIS

    FILESPARK_PKGS=$(cat

  • SPARK MLLIB CODEGrab Scala code for Spark MLlib from GitHub

    Paste into Spark Shellhttps://gist.github.com/asimjalis/965bd44657b90aeab887

    https://gist.github.com/asimjalis/965bd44657b90aeab887

  • RESULTS

  • WHAT WERE THE RESULTS?

  • WHAT WAS THE ROOT MEANSQUARE ERROR?

  • RMSE = 2.084

  • WHAT WERE THE FEATUREIMPORTANCES?

  • Feature Importancet.getText.length 0.340t.getHashtagEntities.size 0.328t.getUserMentionEntities.size 0.168t.getUser.getFollowersCount 0.085t.getMediaEntities.size 0.076

  • WHAT ARE NEXT STEPS?Predict an unpublished tweets virality.Analyze more features.For example, the text of the tweet.Analyze features of most viral tweets within a userstimeline.

  • WANT MORE INFORMATIONON GALVANIZES DATA

    ENGINEERING IMMERSIVE?

  • Talk to me.We are also hiring for faculty positions.

  • QUESTIONS