Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

download Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

of 85

  • date post

    13-Apr-2017
  • Category

    Software

  • view

    745
  • download

    0

Embed Size (px)

Transcript of Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

    Spark & RecommendationsSpark, Streaming, Machine Learning, Graph Processing,

    Approximations, Probabilistic Data Structures, NLP

    Chicago Spark Users MeetupThanks, Expedia/Orbitz and SpringCM!

    Mar 1st, 2016

    Chris FreglyPrincipal Data Solutions Engineer

    Were Hiring! (Only Nice People)advancedspark.com!

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Who Am I?

    2

    Streaming Data EngineerNetflix OSS Committer

    Data Solutions Engineer

    Apache Contributor

    Principal Data Solutions EngineerIBM Technology Center

    Meetup OrganizerAdvanced Apache Meetup

    Book AuthorAdvanced .

    Due 2016

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Recent World Tour: Freg-a-Palooza!London Spark Meetup (Oct 12th)Scotland Data Science Meetup (Oct 13th)Dublin Spark Meetup (Oct 15th)

    Barcelona Spark Meetup (Oct 20th)Madrid Big Data Meetup (Oct 22nd)

    Paris Spark Meetup (Oct 26th)Amsterdam Spark Summit (Oct 27th)Brussels Spark Meetup (Oct 30th)

    Zurich Big Data Meetup (Nov 2nd)Geneva Spark Meetup (Nov 5th)

    3

    Oslo Big Data Hadoop Meetup (Nov 19th)Helsinki Spark Meetup (Nov 20th)Stockholm Spark Meetup (Nov 23rd)

    Copenhagen Spark Meetup (Nov 25th)Istanbul Spark Meetup (Nov 26th)

    Budapest Spark Meetup (Nov 28th)Singapore Spark Meetup (Dec 1st)Sydney Spark Meetup (Dec 8th)

    Melbourne Spark Meetup (Dec 9th)Toronto Spark Meetup (Dec 14th)

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Advanced Apache Spark Meetuphttp://advancedspark.comMeetup Metrics Top 5 Most-active Spark Meetup! 2600+ Members in just 6 mos!! 2600+ Docker downloads (demos)

    Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance

    4

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

    Live, Interactive Demo!!Audience Participation Required

    (cell phone or laptop)

    5

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    http://demo.advancedspark.comEnd User ->

    ElasticSearch ->

    Spark ML ->

    Data Scientist ->

    6

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Presentation Outline Scaling with Parallelism and Composability

    Similarity and Recommendations

    When to Approximate

    Common Approximation Algos and Data Structs

    Common Approximation Libraries and Tools

    Netflix Recommendations and Data Pipeline7

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Scaling with Parallelism

    8

    PeterO(log n)

    O(log n)

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Scaling with Composability

    Max (a max b max c max d) == (a max b) max (c max d)

    Set Union (a U b U c U d) == (a U b) U (c U d)

    Addition (a + b + c + d) == (a + b) +

    (c + d)

    Multiply (a * b * c * d) == (a * b) * (c * d)

    Division??

    9

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    What about Division?Division(a / b / c / d) != (a / b) / (c / d)

    (3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)

    (((3 / 4) / 7) / 8)!= ((3 * 8) / (4 * 7))

    0.134 != 0.857

    10

    What were the Egyptians thinking?!Not Composable

    Divide like an Egyptian

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    What about Average?

    Overall AVG (

    [3, 1] ((3 + 5) + (5 + 7)) 20

    [5, 1] == ----------------------- == --- == 5

    [5, 1] ((1 + 2) + 1)

    4

    [7, 1]

    )11

    value

    count

    Pairwise AVG

    (3 + 5) (5 + 7) 8 12 20

    ------- + ------- == --- + --- == --- == 10 != 5

    2 2 2 2 2

    Divide, Add, Divide?Not Composable

    Single Divide at the End?Doesnt need to be Composable!

    AVG (3, 5, 5, 7) == 5

    Add, Add, Add?Composable!

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Presentation Outline Scaling with Parallelism and Composability

    Similarity and Recommendations

    When to Approximate

    Common Approximation Algos and Data Structs

    Common Approximation Libraries and Tools

    Netflix Recommendations and Data Pipeline12

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

    Similarity

    13

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Euclidean SimilarityExists in Euclidean, flat spaceBased on Euclidean distance Linear measureBias towards magnitude

    14

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Cosine SimilarityAngular measureAdjusts for Euclidean magnitude bias

    15

    Normalize to unit vectors

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Jaccard SimilaritySet similarity measurementSet intersection / set union ->Based on Jaccard distanceBias towards popularity

    16

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Log Likelihood SimilarityAdjusts for popularity biasNetflix Shawshank problem

    17

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Word SimilarityBased on edit distanceCalculate char differences between wordsDeletes, transposes, replaces, inserts

    18

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Document SimilarityTD/IDF

    Term Freq / Inverse Document Freq

    Used by most search engines

    Word2Vec

    Words embedded in vector space nearby similars

    19

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Similarity Pathwayie. Closest recommendations between 2 people

    20

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Calculating SimilarityExact Brute-Force

    All-pairs similarity

    aka Pair-wise similarity, Similarity join

    Cartesian O(n^2) shuffle and comparison

    Approximate

    Sampling

    Bucketing (aka Partitioning, Clustering)

    Remove data with low probability of similarity

    Reduce shuffle and comparisons21

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Bonus: Document SummaryText Rank

    aka Sentence Rank

    TF/IDF + Similarity Graph + PageRank

    Intuition

    Surface summary sentences (abstract)

    Most similar to all others (TF/IDF + Similarity Graph)

    Most influential sentences (PageRank)

    22

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Similarity GraphVertex is movie, tag, actor, plot summary, etc.Edges are relationships and weights

    23

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

    Topic-Sensitive PageRankGraph diffusion algorithmPre-process graph, add vector of probabilities to each vertex

    Probability of landing at this vertex from every other vertex

    24

  • Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

    Recommendations