# Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

date post

13-Apr-2017Category

## Software

view

745download

0

Embed Size (px)

### Transcript of Chicago Spark Meetup 03 01 2016 - Spark and Recommendations

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

Spark & RecommendationsSpark, Streaming, Machine Learning, Graph Processing,

Approximations, Probabilistic Data Structures, NLP

Chicago Spark Users MeetupThanks, Expedia/Orbitz and SpringCM!

Mar 1st, 2016

Chris FreglyPrincipal Data Solutions Engineer

Were Hiring! (Only Nice People)advancedspark.com!

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

Who Am I?

2

Streaming Data EngineerNetflix OSS Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions EngineerIBM Technology Center

Meetup OrganizerAdvanced Apache Meetup

Book AuthorAdvanced .

Due 2016

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

Recent World Tour: Freg-a-Palooza!London Spark Meetup (Oct 12th)Scotland Data Science Meetup (Oct 13th)Dublin Spark Meetup (Oct 15th)

Barcelona Spark Meetup (Oct 20th)Madrid Big Data Meetup (Oct 22nd)

Paris Spark Meetup (Oct 26th)Amsterdam Spark Summit (Oct 27th)Brussels Spark Meetup (Oct 30th)

Zurich Big Data Meetup (Nov 2nd)Geneva Spark Meetup (Nov 5th)

3

Oslo Big Data Hadoop Meetup (Nov 19th)Helsinki Spark Meetup (Nov 20th)Stockholm Spark Meetup (Nov 23rd)

Copenhagen Spark Meetup (Nov 25th)Istanbul Spark Meetup (Nov 26th)

Budapest Spark Meetup (Nov 28th)Singapore Spark Meetup (Dec 1st)Sydney Spark Meetup (Dec 8th)

Melbourne Spark Meetup (Dec 9th)Toronto Spark Meetup (Dec 14th)

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcspark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark

Advanced Apache Spark Meetuphttp://advancedspark.comMeetup Metrics Top 5 Most-active Spark Meetup! 2600+ Members in just 6 mos!! 2600+ Docker downloads (demos)

Meetup Mission Deep-dive into Spark and related open source projects Surface key patterns and idioms Focus on distributed systems, scale, and performance

4

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

Live, Interactive Demo!!Audience Participation Required

(cell phone or laptop)

5

http://demo.advancedspark.comEnd User ->

ElasticSearch ->

Spark ML ->

Data Scientist ->

6

Presentation Outline Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Approximation Algos and Data Structs

Common Approximation Libraries and Tools

Netflix Recommendations and Data Pipeline7

Scaling with Parallelism

8

PeterO(log n)

O(log n)

Scaling with Composability

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) +

(c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

Division??

9

What about Division?Division(a / b / c / d) != (a / b) / (c / d)

(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8)

(((3 / 4) / 7) / 8)!= ((3 * 8) / (4 * 7))

0.134 != 0.857

10

What were the Egyptians thinking?!Not Composable

Divide like an Egyptian

What about Average?

Overall AVG (

[3, 1] ((3 + 5) + (5 + 7)) 20

[5, 1] == ----------------------- == --- == 5

[5, 1] ((1 + 2) + 1)

4

[7, 1]

)11

value

count

Pairwise AVG

(3 + 5) (5 + 7) 8 12 20

------- + ------- == --- + --- == --- == 10 != 5

2 2 2 2 2

Divide, Add, Divide?Not Composable

Single Divide at the End?Doesnt need to be Composable!

AVG (3, 5, 5, 7) == 5

Add, Add, Add?Composable!

Presentation Outline Scaling with Parallelism and Composability

Similarity and Recommendations

When to Approximate

Common Approximation Algos and Data Structs

Common Approximation Libraries and Tools

Netflix Recommendations and Data Pipeline12

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

Similarity

13

Euclidean SimilarityExists in Euclidean, flat spaceBased on Euclidean distance Linear measureBias towards magnitude

14

Cosine SimilarityAngular measureAdjusts for Euclidean magnitude bias

15

Normalize to unit vectors

Jaccard SimilaritySet similarity measurementSet intersection / set union ->Based on Jaccard distanceBias towards popularity

16

Log Likelihood SimilarityAdjusts for popularity biasNetflix Shawshank problem

17

Word SimilarityBased on edit distanceCalculate char differences between wordsDeletes, transposes, replaces, inserts

18

Document SimilarityTD/IDF

Term Freq / Inverse Document Freq

Used by most search engines

Word2Vec

Words embedded in vector space nearby similars

19

Similarity Pathwayie. Closest recommendations between 2 people

20

Calculating SimilarityExact Brute-Force

All-pairs similarity

aka Pair-wise similarity, Similarity join

Cartesian O(n^2) shuffle and comparison

Approximate

Sampling

Bucketing (aka Partitioning, Clustering)

Remove data with low probability of similarity

Reduce shuffle and comparisons21

Bonus: Document SummaryText Rank

aka Sentence Rank

TF/IDF + Similarity Graph + PageRank

Intuition

Surface summary sentences (abstract)

Most similar to all others (TF/IDF + Similarity Graph)

Most influential sentences (PageRank)

22

Similarity GraphVertex is movie, tag, actor, plot summary, etc.Edges are relationships and weights

23

Topic-Sensitive PageRankGraph diffusion algorithmPre-process graph, add vector of probabilities to each vertex

Probability of landing at this vertex from every other vertex

24

Power of data. Simplicity of design. Speed of innovation.IBM Spark spark.tcPower of data. Simplicity of design. Speed of innovation.IBM Spark spark.tc

Recommendations