Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

117
Flux Capacitor AI Bringing AI Back to the Future! Bringing AI Back to the Future! Flux Capacitor AI advancedspark.com

Transcript of Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Page 1: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

advancedspark.com

Page 2: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Who Am I?

2

Streaming Data EngineerNetflix OSS Committer

Data Solutions EngineerApache Contributor

Principal Data Solutions EngineerIBM Technology Center

Meetup OrganizerAdvanced Apache Meetup

Book AuthorAdvanced .

Due 2016

Page 3: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Advanced Apache Spark Meetuphttp://advancedspark.com

Meetup MetricsTop 10 Most-active Spark Meetup!3200+ Members in just 9 mos!!3700+ Docker downloads (demos)

Meetup MissionCode deep-dive into Spark and related open source projectsSurface key patterns and idiomsFocus on distributed systems, scale, and performance

3

Page 4: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Live, Interactive Demo!Audience Participation Required!!Cell Phone Compatible!!!

demo.advancedspark.com4

Page 5: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

http://demo.advancedspark.com

End User ->

ElasticSearch ->

Spark ML ->

Data Scientist ->

5

<- Kafka

<- SparkStreaming

<- Cassandra,Redis

<- Zeppelin, iPython

Page 6: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations6

Page 7: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Scaling with Parallelism

7

Peter

O(log n)O(log n)

WorkerNodes

Page 8: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Parallelism with ComposabilityWorker 1 Worker 2

Max (a max b max c max d) == (a max b) max (c max d)

Set Union (a U b U c U d) == (a U b) U (c U d)

Addition (a + b + c + d) == (a + b) + (c + d)

Multiply (a * b * c * d) == (a * b) * (c * d)

8

What about Division and Average?Collect at Driver

Page 9: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

What about Division?Division (a / b / c / d) != (a / b) / (c / d)

(3 / 4 / 7 / 8) != (3 / 4) / (7 / 8) (((3 / 4) / 7) / 8) != ((3 * 8) / (4 * 7))

0.134 != 0.857

9

What were the Egyptians thinking?!Not Composable

“Divide like an Egyptian”

Page 10: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

What about Average?

Overall AVG(3, 1) (3 + 5 + 5 + 7) 20

+ (5, 1) == -------------------- == --- == 5+ (5, 1) (1 + 1 + 1 + 1) 4+ (7, 1)

10

values

counts

Pairwise AVG(3 + 5) (5 + 7) 8 12 20------- + ------- == --- + --- == --- == 10 != 5

2 2 2 2 2

Divide, Add, Divide?Not Composable

Single-Node Divide at the End?Doesn’t need to be Composable!

AVG (3, 5, 5, 7) == 5

Add, Add, Add?Composable!

Page 11: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations11

Page 12: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Similarities

12

Page 13: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Euclidean SimilarityExists in Euclidean, flat spaceBased on Euclidean distance Linear measureBias towards magnitude

13

Page 14: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cosine SimilarityAngular measureAdjusts for Euclidean magnitude biasNormalize to unit vectors in all dimensionsUsed with real-valued vectors (versus binary)

14

org.jblas.DoubleMatrix

Page 15: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Jaccard SimilaritySet similarity measurementSet intersection / set union Bias towards popularityWorks with binary vectors

15

Page 16: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Log Likelihood SimilarityAdjusts for popularity biasNetflix “Shawshank” problem

16

Page 17: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Word SimilarityEdit Distance

Misspellings and autocorrect

Word2VecSimilar words are defined by similar contexts in vector space

17

English Spanish

Page 18: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Find Synonyms with Word2Vec

18

Page 19: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Find Synonyms using Word2Vec

19

Page 20: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Document SimilarityTF/IDF

Term Freq / Inverse Document FreqUsed by most search engines

Doc2VecSimilar documents are determined by similar contexts

20

Page 21: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus! Text Rank Document SummaryText Rank (aka Sentence Rank)

Surface summary sentences TF/IDF + Similarity Graph + PageRank

Most similar sentence to all other sentencesTF/IDF + Similarity Graph

Most influential sentencesPageRank

21

Page 22: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Similarity Pathways (Recommendations)Best recommendations for 2 (or more) people

“You like Max Max. I like Message in a Bottle.We might like a movie similar to both.”

Item-to-Item Similarity Graph + Dijkstra Heaviest Path

22

Page 23: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Similarity Pathway for Movie Recommendations

23

Page 24: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Load Movies with Tags into DataFrame

24

My Choice

TheirChoice

Page 25: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Tag Jaccard SimilarityBased on Tags

25

Calculate Jaccard Similarity(Tag Set Similarity)

Must be Above the Given Jaccard Similarity Threshold

Page 26: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Tag Similarity Graph

26

Edge Weights ==

Jaccard Similarity(Based on Tag Sets)

Page 27: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

TODO: Use Dijkstra to Find Heaviest Pathway

27

Page 28: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Calculating Exact SimilarityBrute-Force Similarity

Cartesian ProductO(n^2) shuffle and computeaka. All-pairs, Pair-wise,

Similarity Join

28

Page 29: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Calculating Approximate SimilarityGoal: Reduce Shuffle

Approximate SimilaritySamplingBucketing or ClusteringIgnore low-similarity probability

Locality Sensitive Hashing Twitter Algebird MinHash

29

BucketBy Genre

Page 30: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

① Netflix Recommendations30

Page 31: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Recommendations

31

Page 32: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Basic TerminologyUser: User seeking recommendationsItem: Item being recommendedExplicit User Feedback: user knows they are rating or liking, can choose to dislikeImplicit User Feedback: user not explicitly aware, cannot dislike (click, hover, etc)Instances: Rows of user feedback/input dataOverfitting: Training a model too closely to the training data & hyperparametersHold Out Split: Holding out some of the instances to avoid overfittingFeatures: Columns of instance rows (of feedback/input data)Cold Start Problem: Not enough data to personalize (new)Hyperparameter: Model-specific config knobs for tuning (tree depth, iterations)Model Evaluation: Compare predictions to actual values of hold out splitFeature Engineering: Modify, reduce, combine features

32

Page 33: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

FeaturesBinary: True or FalseNumeric Discrete: Integers

Numeric: Real Values

Binning: Convert Continuous into Discrete (Time of Day->Morning, Afternoon)

Categorical Ordinal: Size (Small->Medium->Large), Ratings (1->5)

Categorical Nominal: Independent, Favorite Sports Teams, Dating SpotsTemporal: Time-based, Time of Day, Binge Viewing

Text: Movie Titles, Genres, Tags, Reviews (Tokenize, Stop Words, Stemming)

Media: Images, Audio, Video

Geographic: (Longitude, Latitude), Geohash

Latent: Hidden Features within Data (Collaborative Filtering)Derived: Age of Movie, Duration of User Subscription

33

Page 34: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Feature EngineeringDimension Reduction

Reduce number of features in feature spacePrinciple Component Analysis (PCA)

Find principle features that best describe data variancePeel dimensional layers back

One-Hot EncodingConvert nominal categorical feature values into 0’s and 1’sRemove any numerical relationship between categories

Bears -> 1 Bears -> [1.0, 0.0, 0.0]49’ers -> 2 --> 49’ers -> [0.0, 1.0, 0.0]Steelers-> 3 Steelers-> [0.0, 0.0, 1.0]

34

Convert Each Item to Binary Vector

with Single 1.0 Column

Page 35: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Feature Normalization & StandardizationGoal

Scale features to standard sizeRequired by many ML algos

Normalize FeaturesCalculate L1 (or L2, etc) norm, then divide into each elementorg.apache.spark.ml.feature.Normalizer

Standardize FeaturesApply standard normal transformation (mean->0, stddev->1)org.apache.spark.ml.feature.StandardScaler

35

http://www.mathsisfun.com/data/standard-normal-distribution.html

Page 36: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Non-Personalized Recommendations

36

Page 37: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cold Start Problem“Cold Start” problem

New user, don’t know their preferences, must show something!

Movies with highest-rated actorsTop K aggregations

Facebook social graphFriend-based recommendations

Most desirable singlesPageRank of likes and dislikes

37

Page 38: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!GraphFrame PageRank

38

Page 39: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Example: Dating Site “Like” Graph

39

Page 40: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

PageRank of Top Influencers

40

Page 41: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized Recommendations

41

Page 42: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Personalized PageRank

42

Page 43: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized PageRank: Outbound Links

43

0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network

15% Probability: Choose Self or Random

85% AmongOutboundNetwork

Page 44: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Personalized PageRank: No Outbound

44

0.15 = (1 - 0.85 “Damping Factor”)85% Probability: Choose Among Outbound Network

15% Probability: Choose Self or Random

85% Among No

OutboundNetwork!!

Page 45: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

User-to-User ClusteringUser Similarity

Time-basedPattern of viewing (binge or casual)Time of viewing (am or pm)

Ratings-basedContent ratings or number of viewsAverage rating relative to others (critical or lenient)

Search-basedSearch terms

45

Page 46: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item ClusteringItem Similarity

Profile text (TF/IDF, Word2Vec, NLP)Categories, tags, interests (Jaccard Similarity, LSH)Images, facial structures (Neural Nets, Eigenfaces)

Dating Site Example…

46Cluster Similar Eigen-facesCluster Similar Profiles Cluster Similar Categories

Page 47: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: NLP Conversation Starter Bot

47

“If your responses to my generic opening lines are positive, I may read your profile.”

Spark ML, Stanford CoreNLP,TF/IDF, DecisionTrees, Sentiment

http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Page 48: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Demo!Spark + Stanford CoreNLP Sentiment Analysis

48

Page 49: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Top 100 Country Song Sentiment

49

Page 50: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Surprising Results…?!

50

Page 51: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Based RecommendationsBased on Metadata: Genre, Description, Cast, City

51

Page 52: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Item-to-Item-based Recommendations

One-Hot Encoding + K-Means Clustering

52

Page 53: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

One-Hot Encode Tag Feature Vectors

53

Page 54: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Cluster Movie Tag Feature Vectors

54

HyperparameterTuning

(K Clusters?)

Page 55: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Analyze Movie Tag Clusters

55

Page 56: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

User-to-Item Collaborative FilteringMatrix Factorization① Factor the large matrix (left) into 2 smaller matrices (right)② Lower-rank matrices approximate original when multiplied③ Fill in the missing values of the large matrix④ Surface k (rank) latent features from user-item interactions

56

Page 57: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Item-to-Item Collaborative FilteringFamous Amazon Paper circa 2003

ProblemAs users grew, user-to-item collaborative filtering didn’t scale

SolutionItem-to-item similarity, nearest neighbors Offline (Batch)

Generate itemId->List[userId] vectorsOnline (Real-time)

From cart, recommend nearest-neighbors in vector space57

Page 58: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Collaborative Filtering-based Recommendations

58

Page 59: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Fitting the Matrix Factorization Model

59

Page 60: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show ItemFactors Matrix from ALS

60

Page 61: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show UserFactors Matrix from ALS

61

Page 62: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Generating Individual Recommendations

62

Page 63: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Generating Batch Recommendations

63

Page 64: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Clustering + Collaborative Filtering RecsCluster matrix output from Matrix FactorizationLatent features derived from user-item interaction

Item-to-Item SimilarityCluster item-factor matrix->

User-to-User Similarity<-Cluster user-factor matrix

64

Page 65: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Clustering + Collaborative Filtering-based Recommendations

65

Page 66: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Show ItemFactors Matrix from ALS

66

Page 67: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Convert to Item Factors -> mllib.VectorRequired by K-Means Clustering Algorithm

67

Page 68: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Fit and Evaluate K-Means Cluster Model

68

Measures ClosenessOf Points Within Clusters

K = 5 Clusters

Page 69: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Genres and ClustersTypical Genres

Documentary, Romance, Comedy, Horror, Action, Adventure

Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy

69

Page 70: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations70

Page 71: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

When to Approximate?Memory or time constrained queries

Relative vs. exact counts are OK (approx # errors after a release)

Using machine learning or graph algosInherently probabilistic and approximate

Streaming aggregationsInherently sloppy collection (exactly once?)

71

Approximate as much as you can get away with!Ask for forgiveness later !!

Page 72: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

When NOT to Approximate?If you’ve ever heard the term…

“Sarbanes-Oxley”

…at the office.

72

Page 73: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

A Few Good Algorithms

73

You can’t handle the approximate!

Page 74: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Common to These Algos & Data StructsLow, fixed size in memoryStore large amount of dataKnown error boundsTunable tradeoff between size and errorLess memory than Java/Scala collectionsRely on multiple hash functions or operationsSize of hash range defines error

74

Page 75: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom FilterSet.contains(key): Boolean

“Hash Multiple Times and Flip the Bits Wherever You Land”

75

Page 76: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom FilterApproximate Set.contains(key)

No means No, Yes means Maybe

Elements can only be addedNever updated or removed

76

Page 77: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bloom Filter in Action

77

set(key) contains(key): Boolean

Images by @avibryant

Set.contains(key): TRUE -> maybe contains (other key hashes may overlap)Set.contains(key): FALSE -> definitely does not contain (no key flipped all bits)

Page 78: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin SketchFrequency Count and TopK

“Hash Multiple Times and Add 1 Wherever You Land”

78

Page 79: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin Sketch (CMS)Approximate frequency count and TopK for keyie. “Heavy Hitters” on Twitter

79

Matei Zaharia Martin Odersky Donald Trump

Page 80: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

CountMin Sketch In Action (TopK Count)

80

Images derived from @avibryant

Find minimum of all rows

……

Can overestimate, but never underestimate

Multiple hash functions(1 hash function per row)

Binary hash output(1 element per column)

x 2 occurrences of “Top Gun” for slightly additional complexity

Top GunTop Gun

Top Gun(x 2)

A FewGood Men

Taps

Top Gun(x 2)

add(Top Gun, 2)

getCount(Top Gun): Long

Use Case: TopK movies using total views

add(A Few Good Men, 1)

add(Taps, 1)

A FewGood Men

Taps

Overlap Top Gun

Overlap A Few Good Men

Page 81: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLogCount Distinct

“Hash Multiple Times and Uniformly Distribute Where You Land”

81

Page 82: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLog (HLL)Approximate count distinctSlight twist

Special hash function creates uniform distributionHash subsets of data with single, special hash func

Error estimate14 bits for size of rangem = 2^14 = 16,384 hash slotserror = 1.04/(sqrt(16,384)) = .81%

82

Not many of these

Page 83: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HyperLogLog In Action (Count Distinct)Use Case: Number of distinct users who view a movie

83

0 32

Top Gun: Hour 2user2001

user4009

user3002

user7002

user1005

user6001

User8001

User8002

user1001

user2009

user3005

user3003

Top Gun: Hour 1user3001

user7009

0 16

Uniform Distribution:Estimate distinct # of users by inspecting just the beginning

0 32

Top Gun: Hour 1 + 2user2001

user4009

user3002

user7002

user1005

user6001

User8001

User8002

Combine across different scales

user7009

user1001

user2009

user3005

user3003

user3001

Page 84: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive HashingSet Similarity

“Pre-process Items into Buckets, Compare Within Buckets”

84

Page 85: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive Hashing (LSH)Approximate set similarityPre-process m rows into b buckets

b << m; b = buckets, m = rowsHash items multiple times

** Similar items hash to overlapping buckets** Hash designed to cluster similar items

Compare just contents of bucketsMuch smaller cartesian compare ** Compare in parallel !!

Avoids huge cartesian all-pairs compare85

Chapter 3: LSH

Page 86: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

DIMSUMSet Similarity

“Pre-process and ignore data that is unlikely to be similar.”

86

Page 87: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

DIMSUM“Dimension Independent Matrix Square Using MR”Remove vectors with low probability of similarity

RowMatrix.columnSimiliarites(threshold)Twitter DIMSUM Case Study

40% efficiency gain over bruce-force Cosine Sim

87

Page 88: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Common Tools to Approximate

Twitter Algebird

Redis

Apache Spark

88

Composable Library

Distributed Cache

Big Data Processing

Page 89: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Twitter AlgebirdAlgebraic Fundamentals

Parallel

Associative

ComposableExamples

Min, Max, AvgBloomFilter (Set.contains(key))HyperLogLog (Count Distinct)CountMin Sketch (TopK Count)

89

Page 90: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

RedisImplementation of HyperLogLog (Count Distinct)

12KB per item count2^64 max # of items0.81% error

Add user views for given moviePFADD TopGun_Hour1_HLL user1001 user2009 user3005PFADD TopGun_Hour1_HLL user3003 user1001

Get distinct count (cardinality) of setPFCOUNT TopGun_Hour1_HLLReturns: 4 (distinct users viewed this movie)

Union 2 HyperLogLog Data StructuresPFMERGE TopGun_Hour1_HLL TopGun_Hour2_HLL

90

ignore duplicates

Tunable

Page 91: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Approximations in Spark LibrariesSpark Core

countByKeyApprox(timeout: Long, confidence: Double)PartialResult

Spark SQLapproxCountDistinct(column: Column)HyperLogLogPlus

Spark MLStratified sampling

sampleByKey(fractions: Map[K, Double])DIMSUM sampling

Probabilistic sampling reduces amount of shuffleRowMatrix.columnSimilarities(threshold: Double)

91

Page 92: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Exact Count vs. Approximate HLL and CMS Count

92

Page 93: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HashSet vs. HyperLogLog (Memory)

93

Page 94: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

HashSet vs. CountMin Sketch (Memory)

94

Page 95: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Demo!Exact Similarity vs. Approximate LSH Similarity

95

Page 96: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Brute Force Cartesian All Pair Similarity

96

47 seconds

Page 97: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Locality Sensitive Hash All Pair Similarity

97

6 seconds

Page 98: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Many More Demos!

or

Download Docker Clone on Github

98

http://advancedspark.com

Page 99: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Presentation Outline① Scaling

② Similarities

③ Recommendations

④ Approximations

⑤ Netflix Recommendations99

Page 100: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix RecommendationsFrom Ratings to Real-time

100

Page 101: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Has a Lot of DataNetflix has a lot of data about a lot of users and a lot of movies.

Netflix can use this data to buy new movies.

Netflix is global.

Netflix can use this data to choose original programming.

Netflix knows that a lot of people like politics and Kevin Spacey.

101

The UK doesn’t have White Castle.Renamed my favourite movie to:

“Harold and Kumar Get the Munchies”

My favorite movie:“Harold and Kumar Go to White Castle”

Summary: Buy NFLX Stock!

This broke my unit tests!

Page 102: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Data Pipeline - Then

102

v1.0

v2.0

Page 103: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Data Pipeline – Now (Keystone)

103

v3.0

9 million events per second22 GB per second!!

EC2 D2XLDisk: 6 TB, 475 MB/sRAM: 30 GNetwork: 700 Mbps

Auto-scaling,Fault tolerance

A/B Tests,Trending Now

SAMZA

Splits high andnormal priority

Page 104: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Recommendation Data Pipeline

104

Throw away batch user factors (U)

Keep batch video factors (V)

Page 105: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Trending Now (Time-based Recs)Uses Spark StreamingPersonalized to user (viewing history, past ratings)Learns and adapts to events (Valentine’s Day)

105

“VHS”

Number of Plays

Number of Impressions

CalculateTake Rate

Page 106: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Bonus: Pandora Time-based RecsWork Days

Play familiar musicUser is less likely accept new music

Evenings and WeekendsPlay new musicMore like to accept new music

106

Page 107: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

$1 Million Netflix Prize (2006-2009)Goal

Improve movie predictions by 10% (Root Mean Sq Error)Test data withheld to calculate RMSE upon submission

5-star Ratings Dataset(userId, movieId, rating, timestamp)

Winning algorithm(s)10.06% improvement (RMSE)Ensemble of 500+ ML combined with GBDT’sComputationally impractical

107

Page 108: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Secrets to the Winning AlgorithmsAdjust for the following human bias…① Alice effect: user rates lower than avg② Inception effect: movie rated higher than avg③ Overall mean rating of a movie④ Number of people who have rated a movie⑤ Number of days since user’s first rating⑥ Number of days since movie’s first rating⑦ Mood, time of day, day of week, season, weather

108

Page 109: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Common ML AlgorithmsLogistic RegressionLinear RegressionGradient Boosted Decision TreesRandom ForestMatrix FactorizationSVDRestricted Boltzmann MachinesDeep Neural NetsMarkov ModelsLDAClustering

109

Ensembles!

Page 110: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Genres and ClustersTypical Genres

Documentaries, Romance Comedies, Horror, Action, Adventure

Latent (Hidden) ClustersEmotionally-Independent Dramas for Hopeless RomanticsWitty Dysfunctional-Family TV Animated ComediesRomantic Crime Movies based on Classic LiteratureLatin American Forbidden-Love MoviesCritically-acclaimed Emotional Drug MovieCerebral Military Movie based on Real LifeSentimental Movies about Horses for Ages 11-12Gory Canadian Revenge MoviesRaunchy Mad Scientist Comedy

110

Page 111: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Social IntegrationPost to Facebook after movie start (5 mins)Recommend to new users based on friendsHelps with Cold Start problem

111

Page 112: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix SearchNo results? No problem… Show similar results!

Utilize extensive DVD CatalogMetadata search (ElasticSearch)Named entity recognition (NLP)

Empty searches are opportunity!Explicit feedback for future recommendationsContent to buy and produce!

112

Page 113: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix A/B TestsUsers tend to click on images featuring…

Faces with strong emotional expressionsVillains over heroesSmall number of cast members

113

Page 114: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Netflix Recommendation Serving LayerUse Case: Recommendation service depends on EVCacheProblem: EVCache cluster does down or becomes latent!?Answer: github.com/Netflix/Hystrix Circuit Breaker!

Circuit StatesClosed: Service OK

Open: Service DOWNFallback to Static

114

Page 115: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Why Higher Average Ratings 2004+?2004, Netflix noticed higher ratings on averageSome possible reasons why…

115

① Significant UI improvements deployed② New recommendation engine deployed③

Page 116: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!Bringing AI Back to the Future!Flux Capacitor AI

Thank You, Everyone!!Chris Fregly @cfreglyIBM Spark Technology Center San Francisco, California, USA

http://advancedspark.comSign up for the Meetup and BookContribute to Github RepoRun all Demos using Docker

Find me LinkedIn, Twitter, Github, Email, Fax116

Image derived from http://www.duchess-france.org/

Page 117: Spark After Dark 2.0 - Apache Big Data Conf - Vancouver - May 11, 2016

Flux Capacitor AI Bringing AI Back to the Future!

Bringing AI Back to the Future!Flux Capacitor AI

http://advancedspark.com@cfregly