Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

55
After Dark Generating High-Quality Recommendations using Real-time Advanced Analytics and Machine Learning with Chris Fregly [email protected]

Transcript of Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Page 1: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

After DarkGenerating High-Quality Recommendations usingReal-time Advanced Analytics and Machine Learning with

Chris [email protected]

Page 2: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Who am I?Streaming Platform Engineer

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Spark Contributor

Spark AuthorConsultant, Trainer

2

advancedspark.com

Page 3: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Why After Dark?

Playboy After Dark

Late 1960’s TV Show

Progressive Show For Its Time

And it rhymes!!3

Page 4: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

What is ?

4

Spark Core

Spark Streaming

real-timeSpark SQLstructured data

MLlibmachine learning

GraphXgraph

analytics

BlinkDBapprox queries

Page 5: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

in Production

5

Page 6: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

What is ?

6

Founded by the creators of

as a ServiceAmazon AWS based

Powerful VisualizationsCollaborative Notebooks

Scala/Java, Python, SQL, RFlexible Cluster Management

Job Scheduling and Monitoring

Page 7: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

7

① Generate high-quality recommendations② Demonstrate Spark high-level libraries:

③ Spark Streaming -> Kafka, Approximates④ Spark SQL -> DataFrames, Cassandra① GraphX -> PageRank, Shortest Path① MLlib -> Matrix Factor, Word2Vec

Goals of After Dark?

Images courtesy of tinder.com. Not affiliated with Tinder in any way!

Page 8: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Popular Dating Sites

8

Page 9: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Focus of This Talk

9

① Parallelism② Performance③ Real-time Streaming④ Approximations⑤ Similarity Measures

Spark and…

Page 10: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Parallelism

10

Page 11: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Brady Bunch circa 1980

11

Season 5, Episode 18: “Two Petes in a Pod”

Page 12: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Parallel Algorithm : O(log n)

12

Page 13: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Non-parallel Algorithm : O(n)

13

Page 14: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Spark is Parallel

14

Page 15: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Performance

15

Page 16: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Daytona Gray Sort Contest

16

On-disk only250,000 partitions

No in-memory caching

(2014)(2013) (2014)

Page 17: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Improved Shuffle and Network Layer

17

① “Sort-based shuffle”② Minimize OS resources③ Switched to async Netty④ Keep CPUs hot ⑤ Reuse byte buffers to minimize GC⑥ Use epoll for I/O to stay in kernel space

Page 18: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Project Tungsten: CPU and Memory

18

① More JVM bytecode generation, JIT optimize② CPU-cache-aware data structs and algos

->

③ Custom memory managementSerializers HashMap

Page 19: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

DataFrames and Catalyst

19

19

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

PleaseUse DataFrames!!

-->

JVM bytecode generation

Page 20: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Columnar Storage Format

20

*Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

Page 21: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Parquet File Format

21

① Based on Google Dremel Paper② Implemented by Twitter and Cloudera③ Columnar storage format④ Optimized for fast columnar aggregations⑤ Tight compression⑥ Supports pushdowns⑦ Nested, self-describing, evolving schema

Page 22: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Types of Compression

22

① Run Length EncodingRepeated data

② Dictionary EncodingFixed set of values

③ Delta, Prefix EncodingSorted dataset

Page 23: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Types of Pushdowns

23

① Column, Partition Pruning② Row, Predicate Filtering

Page 24: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Real-time Streaming

24

Page 25: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Direct Kafka Streaming (KafkaRDD)① No single Receiver, no Write Ahead Log (WAL)② Workers pull from Kafka in parallel③ Each KafkaRDD partition stores relevant offsets④ Upon Worker Node failure, rebuild from offsets⑤ Optimizes happy path by avoiding the WAL

25

At least oncedelivery guarantee

<--

Page 26: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Approximations

26

Page 27: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Count Min Sketch

27

① Approximate counters② Better than HashMap③ Low, fixed memory④ Known error bounds⑤ Large num of counters⑥ Available in Twitter’s Algebird⑦ Streaming example in Spark codebase

Page 28: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

HyperLogLog

28

① Measures set cardinalityApprox count distinct

② Low memory1.5KB @ 2% error10^9 elements!

③ From Twitter’s Algebird④ Streaming example in Spark codebase⑤ RDD: countApproxDistinctByKey()

Page 29: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

10 Recommendations

29

Page 30: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Types of Recommendations

30

① Non-personalized (2 out of 10) Cold Start

No preference or behavior data for user, yet② Personalized (8 out of 10)

User-Item Similarity Items that others with similar prefs have

likedItem-Item Similarity Items similar to your previously-liked items

Page 31: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Interactive Demo!

31

Page 32: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Audience Participation Needed!

32

① Navigate to sparkafterdark.com

② Click 3 actors and 3 actresses

->You are here

->

Page 33: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Non-personalized Recommendations

33

Page 34: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Summary Statistics and Aggregations

34

① Top Users by Like Count“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame: Aggregations

Page 35: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Like Graph Analysis

35

② Top Influencers by Like Graph“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank

Page 36: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Demo! Spark SQL + DataFrames + GraphX

36

Page 37: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Similarity Measures

37

Page 38: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Types of Similarity

38

① Euclidean: linear measureMagnitude bias

② Cosine: angle measureAdjust for magnitude bias

③ Jaccard: Set intersection divided by unionPopularity bias

④ Log LikelihoodAdjust for pop. bias

  Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

z

Page 39: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

All-pairs Similarity Measure

39

① Compare everything to everything② aka. “pair-wise similarity” or “similarity join”③ Naïve shuffle: O(m*n^2); m=rows, n=cols④ Minimize shuffle: reduce data size & approx

Reduce m (rows)Sampling and bucketing

Reduce n (cols)Remove most frequent value (0?)

Page 40: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Sampling Algo: DIMSUM

40

① "Dimension Independent Matrix Square Using MR”

② Remove rows with low similarity probability③ MLlib: RowMatrix.columnSimilarities(…)④ Twitter: 40% efficiency gain over Cosine

Page 41: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Bucket Algo: Locality Sensitive Hashing

41

① Split into b buckets using similarity hash algoRequires pre-processing of data

② Compare bucket contents in parallel③ Converts O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets④ Example: 500k x 500k matrix

O(1.25E17) -> O(1.25E13); b=50⑤ github.com/mrsqueeze/spark-hash

Page 42: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

MLlib: SparseVector vs. DenseVector

42

① Remove columns using sparse vectors② Converts O(m*n^2) -> O(m*nnz^2);

nnz=num nonzeros, nnz << n

Tip: Choose most frequent value … may not be 0

Page 43: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Personalized Recommendations

43

Page 44: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Personalized Recommendation Terms

44

① UserUser seeking likeable recommendations

② ItemUser who has been liked*Also a user seeking likeable recommendations!

③ Types of FeedbackExplicit: rating, likeImplicit: search, click, hover, view, scroll

Page 45: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Collaborative Filtering Personalized Recs

45

③ Like behavior of similar users“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

Page 46: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Text-based Personalized Recs

46

④ Similar profiles to each other“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Page 47: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

More Text-based Personalized Recs

47

⑤ Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity

Page 48: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

More Text-based Personalized Recs

48

⑥ Relevant, High-Value Emails“Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition

^ Her Email< My Profile

Page 49: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Personalized Recommendations:The Future

49

Page 50: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Facial Recognition

50

⑦ Eigenfaces“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Page 51: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Conversation Starter Bot

51

⑧ NLP and DecisionTrees“If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree,

Sentiment Analysis

Positiveresponse ->

Negative <- response

Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Page 52: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

52

Maintaining the

Page 53: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

Compromise Recommendations (Couples)

53

⑨ Pathway of Similarity“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path

similar similar plots -> <- actors

… …

Page 54: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

54

⑩ The Final Recommendation

Page 55: Spark After Dark:  Real time Advanced Analytics and Machine Learning with Spark

⑩ Get Off The Computer and Meet People!

linkedin.com/in/cfreglygithub.com/[email protected]@cfregly

55

Thank you!

Image courtesy of http://www.duchess-france.org/Free trial at databricks.comTry !!