Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

After DarkGenerating High-Quality Recommendations usingReal-time Advanced Analytics and Machine Learning with

Chris [email protected]

Who am I?Streaming Platform Engineer

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Spark Contributor

Spark AuthorConsultant, Trainer

2

advancedspark.com

Why After Dark?

Playboy After Dark

Late 1960’s TV Show

Progressive Show For Its Time

And it rhymes!!3

What is ?

4

Spark Core

Spark Streaming

real-timeSpark SQLstructured data

MLlibmachine learning

GraphXgraph

analytics

…

BlinkDBapprox queries

in Production

5

What is ?

6

Founded by the creators of

as a ServiceAmazon AWS based

Powerful VisualizationsCollaborative Notebooks

Scala/Java, Python, SQL, RFlexible Cluster Management

Job Scheduling and Monitoring

7

① Generate high-quality recommendations② Demonstrate Spark high-level libraries:

③ Spark Streaming -> Kafka, Approximates④ Spark SQL -> DataFrames, Cassandra① GraphX -> PageRank, Shortest Path① MLlib -> Matrix Factor, Word2Vec

Goals of After Dark?

Images courtesy of tinder.com. Not affiliated with Tinder in any way!

Popular Dating Sites

8

Focus of This Talk

9

① Parallelism② Performance③ Real-time Streaming④ Approximations⑤ Similarity Measures

Spark and…

Parallelism

10

Brady Bunch circa 1980

11

Season 5, Episode 18: “Two Petes in a Pod”

Parallel Algorithm : O(log n)

12

Non-parallel Algorithm : O(n)

13

Spark is Parallel

14

Performance

15

Daytona Gray Sort Contest

16

On-disk only250,000 partitions

No in-memory caching

(2014)(2013) (2014)

Improved Shuffle and Network Layer

17

① “Sort-based shuffle”② Minimize OS resources③ Switched to async Netty④ Keep CPUs hot ⑤ Reuse byte buffers to minimize GC⑥ Use epoll for I/O to stay in kernel space

Project Tungsten: CPU and Memory

18

① More JVM bytecode generation, JIT optimize② CPU-cache-aware data structs and algos

->

③ Custom memory managementSerializers HashMap

DataFrames and Catalyst

19

19

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

PleaseUse DataFrames!!

-->

JVM bytecode generation

Columnar Storage Format

20

*Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)

Parquet File Format

21

① Based on Google Dremel Paper② Implemented by Twitter and Cloudera③ Columnar storage format④ Optimized for fast columnar aggregations⑤ Tight compression⑥ Supports pushdowns⑦ Nested, self-describing, evolving schema

Types of Compression

22

① Run Length EncodingRepeated data

② Dictionary EncodingFixed set of values

③ Delta, Prefix EncodingSorted dataset

Types of Pushdowns

23

① Column, Partition Pruning② Row, Predicate Filtering

Real-time Streaming

24

Direct Kafka Streaming (KafkaRDD)① No single Receiver, no Write Ahead Log (WAL)② Workers pull from Kafka in parallel③ Each KafkaRDD partition stores relevant offsets④ Upon Worker Node failure, rebuild from offsets⑤ Optimizes happy path by avoiding the WAL

25

At least oncedelivery guarantee

<--

Approximations

26

Count Min Sketch

27

① Approximate counters② Better than HashMap③ Low, fixed memory④ Known error bounds⑤ Large num of counters⑥ Available in Twitter’s Algebird⑦ Streaming example in Spark codebase

HyperLogLog

28

① Measures set cardinalityApprox count distinct

② Low memory1.5KB @ 2% error10^9 elements!

③ From Twitter’s Algebird④ Streaming example in Spark codebase⑤ RDD: countApproxDistinctByKey()

10 Recommendations

29

Types of Recommendations

30

① Non-personalized (2 out of 10) Cold Start

No preference or behavior data for user, yet② Personalized (8 out of 10)

User-Item Similarity Items that others with similar prefs have

likedItem-Item Similarity Items similar to your previously-liked items

Interactive Demo!

31

Audience Participation Needed!

32

① Navigate to sparkafterdark.com

② Click 3 actors and 3 actresses

->You are here

->

Non-personalized Recommendations

33

Summary Statistics and Aggregations

34

① Top Users by Like Count“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame: Aggregations

Like Graph Analysis

35

② Top Influencers by Like Graph“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank

Demo! Spark SQL + DataFrames + GraphX

36

Similarity Measures

37

Types of Similarity

38

① Euclidean: linear measureMagnitude bias

② Cosine: angle measureAdjust for magnitude bias

③ Jaccard: Set intersection divided by unionPopularity bias

④ Log LikelihoodAdjust for pop. bias

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

z

All-pairs Similarity Measure

39

① Compare everything to everything② aka. “pair-wise similarity” or “similarity join”③ Naïve shuffle: O(m*n^2); m=rows, n=cols④ Minimize shuffle: reduce data size & approx

Reduce m (rows)Sampling and bucketing

Reduce n (cols)Remove most frequent value (0?)

Sampling Algo: DIMSUM

40

① "Dimension Independent Matrix Square Using MR”

② Remove rows with low similarity probability③ MLlib: RowMatrix.columnSimilarities(…)④ Twitter: 40% efficiency gain over Cosine

Bucket Algo: Locality Sensitive Hashing

41

① Split into b buckets using similarity hash algoRequires pre-processing of data

② Compare bucket contents in parallel③ Converts O(m*n^2) -> O(m*n/b*b^2);

m=rows, n=cols, b=buckets④ Example: 500k x 500k matrix

O(1.25E17) -> O(1.25E13); b=50⑤ github.com/mrsqueeze/spark-hash

MLlib: SparseVector vs. DenseVector

42

① Remove columns using sparse vectors② Converts O(m*n^2) -> O(m*nnz^2);

nnz=num nonzeros, nnz << n

Tip: Choose most frequent value … may not be 0

Personalized Recommendations

43

Personalized Recommendation Terms

44

① UserUser seeking likeable recommendations

② ItemUser who has been liked*Also a user seeking likeable recommendations!

③ Types of FeedbackExplicit: rating, likeImplicit: search, click, hover, view, scroll

Collaborative Filtering Personalized Recs

45

③ Like behavior of similar users“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

Text-based Personalized Recs

46

④ Similar profiles to each other“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs

47

⑤ Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity

More Text-based Personalized Recs

48

⑥ Relevant, High-Value Emails“Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition

^ Her Email< My Profile

Personalized Recommendations:The Future

49

Facial Recognition

50

⑦ Eigenfaces“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

Conversation Starter Bot

51

⑧ NLP and DecisionTrees“If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree,

Sentiment Analysis

Positiveresponse ->

Negative <- response

Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html

52

Maintaining the

Compromise Recommendations (Couples)

53

⑨ Pathway of Similarity“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path

similar similar plots -> <- actors

… …

54

⑩ The Final Recommendation

⑩ Get Off The Computer and Meet People!

linkedin.com/in/cfreglygithub.com/[email protected]@cfregly

55

Thank you!

Image courtesy of http://www.duchess-france.org/Free trial at databricks.comTry !!

Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark

Software

Transcript of Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark