Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
-
Upload
chris-fregly -
Category
Software
-
view
1.029 -
download
0
Transcript of Spark After Dark: Real time Advanced Analytics and Machine Learning with Spark
After DarkGenerating High-Quality Recommendations usingReal-time Advanced Analytics and Machine Learning with
Chris [email protected]
Who am I?Streaming Platform Engineer
Streaming Data EngineerNetflix Open Source Committer
Data Solutions EngineerApache Spark Contributor
Spark AuthorConsultant, Trainer
2
advancedspark.com
Why After Dark?
Playboy After Dark
Late 1960’s TV Show
Progressive Show For Its Time
And it rhymes!!3
What is ?
4
Spark Core
Spark Streaming
real-timeSpark SQLstructured data
MLlibmachine learning
GraphXgraph
analytics
…
BlinkDBapprox queries
in Production
5
What is ?
6
Founded by the creators of
as a ServiceAmazon AWS based
Powerful VisualizationsCollaborative Notebooks
Scala/Java, Python, SQL, RFlexible Cluster Management
Job Scheduling and Monitoring
7
① Generate high-quality recommendations② Demonstrate Spark high-level libraries:
③ Spark Streaming -> Kafka, Approximates④ Spark SQL -> DataFrames, Cassandra① GraphX -> PageRank, Shortest Path① MLlib -> Matrix Factor, Word2Vec
Goals of After Dark?
Images courtesy of tinder.com. Not affiliated with Tinder in any way!
Popular Dating Sites
8
Focus of This Talk
9
① Parallelism② Performance③ Real-time Streaming④ Approximations⑤ Similarity Measures
Spark and…
Parallelism
10
Brady Bunch circa 1980
11
Season 5, Episode 18: “Two Petes in a Pod”
Parallel Algorithm : O(log n)
12
Non-parallel Algorithm : O(n)
13
Spark is Parallel
14
Performance
15
Daytona Gray Sort Contest
16
On-disk only250,000 partitions
No in-memory caching
(2014)(2013) (2014)
Improved Shuffle and Network Layer
17
① “Sort-based shuffle”② Minimize OS resources③ Switched to async Netty④ Keep CPUs hot ⑤ Reuse byte buffers to minimize GC⑥ Use epoll for I/O to stay in kernel space
Project Tungsten: CPU and Memory
18
① More JVM bytecode generation, JIT optimize② CPU-cache-aware data structs and algos
->
③ Custom memory managementSerializers HashMap
DataFrames and Catalyst
19
19
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
PleaseUse DataFrames!!
-->
JVM bytecode generation
Columnar Storage Format
20
*Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)
Parquet File Format
21
① Based on Google Dremel Paper② Implemented by Twitter and Cloudera③ Columnar storage format④ Optimized for fast columnar aggregations⑤ Tight compression⑥ Supports pushdowns⑦ Nested, self-describing, evolving schema
Types of Compression
22
① Run Length EncodingRepeated data
② Dictionary EncodingFixed set of values
③ Delta, Prefix EncodingSorted dataset
Types of Pushdowns
23
① Column, Partition Pruning② Row, Predicate Filtering
Real-time Streaming
24
Direct Kafka Streaming (KafkaRDD)① No single Receiver, no Write Ahead Log (WAL)② Workers pull from Kafka in parallel③ Each KafkaRDD partition stores relevant offsets④ Upon Worker Node failure, rebuild from offsets⑤ Optimizes happy path by avoiding the WAL
25
At least oncedelivery guarantee
<--
Approximations
26
Count Min Sketch
27
① Approximate counters② Better than HashMap③ Low, fixed memory④ Known error bounds⑤ Large num of counters⑥ Available in Twitter’s Algebird⑦ Streaming example in Spark codebase
HyperLogLog
28
① Measures set cardinalityApprox count distinct
② Low memory1.5KB @ 2% error10^9 elements!
③ From Twitter’s Algebird④ Streaming example in Spark codebase⑤ RDD: countApproxDistinctByKey()
10 Recommendations
29
Types of Recommendations
30
① Non-personalized (2 out of 10) Cold Start
No preference or behavior data for user, yet② Personalized (8 out of 10)
User-Item Similarity Items that others with similar prefs have
likedItem-Item Similarity Items similar to your previously-liked items
Interactive Demo!
31
Audience Participation Needed!
32
① Navigate to sparkafterdark.com
② Click 3 actors and 3 actresses
->You are here
->
Non-personalized Recommendations
33
Summary Statistics and Aggregations
34
① Top Users by Like Count“I might like users with the highest sum aggregation of likes overall.”
SparkSQL + DataFrame: Aggregations
Like Graph Analysis
35
② Top Influencers by Like Graph“I might like users who have the highest probability of me liking them randomly while walking the like graph.”
GraphX: PageRank
Demo! Spark SQL + DataFrames + GraphX
36
Similarity Measures
37
Types of Similarity
38
① Euclidean: linear measureMagnitude bias
② Cosine: angle measureAdjust for magnitude bias
③ Jaccard: Set intersection divided by unionPopularity bias
④ Log LikelihoodAdjust for pop. bias
Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1
z
All-pairs Similarity Measure
39
① Compare everything to everything② aka. “pair-wise similarity” or “similarity join”③ Naïve shuffle: O(m*n^2); m=rows, n=cols④ Minimize shuffle: reduce data size & approx
Reduce m (rows)Sampling and bucketing
Reduce n (cols)Remove most frequent value (0?)
Sampling Algo: DIMSUM
40
① "Dimension Independent Matrix Square Using MR”
② Remove rows with low similarity probability③ MLlib: RowMatrix.columnSimilarities(…)④ Twitter: 40% efficiency gain over Cosine
Bucket Algo: Locality Sensitive Hashing
41
① Split into b buckets using similarity hash algoRequires pre-processing of data
② Compare bucket contents in parallel③ Converts O(m*n^2) -> O(m*n/b*b^2);
m=rows, n=cols, b=buckets④ Example: 500k x 500k matrix
O(1.25E17) -> O(1.25E13); b=50⑤ github.com/mrsqueeze/spark-hash
MLlib: SparseVector vs. DenseVector
42
① Remove columns using sparse vectors② Converts O(m*n^2) -> O(m*nnz^2);
nnz=num nonzeros, nnz << n
Tip: Choose most frequent value … may not be 0
Personalized Recommendations
43
Personalized Recommendation Terms
44
① UserUser seeking likeable recommendations
② ItemUser who has been liked*Also a user seeking likeable recommendations!
③ Types of FeedbackExplicit: rating, likeImplicit: search, click, hover, view, scroll
Collaborative Filtering Personalized Recs
45
③ Like behavior of similar users“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
Text-based Personalized Recs
46
④ Similar profiles to each other“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
More Text-based Personalized Recs
47
⑤ Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
More Text-based Personalized Recs
48
⑥ Relevant, High-Value Emails“Your initial email has similar named entities to my profile. I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition
^ Her Email< My Profile
Personalized Recommendations:The Future
49
Facial Recognition
50
⑦ Eigenfaces“Your face looks similar to others that I’ve liked. I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Conversation Starter Bot
51
⑧ NLP and DecisionTrees“If your responses to my trite opening lines are positive, I might actually read your profile.” MLlib: TF/IDF, DecisionTree,
Sentiment Analysis
Positiveresponse ->
Negative <- response
Image courtesty of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
52
Maintaining the
Compromise Recommendations (Couples)
53
⑨ Pathway of Similarity“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path
similar similar plots -> <- actors
… …
54
⑩ The Final Recommendation
⑩ Get Off The Computer and Meet People!
linkedin.com/in/cfreglygithub.com/[email protected]@cfregly
55
Thank you!
Image courtesy of http://www.duchess-france.org/Free trial at databricks.comTry !!