Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...
-
Upload
chris-fregly -
Category
Software
-
view
1.357 -
download
5
Transcript of Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...
Click to edit Master text styles
Click to edit Master text styles
After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations
Barcelona Spark Meetup
Oct 20th, 2015
Chris Fregly Principal Data Solutions Engineer
IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Who Am I?
2
Streaming Data Engineer Netflix Open Source Committer
Data Solutions Engineer
Apache Contributor
Principal Data Solutions Engineer IBM Technology Center
Meetup Organizer Advanced Apache Meetup
Book Author Advanced (2016)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc
Surface and share the patterns and idioms of these well-designed, distributed, big data components
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4
Core
Spark Streaming
real-time Spark SQL structured data
MLlib machine learning
GraphX graph
analytics
…
BlinkDB approx queries
What is Spark?
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark Deployments In Production
5
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Tools of the Talk
6
Redis Docker Cassandra MLlib, GraphX Parquet, JSON Apache Zeppelin Spark Streaming, Kafka Spark SQL, DataFrames Spark JDBC/ODBC Hive ThriftServer ElasticSearch, Logstash, Kibana (ELK)
and…
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
SMACK Stack!
7
S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Themes of this Talk
Parallelism Performance Streaming Approximations Similarity Measures Recommendations
8
and…
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Goals of Spark After Dark Generate high-quality recommendations
Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates
Spark SQL -> DataFrames, Cassandra
GraphX -> PageRank, Shortest Path
MLlib -> Matrix Factor, Word2Vec
9
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Popular Dating Sites
10
Click to edit Master text styles
Click to edit Master text styles Parallelism
11
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”
12
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parallel Algorithm: O(log n)
13
O(log n)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Non-Parallel Algorithm: O(n)
14
O(n)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark is Parallel!
15
Click to edit Master text styles
Click to edit Master text styles Performance
16
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Spark Beats Hadoop @ 100 TB GraySort
17
On-disk only 28,000 partitions No in-memory caching
(2014) (2013) (2014)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Improved Shuffle and Network Layer “Sort-based shuffle”
Minimize OS resources
Switched to async Netty
Keep CPUs hot
Reuse byte buffers to minimize GC
Use epoll for I/O to stay in kernel space 18
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Project Tungsten: CPU and Memory More JVM bytecode generation, JIT optimize
CPU-cache-aware data structs and algos -->
Custom memory management Serializers Performance New HashMap
19
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
DataFrames and Catalyst Optimizer
20
20
https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/
Please Use DataFrames!
--> -->
JVM bytecode generation
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Columnar Storage Format
21
Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Parquet File Format Based on Google Dremel
Implemented by Twitter and Cloudera
Columnar storage format
Optimized for fast columnar aggregations
Tight compression
Supports pushdowns
Nested, self-describing, evolving schema 22
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Compression Run Length Encoding: Repeated data Dictionary Encoding: Fixed set of values
Delta, Prefix Encoding: Sorted data
23
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Query Optimizations Column, Partition Pruning Row, Predicate Pushdown
SELECT b FROM table WHERE a in [a2,a3]
24
Click to edit Master text styles
Click to edit Master text styles Streaming
25
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Direct Kafka Streaming – KafkaRDD No single Receiver, no Write Ahead Log (WAL) Workers pull from Kafka in parallel Each KafkaRDD partition stores relevant offsets Upon Worker Node failure, rebuild from offsets Optimizes happy path by avoiding the WAL
26
At least once delivery guarantee <--
Click to edit Master text styles
Click to edit Master text styles Approximations
27
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Count Min Sketch Approximate counters
Better than HashMap
Low, fixed memory Known error bounds Large num of counters From Twitter’s Algebird Streaming example in Spark codebase
28
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
HyperLogLog Approximate cardinality Approx count distinct ! From Twitter’s Algebird!
Low memory 1.5KB @ 2% error, 10^9 elements !
Streaming example in Spark codebase
RDD: countApproxDistinctByKey() 29
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)
30
Click to edit Master text styles
Click to edit Master text styles Recommendations
31
Click to edit Master text styles
Click to edit Master text styles Interactive Demo!
32
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Audience Participation Needed!
33
Navigate to sparkafterdark.com
Click 3 actresses and 3 actors
-> You are here
->
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked
Item-Item Similarity Items similar to your previously-liked items
34
Click to edit Master text styles
Click to edit Master text styles Non-Personalized Recommendations
35
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Summary Statistics and Aggregations Top Users by Like Count
“I might like users with the highest sum aggregation of likes overall.”
SparkSQL + DataFrame = Aggregations
36
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Graph Analytics Top Influencers by Like Graph
“I might like users who have the highest probability of me liking them randomly while walking the like graph.”
GraphX: PageRank
37
Click to edit Master text styles
Click to edit Master text styles Demo!
Spark SQL/DataFrames + GraphX/PageRank
38
Click to edit Master text styles
Click to edit Master text styles Similarities
39
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40
Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1
z!
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis
41
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets
ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50
github.com/mrsqueeze/spark-hash 43
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)
44
(index,value)
(index,value)
Click to edit Master text styles
Click to edit Master text styles Personalized Recommendations
45
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Recommendation Terminology User User seeking recommendations Item
Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering
Dimension reduction
46
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Collaborative Filtering Personalized Recs Like behavior of similar users
“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity
47
Click to edit Master text styles
Click to edit Master text styles Demo!
Spark SQL/DataFrames + MLlib/Alternating Least Squares
48
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Text-based Personalized Recs (1/3) Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity
49
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Text Based Personalized Recs (2/3)
50
Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Text-based Personalized Recs (3/3) Relevant, High-Value Emails “Your initial email has similar named entities to my profile.
I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition
51
^ Her Email < My Profile
Click to edit Master text styles
Click to edit Master text styles The Future of Recommendations!
52
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Facial Recognition Eigenfaces
“Your face looks similar to others that I’ve liked. I might like you.”
MLlib: RowMatrix, PCA, Item-Item Similarity
53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Natural Language Processing: Convo Bot NLP and DecisionTrees
“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis
54
Positive Negative
Click to edit Master text styles
Click to edit Master text styles
55
Maintaining the Spark!
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Recommendations for Couples Pathways of Similarity
“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”
MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar • plots -> <- actors
56
Click to edit Master text styles
Click to edit Master text styles Final Recommendation!
57
Click to edit Master text styles
Click to edit Master text styles
spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark
Get Off the Computer & Meet People! Thank you!!
Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA
Relevant Links advancedspark.com
Signup for the book and meetup! github.com/fluxcapacitor/pipeline
Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline
Run all demos presented today!
58
Image courtesy of http://www.duchess-france.org/
Click to edit Master text styles
Click to edit Master text styles
Power of data. Simplicity of design. Speed of innovation.
IBM Spark