Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

Click to edit Master text styles


After Dark Real-time Advanced Analytics, Machine Learning, Graph Analytics, Text NLP, and Recommendations

Barcelona Spark Meetup

Oct 20th, 2015

Chris Fregly Principal Data Solutions Engineer

IBM Spark Technology Center ** We’re Hiring!! Nice People Only, Please. **



spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark

Who Am I?

2

Streaming Data Engineer Netflix Open Source Committer

Data Solutions Engineer

Apache Contributor

Principal Data Solutions Engineer IBM Technology Center

Meetup Organizer Advanced Apache Meetup

Book Author Advanced (2016)




Advanced Apache Spark Meetup Total Spark Experts: ~1350+ in 3 mos! #4 most active Spark Meetup in the world! Main Goals Dig deep into the Spark & extended-Spark codebase Study integrations such as Cassandra, ElasticSearch, Tachyon, S3, BlinkDB, Mesos, YARN, Kafka, R, etc

Surface and share the patterns and idioms of these well-designed, distributed, big data components



spark.tc Power of data. Simplicity of design. Speed of innovation. IBM Spark 4

Core

Spark Streaming

real-time Spark SQL structured data

MLlib machine learning

GraphX graph

analytics

…

BlinkDB approx queries

What is Spark?




Spark Deployments In Production

5




Tools of the Talk

6

  Redis   Docker   Cassandra   MLlib, GraphX   Parquet, JSON   Apache Zeppelin   Spark Streaming, Kafka   Spark SQL, DataFrames   Spark JDBC/ODBC Hive ThriftServer   ElasticSearch, Logstash, Kibana (ELK)

and…




SMACK Stack!

7

S park (Data Processing) M esos (Cluster Manager) A kka (Actors) C assandra (NoSQL) K afka (Streaming)




Themes of this Talk

 Parallelism  Performance  Streaming  Approximations  Similarity Measures  Recommendations

8

and…




Goals of Spark After Dark   Generate high-quality recommendations

  Demonstrate Spark high-level libraries Spark Streaming -> Kafka, Approximates

Spark SQL -> DataFrames, Cassandra

  GraphX -> PageRank, Shortest Path

  MLlib -> Matrix Factor, Word2Vec

9




Popular Dating Sites

10


Click to edit Master text styles Parallelism

11




My First Experience With Parallelism Brady Bunch circa 1980 Season 5, Episode 18: “Two Pete’s in a Pod”

12




Parallel Algorithm: O(log n)

13

O(log n)




Non-Parallel Algorithm: O(n)

14

O(n)




Spark is Parallel!

15


Click to edit Master text styles Performance

16




Spark Beats Hadoop @ 100 TB GraySort

17

  On-disk only   28,000 partitions   No in-memory caching

(2014) (2013) (2014)




Improved Shuffle and Network Layer   “Sort-based shuffle”

  Minimize OS resources

  Switched to async Netty

  Keep CPUs hot

  Reuse byte buffers to minimize GC

  Use epoll for I/O to stay in kernel space 18




Project Tungsten: CPU and Memory   More JVM bytecode generation, JIT optimize

  CPU-cache-aware data structs and algos -->

  Custom memory management Serializers Performance New HashMap

19




DataFrames and Catalyst Optimizer

20

20

https://ogirardot.wordpress.com/2015/05/29/rdds-are-the-new-bytecode-of-apache-spark/

Please Use DataFrames!

--> -->

JVM bytecode generation




Columnar Storage Format

21

Skip whole chunks with min-max heuristicsstored in each chunk (sorted data only)




Parquet File Format  Based on Google Dremel

 Implemented by Twitter and Cloudera

 Columnar storage format

 Optimized for fast columnar aggregations

 Tight compression

 Supports pushdowns

 Nested, self-describing, evolving schema 22




Types of Compression   Run Length Encoding: Repeated data   Dictionary Encoding: Fixed set of values

  Delta, Prefix Encoding: Sorted data

23




Types of Query Optimizations   Column, Partition Pruning   Row, Predicate Pushdown

SELECT b FROM table WHERE a in [a2,a3]

24


Click to edit Master text styles Streaming

25




Direct Kafka Streaming – KafkaRDD   No single Receiver, no Write Ahead Log (WAL)   Workers pull from Kafka in parallel   Each KafkaRDD partition stores relevant offsets   Upon Worker Node failure, rebuild from offsets   Optimizes happy path by avoiding the WAL

26

At least once delivery guarantee <--


Click to edit Master text styles Approximations

27




Count Min Sketch   Approximate counters

  Better than HashMap

  Low, fixed memory   Known error bounds   Large num of counters   From Twitter’s Algebird   Streaming example in Spark codebase

28




HyperLogLog   Approximate cardinality Approx count distinct !  From Twitter’s Algebird!

  Low memory 1.5KB @ 2% error, 10^9 elements !

  Streaming example in Spark codebase

  RDD: countApproxDistinctByKey() 29




Monte Carlo Simulations From Manhattan Project (A-bomb) Simulate movement of neutrons Law of Large Numbers (LLN) Average of results of many trials Converge on expected value SparkPi example in Spark codebase Pi ~ (# red dots / # total dots * 4)

30


Click to edit Master text styles Recommendations

31


Click to edit Master text styles Interactive Demo!

32




Audience Participation Needed!

33

  Navigate to sparkafterdark.com

  Click 3 actresses and 3 actors

-> You are here

->




Types of Recommendations Non-personalized Cold Start No preference or behavior data for user, yet Personalized User-Item Similarity Items that others with similar prefs have liked

Item-Item Similarity Items similar to your previously-liked items

34


Click to edit Master text styles Non-Personalized Recommendations

35




Summary Statistics and Aggregations   Top Users by Like Count

“I might like users with the highest sum aggregation of likes overall.”

SparkSQL + DataFrame = Aggregations

36




Graph Analytics   Top Influencers by Like Graph

“I might like users who have the highest probability of me liking them randomly while walking the like graph.”

GraphX: PageRank

37


Click to edit Master text styles Demo!

Spark SQL/DataFrames + GraphX/PageRank

38


Click to edit Master text styles Similarities

39




Types of Similarity Euclidean: linear measure Magnitude bias Cosine: angle measure Adjust for magnitude bias Jaccard: (intersection / union) Popularity bias Log Likelihood Adjust for popularity bias 40

Ali Matei Reynold Patrick AndyKimberly 1 1 1 1Leslie 1 1!Meredith 1 1 1Lisa 1 1 1Holden 1 1 1 1 1

z!




All-Pairs Similarity Comparison Compare everything to everything aka. “pair-wise similarity” or “similarity join” Naïve shuffle: O(m*n^2); m=rows, n=cols Minimize shuffle through approximations! Reduce m (rows) Sampling and bucketing Reduce n (cols) Remove most frequent value (ie.0) Principle Component Analysis

41




Reduce m: DIMSUM Sampling “Dimension Independent Matrix Square Using MR” Remove rows with low similarity probability MLlib: RowMatrix.columnSimilarities(…) Twitter: 40% efficiency gain over Cosine Similarity 42




Reduce m: LSH Bucketing “Locality Sensitive Hashing” Split m into b buckets Use similarity hash algorithm Requires pre-processing of data Compare bucket contents in parallel Converts O(m*n^2) -> O(m*n/b*b^2); m=rows, n=cols, b=buckets

ie. 500k x 500k matrix O(1.25e17) -> O(1.25e13); b=50

github.com/mrsqueeze/spark-hash 43




Reduce n: Remove Most Frequent Value Eliminate most-frequent value Represent other values with (index,value) pairs Converts O(m*n^2) -> O(m*nnz^2); nnz=num nonzeros, nnz << n Note: Choose most frequent value (may not be 0)

44

(index,value)

(index,value)


Click to edit Master text styles Personalized Recommendations

45




Recommendation Terminology User User seeking recommendations Item

Item that has been liked or rated Feedback Explicit: like, rating Implicit: search, click, hover, view, scroll Feature Engineering

Dimension reduction

46




Collaborative Filtering Personalized Recs   Like behavior of similar users

“I like the same people that you like. What other people did you like that I haven’t seen?” MLlib: Matrix Factorization, User-Item Similarity

47


Click to edit Master text styles Demo!

Spark SQL/DataFrames + MLlib/Alternating Least Squares

48




Text-based Personalized Recs (1/3)   Similar profiles to me“Our profiles have similar, unique k-skip n-grams. We might like each other.” MLlib: Word2Vec, TF/IDF, Doc Similarity

49




Text Based Personalized Recs (2/3)

50

 Similar profiles from my past likes“Your profile shares a similar feature vector space to others that I’ve liked. I might like you.” MLlib: Word2Vec, TF/IDF, Doc Similarity




Text-based Personalized Recs (3/3)   Relevant, High-Value Emails “Your initial email has similar named entities to my profile.

I might like you just for making the effort.” MLlib: Word2Vec, TF/IDF, Entity Recognition

51

^ Her Email < My Profile


Click to edit Master text styles The Future of Recommendations!

52




Facial Recognition   Eigenfaces

“Your face looks similar to others that I’ve liked. I might like you.”

MLlib: RowMatrix, PCA, Item-Item Similarity

53 Image courtesy of http://crockpotveggies.com/2015/02/09/automating-tinder-with-eigenfaces.html




Natural Language Processing: Convo Bot   NLP and DecisionTrees

“If your responses to my trite opening lines are positive, I may read your profile.” MLlib: TF/IDF, DecisionTree, Sentiment Analysis

54

Positive Negative



55

Maintaining the Spark!




Recommendations for Couples   Pathways of Similarity

“I want Mad Max. You want Message In a Bottle. Let’s find something in between to watch tonight.”

MLlib: RowMatrix, Item-Item Similarity GraphX: Nearest Neighbors, Shortest Path similar similar •  plots -> <- actors

56


Click to edit Master text styles Final Recommendation!

57




 Get Off the Computer & Meet People! Thank you!!

Chris Fregly @cfregly IBM Spark Tech Center San Francisco, CA, USA

Relevant Links advancedspark.com

Signup for the book and meetup! github.com/fluxcapacitor/pipeline

Clone all code used today! hub.docker.com/r/fluxcapacitor/pipeline

Run all demos presented today!

58

Image courtesy of http://www.duchess-france.org/



Power of data. Simplicity of design. Speed of innovation.

IBM Spark

Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...

Software

Transcript of Barcelona Spain Apache Spark Meetup Oct 20, 2015: Spark Streaming, Kafka, MLlib, SQL, Project...