Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

34
Spark Meetup chez Viadeo Mercredi 4 février 2015 • 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – [email protected] -Spark vs Hadoop MapReduce -Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy • 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – [email protected] -La mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer • 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – duy_hai.doan @datastax.com Apache Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.

Transcript of Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Page 1: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Spark Meetup chez ViadeoMercredi 4 février 2015

• 19h-19h45 : Présentation de la technologie Spark et exemple de nouveaux cas métiers pouvant être traités par du BigData temps réel. Cédric Carbone - Cofondateur d'Influans – [email protected] -Spark vs HadoopMapReduce-Spark Streaming vs Storm -Le Machine Learning avec Spark -Use case métier : NextProductToBuy

• 19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs – [email protected] mémoire partagée de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer

• 20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN - Technical Advocate at DataStax – [email protected] Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing.

Page 2: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Map Reduce

➜ Map() : parse inputs and generate 0 to n <key, value>

➜ Reduce() : sums all values of the same key and generate a <key, value>

WordCount Example

➜ Each map take a line as an input and break into words

• It emits a key/value pair of the word and 1

➜ Each Reducer sums the counts for each word

• It emits a key/value pair of the word and sum

Page 3: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Map Reduce

Page 4: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop MapReduce v1

Page 5: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop MapReduce v1

Page 6: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop MapReduce v1

Page 7: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

MapReduce v1

Not good for low-latency jobs on smallest dataset

Page 8: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

MapReduce v1

Good for off-line batch jobs on massive data

Page 9: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop 1

➜ Batch ONLY

• High latency jobs

HDFS (Redundant, Reliable Storage)

MapReduce1Cluster Resource Management + Data Processing

BATCH

HIVEQuery

PigScripting

CascadingAccelerate Dev.

Page 10: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop2 : Big Data

Operating System

➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS

• Simultaneously & with predictable levels of service

• Data analysts and real-time applications

HDFS (Redundant, Reliable Storage)

MapReduce1Data Processing

BATCH

YARN (Cluster Resource Management)

OtherData Processing

Page 11: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop2 : Big Data

Operating System

➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS

• Simultaneously & with predictable levels of service

• Data analysts and real-time applications

HDFS (Redundant, Reliable Storage)

BATCH INTERACTIVE STREAMING GRAPH ML IN-MEMORYONLINE SEARCH

YARN (Cluster Resource Management)

Page 12: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Hadoop2 : Big Data

Operating System

➜ Customers want to store ALL DATA in one place and interact with it in MULTIPLE WAYS

• Simultaneously & with predictable levels of service

• Data analysts and real-time applications

HDFS (Redundant, Reliable Storage)

YARN (Cluster Resource Management)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm, SamzaSpark Streaming)

GRAPH(Giraph,GraphX)

MachineLearning(MLLIb)

In-Memory(Spark)

ONLINE(Hbase HOYA)

OTHER(ElasticSearch)

Page 13: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

https://spark.apache.org

Apache Spark™ is a fast and general engine for large-scale data processing.

Page 14: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

The most active project

0

50

100

150

200

250

Patches

MapReduce Storm

Yarn Spark

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

Lines Added

MapReduce Storm

Yarn Spark

Page 15: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Spark won the

Daytona GraySort contest!

Run programs up to 100x faster than HadoopMapReduce in memory, or 10x faster on disk.

Sort on disk 100TB of data 3x faster than HadoopMapReduce using 10x fewer machines.

Page 16: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

RDD & Operation

Resilient Distributed Datasets (RDDs)

Operations

➜ Transformations (e.g. map, filter, groupBy)

➜ Actions (e.g. count, collect, save)

Page 17: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Spark

scala> val textFile = sc.textFile("README.md")

➜ textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3

scala> textFile.count()

➜ res0: Long = 126

scala> textFile.first()

➜ res1: String = # Apache Spark

scala> val linesWithSpark = textFile.filter(line =>

line.contains("Spark"))

➜ linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4

scala>

textFile.filter(line=>line.contains("Spark")).count()

➜ res3: Long = 15

Page 18: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Streaming

Streaming

Page 19: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Storm

Page 20: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Storm

Page 21: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Storm vs Spark

Spark Streaming Storm Storm Trident

Processing model Micro batches Record-at-a-time Micro batches

Thoughput ++++ ++ ++++

Latency Second Sub-second Second

Reliability Models Exactly once At least once Exactly once

Embedded Hadoop Distro HDP, CDH, MapR HDP HDP

Support Databricks N/A N/A

Community ++++ ++ ++

Spark Storm

Scope Batch, Streaming, Graph, ML, SQL Streaming only

Page 22: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Machine Learning Library (Mllib)

Page 23: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering

Page 24: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering

(learning)

Page 25: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering

(learning)

Page 26: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering

(learning)

Page 27: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering :

Let’s use the model

Page 28: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering :

similar behaviors

Page 29: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Collaborative Filtering

Prediction

Page 30: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Netflix Prize (2009)

Netflix is a provider of on-demand Internet streaming media

Page 31: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Input Data

UserID::MovieID::Rating::Timestamp1::1193::5::9783007601::661::3::9783021091::914::3::978301968Etc…

2::1357::5::9782987092::3068::4::9782990002::1537::4::978299620

Page 32: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

The result

1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972)1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994)1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993)1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991)1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939)2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994)2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994)2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993)2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982)2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995)3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995)3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994)3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19

Page 33: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Real Time Big Data

Use Case

Next Product To Buy

➜ Right Person

➜ Right Product

➜ Right Price

➜ Right Time

➜ Right Channel

Page 34: Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

Questions?

Cédric Carbone

[email protected]

@carbone

www.hugfrance.fr

[email protected]

@hugfrance