Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

download Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

of 34

  • date post

    17-Jul-2015
  • Category

    Internet

  • view

    886
  • download

    5

Embed Size (px)

Transcript of Paris Spark Meetup (Feb2015) ccarbone : SPARK Streaming vs Storm / MLLib / NextProductToBuy

PowerPoint Presentation

Spark Meetup chez ViadeoMercredi 4 fvrier 2015 19h-19h45 :Prsentation de la technologie Spark et exemple de nouveaux cas mtiers pouvant tre traits par du BigData tempsrel. Cdric Carbone - Cofondateur d'Influans cedric@influans.com -Spark vs Hadoop MapReduce-SparkStreamingvs Storm -Le Machine Learning avec Spark -Use case mtier : NextProductToBuy

19h45-20h : Extension de Spark (Tachyon / Spark JobServer). Jonathan Lamiel - Talend Labs jlamiel@talend.com -La mmoire partage de Spark avec Tachyon -Rendre Spark Interactif avec Spark JobServer

20h-21h : Big Data analytics avec Spark & Cassandra. DuyHai DOAN-Technical Advocate at DataStax duy_hai.doan@datastax.comApache Spark is a general data processing framework which allows you perform data processing tasks in memory. Apache Cassandra is a highly available and massively scalable NoSQL data-store. By combining Spark flexible API and Cassandra performance, we get an interesting combo for both real-time and batch processing. During this talk we will highlight the tight integration between Spark & Cassandra and demonstrate some usages with live code demo.

Pk SparkQui a deja developp en HadoopConcept de Hadoop M/R, ses limites, => Spark

1Map ReduceMap() : parse inputs and generate 0 to n Reduce() : sums all values of the same key and generate a

WordCount ExampleEach map take a line as an input and break into wordsIt emits a key/value pair of the word and 1Each Reducer sums the counts for each wordIt emits a key/value pair of the word and sum2Map Reduce

Hadoop MapReduce v1

Read/Write extensively on HDFSIntermediate Result (between Map and Reduce) are writing in HDFSData between Jobs are wirrting into HDFSWriting into HDFS is not low latency4Hadoop MapReduce v1

Too RigidNeed Exec decision1 simple Hive query will generate multiple MR jobs 5Hadoop MapReduce v1

Too low levelMapReduce v1Not good for low-latency jobs on smallest dataset

MapReduce v1Good for off-line batch jobs on massive data

8Hadoop 1Batch ONLYHigh latency jobsHDFS (Redundant, Reliable Storage)MapReduce1Cluster Resource Management + Data Processing BATCH

HIVEQueryPigScriptingCascadingAccelerate Dev.9Hadoop2 : Big Data Operating SystemCustomers want to store ALL DATA in one place and interact with it in MULTIPLE WAYSSimultaneously & with predictable levels of serviceData analysts and real-time applications

HDFS (Redundant, Reliable Storage)MapReduce1Data Processing BATCH

YARN (Cluster Resource Management) OtherData Processing Splitting up 2 major function of the JobTrackerCluster Resource ManagementApplication life-cycle management

YARN allows other processing paradigmsFlexible API for implementing YARN appsMapReduce becomes a YARN appStorm topology is another YARN app

YARN Container:Fine-gained resource allocaiton thanks to YARN Container (RAM, CPU, Disk, GPU, Network) vs the fixed map/reduce slots10Hadoop2 : Big Data Operating SystemCustomers want to store ALL DATA in one place and interact with it in MULTIPLE WAYSSimultaneously & with predictable levels of serviceData analysts and real-time applications

HDFS (Redundant, Reliable Storage)BATCHINTERACTIVESTREAMINGGRAPHMLIN-MEMORYONLINESEARCH

YARN (Cluster Resource Management) Hadoop2 : Big Data Operating SystemCustomers want to store ALL DATA in one place and interact with it in MULTIPLE WAYSSimultaneously & with predictable levels of serviceData analysts and real-time applications

HDFS (Redundant, Reliable Storage)

YARN (Cluster Resource Management) BATCH(MapReduce)INTERACTIVE(Tez)STREAMING(Storm, SamzaSpark Streaming)GRAPH(Giraph,GraphX)MachineLearning(MLLIb)In-Memory(Spark)ONLINE(Hbase HOYA)OTHER(ElasticSearch)Stream-ProcessingReal-time Processing12https://spark.apache.org

Apache Spark is a fast and general engine for large-scale data processing.

Spark vient de UC Berkeley AMPLab (6 ans)Databricks cr par les createurs de Spark en fin 2013 , 47M$ (09/2013 $14M / Juin SerieB->33M$)OEMis dans Cloudera, Hortonworks, MapRGoogle a annonc il y a 15 jours le passage de DataFlow sur Spark (et partenariat avec Cloudera)

13The most active project

Spark won the Daytona GraySort contest!

Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.

Sort on disk 100TB of data 3x faster than Hadoop MapReduce using 10x fewer machines.

RDD & Operation

Resilient Distributed Datasets (RDDs)

OperationsTransformations (e.g. map, filter, groupBy)Actions (e.g. count, collect, save)Resilient Distributed Datasets (RDDs)Collections of objects spread across a clusterstored in RAM or on Disk (or both) via APIuser can decide when to keep in memory and when to persistBuilt through parallel transformationsAutomatically rebuilt on failure

OperationsTransformations (e.g. map, filter, groupBy)Actions (e.g. count, collect, save)

16Sparkscala> val textFile = sc.textFile("README.md")textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 scala> textFile.count()res0: Long = 126 scala> textFile.first()res1: String = # Apache Spark scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))linesWithSpark: spark.RDD[String]=spark.FilteredRDD@7dd4 scala> textFile.filter(line=>line.contains("Spark")).count()res3: Long = 15RDD Resilient Distributed DatasetsCount (operation de type action)First (action)Filter (op de type transformation)Filter transfo) puis une operation de type action pour avoir un resultat

17Streaming

StreamingStorm

Nimbus = Master Node (like a JobTracker = chef dorchestre qui va distribuer les workloads)Supervisor = Worker Node (execute a subset of a topology)Zookeeper = Coordination between Master and Worker nodes.

19Storm

Storm Topology is like a mapReduce job (but a Storm topology processes messages forever until you kill it).In a stream you have Splout and Bolt

Splout = source of data (connect to Twitter API and emit a stream of tweets)Bolt = consume input streams, do some processing (run functions, filter tuples, do streaming agg, do streamming koin, talk to DB) and possibly emit new streams.

20Storm vs SparkSpark StreamingStormStorm TridentProcessing modelMicro batchesRecord-at-a-timeMicro batchesThoughput++++++++++LatencySecondSub-secondSecondReliability ModelsExactly onceAt least onceExactly onceEmbedded Hadoop DistroHDP, CDH, MapRHDPHDPSupportDatabricksN/AN/ACommunity++++++++SparkStormScopeBatch, Streaming, Graph, ML, SQLStreaming onlyWith trident it's possible to do mini batches and get highter throughput and have exactly one record processed after a Recovery from Fault

21Machine Learning Library (Mllib)

Collaborative Filtering

Collaborative Filtering (learning)

Collaborative Filtering (learning)

Collaborative Filtering (learning)

Collaborative Filtering : Lets use the model

Collaborative Filtering : similar behaviorsCollaborative Filtering Prediction

Netflix Prize (2009)

Netflix is a provider of on-demand Internet streaming media 30Input DataUserID::MovieID::Rating::Timestamp1::1193::5::9783007601::661::3::9783021091::914::3::978301968Etc

2::1357::5::9782987092::3068::4::9782990002::1537::4::978299620The result1 ; Lyndon Wilson ; 4.608531808535918 ; 858 ; Godfather, The (1972)1 ; Lyndon Wilson ; 4.596556961095434 ; 318 ; Shawshank Redemption, The (1994)1 ; Lyndon Wilson ; 4.575789377957803 ; 527 ; Schindler's List (1993)1 ; Lyndon Wilson ; 4.549694932928024 ; 593 ; Silence of the Lambs, The (1991)1 ; Lyndon Wilson ; 4.46311974037361 ; 919 ; Wizard of Oz, The (1939)2 ; Benjamin Harrison ; 4.99545499047152 ; 318 ; Shawshank Redemption, The (1994)2 ; Benjamin Harrison ; 4.94255532354725 ; 356 ; Forrest Gump (1994)2 ; Benjamin Harrison ; 4.80168679606128 ; 527 ; Schindler's List (1993)2 ; Benjamin Harrison ; 4.7874247577586795 ; 1097 ; E.T. the Extra-Terrestrial (1982)2 ; Benjamin Harrison ; 4.7635998147872325 ; 110 ; Braveheart (1995)3 ; Richard Hoover ; 4.962687467351026 ; 110 ; Braveheart (1995)3 ; Richard Hoover ; 4.8316542374095315 ; 318 ; Shawshank Redemption, The (1994)3 ; Richard Hoover ; 4.7307103243995385 ; 356 ; Forrest Gump (19Real Time Big DataUse CaseNext Product To BuyRight PersonRight ProductRight PriceRight TimeRight Channel

Agression33Questions?Cdric Carbonecedric@influans.com@carbone

www.hugfrance.fr

Join usSpeaker Welcomehug-france-orga@googlegroups.com@hugfrance

Im hiring#Data lover#UX Front Engineers34