In-Memory Processing with Apache Spark -...

23
In-Memory Processing with Apache Spark Vincent Leroy

Transcript of In-Memory Processing with Apache Spark -...

Page 1: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

In-MemoryProcessingwithApacheSpark

VincentLeroy

Page 2: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

Sources

•  ResilientDistributedDatasets,HenggangCui•  CourseraIntroducBontoApacheSpark,UniversityofCalifornia,Databricks

Page 3: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

DatacenterOrganizaBon

CPUs: 10 GB/s

100 MB/s

0.1 ms random access

$0.45 per GB

600 MB/s

3-12 ms random access

$0.05 per GB

1 Gb/s or 125 MB/s

Network

0.1 Gb/s

Nodes in another rack

Nodes in same rack

1 Gb/s or 125 MB/s

Datacenter Organization

Page 4: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

MapReduceExecuBonMap Reduce: Distributed Execution

MAP REDUCEEach stage passes through the hard drives

Page 5: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

IteraBveJobs

•  DiskI/OforeachrepeBBonàSlowwhenexecuBngmanysmalliteraBons

Map Reduce: Iterative Jobs• Iterative jobs involve a lot of disk I/O for each repetition

Stag

e 1

Stag

e 2

Stag

e 3

Disk I/O is very slow!

Page 6: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

MemoryCostTech Trend: Cost of Memory

PRIC

E

YEAR

Memory

disk flash

http://www.jcmit.com/mem2014.htm

2010: 1 ¢/MB

Lower cost means can put more memory in each server

Page 7: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

In-MemoryProcessing

•  Manydatasetsfitinmemory(ofacluster)•  MemoryisfastandavoiddiskI/OàSparkdistributedexecuBonengine

Page 8: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

ReplaceDiskwithMemoryUse Memory Instead of Disk

iteration 1 iteration 2 . . .Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query 1

query 2

query 3

result 1

result 2

result 3

. . .

HDFS read

Page 9: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

ReplaceDiskwithMemory

In-Memory Data Sharing

iteration 1 iteration 2 . . .Input

HDFS read

Input

query 1

query 2

query 3

result 1

result 2

result 3. . .

one-time processing

Distributedmemory

10-100x faster than network and disk

Page 10: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

SparkArchitectureApache Spark Components

Apache Spark

Spark Streaming

Spark SQL

MLlib & ML

(machine learning)

GraphX (graph)

Page 11: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

ResilientDistributedDatasets(RDDs)

•  DataCollecBon– Distributed– Read-only–  In-memory– BuiltfromstablestorageorotherRDDs

Page 12: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDCreaBon

Parallelize

# Parallelize in Python wordsRDD = sc.parallelize(["fish", "cats", "dogs"])

Take an existing in-memory collection and pass it to SparkContext’s parallelize method

There are other methods to read data from HDFS, C*, S3, HBase, etc.

Read from Text File # Read a local txt file in Python linesRDD = sc.textFile("/path/to/README.md")

Create a Base RDD

Page 13: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

OperaBonsonRDDs

•  TransformaBons:lazyexecuBon– Map,filter,intersecBon,groupByKey,zipWithIndex…

•  AcBons:triggerexecuBonoftransformaBons– Collect,count,reduce,saveAsTextFile…

•  Caching– AvoidmulBpleexecuBonsoftransformaBonswhenre-usingRDD

Page 14: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDExample

Error, ts, msg1 Warn, ts, msg2 Error, ts, msg1

Info, ts, msg8 Warn, ts, msg2 Info, ts, msg8

Error, ts, msg3 Info, ts, msg5 Info, ts, msg5

Error, ts, msg4 Warn, ts, msg9 Error, ts, msg1

logLinesRDD

Error, ts, msg1 Error, ts, msg1

Error, ts, msg3

Error, ts, msg4 Error, ts, msg1

errorsRDD

.filter( )

(input/base RDD)

λ

Page 15: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDExample

errorsRDD

.coalesce(2)

Error, ts, msg1 Error, ts, msg3 Error, ts, msg1

Error, ts, msg4 Error, ts, msg1

cleanedRDD

Error, ts, msg1 Error, ts, msg1

Error, ts, msg3

Error, ts, msg4 Error, ts, msg1

.collect( )

Driver

Page 16: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDLineage

.collect()

logLinesRDD

errorsRDD

cleanedRDD

.filter( )

.coalesce( 2 )

Driver

λ

DAG

Page 17: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

ExecuBon

Driver

logLinesRDD

errorsRDD

cleanedRDD

Execution

Page 18: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDFaultTolerance

•  HadoopconservaBve/pessimisBcapproachà Gotodisk/stablestorage(HDFS)

•  SparkopBmisBcàDon’t“waste”BmewriBngtodisk,re-computeincaseofcrashusinglineage

Page 19: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDDependenciesRDD Dependencies

13

Page 20: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

RDDDependencies

•  Narrowdependencies– allowforpipelinedexecuBonononeclusternode– easyfaultrecovery

•  Widedependencies–  requiredatafromallparentparBBonstobeavailableandtobeshuffledacrossthenodes

– asinglefailednodemightcauseacompletere-execuBon.

Page 21: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

JobScheduling

•  ToexecuteanacBononanRDD– schedulerdecidethestagesfromtheRDD’slineagegraph

– eachstagecontainsasmanypipelinedtransformaBonswithnarrowdependenciesaspossible

Page 22: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

JobSchedulingJob Scheduling

16

Page 23: In-Memory Processing with Apache Spark - harschalig-membres.imag.fr/.../uploads/sites/125/2016/11/Spark.pdfStreaming Spark SQL MLlib & ML (machine learning) GraphX (graph) Resilient

SparkInteracBveShellscala> val wc = lesMiserables.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_)

wc: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:14scala> wc.take(5).foreach(println)(créanciers;,1)(abondent.,1)(plaisir,,5)(déplaçaient,1)(sociale,,7)scala> val cw = wc.map(p => (p._2, p._1))

cw: org.apache.spark.rdd.RDD[(Int, String)] = MappedRDD[5] at map at <console>:16scala> val sortedCW = cw.sortByKey(false)

sortedCW: org.apache.spark.rdd.RDD[(Int, String)] = ShuffledRDD[11] at sortByKey at <console>:18scala> sortedCW.take(5).foreach(println)(16757,de)(14683,)(11025,la)(9794,et)(8471,le)scala> sortedCW.filter(x => "Cosette".equals(x._2)).collect.foreach(println)(353,Cosette)

Wordcount en Spark