High-Speed In-Memory Analytics over Hadoop and Hive...

58
Slides adopted from Matei Zaharia (MIT) and Oliver Vagner (NCR) Spark & Spark SQL High-Speed In-Memory Analytics over Hadoop and Hive Data CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech Instructor: Duen Horng (Polo) Chau 1

Transcript of High-Speed In-Memory Analytics over Hadoop and Hive...

Page 1: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)

Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech

Instructor: Duen Horng (Polo) Chau1

Page 2: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

WhatisSpark?NotamodifiedversionofHadoop

Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop

CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc.

http://spark.apache.org

2

Page 3: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

WhatisSparkSQL? (FormallycalledShark)

PortofApacheHivetorunonSpark

CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)

Similarspeedupsofupto40x

3

Page 4: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

ProjectHistory[latest:v2.1]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010

BecameApacheTop-LevelProjectinFeb2014

Shark/SparkSQLstartedsummer2011

Builtby250+developersandpeoplefrom50companies

Scaleto1000+nodesinproduction

InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…

UCBERKELEY

http://en.wikipedia.org/wiki/Apache_Spark 4

Page 5: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

5

Page 6: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

Requirefasterdatasharingacrossparalleljobs

5

Page 8: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

7

Page 9: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

Slowduetoreplication,serialization,anddiskIO 7

Page 10: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

8

Page 11: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

10-100×fasterthannetworkanddisk 8

Page 12: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SparkProgrammingModel

Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure

Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R

9

Page 13: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

http://www.scala-lang.org/old/faq/4

Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html

Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations

10

Page 14: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 15: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 16: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 17: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

BaseRDD

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 18: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 19: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

TransformedRDD

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 20: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 21: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 22: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countAction

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 23: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 24: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 25: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 26: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 27: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 28: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 29: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 30: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 31: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 32: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Result:scaledto1TBdatain5-7sec (vs170secforon-diskdata)

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Page 33: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath=hdfs://…

FilteredRDDfunc=_.contains(...)

MappedRDDfunc=_.split(…)

12

Page 34: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

13

Page 35: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Loaddatainmemoryonce

13

Page 36: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Initialparametervector

13

Page 37: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)RepeatedMapReducesteps

todogradientdescent

13

Page 38: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

LogisticRegressionPerformance

Run

ning

Tim

e(s)

0

1000

2000

3000

4000

NumberofIterations

1 5 10 20 30

HadoopSpark

127s/iteration

firstiteration174s furtheriterations6s

14

Page 39: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SupportedOperatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reducecountreduceByKey

groupByKey

firstunioncross

samplecogroup

take

partitionBy

pipesave...

15

Page 40: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SparkUsers

16

Page 41: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SparkSQL:HiveonSpark

17

Page 42: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes

Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL

CanweextendHivetorunonSpark?

18

Page 43: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

HiveArchitecture

Metastore

HDFS

Client

Driver

SQLParser

QueryOptimizer

PhysicalPlan

Execution

CLI JDBC

MapReduce

19

Page 44: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SparkSQLArchitecture

Metastore

HDFS

Client

Driver

SQLParser

PhysicalPlan

Execution

CLI JDBC

Spark

CacheMgr.

QueryOptimizer

[Engleetal,SIGMOD2012]20

Page 45: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …

RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported

CanalsocallfromScalatomixwithSpark

21

Page 46: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

22

Page 47: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;

23

Page 48: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

BehaviorwithNotEnoughRAM

Iterationtime(s)

0

25

50

75

100

%ofworkingsetinmemory

Cachedisabled 25% 50% 75% Fullycached

11.5

29.740.7

58.168.8

24

Page 49: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)

Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc

25

Page 50: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

T=1

T=2

map reduceByWindow

[Zahariaetal,HotCloud2012] 26

Page 51: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

map()vsflatMap()Thebestexplanation:

https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey

flatMap=map+flatten

27

Page 52: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

[Zahariaetal,HotCloud2012]

Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency

28

Page 53: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals

Dataisdividedintobatchesforprocessing

Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms

29

Page 54: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

SPARKPLATFORM

30

Standard FS/HDFS/CFS/S3

GraphXSpark SQLShark

Spark Streaming

YARN/Spark/Mesos

Scala/Python/Java

RDD

MLlib Execution

Resource Management

Data Storage

Page 55: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

GraphXParallelgraphprocessing

ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge

Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts

Alphacomponent31

Page 56: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

MLlibScalablemachinelearninglibrary

InteroperateswithNumPy

Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent

32

Page 57: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

MLlib2.0(partofSpark2.0)

33https://spark.apache.org/docs/latest/mllib-guide.html

Page 58: High-Speed In-Memory Analytics over Hadoop and Hive Datapoloclub.gatech.edu/.../2017fall/slides/CSE6242-620-ScalingUp-spark… · Spark & Spark SQL High-Speed In-Memory Analytics

Spark2 (verynewstill)

34

Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Spark2.0.0hasAPIbreakingchanges

PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)

Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html