High-Speed In-Memory Analytics over Hadoop and...

Post on 20-Apr-2020

2 views 0 download

Transcript of High-Speed In-Memory Analytics over Hadoop and...

SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)

Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech

Instructor: Duen Horng (Polo) Chau1

WhatisSpark?NotamodifiedversionofHadoop

Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop

CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc

http://spark.apache.org

2

WhatisSparkSQL? (FormallycalledShark)

PortofApacheHivetorunonSpark

CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)

Similarspeedupsofupto40x

3

ProjectHistory[latest:v2.1]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010

BecameApacheTop-LevelProjectinFeb2014

Shark/SparkSQLstartedsummer2011

Builtby250+developersandpeoplefrom50companies

Scaleto1000+nodesinproduction

InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…

UCBERKELEY

http://en.wikipedia.org/wiki/Apache_Spark 4

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

5

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

Requirefasterdatasharingacrossparalleljobs

5

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

7

DataSharinginMapReduce

iter.1 iter.2 ...

Input

HDFS read

HDFS write

HDFS read

HDFS write

Input

query1

query2

query3

result1

result2

result3

...

HDFS read

Slowduetoreplication,serialization,anddiskIO 7

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

8

iter.1 iter.2 ...

Input

DataSharinginSpark

Distributedmemory

Input

query1

query2

query3

...

one-time processing

10-100×fasterthannetworkanddisk 8

SparkProgrammingModel

Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure

Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R

9

http://www.scala-lang.org/old/faq/4

Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html

Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations

10

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

BaseRDD

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

TransformedRDD

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countAction

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

tasks

results

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Block1

Block2

Block3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)Result:scaledto1TBdatain5-7sec

(vs170secforon-diskdata)11

http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath=hdfs://…

FilteredRDDfunc=_.contains(...)

MappedRDDfunc=_.split(…)

12

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

13

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Loaddatainmemoryonce

13

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Initialparametervector

13

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

RepeatedMapReducesteps todogradientdescent

13

LogisticRegressionPerformance

Run

ning

Tim

e(s)

0

1000

2000

3000

4000

NumberofIterations

1 5 10 20 30

HadoopSpark

127s/iteration

firstiteration174s furtheriterations6s

14

SupportedOperatorsmap

filter

groupBy

sort

join

leftOuterJoin

rightOuterJoin

reducecountreduceByKey

groupByKey

firstunioncross

samplecogroup

take

partitionBy

pipesave...

15

SparkUsers

16

SparkSQL:HiveonSpark

17

MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes

Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL

CanweextendHivetorunonSpark?

18

HiveArchitecture

Metastore

HDFS

Client

Driver

SQLParser

QueryOptimizer

PhysicalPlan

Execution

CLI JDBC

MapReduce

19

SparkSQLArchitecture

Metastore

HDFS

Client

Driver

SQLParser

PhysicalPlan

Execution

CLI JDBC

Spark

CacheMgr.

QueryOptimizer

[Engleetal,SIGMOD2012]20

UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …

RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported

CanalsocallfromScalatomixwithSpark

21

BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

22

BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;

23

BehaviorwithNotEnoughRAM

Iterationtime(s)

0

25

50

75

100

%ofworkingsetinmemory

Cachedisabled 25% 50% 75% Fullycached

11.5

29.740.7

58.168.8

24

What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)

Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc

25

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

T=1

T=2

map reduceByWindow

[Zahariaetal,HotCloud2012] 26

map()vsflatMap()Thebestexplanation:

https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey

flatMap=map+flatten

27

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

T=1

T=2

map reduceByWindow

[Zahariaetal,HotCloud2012]

Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency

28

SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals

Dataisdividedintobatchesforprocessing

Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms

29

SPARKPLATFORM

30

Standard FS/HDFS/CFS/S3

GraphXSpark SQLShark

Spark Streaming

YARN/Spark/Mesos

Scala/Python/Java

RDD

MLlib Execution

Resource Management

Data Storage

GraphXParallelgraphprocessing

ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge

Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts

Alphacomponent31

MLlibScalablemachinelearninglibrary

InteroperateswithNumPy

Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent

32

MLlib2.0(partofSpark2.0)

33https://spark.apache.org/docs/latest/mllib-guide.html

Spark2 (verynewstill)

34

Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Spark2.0.0hasAPIbreakingchanges

PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)

Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html

CommercialSupportDatabricks»NottobeconfusedwithDataStax»FoundbymembersoftheAMPLab»Offering

• Certification• Training

• Support• DataBricksCloud

35