High-Speed In-Memory Analytics over Hadoop and...

SlidesadoptedfromMateiZaharia(MIT)andOliverVagner(NCR)

Spark&SparkSQLHigh-SpeedIn-MemoryAnalytics overHadoopandHiveData

CSE 6242 / CX 4242 Data and Visual Analytics | Georgia Tech

Instructor: Duen Horng (Polo) Chau1

WhatisSpark?NotamodifiedversionofHadoop

Separate,fast,MapReduce-likeengine» In-memorydatastorageforveryfastiterativequeries»Generalexecutiongraphsandpowerfuloptimizations»Upto40xfasterthanHadoop

CompatiblewithHadoop’sstorageAPIs»Canread/writetoanyHadoop-supportedsystem,includingHDFS,HBase,SequenceFiles,etc

http://spark.apache.org

WhatisSparkSQL? (FormallycalledShark)

PortofApacheHivetorunonSpark

CompatiblewithexistingHivedata,metastores,andqueries(HiveQL,UDFs,etc)

Similarspeedupsofupto40x

ProjectHistory[latest:v2.1]Sparkprojectstartedin2009atUCBerkeleyAMPlab,opensourced2010

BecameApacheTop-LevelProjectinFeb2014

Shark/SparkSQLstartedsummer2011

Builtby250+developersandpeoplefrom50companies

Scaleto1000+nodesinproduction

InuseatBerkeley,Princeton,Klout,Foursquare,Conviva,Quantifind,Yahoo!Research,…

UCBERKELEY

http://en.wikipedia.org/wiki/Apache_Spark 4

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

WhyaNewProgrammingModel?

MapReducegreatlysimplifiedbigdataanalysis

Butassoonasitgotpopular,userswantedmore:»Morecomplex,multi-stageapplications(e.g.iterativegraphalgorithmsandmachinelearning)»Moreinteractivead-hocqueries

Requirefasterdatasharingacrossparalleljobs

IsMapReducedead?Notreally.

http://www.reddit.com/r/compsci/comments/296aqr/on_the_death_of_mapreduce_at_google/

http://www.datacenterknowledge.com/archives/2014/06/25/google-dumps-mapreduce-favor-new-hyper-scale-analytics-system/

DataSharinginMapReduce

iter.1 iter.2 ...

HDFS read

HDFS write

HDFS read

HDFS write

query1

query2

query3

result1

result2

result3

HDFS read

DataSharinginMapReduce

iter.1 iter.2 ...

HDFS read

HDFS write

HDFS read

HDFS write

query1

query2

query3

result1

result2

result3

HDFS read

Slowduetoreplication,serialization,anddiskIO 7

iter.1 iter.2 ...

DataSharinginSpark

Distributedmemory

query1

query2

query3

one-time processing

iter.1 iter.2 ...

DataSharinginSpark

Distributedmemory

query1

query2

query3

one-time processing

10-100×fasterthannetworkanddisk 8

SparkProgrammingModel

Keyidea:resilientdistributeddatasets(RDDs)»Distributedcollectionsofobjectsthatcanbecachedinmemoryacrossclusternodes»Manipulatedthroughvariousparalleloperators»Automaticallyrebuiltonfailure

Interface»Cleanlanguage-integratedAPIinScala»CanbeusedinteractivelyfromScala,Pythonconsole»Supportedlanguages:Java,Scala,Python,R

http://www.scala-lang.org/old/faq/4

Functional programming in D3: http://sleptons.blogspot.com/2015/01/functional-programming-d3js-good-example.html

Scala vs Java 8: http://kukuruku.co/hub/scala/java-8-vs-scala-the-difference-in-approaches-and-mutual-innovations

Example:LogMiningLoaderrormessagesfromalogintomemory,theninteractivelysearchforvariouspatterns

11http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

Worker

Driver

lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache()

Worker

Driver

Worker

Driver

BaseRDD

Worker

Driver

Worker

Driver

TransformedRDD

Worker

Driver

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).countAction

Worker

Driver

Block1

Block2

Block3

Worker

Driver

Block1

Block2

Block3

Worker

Driver

Block1

Block2

Block3

Worker

Driver

results

Block1

Block2

Block3

Worker

Driver

results

Cache1

Cache2

Cache3

Block1

Block2

Block3

Worker

Driver

Cache1

Cache2

Cache3

Block1

Block2

Block3

Worker

Driver

cachedMsgs.filter(_.contains(“bar”)).count

Cache1

Cache2

Cache3

Block1

Block2

Block3

Worker

Driver

Cache1

Cache2

Cache3

Block1

Block2

Block3

Worker

Driver

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)

Block1

Block2

Block3

Worker

Driver

Cache1

Cache2

Cache3

Result:full-textsearchofWikipediain<1sec(vs20secforon-diskdata)Result:scaledto1TBdatain5-7sec

(vs170secforon-diskdata)11

http://www.slideshare.net/normation/scala-dreadedhttp://ananthakumaran.in/2010/03/29/scala-underscore-magic.html

FaultToleranceRDDstracktheseriesoftransformationsusedtobuildthem(theirlineage)torecomputelostdata

E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath=hdfs://…

FilteredRDDfunc=_.contains(...)

MappedRDDfunc=_.split(…)

Example:LogisticRegressionval data = spark.textFile(...).map(readPoint).cache()

var w = Vector.random(D)

for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient }

println("Final w: " + w)

Loaddatainmemoryonce

Initialparametervector

RepeatedMapReducesteps todogradientdescent

LogisticRegressionPerformance

NumberofIterations

1 5 10 20 30

HadoopSpark

127s/iteration

firstiteration174s furtheriterations6s

SupportedOperatorsmap

filter

groupBy

leftOuterJoin

rightOuterJoin

reducecountreduceByKey

groupByKey

firstunioncross

samplecogroup

partitionBy

pipesave...

SparkUsers

SparkSQL:HiveonSpark

MotivationHiveisgreat,butHadoop’sexecutionenginemakeseventhesmallestqueriestakeminutes

Scalaisgoodforprogrammers,butmanydatausersonlyknowSQL

CanweextendHivetorunonSpark?

HiveArchitecture

Metastore

Client

Driver

SQLParser

QueryOptimizer

PhysicalPlan

Execution

CLI JDBC

MapReduce

SparkSQLArchitecture

Metastore

Client

Driver

SQLParser

PhysicalPlan

Execution

CLI JDBC

CacheMgr.

QueryOptimizer

[Engleetal,SIGMOD2012]20

UsingSparkSQLCREATE TABLE mydata_cached AS SELECT …

RunstandardHiveQLonit,includingUDFs»Afewesotericfeaturesarenotyetsupported

CanalsocallfromScalatomixwithSpark

BenchmarkQuery1SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;

BenchmarkQuery2SELECT sourceIP, AVG(pageRank), SUM(adRevenue) AS earnings FROM rankings AS R, userVisits AS V ON R.pageURL = V.destURL WHERE V.visitDate BETWEEN ‘1999-01-01’ AND ‘2000-01-01’ GROUP BY V.sourceIP ORDER BY earnings DESCLIMIT 1;

BehaviorwithNotEnoughRAM

Iterationtime(s)

%ofworkingsetinmemory

Cachedisabled 25% 50% 75% Fullycached

29.740.7

58.168.8

What’sNext?RecallthatSpark’smodelwasmotivatedbytwoemerginguses(interactiveandmulti-stageapps)

Anotheremergingusecasethatneedsfastdatasharingisstreamprocessing»Trackandupdatestateinmemoryaseventsarrive»Large-scalereporting,clickanalysis,spamfiltering,etc

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _)

map reduceByWindow

[Zahariaetal,HotCloud2012] 26

map()vsflatMap()Thebestexplanation:

https://www.linkedin.com/pulse/difference-between-map-flatmap-transformations-spark-pyspark-pandey

flatMap=map+flatten

StreamingSparkExtendsSparktoperformstreamingcomputations

Runsasaseriesofsmall(~1s)batchjobs,keepingstateinmemoryasfault-tolerantRDDs

Intermixseamlesslywithbatchandad-hocqueries

tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _)

map reduceByWindow

[Zahariaetal,HotCloud2012]

Result:canprocess42millionrecords/second(4GB/s)on100nodesatsub-secondlatency

SparkStreamingCreateandoperateonRDDsfromlivedatastreamsatsetintervals

Dataisdividedintobatchesforprocessing

Streamsmaybecombinedasapartofprocessingoranalyzedwithhigherleveltransforms

SPARKPLATFORM

Standard FS/HDFS/CFS/S3

GraphXSpark SQLShark

Spark Streaming

YARN/Spark/Mesos

Scala/Python/Java

MLlib Execution

Resource Management

Data Storage

GraphXParallelgraphprocessing

ExtendsRDD->ResilientDistributedPropertyGraph» Directedmultigraphwithpropertiesattachedtoeachvertexandedge

Limitedalgorithms» PageRank» ConnectedComponents» TriangleCounts

Alphacomponent31

MLlibScalablemachinelearninglibrary

InteroperateswithNumPy

Availablealgorithmsin1.0» LinearSupportVectorMachine(SVM)» LogisticRegression» LinearLeastSquares» DecisionTrees» NaïveBayes» CollaborativeFilteringwithALS» K-means» SingularValueDecomposition» PrincipalComponentAnalysis» GradientDescent

MLlib2.0(partofSpark2.0)

33https://spark.apache.org/docs/latest/mllib-guide.html

Spark2 (verynewstill)

Newfeaturehighlights https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

Spark2.0.0hasAPIbreakingchanges

PartlywhyHW3usesSpark1.6(also,Clouderadistribution’sSpark2supportisinbeta)

Moredetails:https://spark.apache.org/releases/spark-release-2-0-0.html

CommercialSupportDatabricks»NottobeconfusedwithDataStax»FoundbymembersoftheAMPLab»Offering

• Certification• Training

• Support• DataBricksCloud

High-Speed In-Memory Analytics over Hadoop and...

Documents

Transcript of High-Speed In-Memory Analytics over Hadoop and...

D3: The Crash Course - Georgia Institute of Technologypoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-7-d3-stolper.pdf · Enter-Update-Exit • Pattern: − Select a “group”

Dimension Reduction - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2013spring/lectures/CSE6242...2013/02/14 · Dimension Reduction in Action Handwritten Digit Data Visualization

CX4242: Data & Visual Analytics Scaling Uppoloclub.gatech.edu/.../CSE6242-13-ScalingUp-hbase.pdf · 2016-10-25 · CSE6242 / CX4242: Data & Visual Analytics Scaling Up HBase Duen

ORIX Digest 2017Spring 2017 ORIX Digestbiz.orix.co.jp/cpn/orixdigest_170511.pdfORIX Digest Contents 1 ORIX Digest 2017Spring お客さまとオリックスグループを結ぶ インフォメーションレター

D3: The Crash Coursepoloclub.gatech.edu/cse6242/2017spring/slides/CSE6242-7-d3-stolp… · D3: The Crash Course Chad Stolper CSE 6242 Guest Lecture 1 aka: D3: The Early Sticking Points

Classificationpoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-11... · 2. Ordinal attribute (e.g. `achievement') x 1j=Gold, x 2j=Platinum, x 3j=Silver 3. Continuous attribute

Deep Learning Intro - Georgia Tech - CSE6242 - March 2015

CSE 6242/CX 4242: Data and Visual Analytics | Georgia Tech ...poloclub.gatech.edu › cse6242 › 2017spring › hw3 › CSE6242-HW3.pdf · but you can edit them as long as our run

Big Data Analytics Building Blocks. Simple Data Storage ...poloclub.gatech.edu/cse6242/2015fall/slides/CSE... · Homework 1 (out next week) • Simple “End-to-end” analysis •

Graphs / Networks - Visualizationpoloclub.gatech.edu/cse6242/2014fall/slides/CSE6242... · 2014. 9. 18. · Graphs / Networks Basics, how to build & store graphs, laws, etc. Centrality,

poloclub.github.io/#cse6242 CSE6242 / CX4242 Data & Visual ...

Analytics Building Blocks - Visualizationpoloclub.gatech.edu/cse6242/2016spring/slides/CSE6242-1...Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014fall/slides/CSE... · 8/21/2014 · What is big data?Why care? • Many companies’ businesses

2017spring jjug ccc_f2

Data Cleaning & Integration - Polo Club of Data Sciencepoloclub.gatech.edu/cse6242/2015spring/slides/CSE6242-2-Clean... · Data Cleaning & Integration Duen Horng (Polo) Chau ... •

Visualization DOs & DON’Ts - Georgia Institute of …poloclub.gatech.edu/cse6242/2014fall/slides/CSE6242...Visualization DOs & DON’Ts CSE 6242 / CX 4242 Sept 11, 2014 Duen Horng

CSE 6 242/CX 4 242: D ata a nd V isual A nalytics | G ...poloclub.gatech.edu/cse6242/2017fall/hw3/CSE6242-HW3.pdf · T h e v i rt u al i mag e c o mes w i t h p re- i n st al l ed

Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Homework 2 D3 Graphs and Visualization Due: October …poloclub.gatech.edu/cse6242/2015fall/hw2/CSE6242-HW2.pdf · Homework 2 D3 Graphs and Visualization Due: ... [1 pts] Use this

Georgia Tech cse6242 - Intro to Deep Learning and DL4J

ORIX Digest 2017Spring 2017 ORIX Digestbiz.orix.co.jp/cpn/orixdigest_170511.pdfORIX Digest Contents 1 ORIX Digest 2017Spring お客さまとオリックスグループを結ぶインフォメーションレター