Apache Spark Overview

LearningApacheSparkpart1

PresenterIntroduction• TimSpann,SeniorSolutionsArchitect,airis.DATA

• ex-PivotalSeniorFieldEngineer• DZONEMVBandZoneLeader• ex-StartupSeniorEngineer/TeamLead

http://www.slideshare.net/bunkertorhttp://sparkdeveloper.com/http://www.twitter.com/PaasDev

airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.

Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.

WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.

Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.

Agenda

• Overview

•WhatisMapReduce?

• Hands-On:• Installation• SparkMapReduce• BuildwithIntelliJ/SBT• DeployLocal

Overview

SparkisafastclustercomputingsystemthatsupportsJava,Scala,PythonandRAPIs.Itallowsformultipleworkloadsusingthesamesystemandcoding.

Onestopshoppingforyourbigdataprocessingatscaleneeds.

ItworkswellwithexistingHadoopclusters,byitself,withAWSoronit’sown.

http://spark.apache.org/docs/latest/index.html

WhatisMapReduce?

TRANSFORMATION

map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.

ACTION

reduce(func) Aggregatetheelementsofthedatasetusingafunction func (whichtakestwoargumentsandreturnsone).Thefunctionshouldbecommutativeandassociativesothatitcanbecomputedcorrectlyinparallel.

ProblemDefinition

WehaveApachelogsfromourwebsite.Theyfollowastandardpatternandwewanttoparsethemtogainsomeinsightsonusage.114.200.179.85- - [24/Feb/2016:00:10:02 -0500]"GET/wp HTTP/1.1"2005279"http://sparkdeveloper.com/""Mozilla/5.0"

BytesSentHTTPRefererUserAgent

IPAddressClientIDUserIDDateTimeStampRequestStringHTTPStatusCode

MapFunction

logFile.map(parseLogLine)

LogRecord(m.group(1),m.group(2),m.group(3),m.group(4),m.group(5),m.group(8).toInt,m.group(9).toLong,m.group(10),m.group(11))

Ourmapping function isparseLogLinewhichtakesaLogStringandsplitsitintofieldsinaCaseclassusing regularexpressions.

val contentSizes =accessLogs.map(log=>log.bytesSent)

Oursecondmapping function,mapstojustthebytefield

Reduce

contentSizes.reduce(_+_)

Wereducebyasummingupallthebytesinthedataset.Theresultisafinalsumofallsizes.

Spark1.6.1Stack

SparkSQL SparkStreaming MLlib GraphX

SparkCore

Standalone YARN Mesos

Hands-On

SparkMapReduceBuildwithIntelliJ/SBTDeployLocalRunHistoryServer

spark-1.6.1-bin-hadoop2.6/sbin/start-history-server.sh

Installation• InstallJDK• InstallScala2.10• InstallSBT• InstallMaven(Optional)• UnzipSpark1.6.1

EnvironmentVariableValue(example)Unix/Linux/MacexportSCALA_HOME=/usr/local/share/scalaexportPATH=$PATH:$SCALA_HOME/binWindowsSetSCALA_HOME=c:\Progra~1\ScalasetPATH=%PATH%;%SCALA_HOME%\bin

SparkResources

• https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/info• http://airisdata.com/scala-spark-resources-setup-learning/• http://spark.apache.org/docs/latest/monitoring.html• http://spark.apache.org/docs/latest/submitting-applications.html

SparkCluster

http://spark.apache.org/docs/latest/cluster-overview.html

Term Meaning

Application UserprogrambuiltonSpark.Consists ofa driverprogram and executorsonthecluster.

Application jar Ajarcontainingtheuser's Sparkapplication.Insome casesuserswillwanttocreatean"uberjar"containingtheirapplicationalongwithitsdependencies. Theuser's jarshould neverincludeHadooporSparklibraries,however,thesewillbeaddedatruntime.

Driverprogram Theprocess runningthemain() functionoftheapplication andcreatingtheSparkContext

Clustermanager Anexternalserviceforacquiringresourcesonthecluster(e.g.standalonemanager,Mesos, YARN)

Deploymode Distinguisheswherethedriverprocessruns.In"cluster"mode, theframeworklaunches thedriverinside ofthecluster.In"client"mode,thesubmitterlaunches thedriveroutside ofthecluster.

Workernode Anynodethatcanrunapplication codeinthecluster

Executor Aprocess launchedforanapplicationonaworkernode, thatrunstasksandkeepsdatainmemoryordisk storageacrossthem.Eachapplicationhasitsownexecutors.

Task Aunitofworkthatwillbesenttooneexecutor

Job Aparallelcomputationconsistingofmultiple tasksthatgetsspawnedinresponse toaSparkaction(e.g. save, collect);you'llseethistermused inthedriver'slogs.

Stage Eachjobgetsdivided intosmallersetsoftaskscalled stages thatdependoneachother(similartothemapandreducestagesinMapReduce);you'll seethistermusedinthedriver's logs.

Glossary The following table summarizes terms you’ll see used to refer to cluster concepts:

Apache Spark Overview

Technology

Transcript of Apache Spark Overview