Apache Spark Overview

download Apache Spark Overview

of 15

  • date post

    12-Apr-2017
  • Category

    Technology

  • view

    703
  • download

    0

Embed Size (px)

Transcript of Apache Spark Overview

  • LearningApacheSparkpart1

  • PresenterIntroduction TimSpann,SeniorSolutionsArchitect,airis.DATA

    ex-PivotalSeniorFieldEngineer DZONEMVBandZoneLeader ex-StartupSeniorEngineer/TeamLead

    http://www.slideshare.net/bunkertorhttp://sparkdeveloper.com/http://www.twitter.com/PaasDev

  • airis.DATAairis.DATA isanextgenerationsystemintegratorthatspecializesinrapidlydeployablemachinelearningandgraphsolutions.

    Ourcorecompetenciesinvolveprovidingmodular,scalableBigDataproductsthatcanbetailoredtofitusecasesacrossindustryverticals.

    WeofferpredictivemodelingandmachinelearningsolutionsatPetabytescaleutilizingthemostadvanced,best-in-classtechnologiesandframeworksincludingSpark,H20,Mahout,andFlink.

    Ourdatapipeliningsolutionscanbedeployedinbatch,real-timeornear-real-timesettingstofityourspecificbusinessuse-case.

  • Agenda

    Overview

    WhatisMapReduce?

    Hands-On: Installation SparkMapReduce BuildwithIntelliJ/SBT DeployLocal

  • Overview

    SparkisafastclustercomputingsystemthatsupportsJava,Scala,PythonandRAPIs.Itallowsformultipleworkloadsusingthesamesystemandcoding.

    Onestopshoppingforyourbigdataprocessingatscaleneeds.

    ItworkswellwithexistingHadoopclusters,byitself,withAWSoronitsown.

    http://spark.apache.org/docs/latest/index.html

  • WhatisMapReduce?

    TRANSFORMATION

    map(func) Returnanewdistributeddatasetformedbypassingeachelementofthesourcethroughafunction func.

    ACTION

    reduce(func) Aggregatetheelementsofthedatasetusingafunction func (whichtakestwoargumentsandreturnsone).Thefunctionshouldbecommutativeandassociativesothatitcanbecomputedcorrectlyinparallel.

  • ProblemDefinition

    WehaveApachelogsfromourwebsite.Theyfollowastandardpatternandwewanttoparsethemtogainsomeinsightsonusage.114.200.179.85- - [24/Feb/2016:00:10:02 -0500]"GET/wp HTTP/1.1"2005279"http://sparkdeveloper.com/""Mozilla/5.0"

    BytesSentHTTPRefererUserAgent

    IPAddressClientIDUserIDDateTimeStampRequestStringHTTPStatusCode

  • MapFunction

    logFile.map(parseLogLine)

    LogRecord(m.group(1),m.group(2),m.group(3),m.group(4),m.group(5),m.group(8).toInt,m.group(9).toLong,m.group(10),m.group(11))

    Ourmapping function isparseLogLinewhichtakesaLogStringandsplitsitintofieldsinaCaseclassusing regularexpressions.

    val contentSizes =accessLogs.map(log=>log.bytesSent)

    Oursecondmapping function,mapstojustthebytefield

  • Reduce

    contentSizes.reduce(_+_)

    Wereducebyasummingupallthebytesinthedataset.Theresultisafinalsumofallsizes.

  • Spark1.6.1Stack

    SparkSQL SparkStreaming MLlib GraphX

    SparkCore

    Standalone YARN Mesos

  • Hands-On

    SparkMapReduceBuildwithIntelliJ/SBTDeployLocalRunHistoryServer

    spark-1.6.1-bin-hadoop2.6/sbin/start-history-server.sh

  • Installation InstallJDK InstallScala2.10 InstallSBT InstallMaven(Optional) UnzipSpark1.6.1

    EnvironmentVariableValue(example)Unix/Linux/MacexportSCALA_HOME=/usr/local/share/scalaexportPATH=$PATH:$SCALA_HOME/binWindowsSetSCALA_HOME=c:\Progra~1\ScalasetPATH=%PATH%;%SCALA_HOME%\bin

  • SparkResources

    https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/info http://airisdata.com/scala-spark-resources-setup-learning/ http://spark.apache.org/docs/latest/monitoring.html http://spark.apache.org/docs/latest/submitting-applications.html

  • SparkCluster

    http://spark.apache.org/docs/latest/cluster-overview.html

  • Term Meaning

    Application UserprogrambuiltonSpark.Consists ofa driverprogram and executorsonthecluster.

    Application jar Ajarcontainingtheuser's Sparkapplication.Insome casesuserswillwanttocreatean"uberjar"containingtheirapplicationalongwithitsdependencies. Theuser's jarshould neverincludeHadooporSparklibraries,however,thesewillbeaddedatruntime.

    Driverprogram Theprocess runningthemain() functionoftheapplication andcreatingtheSparkContext

    Clustermanager Anexternalserviceforacquiringresourcesonthecluster(e.g.standalonemanager,Mesos, YARN)

    Deploymode Distinguisheswherethedriverprocessruns.In"cluster"mode, theframeworklaunches thedriverinside ofthecluster.In"client"mode,thesubmitterlaunches thedriveroutside ofthecluster.

    Workernode Anynodethatcanrunapplication codeinthecluster

    Executor Aprocess launchedforanapplicationonaworkernode, thatrunstasksandkeepsdatainmemoryordisk storageacrossthem.Eachapplicationhasitsownexecutors.

    Task Aunitofworkthatwillbesenttooneexecutor

    Job Aparallelcomputationconsistingofmultiple tasksthatgetsspawnedinresponse toaSparkaction(e.g. save, collect);you'llseethistermused inthedriver'slogs.

    Stage Eachjobgetsdivided intosmallersetsoftaskscalled stages thatdependoneachother(similartothemapandreducestagesinMapReduce);you'll seethistermusedinthedriver's logs.

    Glossary The following table summarizes terms youll see used to refer to cluster concepts: