Apache Spark Crash Course

RobertHryniewiczDeveloperAdvocate

@RobertH8z

ApacheSparkCrashCourse- DataWorks Summit- Munich2017

2 ©HortonworksInc.2011– 2016.AllRightsReserved

“BigData”Ã InternetofAnything(IoT)

– WindTurbines,OilRigs– Beacons,Wearables– SmartCars

Ã UserGeneratedContent(Social,Web&Mobile)– Twitter,Facebook,Snapchat– Clickstream– Paypal,Venmo

44ZBin2020


Visualizing44ZB

100pixels=1MTB

100px ->1MTBassumes5Mpixelresolutionscreen


The“BigData”Problem

Ã Asinglemachinecannotprocessorevenstoreallthedata!Problem

SolutionÃ Distributedataoverlargeclusters

DifficultyÃ Howtosplitworkacrossmachines?

Ã Movingdataovernetworkisexpensive

Ã Mustconsiderdata&networklocality

Ã Howtodealwithfailures?

Ã Howtodealwithslownodes?


SparkBackground


WhatIsApacheSpark?

Ã ApacheopensourceprojectoriginallydevelopedatAMPLab(UniversityofCaliforniaBerkeley)

Ã Unifieddataprocessingenginethatoperatesacrossvarieddataworkloadsandplatforms


WhyApacheSpark?

Ã ElegantDeveloperAPIs– Singleenvironmentfordatamunging,datawrangling,andMachineLearning(ML)

Ã In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML

Ã MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)


SparkSQLStructuredData

SparkStreamingNearReal-time

SparkMLlibMachineLearning

GraphXGraphAnalysis


SparkBasics


SparkSession

Ã MainentrypointforSparkfunctionality

Ã AllowsprogrammingwithDataFrame andDatasetAPIs– FewerconceptsandconstructsadeveloperhastojugglewhileinteractingwithSpark

Ã Representedassparkandauto-initializedinZeppelinenv.

Whatisit?





GraphXGraphAnalysis


MoreFlexible BetterStorageandPerformance///


SparkSQLOverview

Ã Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles,CSV)

Ã Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI


DataFrames

Ã Distributed collection ofdata organized intonamedcolumns

Ã ConceptuallyequivalenttoatableinrelationalDBoradataframeinR/Python

Ã APIavailableinScala,Java,Python,andR

Col1 Col2 … … ColN

DataFrame

Column

Row

DataisdescribedasaDataFramewithrows,columns,andaschema


Sources

CSVAvro

HIVE

SparkSQL

Col1 Col2 … … ColN

DataFrame

Column

Row

JSON


CreateaDataFrame

val path = "examples/flights.json"

val flights = spark.read.json(path)

Example


RegisteraTemporaryView(SQLAPI)

Example

flights.createOrReplaceTempView("flightsView")


TwoAPIExamples:DataFrame andSQLAPIs

flights.select("Origin", "Dest", "DepDelay”)

.filter($"DepDelay" > 15).show(5)

Results+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

SELECT Origin, Dest, DepDelayFROM flightsViewWHERE DepDelay > 15 LIMIT 5

SQLAPI

DataFrame API





GraphXGraphAnalysis


WhatisStreamProcessing?

BatchProcessing• Abilitytoprocessandanalyzedataat-rest(storeddata)• Request-based,bulkevaluationandshort-livedprocessing• EnablerforRetrospective,ReactiveandOn-demandAnalytics

StreamProcessing• Abilitytoingest,processandanalyzedatain-motioninreal- ornear-real-time• Eventormicro-batchdriven,continuousevaluationandlong-livedprocessing• Enablerforreal-timeProspective,ProactiveandPredictiveAnalytics forNextBest

Action

StreamProcessing +BatchProcessing =AllDataAnalyticsreal-time (now) historical (past)


Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

22

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…


SparkStreaming

Ã ExtensionofSparkCoreAPI

Ã Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant

Overview

ZeroMQ

MQTT

Nolongersupported

inSpark2.x


SparkStreaming


SparkStreaming

DiscretizedStreams(DStreams)Ã High-levelabstractionrepresentingcontinuousstreamofdata

Ã InternallyrepresentedasasequenceofRDDs

Ã OperationappliedonaDStream translatestooperationsontheunderlyingRDDs


SparkStreaming

Example:flatMap operation


SparkStreaming

Ã Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations


ChallengesinStreamingData

Ã Consistency

Ã Faulttolerance

Ã Out-of-orderdata


StructuredStreaming:Basics


StructuredStreaming:Model


Handlinglatearrivingdata





GraphXGraphAnalysis


AIinMedia&PopCulture


Machine Learning use cases

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproductionlevels


Scatter 2D Data Visualized

scatterData ç DataFrame

+-----+--------+

|label|features|

+-----+--------+

|-12.0| [-4.9]|

| -6.0| [-4.5]|

| -7.2| [-4.1]|

| -5.0| [-3.2]|

| -2.0| [-3.0]|

| -3.1| [-2.1]|

| -4.0| [-1.5]|

| -2.2| [-1.2]|

| -2.0| [-0.7]|

| 1.0| [-0.5]|

| -0.7| [-0.2]|.........


Linear Regression Model Training (one feature)

Coefficients:2.81Intercept:3.05

y=2.81x+3.05

TrainingResult


Linear Regression (two features)

Coefficients: [0.464, 0.464] Intercept: 0.0563


Spark API for building ML pipelines

Featuretransform

1

Featuretransform

2

Combinefeatures

LinearRegression

InputDataFrame

InputDataFrame

OutputDataFrame

Pipeline

PipelineModel

Train

Predict

ExportModel





GraphXGraphAnalysis


Ã PageRank

Ã TopicModeling(LDA)

Ã CommunityDetection

Source:ampcamp.berkeley.edu


Zeppelin&HDP


What’s Apache Zeppelin?

Web-based notebook that enables interactive

data analytics.

You can make beautiful data-driven, interactive

and collaborative documents with SQL,

Python, Scala and more


Simplelinechart


Horizon

talploto

fthreeline

charts


Stream

ingdataintoaline

chart


Plottin

gIrisd

atafeaturesinone

plot


Comparin

gIrisd

atadistrib

utions


What is a Note/Notebook?

• AwebbasedGUIforsmallcodesnippets

• Writecodesnippetsinbrowser

• Zeppelinsendscodetobackendforexecution

• Zeppelingetsdatabackfrombackend

• Zeppelinvisualizesdata

• ZeppelinNote=Setof(Paragraphs/Cells)

• OtherFeatures- Sharing/Collaboration/Reports/Import/Export


HowdoesZeppelinwork?

NotebookAuthor

Collaborators/Reportviewers

Zeppelin

ClusterSpark|Hive|HBaseAnyof30+backends


BigDataLifecycle

Collect ETL/Process Analysis

Report

DataProduct

BusinessuserCustomer

DataScientistDataEngineer

AllinZeppelin!


Ã Zeppelinè Interactivenotebook

Ã Spark

Ã YARNè ResourceManagement

Ã HDFSè DistributedStorageLayerYARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS


Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.


WhyApacheSparkonYARN?

Ã Resourcemanagement

Ã UtilizesexistingHDPclusterinfrastructure

Ã Schedulingandqueues

SparkDriver

ClientSpark

ApplicationMaster

YARNcontainer

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task


Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomly acrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

0

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44


Spark and HDP


HDCloud


HortonworksCloudSolutions

Microsoft AWS Google

Managed AzureHDInsight

Non-Managed/Marketplace

HortonworksDataCloudforAWS

CloudIaaS HortonworksDataPlatform(viaAmbariandviaCloudbreak)


HortonworksCloudSolutions:FlexibilityandChoice

HortonworksDataCloudforAWS

Cloudbreak

HDPonCloudIaaS

MorePrescriptiveMoreEphemeral

MoreOptionsMoreLongRunning


HDP2.6andNewClusterTypes

Spark2.1

DruidTP

InteractiveHive


MultitenancywithZeppelin


Livy

Ã LivyistheopensourceRESTinterfaceforinteractingwithApacheSparkfromanywhere

Ã InstalledasSparkAmbari Service

Livy Client

HTTP HTTP(RPC)

SparkInteractiveSessionSparkContext

SparkBatchSessionSparkContext

Livy Server


SecurityAcrossZeppelin-Livy-Spark

Shiro

IsparkGroupInterpreter

SPNego:Kerberos Kerberos

LivyAPIs

SparkonYARN

Zeppelin

Driver

LDAP

Livy Server


ReasonstoIntegratewithLivy

Ã BringSessionstoApacheZeppelin– Isolation– Sessionsharing

Ã Enableefficientclusterresourceutilization– DefaultSparkinterpreterkeepsYARN/Sparkjobrunningforever– Livyinterpreterrecycledafter60minutesofinactivity

(controlledbylivy.server.session.timeout )

Ã ToIdentityPropagation– SenduseridentityfromZeppelin>Livy>SparkonYARN


Livy Server

SparkSession Sharing

Session-2

Session-1

SparkSession-1SparkContext

SparkSession-2SparkContext

Client1

Client2

Client3

Session-1

Session-1

Session-2


ApacheZeppelinSecurity:Authentication+SSL

TommyCallahan

Zeppelin SparkonYARN

LDAP

SSL

Firewall

1

2

3


ApacheZeppelin+LivyEnd-to-EndSecurity

IsparkGroupInterpreter

SPNego:Kerberos Kerberos/RPC

LivyAPIsSparkonYARN

Zeppelin

LDAP

Livy ServerJobrunsas

TommyCallahan

TommyCallahan


SampleArchitecture


ModernDataApps

Ã HDP2.6– BatchProcessing

Ã HDF2.1– StreamingApps

DATAATREST

DATAINMOTION

ACTIONABLEINTELLIGENCE

ModernDataApplications


ModernDataApplicationsCustomorOfftheShelf

Real-TimeCyberSecurityprotectssystemswithsuperiorthreatdetectionSmartManufacturingdramaticallyimprovesyieldsbymanagingmorevariablesingreaterdetailConnected,AutonomousCarsdrivethemselvesandimproveroadsafetyFutureFarmingoptimizingsoil,seedsandequipmenttomeasuredconditionsoneachsquarefootAutomaticRecommendationEnginesmatchproductstopreferencesinmilliseconds

DATAATREST

DATAINMOTION

ACTIONABLEINTELLIGENCE

ModernDataApplications

HortonworksDataFlow

HortonworksDataPlatform


ManagedDataflowSOURCES REGIONAL

INFRASTRUCTURECORE

INFRASTRUCTURE


High-LevelOverview

IoT Edge(singlenode)

IoT Edge(singlenode)

IoT Devices

IoT Devices

NiFi Hub DataBroker

ColumnDB

DataStore

LiveDashboard

DataCenter(onprem/cloud)

HDFS/S3 HBase/Cassandra


Labs/Tutorials


FutureTutorials

Ã DeployingModelswithSparkStructuredStreaming

Ã PredictingAirlineDelayswithSparkR

Ã SentimentAnalysiswithApacheSpark(GradientBoosting)

Ã AutoTextClassification(NaïveBayes)


HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories


www.futureofdata.io

FutureofDataMeetups


FBSort

Ã Sparkjobthatreads60TBofcompresseddataandperformsa90TBshuffleandsort.

Ã Largestreal-worldSparkjobtodate!– Databricks’PetaByte sortwasonsyntheticdata.

Ã Multiplereliabilityfixes.

Ã Sparkjobthatreads60TBofcompresseddataandperformsa90TBshuffleandsort.

Ã Largestreal-worldSparkjobtodate!– Databricks’PetaByte sortwasonsyntheticdata.

Ã Multiplereliabilityfixes.

“Sparkcouldreliablyshuffleandsort90TB+intermediatedataandrun250,000tasksinasinglejob[...]andithasbeenrunninginproductionforseveralmonths.”





GraphXGraphAnalysis

RobertHryniewicz@robertH8z


What’snewinHDP2.6– Spark&Zeppelin

Ã Spark1.6.3GA

Ã Spark2.1GA

Ã RESTAPI(Livy)GA

Ã SparkThriftServerdoAS GA

Ã SparkSQL – Row/ColumnSecurity(GA)

Ã SparkStreaming+KafkaoverSSL

Ã MultiClusterHBase supportforSHC

Ã PackagesupportinPySpark &SparkR

SparkÃ Spark2.xsupport

Ã ImprovedLivyintegration

Ã Nopasswordinclear

Ã JDBCinterpreterimprovements

Ã SmartSenseintegration

Ã KnoxproxyZeppelinUI

Zeppelin0.7.x

RobertHryniewicz@RobertH8z

Thanks!

Apache Spark Crash Course

Technology

Transcript of Apache Spark Crash Course