Intro to Spark & Zeppelin - Crash Course - HS16SJ

51
Robert Hryniewicz Data Evangelist @RobHryniewicz Hands-on Intro to Spark & Zeppelin Crash Course

Transcript of Intro to Spark & Zeppelin - Crash Course - HS16SJ

RobertHryniewiczDataEvangelist@RobHryniewicz

Hands-onIntrotoSpark&ZeppelinCrash�Course

2 ©HortonworksInc.2011–2016.AllRightsReserved

The“BigData”Problem

à Asinglemachinecannotprocessorevenstoreallthedata!Problem

Solutionà Distributedataoverlargeclusters

Difficultyà Howtosplitworkacrossmachines?

à Movingdataovernetworkisexpensive

à Mustconsiderdata&networklocality

à Howtodealwithfailures?

à Howtodealwithslownodes?

3 ©HortonworksInc.2011–2016.AllRightsReserved

SparkBackground

4 ©HortonworksInc.2011–2016.AllRightsReserved

AccessRates

Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed

FAST slow slow

5 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisSpark?

à ApacheOpenSourceProject - originallydevelopedatAMPLab (UniversityofCaliforniaBerkeley)

à DataProcessingEngine - focusedonin-memorydistributedcomputinguse-cases

à API - Scala,Python,JavaandR

6 ©HortonworksInc.2011–2016.AllRightsReserved

SparkEcosystem

SparkCore

SparkSQL SparkStreaming MLLib GraphX

7 ©HortonworksInc.2011–2016.AllRightsReserved

WhySpark?

à ElegantDeveloperAPIs– Singleenvironmentfordatamunging andMachineLearning(ML)

à In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML

à MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)

8 ©HortonworksInc.2011–2016.AllRightsReserved

HistoryofHadoop &Spark

9 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheSparkBasics

10 ©HortonworksInc.2011–2016.AllRightsReserved

SparkContext

à MainentrypointforSparkfunctionality

à RepresentsaconnectiontoaSparkcluster

à Representedassc inyourcode

Whatisit?

11 ©HortonworksInc.2011–2016.AllRightsReserved

RDD- ResilientDistributedDatasetà PrimaryabstractioninSpark

– AnImmutable collectionofobjects(orrecords,orelements)thatcanbeoperatedoninparallel

à Distributed– Collectionofelementspartitioned acrossnodesinacluster– EachRDDiscomposedofoneormorepartitions– Usercancontrolthenumberofpartitions– Morepartitions=>moreparallelism

à Resilient– Recoverfromnodefailures– AnRDDkeepsitslineageinformation->itcanberecreatedfromparentRDDs

à CreatedbystartingwithafileinHadoop DistributedFileSystem(HDFS)oranexistingcollectioninthedriverprogram

à Maybepersisted inmemoryforefficient reuse acrossparalleloperations(caching)

12 ©HortonworksInc.2011–2016.AllRightsReserved

RDD– ResilientDistributedDataset

Partition1

Partition2

Partition3

RDD2

Partition1

Partition2

Partition3

Partition4

RDD1

ClusterNodes

13 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL

14 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQLOverview

à Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles)

à Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI

à Sameexecutionengineforallthree

à SparkSQLinterfaces providemoreinformationaboutbothstructure andcomputationbeingperformedthanbasicSparkRDDAPI

15 ©HortonworksInc.2011–2016.AllRightsReserved

DataFrames

à Conceptually equivalent toatableinrelationalDBordataframeinR/Python

à APIavailableinScala,Java,Python,andR

à Richeroptimizations(significantlyfasterthanRDDs)

à Distributedcollectionofdataorganizedintonamedcolumns

à UnderneathisanRDD

16 ©HortonworksInc.2011–2016.AllRightsReserved

DataFramesCSVAvro

HIVE

SparkSQL

Text

Col1 Col2 … … ColN

DataFrame(withRDDunderneath)

Column

Row

CreatedfromVariousSources

à DataFrames fromHIVE:– ReadingandwritingHIVEtables,

includingORC

à DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro

à DataFrames fromexistingRDDs– withtoDF()function

DataisdescribedasaDataFramewithrows,columnsandaschema

17 ©HortonworksInc.2011–2016.AllRightsReserved

SQLContextandHiveContext

à EntrypointintoallfunctionalityinSparkSQL

à AllyouneedisSparkContextval sqlContext = SQLContext(sc)

SQLContext

à SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs

HiveContext

val hc = HiveContext(sc)

Usewhenyourdataresidesin

Hive

18 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQLExamples

19 ©HortonworksInc.2011–2016.AllRightsReserved

DataFrame Example

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

ReadingDataFromTable

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+

20 ©HortonworksInc.2011–2016.AllRightsReserved

DataFrame Example

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

UsingDataFrame APItoFilterData(showdelaysmorethan15min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

21 ©HortonworksInc.2011–2016.AllRightsReserved

SQLExample

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5").show

UsingSQLtoQueryandFilterData(again,showdelaysmorethan15min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+

22 ©HortonworksInc.2011–2016.AllRightsReserved

RDDvs.DataFrame

23 ©HortonworksInc.2011–2016.AllRightsReserved

RDDsvs.DataFrames

RDD

DataFrame

à Lower-levelAPI(morecontrol)

à Lotsofexistingcode&users

à Compile-timetype-safety

à Higher-levelAPI(fasterdevelopment)

à Fastersorting,hashing,andserialization

à Moreopportunitiesforautomaticoptimization

à Lowermemorypressure

24 ©HortonworksInc.2011–2016.AllRightsReserved

Data Frames are Intuitive

RDD Example

Equivalent Data Frame Example

dept name ageBio HSmith 48CS ATuring 54Bio BJones 43Phys E Witten 61

Findaverageagebydepartment?

25 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQLOptimizationsà SparkSQLusesanunderlyingoptimizationengine(Catalyst)

– Catalystcanperformintelligentoptimizationsinceitunderstands theschema

à SparkSQLdoesnotmaterializeallthecolumns(aswithRDD)onlywhat’sneeded

26 ©HortonworksInc.2011–2016.AllRightsReserved

Catalyst:SparkSQLoptimizer

à Queryordataframeoperationsmodeledasatree

à Logicalplancreatedandoptimized

à Variousphysicalplanscreated;bestplanchosen

à Codegenerationandexecution

27 ©HortonworksInc.2011–2016.AllRightsReserved

SparkStreaming

28 ©HortonworksInc.2011–2016.AllRightsReserved

SparkStreaming

à ExtensionofSparkCoreAPI

à Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant

Overview

29 ©HortonworksInc.2011–2016.AllRightsReserved

SparkStreaming

30 ©HortonworksInc.2011–2016.AllRightsReserved

SparkStreaming

à Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations

31 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheZeppelin&HDPSandbox

32 ©HortonworksInc.2011–2016.AllRightsReserved

ApacheZeppelin– AModernWeb-basedDataScienceStudio

à Dataexplorationanddiscovery

à Visualization

à DeeplyintegratedwithSparkandHadoop

à Pluggableinterpreters

à Multiplelanguagesinonenotebook:R,Python,Scala

33 ©HortonworksInc.2011–2016.AllRightsReserved

34 ©HortonworksInc.2011–2016.AllRightsReserved

35 ©HortonworksInc.2011–2016.AllRightsReserved

36 ©HortonworksInc.2011–2016.AllRightsReserved

What’snotincludedwithSpark?

ResourceManagement

Storage

Applications

SparkCoreEngine

ScalaJavaPythonlibraries

MLlib(Machinelearning)

SparkSQL*

SparkStreaming*

SparkCoreEngine

37 ©HortonworksInc.2011–2016.AllRightsReserved

HDPSandbox

What’sincludedintheSandbox?

à Zeppelin

à LatestHortonworksDataPlatform(HDP)– Spark– YARNà ResourceManagement– HDFSà DistributedStorageLayer– Andmanymorecomponents... YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS

38 ©HortonworksInc.2011–2016.AllRightsReserved

Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.

39 ©HortonworksInc.2011–2016.AllRightsReserved

WhySparkonYARN?

à UtilizeexistingHDPclusterinfrastructure

à Resourcemanagement– shareSparkworkloadswithotherworkloadslikePIG,HIVE,etc.

à Schedulingandqueues

SparkDriver

ClientSpark

ApplicationMaster

YARNcontainer

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

40 ©HortonworksInc.2011–2016.AllRightsReserved

Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomlyacrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

0

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44

41 ©HortonworksInc.2011–2016.AllRightsReserved

There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFSEncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

HortonworksDataPlatform2.4.x

DeploymentChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System

42 ©HortonworksInc.2011–2016.AllRightsReserved

HDP2.5TP

43 ©HortonworksInc.2011–2016.AllRightsReserved

44 ©HortonworksInc.2011–2016.AllRightsReserved

45 ©HortonworksInc.2011–2016.AllRightsReserved

ViewUserSessions

46 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksCommunityConnection

47 ©HortonworksInc.2011–2016.AllRightsReserved

HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories

48 ©HortonworksInc.2011–2016.AllRightsReserved

CommunityEngagement

Participate now at: community.hortonworks.com©HortonworksInc.2011–2015.AllRightsReserved

7,500+RegisteredUsers

15,000+Answers

20,000+TechnicalAssets

One Website!

49 ©HortonworksInc.2011–2016.AllRightsReserved

LabPreview

50 ©HortonworksInc.2011–2016.AllRightsReserved

LinktoTutorialwithLabInstructions

http://tinyurl.com/hwx-intro-to-spark

RobertHryniewicz@RobHryniewicz

Thanks!