Intro to Spark & Zeppelin - Crash Course - HS16SJ

RobertHryniewiczDataEvangelist@RobHryniewicz

Hands-onIntrotoSpark&ZeppelinCrash�Course

2 ©HortonworksInc.2011–2016.AllRightsReserved

The“BigData”Problem

Ã Asinglemachinecannotprocessorevenstoreallthedata!Problem

SolutionÃ Distributedataoverlargeclusters

DifficultyÃ Howtosplitworkacrossmachines?

Ã Movingdataovernetworkisexpensive

Ã Mustconsiderdata&networklocality

Ã Howtodealwithfailures?

Ã Howtodealwithslownodes?


SparkBackground


AccessRates

Atleastanorderofmagnitudedifferencebetweenmemoryandharddrive/networkspeed

FAST slow slow


WhatisSpark?

Ã ApacheOpenSourceProject - originallydevelopedatAMPLab (UniversityofCaliforniaBerkeley)

Ã DataProcessingEngine - focusedonin-memorydistributedcomputinguse-cases

Ã API - Scala,Python,JavaandR


SparkEcosystem

SparkCore

SparkSQL SparkStreaming MLLib GraphX


WhySpark?

Ã ElegantDeveloperAPIs– Singleenvironmentfordatamunging andMachineLearning(ML)

Ã In-memorycomputationmodel– Fast!– EffectiveforiterativecomputationsandML

Ã MachineLearning– ImplementationofdistributedMLalgorithms– PipelineAPI(SparkML)


HistoryofHadoop &Spark


ApacheSparkBasics


SparkContext

Ã MainentrypointforSparkfunctionality

Ã RepresentsaconnectiontoaSparkcluster

Ã Representedassc inyourcode

Whatisit?


RDD- ResilientDistributedDatasetÃ PrimaryabstractioninSpark

– AnImmutable collectionofobjects(orrecords,orelements)thatcanbeoperatedoninparallel

Ã Distributed– Collectionofelementspartitioned acrossnodesinacluster– EachRDDiscomposedofoneormorepartitions– Usercancontrolthenumberofpartitions– Morepartitions=>moreparallelism

Ã Resilient– Recoverfromnodefailures– AnRDDkeepsitslineageinformation->itcanberecreatedfromparentRDDs

Ã CreatedbystartingwithafileinHadoop DistributedFileSystem(HDFS)oranexistingcollectioninthedriverprogram

Ã Maybepersisted inmemoryforefficient reuse acrossparalleloperations(caching)


RDD– ResilientDistributedDataset

Partition1

Partition2

Partition3

RDD2

Partition1

Partition2

Partition3

Partition4

RDD1

ClusterNodes


SparkSQL


SparkSQLOverview

Ã Sparkmoduleforstructureddataprocessing(e.g.DBtables,JSONfiles)

Ã Threewaystomanipulatedata:– DataFrames API– SQLqueries– DatasetsAPI

Ã Sameexecutionengineforallthree

Ã SparkSQLinterfaces providemoreinformationaboutbothstructure andcomputationbeingperformedthanbasicSparkRDDAPI


DataFrames

Ã Conceptually equivalent toatableinrelationalDBordataframeinR/Python

Ã APIavailableinScala,Java,Python,andR

Ã Richeroptimizations(significantlyfasterthanRDDs)

Ã Distributedcollectionofdataorganizedintonamedcolumns

Ã UnderneathisanRDD


DataFramesCSVAvro

HIVE

SparkSQL

Text

Col1 Col2 … … ColN

DataFrame(withRDDunderneath)

Column

Row

CreatedfromVariousSources

Ã DataFrames fromHIVE:– ReadingandwritingHIVEtables,

includingORC

Ã DataFrames fromfiles:– Built-in:JSON,JDBC,ORC,Parquet,HDFS– Externalplug-in:CSV,HBASE,Avro

Ã DataFrames fromexistingRDDs– withtoDF()function

DataisdescribedasaDataFramewithrows,columnsandaschema


SQLContextandHiveContext

Ã EntrypointintoallfunctionalityinSparkSQL

Ã AllyouneedisSparkContextval sqlContext = SQLContext(sc)

SQLContext

Ã SupersetoffunctionalityprovidedbybasicSQLContext– ReaddatafromHivetables– AccesstoHiveFunctionsà UDFs

HiveContext

val hc = HiveContext(sc)

Usewhenyourdataresidesin

Hive


SparkSQLExamples


DataFrame Example

val df = sqlContext.table("flightsTbl")

df.select("Origin", "Dest", "DepDelay").show(5)

ReadingDataFromTable

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 8|| IAD| TPA| 19|| IND| BWI| 8|| IND| BWI| -4|| IND| BWI| 34|+------+----+--------+


DataFrame Example

df.select("Origin", "Dest", "DepDelay”).filter($"DepDelay" > 15).show(5)

UsingDataFrame APItoFilterData(showdelaysmorethan15min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+


SQLExample

// Register Temporary Table

df.registerTempTable("flights")

// Use SQL to Query Dataset

sqlContext.sql("SELECT Origin, Dest, DepDelayFROM flights WHERE DepDelay > 15 LIMIT 5").show

UsingSQLtoQueryandFilterData(again,showdelaysmorethan15min)

+------+----+--------+|Origin|Dest|DepDelay|+------+----+--------+| IAD| TPA| 19|| IND| BWI| 34|| IND| JAX| 25|| IND| LAS| 67|| IND| MCO| 94|+------+----+--------+


RDDvs.DataFrame


RDDsvs.DataFrames

RDD

DataFrame

Ã Lower-levelAPI(morecontrol)

Ã Lotsofexistingcode&users

Ã Compile-timetype-safety

Ã Higher-levelAPI(fasterdevelopment)

Ã Fastersorting,hashing,andserialization

Ã Moreopportunitiesforautomaticoptimization

Ã Lowermemorypressure


Data Frames are Intuitive

RDD Example

Equivalent Data Frame Example

dept name ageBio HSmith 48CS ATuring 54Bio BJones 43Phys E Witten 61

Findaverageagebydepartment?


SparkSQLOptimizationsÃ SparkSQLusesanunderlyingoptimizationengine(Catalyst)

– Catalystcanperformintelligentoptimizationsinceitunderstands theschema

Ã SparkSQLdoesnotmaterializeallthecolumns(aswithRDD)onlywhat’sneeded


Catalyst:SparkSQLoptimizer

Ã Queryordataframeoperationsmodeledasatree

Ã Logicalplancreatedandoptimized

Ã Variousphysicalplanscreated;bestplanchosen

Ã Codegenerationandexecution


SparkStreaming


SparkStreaming

Ã ExtensionofSparkCoreAPI

Ã Streamprocessingoflivedatastreams– Scalable– High-throughput– Fault-tolerant

Overview


SparkStreaming


SparkStreaming

Ã Applytransformationsoveraslidingwindowofdata,e.g.rollingaverageWindowOperations


ApacheZeppelin&HDPSandbox


ApacheZeppelin– AModernWeb-basedDataScienceStudio

Ã Dataexplorationanddiscovery

Ã Visualization

Ã DeeplyintegratedwithSparkandHadoop

Ã Pluggableinterpreters

Ã Multiplelanguagesinonenotebook:R,Python,Scala


What’snotincludedwithSpark?

ResourceManagement

Storage

Applications

SparkCoreEngine

ScalaJavaPythonlibraries

MLlib(Machinelearning)

SparkSQL*

SparkStreaming*

SparkCoreEngine


HDPSandbox

What’sincludedintheSandbox?

Ã Zeppelin

Ã LatestHortonworksDataPlatform(HDP)– Spark– YARNà ResourceManagement– HDFSà DistributedStorageLayer– Andmanymorecomponents... YARN

ScalaJava

PythonR

APIs

Spark Core Engine

Spark SQL

Spark StreamingMLlib GraphX

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

NHDFS


Access patterns enabled by YARN

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° °

°

°N

HDFS Hadoop Distributed File System

Interactive Real-TimeBatch

Applications BatchNeeds to happen but, no timeframe limitations

InteractiveNeeds to happen at Human time

Real-Time Needs to happen at Machine Execution time.


WhySparkonYARN?

Ã UtilizeexistingHDPclusterinfrastructure

Ã Resourcemanagement– shareSparkworkloadswithotherworkloadslikePIG,HIVE,etc.

Ã Schedulingandqueues

SparkDriver

ClientSpark

ApplicationMaster

YARNcontainer

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task

SparkExecutor

YARNcontainer

Task Task


Why HDFS?Fault Tolerant Distributed Storage• Dividefilesintobigblocksanddistribute3copiesrandomlyacrossthecluster• ProcessingDataLocality

• NotJuststoragebutcomputation

10110100101001001110011111100101001110100101001011001001010100110001010010111010111010111101101101010110100101010010101010101110010011010111010

0

Logical File

1

2

3

4

Blocks

1

Cluster

1

1

2

22

3

3

34

44


There’s more to HDP

YARN : Data Operating System

DATA ACCESS SECURITYGOVERNANCE & INTEGRATION OPERATIONS

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° °

°

N

Data Lifecycle & Governance

FalconAtlas

AdministrationAuthenticationAuthorizationAuditingData Protection

RangerKnoxAtlasHDFSEncryptionData Workflow

SqoopFlumeKafkaNFSWebHDFS

Provisioning, Managing, & Monitoring

AmbariCloudbreakZookeeper

Scheduling

Oozie

Batch

MapReduce

Script

Pig

Search

Solr

SQL

Hive

NoSQL

HBaseAccumuloPhoenix

Stream

Storm

In-memory Others

ISV Engines

Tez Tez Slider Slider

DATA MANAGEMENT

HortonworksDataPlatform2.4.x

DeploymentChoiceLinux Windows On-Premise Cloud

HDFS Hadoop Distributed File System


HDP2.5TP


ViewUserSessions


HortonworksCommunityConnection


HortonworksCommunityConnection

Read access for everyone, join to participate and be recognized

• FullQ&APlatform(likeStackOverflow)

• KnowledgeBaseArticles

• CodeSamplesandRepositories


LabPreview


LinktoTutorialwithLabInstructions

http://tinyurl.com/hwx-intro-to-spark

RobertHryniewicz@RobHryniewicz

Thanks!

Intro to Spark & Zeppelin - Crash Course - HS16SJ

Technology

Transcript of Intro to Spark & Zeppelin - Crash Course - HS16SJ