Apache Spark Operations

download Apache Spark Operations

of 33

  • date post

    17-Jan-2017
  • Category

    Software

  • view

    963
  • download

    0

Embed Size (px)

Transcript of Apache Spark Operations

PowerPoint Presentation

Spark OperationsKostas Sakellis

# Cloudera, Inc. All rights reserved.Lets talk about what we have seen as issues from our customers as issues as they try to get Spark into production.1

MeSoftware Engineer at ClouderaContributor to Apache SparkBefore that, contributed to Cloudera Manager

# Cloudera, Inc. All rights reserved.In scope - Focus on operational issues - Not on building the code itself

Experience from our customer support tickets

2

Building a proof of concept!Courtesy of: http://www.nefloridadesign.com/mbimages/6.jpg

# Cloudera, Inc. All rights reserved.Spark makes building a proof of concept with a subset of data relatively easy.But then things go wrongPlug for my talk at Hadoop Summit3

Examplesc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

# Cloudera, Inc. All rights reserved.Lets start with an example program in Spark. 4

Examplesc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

# Cloudera, Inc. All rights reserved.Lets start with an example program in Spark. 5

Examplesc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

# Cloudera, Inc. All rights reserved.The sum() call launches a job6

Partitionssc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

HDFSPartition 1Partition 2Partition 3Partition 4

# Cloudera, Inc. All rights reserved.A chunk of data somewhereCould be on Hadoop File System (HDFS)Could be cached in SparkDefines the degree of parallelism

7

RDDssc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

RDDHDFSPartition 1Partition 2Partition 3Partition 4

# Cloudera, Inc. All rights reserved.Describes a way of generating input and output partitionsImmutable very important!RDDs can depend on other RDDs Most have single parentJoins have multiple parentsLineage over replication for fault tolerancehttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

8

RDDssc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

RDDRDDHDFSPartition 1Partition 2Partition 3Partition 4Partition 1Partition 2Partition 3Partition 4

# Cloudera, Inc. All rights reserved.Describes a way of generating input and output partitionsImmutable very important!RDDs can depend on other RDDs Most have single parentJoins have multiple parentsLineage over replication for fault tolerancehttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

9

RDDssc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

RDDRDDHDFSPartition 1Partition 2Partition 3Partition 4Partition 1Partition 2Partition 3Partition 4

RDDPartition 1Partition 2Partition 3Partition 4

# Cloudera, Inc. All rights reserved.Describes a way of generating input and output partitionsImmutable very important!RDDs can depend on other RDDs Most have single parentJoins have multiple parentsLineage over replication for fault tolerancehttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

10

RDDRDDRDDsHDFSPartition 1Partition 2Partition 3Partition 4sc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

Partition 1Partition 2Partition 3Partition 4

RDDPartition 1Partition 2Partition 3Partition 4Collect

# Cloudera, Inc. All rights reserved.Describes a way of generating input and output partitionsImmutable very important!RDDs can depend on other RDDs Most have single parentJoins have multiple parentsLineage over replication for fault tolerancehttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

11

RDDRDDRDD LineageHDFSPartition 1Partition 2Partition 3Partition 4sc.textFile(hdfs://data/u.item, 4) .map(Movie(_)) .filter(_.month.equals(Nov)) .collect()

Partition 1Partition 2Partition 3Partition 4

RDDPartition 1Partition 2Partition 3Partition 4CollectLineage

# Cloudera, Inc. All rights reserved.Describes a way of generating input and output partitionsImmutable very important!RDDs can depend on other RDDs Most have single parentJoins have multiple parentsLineage over replication for fault tolerancehttps://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

12

Task

RDDRDDHDFSPartition 1Partition 2Partition 3Partition 4Partition 1Partition 2Partition 3Partition 4

RDDPartition 1Partition 2Partition 3Partition 4Collect

A pipelined set of transformation on a single thread

# Cloudera, Inc. All rights reserved.

13

Spark Architecture

# Cloudera, Inc. All rights reserved.Lets review the general Spark architecture14

Spark System Architecture

# Cloudera, Inc. All rights reserved.A driverWhere the DAG scheduler livesDrives the showSingle point of failureExecutorsCommunicates with driverRuns the tasks created by the driverThink of this as a ThreadPoolExecutor in javaPluggable cluster managersYARN, Mesos, standalone 15

DeploymentsSpark supports pluggable Cluster Managerslocal, Standalone, YARN and MesosIn early 2014, CDH 4.x with Spark 0.9 only supported StandaloneCDH 5.x includes Spark on YARN support

# Cloudera, Inc. All rights reserved.In scope - Focus on operational issues - Not on building the code itself

Experience from our customer support tickets

16

StandaloneMasterWorkerClientWorkerProcessAppMasterProcess

# Cloudera, Inc. All rights reserved.Lets review the general Spark architecture17

StandaloneOn cluster./sbin/start-master.sh./sbin/start-slave.sh

Submit jobspark-submit --master

# Cloudera, Inc. All rights reserved.In scope - Focus on operational issues - Not on building the code itself

Experience from our customer support tickets

18

ContainerYARN ArchitectureResource ManagerNode ManagerClientNode ManagerContainerProcessAppMasterContainerProcess

# Cloudera, Inc. All rights reserved.Lets review the general Spark architecture19

ContainerSpark on YARN ArchitectureResource ManagerNode ManagerClientNode ManagerContainerProcessAppMasterContainerProcess

# Cloudera, Inc. All rights reserved.Lets review the general Spark architecture20

ContainerSpark on YARN ArchitectureResource ManagerNode ManagerClientNode ManagerContainerProcessAppMasterContainerProcess

# Cloudera, Inc. All rights reserved.Lets review the general Spark architecture21

Spark on YARNSubmit jobspark-submit --master yarn-client

Cluster modespark-submit --master yarn-cluster

Spark shell only works in client mode!

# Cloudera, Inc. All rights reserved.In scope - Focus on operational issues - Not on building the code itself

Experience from our customer support tickets

22

Customers often have shared infrastructure

Courtesy of: https://radioglobalistic.files.wordpress.com/2011/02/lagos-traffic.jpg

# Cloudera, Inc. All rights reserved.Spark makes building a proof of concept with a subset of data relatively easy.23

Multi-tenancyCluster utilization is top metricTarget: 70-80% utilizationMixed workloads from mixed customersWe recommend YARNBuilt in resource manager

# Cloudera, Inc. All rights reserved.Underutilized Clusters

Courtesy of: http://media.nbclosangeles.com/images/1200*675/60-freeway-repair-dec16-2-empty.JPG

# Cloudera, Inc. All rights reserved.

25

Dynamic AllocationSpark applications scale the number of executors based on loadRemoves need for: --num-executorsIdle executors get killedFirst supported in CDH 5.4Ideal for:Long ETL jobs with large shufflesshell applications: hive and spark shell

# Cloudera, Inc. All rights reserved.Dynamic Allocation LimitationsStill required to specify cores--num-coresMemory--executor-memoryIncludes JVM overheadNeed to do the math yourselfOur customers still get it wrong!

# Cloudera, Inc. All rights reserved.The Future of Dynamic Allocation Only task size needed: --task-sizeEliminates--num-cores--num-executors--executor-memoryLeads to better cluster utilization

# Cloudera, Inc. All rights reserved.Security, now its getting serious.

Courtesy of: https://www.iti.illinois.edu/sites/default/files/Cybersecurity_image.jpg

# Cloudera, Inc. All rights reserved.Spark makes building a proof of concept with a subset of data relatively easy.29

AuthenticationKerberos the necessary evilUbiquitous amongst other servicesYARN, HDFS, Hive, HBase, etc.Spark utilizes delegation tokens

# Cloudera, Inc. All rights reserved.EncryptionControl planeFile distributionBlock ManagerUser UI / REST APIData-at-rest (shuffle files)SPARK-6028 (Replace with netty)Replace with nettySpark 1.4SPARK-2750 (SSL)SPARK-5682

# Cloudera, Inc. All rights reserved.Control planeFile distributionBlock ManagerUser UI / REST APIData-at-rest (shuffle files)

31

AuthorizationEnterprises have sensitive dataBeyond HDFS file permissionsPartial access to dataColumn level granularityApache SentryHDFS-Sentry synchronization pluginRecord ServiceColumn level security for Spark!

# Cloudera, Inc. All rights reserved.Thank youWere Hiring!

# Cloudera, Inc. All rights reserved.