Hadoop Tutorials Spark - CERN

18
Hadoop Tutorials Spark Kacper Surdy Prasanth Kothuri

Transcript of Hadoop Tutorials Spark - CERN

Page 1: Hadoop Tutorials Spark - CERN

Hadoop TutorialsSparkKacper Surdy

Prasanth Kothuri

Page 2: Hadoop Tutorials Spark - CERN

About the tutorial

• The third session in Hadoop tutorial series• this time given by Kacper and Prasanth

• Session fully dedicated to Spark framework• Extensively discussed

• Actively developed

• Used in production

• Mixture of a talk and hands-on exercises

1

Page 3: Hadoop Tutorials Spark - CERN

What is Spark

• A framework for performing distributed computations

• Scalable, applicable for processing TBs of data

• Easy programming interface

• Supports Java, Scala, Python, R

• Varied APIs: DataFrames, SQL, MLib, Streaming

• Multiple cluster deployment modes

• Multiple data sources HDFS, Cassandra, HBase, S3

2

Page 4: Hadoop Tutorials Spark - CERN

Compared to Impala

• Similar concept of the workload distribution

• Overlap in SQL functionalities

• Spark is more flexible and offers richer API

• Impala is fine tuned for SQL queries

3

Interconnect network

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

MEMORY

CPU

Disks

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

Page 5: Hadoop Tutorials Spark - CERN

Evolution from MapReduce

2002 2004 2006 2008 2010 2012 2014

MapReduce paper

MapReduce at Google Hadoop at Yahoo!

Hadoop Summit

Spark paper

Apache Spark top-level

4Based on a Databricks' slide

Page 6: Hadoop Tutorials Spark - CERN

Deployment modes

Local mode• Make use of multiple cores/CPUs with the thread-level parallelism

Cluster modes:

• Standalone

• Apache Mesos

• Hadoop YARNtypical for hadoop clusters

with centralised resource management

5

Page 7: Hadoop Tutorials Spark - CERN

How to use it

• Interactive shellspark-shellpyspark

• Job submissionspark-submit (use also for python)

• NotebooksjupyterzeppelinSWAN (centralised CERN jupyter service)

terminal

web interface

6

Page 8: Hadoop Tutorials Spark - CERN

Example – Pi estimation

7

import scala.math.random

val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>

val x = randomval y = randomif (x*x + y*y < 1) 1 else 0

}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)

https://en.wikipedia.org/wiki/Monte_Carlo_method

Page 9: Hadoop Tutorials Spark - CERN

Example – Pi estimation

• Login to on of the cluster nodes haperf1[01-12]

• Start spark-shell

• Copy or retype the content of/afs/cern.ch/user/k/kasurdy/public/pi.spark

8

Page 10: Hadoop Tutorials Spark - CERN

Spark concepts and terminology

9

Page 11: Hadoop Tutorials Spark - CERN

Transformations, actions (1)

• Transformations define how you want to transform you data. The result of a transformation of a spark dataset is another spark dataset. They do not trigger a computation.

examples: map, filter, union, distinct, join

• Actions are used to collect the result from a dataset defined previously. The result is usually an array or a number. They start the computation.

examples: reduce, collect, count, first, take

10

Page 12: Hadoop Tutorials Spark - CERN

Transformations, actions (2)

11

1 2 3 … … … n-1 n

import scala.math.random

val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>

val x = randomval y = randomif (x*x + y*y < 1) 1 else 0

}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)

0 1 1 … … … 1 0

4 * count = 157018 / n = 200000

Map function

Reduce function

Page 13: Hadoop Tutorials Spark - CERN

Driver, worker, executor

12

import scala.math.random

val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>

val x = randomval y = randomif (x*x + y*y < 1) 1 else 0

}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)

Driver

SparkContext

Cluster Manager

Worker

Executor

Worker

Executor

Worker

Executor

typically on a cluster

Page 14: Hadoop Tutorials Spark - CERN

Stages, tasks

• Task - A unit of work that will be sent to one executor

• Stage - Each job gets divided into smaller sets of “tasks” called stages that depend on each other aka synchronization points

13

Page 15: Hadoop Tutorials Spark - CERN

Your job as a directed acyclic graph (DAG)

14https://dwtobigdata.wordpress.com/2015/09/29/etl-with-apache-spark/

Page 16: Hadoop Tutorials Spark - CERN

Monitoring pages

• If you can, scroll the console output up to get an application ID(e.g. application_1468968864481_0070) and open

http://haperf100.cern.ch:8088/proxy/application_XXXXXXXXXXXXX_XXXX/

• Otherwise use the link below to find you application on the list

http://haperf100.cern.ch:8088/cluster

15

Page 17: Hadoop Tutorials Spark - CERN

Monitoring pages

16

Page 18: Hadoop Tutorials Spark - CERN

Data APIs

17