Hadoop Tutorials Spark - CERN

Hadoop TutorialsSparkKacper Surdy

Prasanth Kothuri

About the tutorial

• The third session in Hadoop tutorial series• this time given by Kacper and Prasanth

• Session fully dedicated to Spark framework• Extensively discussed

• Actively developed

• Used in production

• Mixture of a talk and hands-on exercises

What is Spark

• A framework for performing distributed computations

• Scalable, applicable for processing TBs of data

• Easy programming interface

• Supports Java, Scala, Python, R

• Varied APIs: DataFrames, SQL, MLib, Streaming

• Multiple cluster deployment modes

• Multiple data sources HDFS, Cassandra, HBase, S3

Compared to Impala

• Similar concept of the workload distribution

• Overlap in SQL functionalities

• Spark is more flexible and offers richer API

• Impala is fine tuned for SQL queries

Interconnect network

MEMORY

Node 1 Node 2 Node 3 Node 4 Node 5 Node X

Evolution from MapReduce

2002 2004 2006 2008 2010 2012 2014

MapReduce paper

MapReduce at Google Hadoop at Yahoo!

Hadoop Summit

Spark paper

Apache Spark top-level

4Based on a Databricks' slide

Deployment modes

Local mode• Make use of multiple cores/CPUs with the thread-level parallelism

Cluster modes:

• Standalone

• Apache Mesos

• Hadoop YARNtypical for hadoop clusters

with centralised resource management

How to use it

• Interactive shellspark-shellpyspark

• Job submissionspark-submit (use also for python)

• NotebooksjupyterzeppelinSWAN (centralised CERN jupyter service)

terminal

web interface

Example – Pi estimation

import scala.math.random

val slices = 2val n = 100000 * slicesval rdd = sc.parallelize(1 to n, slices)val sample = rdd.map { i =>

val x = randomval y = randomif (x*x + y*y < 1) 1 else 0

}val count = sample.reduce(_ + _)println("Pi is roughly " + 4.0 * count / n)

https://en.wikipedia.org/wiki/Monte_Carlo_method

Example – Pi estimation

• Login to on of the cluster nodes haperf1[01-12]

• Start spark-shell

• Copy or retype the content of/afs/cern.ch/user/k/kasurdy/public/pi.spark

Spark concepts and terminology

Transformations, actions (1)

• Transformations define how you want to transform you data. The result of a transformation of a spark dataset is another spark dataset. They do not trigger a computation.

examples: map, filter, union, distinct, join

• Actions are used to collect the result from a dataset defined previously. The result is usually an array or a number. They start the computation.

examples: reduce, collect, count, first, take

Transformations, actions (2)

1 2 3 … … … n-1 n

0 1 1 … … … 1 0

4 * count = 157018 / n = 200000

Map function

Reduce function

Driver, worker, executor

Driver

SparkContext

Cluster Manager

Worker

Executor

Worker

Executor

Worker

Executor

typically on a cluster

Stages, tasks

• Task - A unit of work that will be sent to one executor

• Stage - Each job gets divided into smaller sets of “tasks” called stages that depend on each other aka synchronization points

Your job as a directed acyclic graph (DAG)

14https://dwtobigdata.wordpress.com/2015/09/29/etl-with-apache-spark/

Monitoring pages

• If you can, scroll the console output up to get an application ID(e.g. application_1468968864481_0070) and open

http://haperf100.cern.ch:8088/proxy/application_XXXXXXXXXXXXX_XXXX/

• Otherwise use the link below to find you application on the list

http://haperf100.cern.ch:8088/cluster

Monitoring pages

Data APIs

Hadoop Tutorials Spark - CERN

Documents

Transcript of Hadoop Tutorials Spark - CERN

Introduction to Apache Hadoop & Pigsalsahpc.indiana.edu/CloudCom2010/slides/PDF/tutorials/... · 2010-12-24 · Introduction to Apache Hadoop & Pig Milind Bhandarkar (milindb@yahoo-inc.com)

Hadoop , Hadoop , Hadoop !!!

CERN IT Department CH-1211 Geneva 23 Switzerland t Sequential data access with Oracle and Hadoop: a performance comparison Zbigniew Baranowski.

Hadoop Installation Guide | Hadoop Configuration

Sebasen Goasguen –sebgoa@clemson - Fermilabcd-docdb.fnal.gov/0040/004050/001/cloud-fermi.pdf · Sebasen Goasguen ... (Hadoop), new analysis framework (Map ... Thanks to CERN and

Hadoop Present - Open Enterprise Hadoop

250 Hadoop Interview Questions and Answers for Experienced Hadoop Developers - Hadoop Online Tutorials

BigData NoSQL Hadoop Part I: What? How? What for? Kacper Szkudlarek Openlab fellow CERN - European Organisation for Nuclear Research EN-ICE-SCD Industrial.

Analyzing Hadoop with Hadoop

CERN people and CERN organization

Why use Hadoop?, Challenges / Learning Hadoop & Average Salary of Hadoop Professional

Hadoop Hadoop & Spark meetup - Altiscale

Big Data Technologies - Hahslermichael.hahsler.net/SMU/EMIS8331/S2015/tutorials/Big_Data.pdf · Yarn (Hadoop 2.0) A framework for job ... Hadoop Distributed File System that runs

Overview of Hadoop and Spark service at CERN · Overview of Hadoop and Spark service at CERN Zbigniew Baranowski, IT-DB Hadoop and Spark Service May 7th, 2018 1. 3 Infrastructure

Continuous Delivery for Linux/Windows/Hadoop...Beta Cluster Hadoop JobTracker Jenkins Slave Hadoop node Hadoop node Hadoop node Hadoop node Slave Node Gateway Prod. Cluster PigServer

ITDB Analytics and Hadoop Servicecanali.web.cern.ch › docs › ITDB_Analytics_and Hadoop_Service.pdf · Hadoop clusters at CERN IT • 3 production clusters (+ 1 for QA) as of December

Evaluation of Erasure Coding and other features of Hadoop 3 · 2019. 11. 26. · CERN openlab Report 2019 Evaluation of Erasure Coding and other features of Hadoop 3 2 PROJECT SPECIFICATION

05-June-2013 1 CERN Oracle tutorials 2013 Database solutions to ATLAS specific application requirements (real-life examples) Gancho Dimitrov (CERN) G.

CERN IT Department CH-1211 Geneva 23 Switzerland t Oracle Parallel Query vs Hadoop MapReduce for Sequential Data Access Zbigniew Baranowski.

Curso Hadoop. FcoJavierLahozSevilla v1.0.pdf · Introducción+a Hadoop. InstalaciónenAWS • Parte+1.+Introducción+a Hadoop+ – ¿Que+es+Hadoop?+ – Versionesde+Hadoop+ – Gesón