Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop...

38
Hadoop – Spark Overview J. Allemandou Generalities Hadoop Spark Demo Apache Hadoop & Spark – What is it ? Joseph Allemandou JoalTech 17 / 05 / 2018 1 / 23

Transcript of Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop...

Page 1: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop & Spark – What is it ?

Joseph Allemandou

JoalTech

17 / 05 / 2018

1 / 23

Page 2: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Hello!

Joseph Allemandou

2 / 23

Page 3: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Plan

1 Generalities on High Performance Computing (HPC)

2 Apache Hadoop and Spark – A Glimpse

3 Demonstration

3 / 23

Page 4: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

More computation power: Scale up vs. scale out

Scale Up Scale out

4 / 23

Page 5: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

More computation power: Scale up vs. scale out

Scale UpScale out

4 / 23

Page 6: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Parallel computing

Things to consider when doing parallel computing:

Partitioning (tasks, data)

Communications

Synchronization

Data dependencies

Load balancing

Granularity

IO

Livermore Computing Center - Tutorial

5 / 23

Page 7: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Looking back - Since 1950

Figure: Google Ngram Viewer

6 / 23

Page 8: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Looking back - Since 1950

Figure: Google Ngram Viewer

6 / 23

Page 9: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Looking back - Since 1950

Figure: Google Ngram Viewer

6 / 23

Page 10: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Looking back - Recent times

Figure: Google Trends

7 / 23

Page 11: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Same problem, different tools

Supercomputer

Dedicated hardware

Message Passing Interface

Big Data

Commodity hardware

8 / 23

Page 12: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

MPI

C / C++ / Fortran / Python

Low-level API - Send / receive messages

a lot to do manually

split the data

assign tasks to workers

handle synchronisation

handle errors

9 / 23

Page 13: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Hadoop + Spark

Java / Scala / Python / R

High-level API - dataflows

Less performant than MPI (see this scientific paper )

Easier to code than MPI (a lot!)

High-availability oriented

10 / 23

Page 14: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Plan

1 Generalities on High Performance Computing (HPC)

2 Apache Hadoop and Spark – A Glimpse

3 Demonstration

11 / 23

Page 15: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop (v2) in (almost) 3 points

Distributed storage and computation platform (Java, opensource)

HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator

Built for scalability

Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)

Working with multiple computation engines

Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)

12 / 23

Page 16: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop (v2) in (almost) 3 points

Distributed storage and computation platform (Java, opensource)

HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator

Built for scalability

Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)

Working with multiple computation engines

Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)

12 / 23

Page 17: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop (v2) in (almost) 3 points

Distributed storage and computation platform (Java, opensource)

HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator

Built for scalability

Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)

Working with multiple computation engines

Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)

12 / 23

Page 18: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop (v2) in (almost) 3 points

Distributed storage and computation platform (Java, opensource)

HDFS – Hadoop Distributed File SystemYARN – Yet Another Resource Negotiator

Built for scalability

Up to thousands of nodes (see this tweet )Can be easily extended (heterogeneous hardware)Is resilient to errors (to some extent)

Working with multiple computation engines

Hadoop MapReduce (the good old one)Apache Spark (the new kid on the block)

12 / 23

Page 19: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop - A big ecosysem

13 / 23

Page 20: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Things to know about Apache Hadoop

HDFS

Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)

YARN

Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)

Global

Execution engines exploit data locality

14 / 23

Page 21: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Things to know about Apache Hadoop

HDFS

Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)

YARN

Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)

Global

Execution engines exploit data locality

14 / 23

Page 22: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Things to know about Apache Hadoop

HDFS

Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)

YARN

Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)

Global

Execution engines exploit data locality

14 / 23

Page 23: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Things to know about Apache Hadoop

HDFS

Files are split in blocks – Partitioning !Blocks are replicated – Failure tolerance !Master - Slave architecture (NameNode & DataNodes)

YARN

Execution containers with defined resourcesScheduler manages resource sharingMaster - Slave architecture (ResourceManager &NodeManagers)

Global

Execution engines exploit data locality

14 / 23

Page 24: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop - The Frame (HDFS + YARN)

15 / 23

Page 25: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop - The Frame (HDFS + YARN)

15 / 23

Page 26: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop - The Frame (HDFS + YARN)

15 / 23

Page 27: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop - MapReduce

16 / 23

Page 28: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Hadoop - MapReduce

16 / 23

Page 29: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Spark

17 / 23

Page 30: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Apache Spark

17 / 23

Page 31: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Why is Spark preferred to MapReduce?

Programming data-flows instead of single map-reducesteps

A lot less code for complex flowsData lineage allows for error recovery

Decoupling of tasks and containers – Spark executors runmultiple tasks

Less executor management overhead

Executors can reuse RAM - Caching!

18 / 23

Page 32: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Why is Spark preferred to MapReduce?

Programming data-flows instead of single map-reducesteps

A lot less code for complex flowsData lineage allows for error recovery

Decoupling of tasks and containers – Spark executors runmultiple tasks

Less executor management overhead

Executors can reuse RAM - Caching!

18 / 23

Page 33: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Why is Spark preferred to MapReduce?

Programming data-flows instead of single map-reducesteps

A lot less code for complex flowsData lineage allows for error recovery

Decoupling of tasks and containers – Spark executors runmultiple tasks

Less executor management overhead

Executors can reuse RAM - Caching!

18 / 23

Page 34: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Spark - A big ecosysem as well

19 / 23

Page 35: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Spark - An example of (very) complex flow

Mediawiki History Reconstruction Job

20 / 23

Page 36: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Plan

1 Generalities on High Performance Computing (HPC)

2 Apache Hadoop and Spark – A Glimpse

3 Demonstration

21 / 23

Page 37: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Scala Spark / Toree / Jupyter notebook

Notebook with Spark on Docker

22 / 23

Page 38: Apache Hadoop & Spark What is it - Ifremer · Spark Overview J. Allemandou Generalities Hadoop Spark Demo Parallel computing Things to consider when doing parallel computing: Partitioning

Hadoop –Spark

Overview

J. Allemandou

Generalities

Hadoop Spark

Demo

Thank you!

23 / 23