Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

© 2014 Trace3, All rights reserved.

BIG DATA INTELLIGENCE PRACTICE

HADOOP: PAST, PRESENT AND FUTURE


Roadmap

1

~1 hour

1-‐ What Makes Up Hadoop 1.x?

2-‐ What’s New In Hadoop 2.x?

3-‐ The Future Of Hadoop …


WHAT MAKES UP HADOOP 1.0?


What’s a “Node”?

Node aka Server

Compute

Storage

Processes / Daemons / Services

Memory

OperaZng System


Hadoop 1.0: HDFS + MapReduce

4

NameNode

DataNode / TaskTracker DataNode / TaskTracker


JobTracker

Client 1-‐1

1-‐2 1-‐3


Hadoop 1.0: HDFS + MapReduce

5

NameNode



JobTracker

Client 1-‐1 1-‐2

1-‐3

Reduce Map

2-‐1 3-‐2 3-‐3 4-‐1

2-‐3 4-‐2 2-‐2 3-‐1 4-‐3

Reduce Map


MapReduce v1 LimitaZons

6

Scalability Maximum cluster size is 4,000 nodes and maximum concurrent tasks is 40,000

Availability JobTracker failure kills all queued and running jobs

Resources ParZZoned into Map and Reduce Hard parGGoning of Map and Reduce slots led to low resource uZlizaZon

No Support for Alternate Paradigms / Services Only MapReduce batch jobs, nothing else


Hadoop 1.0: Single Use System

7

HADOOP 1.0

Single Use System Batch Apps

HDFS (redundant, reliable storage)

MapReduce (cluster resource management and data

processing)

Pig Hive


WHAT’S NEW IN HADOOP 2.0?


YARN

9

YARN Replaces MapReduce

Yet Another Resource NegoZator

YARN will be the de-‐facto distributed operaZng system for Big Data


YARN = BIG DATA

10

© 2014 Trace3, All rights reserved. 11

Store DATA in one place Interact with that data in MULTIPLE WAYS

with Predictable Performance and Quality of Service

ApplicaGons Run NaGvely IN Hadoop

HDFS2 (redundant, reliable storage)

YARN (cluster resource management)

BATCH (MapReduce)

INTERACTIVE (Tez)

ONLINE (HBase)

STREAMING (DataTorrent)

GRAPH (Giraph)

YARN: No Longer Just Batch Apps


YARN: ApplicaZons

Running all on the same Hadoop cluster to give applicaZons access to all the same source data!

MapReduce v2

Real-‐Time Stream Processing

Master-‐Worker Online

In-‐Memory

Apache Storm


YARN: Quickly Maturing

2010

2011

2012

2013

2014

Today

Conceived at Yahoo!

Alpha Releases – 2.0

Beta Releases – 2.1 GA Released – 2.2

200,000+ nodes, 800,000+ jobs daily 10 million+ hours of compute daily

Version 2.3 Version 2.4

Version 2.5


YARN: What Has Changed? YARN MRv1 RM

ResourceManager

AM ApplicaZonMaster

JT JobTracker

Scheduler Scheduler

NM NodeManager

TT TaskTracker

Container Map & Reduce Slot

ResourceManager

Scheduler

JobTracker

Scheduler

NodeManager

ApplicaZonMaster

TaskTracker

Map Reduce

NodeManager

Container Container

TaskTracker

Map Reduce


The 6 Benefits Of YARN

15

• Scale • New programming models and services

• Improved cluster uZlizaZon

• Agility • Backwards compaZble with MapReduce v1

• Mixed workloads on the same source of data


THE FUTURE OF HADOOP


SQL on Hadoop

Speed Deliver interacGve query performance.

SQL Support array of SQL semanGcs for analyGc applicaGons running against Hadoop.

Scale SQL interface to Hadoop designed for queries that scale from Terabytes to Petabytes


SQL on Hadoop

Hive on Apache Tez Hortonworks HDP2

Hive on Apache Spark Cloudera CDH5

Apache Drill MapR M7

Cloudera Impala Cloudera CDH5

Pivotal HAWQ Pivotal Big Data Suite


Apache Spark

HDFS2 (redundant, reliable storage)

YARN (cluster resource management)

Apache Spark (Databricks)

Programming Languages Java, Scala, Python, R*

InteracZve Shell Ability to write code and get output.

Faster by ~100x Due how it handles data in memory.


Apache Spark – Wordcount


HOYA: HBase (NoSQL) on YARN

Dynamic Scaling On-‐demand cluster size. Increase and decrease the size with load.

Easier Deployment APIs to create, start, stop and delete HBase clusters.

Availability Recover from Region Server loss with a new container.


Apache REEF

Machine Learning Framework well suited for building machine learning jobs.

Scalable / Fault Tolerant Makes it easy to implement scalable, fault-‐tolerant runGme environments for a range of computaGonal models.

Maintain State Users can build jobs that uGlize data from where it’s needed and also maintain state a`er jobs are done.

Retainable Evaluator ExecuGon Framework


Real-‐Time Stream Processing

Apache Storm

Streaming


Heterogeneous Storage

NameNode

Storage

NameNode

SATA SSD Fusion IO

THEN NOW


Hadoop Roadmap

• Apache Hadoop 2.5 – NodeManager Restart w/o disrupGon

• Apache Hadoop 2.6 – Memory As Storage Tier – Dynamic Resource ConfiguraGon –  Support For Docker Containers

Q3 2014

Q4 2014


I KNOW YOU HAVE QUESTIONS

26


THANK YOU!

hqp://bigdatajoe.io/

hqp://bigdatacentric.com/

@bigdatajoerossi

[email protected]

Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

Technology

Transcript of Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition